FAQ Style Article

Linux Scheduler – CFS and Virtual Run Time (vruntime)

In Linux Scheduler, Work In Progress on July 3, 2012 at 5:20 pm

This article explains concept of virtual run time as used in Linux CFS scheduler

What is Virtual run time?
Virtual run time is the weighted time a task has run on the CPU

Where is it stored?
Virtual run time is stored in vruntime member of struct sched_entity
Note that sched_entity contains scheduling related information of task and it is member of task_struct.

struct task_struct
{
….
struct sched_entity se;
…..
};

struct sched_entity
{

u64 vruntime;
….
};

What is the unit of vruntime?
vruntime is measured in nano seconds

Where is vruntime updated?
vruntime of current task is updated in __update_curr function defined in sched_fair.c.
__update_curr() is called from update_curr() function, again defined in sched_fair.c
update_curr() is called from many places, including periodic timer interrupt.

How is vruntime updated?
First, time elapsed since last call to update_curr is determined
Then, this time is weighted and added to vruntime of current task.

How is weighting factor calculated?
Factor depends on nice value of task.
weighting factor = weight of process with nice value 0 / weight of current task;
where ‘weight’ is roughly equivalent to 1024 * (1.25)^(-nice)

Some examples values of weight
weight is 820 for nice value 1
weight is 1024 for nice value 0
weight is 1277 for nice value -1

In other words, the weighting factor = 1.25^nice
How does this impact tasks with different nice values?

For nice value equal to 0, factor is 1; vruntime is same as real time spent by task on CPU.

For nice value less than 0, factor is < 1; vruntime is less than real time spent. vruntime grows slower than real time.

For nice value more than 0, factor is > 1; vruntime is more than real time spent. vruntime grows faster than real time.

Which functions are involved in updating vruntime?
Four functions are involved
(a) update_curr()
(b) __update_curr()
(c) calc_delta_fair()
(d) calc_delta_mine()

Brief description of what each of these functions do?
update_curr determines time elapsed since last call to update_curr, then calls __update_curr with that time
__update_curr determines weighted time using calc_delta_fair (which in turn uses calc_delta_mine) and updates vruntime of current task.

Detailed description of what each of these functions do?
calc_delta_mine()
Prototype
static unsigned long calc_delta_mine(unsigned long delta_exec, unsigned long weight, struct load_weight *lw)
Description
Returns (delta_exec * weight)/lw->weight;

calc_delta_fair
Prototype
static inline unsigned long calc_delta_fair(unsigned long delta, struct sched_entity *se)
Description:
Returns (delta * 1024)/se->load->weight;
Calls calc_delta_mine to do the processing.

__update_curr()
Prototype
static inline void __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr, unsigned long delta_exec)
Description
Determines (delta_exec * 1024)/curr->load->weight using calc_delta_fair
Adds that value to curr->vruntime

update_curr()
Prototype
static void update_curr(struct cfs_rq *cfs_rq)
Description
Determines time spent since last call to update_curr
Calls __update_curr to add weighted runtime to vruntime

================================================
Minimum virtual run time of CFS RunQueue (min_vruntime)
================================================
What is min_vruntime?
Represents minimum run time of any task in the CFS run queue.

Where is stored?
It is part of CFS run queue structure (struct cfs_rq)
struct cfs_rq
{
…..
u64 min_vruntime;
…..
};

What is its initial value?
min_vruntime is set to -1 << 20 in init_cfs_rq() function defined in sched.c

Where is min_vruntime updated?
It is updated in update_min_vruntime() function defined in sched_fair.c
update_min_vruntime() is called from many places including
(a) __update_curr(), which is in turn called from update_curr()
(b) dequeue_entity(), which removes an entity from RB tree

How is min_vruntime updated?
Find the minimum vruntime of current task and leftmost task in run queue
Set this runtime as min_vruntime if it is greater than current value of min_vruntime
In other words

min_vruntime = MAX(min_vruntime, MIN(current_running_task.vruntime, cfs_rq.left_most_task.vruntime))

================================================
Time related members in sched_entity
================================================
What are various time related members in sched_entity?
There are following members, related to time
struct sched_entity {
…..
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;
…..
}

What are these members?
exec_start
– Time when task started running on CPU
– Used in update_curr() to find time duration run by current process on CPU
– Reset in update_curr(), after using previous value to determine time current process ran on CPU

sum_exec_runtime
– Total time process ran on CPU
– In real time
– Nano second units
– Updated in __update_curr(), called from update_curr()

prev_sum_exec_runtime
– When a process is taken to the CPU, its current sum_exec_runtime value is preserved in prev_exec_runtime.

Advertisements
  1. Very good explanation. Really cleared many doubts regarding min_vruntime.
    This artilcle along with http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html made many things clear.

    Thanks 🙂

  2. These notes have been very helpful to me but I do have a question…

    Can you explain why min_vruntime is initially set to (u64)(-(1LL << 20)) ?

    As far as I can ascertain, this is will set the 44 most significant bits to 1 and the 20 least significant bits to 0.

    min_vruntime will then overlow or "clock over" to zero in 20 bits worth of nanoseconds, or about 1ms.

    I would like to know why this value would be better than any other (eg. zero).

    Maybe it might allow for some subtracting shortly after the run queue is instantiated ?

    And could you elaborate on what happens when min_vruntime does clock over ?

    Regards

    Richard

  3. I’m wondering whether the answer could be related to what is done with the initialisation of the jiffies counter ?

    from Linux/include/linux/jiffies.h :

    162 /*
    163 * Have the 32 bit jiffies value wrap 5 minutes after boot
    164 * so jiffies wrap bugs show up earlier.
    165 */
    166 #define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))

    So (at least for jiffies) starting with an offset of -300 seconds (i.e..5 minutes) is done on purpose, to expose bugs in drivers which don’t handle wrapping of the jiffies (see comment in the code).

    Could a similar rationale be the answer to the starting value for min_vruntime ?

    I’ve been trying to wrap my head around the double cast (unsigned long)(unsigned int) used here…..

    Here’s my best explanation… can you tell me if you agree (I’m no expert but I can use Google) ?

    My thoughts are that on the first (rightmost) cast, the negative signed int (-300*HZ) is converted to an unsigned int with sign extension to whatever size an unsigned int is on that implementation (typically 32 bits), then the second cast (to unsigned long) casts that up to the size of an unsigned long (32 or 64 bits depending on implementation) with zero extension (as we are going from unsigned to unsigned type).

    In the case of 32 bit unsigned long, the second (leftmost) cast does nothing.

    In the case of 64 bit unsigned long, the second cast would cause the upper 32 bits to be set to 0.

    In any case, the jiffies variable (32 or 64 bit) is overlaid on the jiffies_64 (type u64) variable, aligned at the least significant bit, so if the upper 32 bits of jiffies_64 are not already set to zero (in the case of 32 bit unsigned long), they will be when jiffies_64 gets initialised due to implicit type conversion of unsigned 32 bit integer to unsigned 64 bit integer with zero extension:

    from Linux/kernel/timer.c :
    55 u64 jiffies_64 __cacheline_aligned_in_smp = INITIAL_JIFFIES;

    In other words, by whatever path, the upper 32 bits of the jiffies/jiffies_64 combo will start at zero, and the lower 32 bits are set so that they will clock over in 5 mins.

    If a real time in jiffies since startup is needed, the initial value can always be subtracted from the current value (as a 64 bit subtraction using jiffies_64 and INITIAL_JIFFIES).

    If we used a single typecast to get us from from signed int to unsigned long in the definition of INITIAL_JIFFIES, we would sign extend right up to the most significant bit, and in the case of a 64 bit unsigned long, set the upper 32 bits to 1.

    We are only trying to establish proper wrapping behaviour for the lower 32 bits, as the full 64 bit version will not be expected to wrap around in any of our lifetimes as long as the upper 32 bits start out at 0.

    Could you comment on that ?

    Regards

    Richard

  4. Very good explanation.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: