FAQ Style Article

Archive for July, 2012|Monthly archive page

CFS and Periodic Scheduler (scheduler_tick())

In Linux Scheduler, Work In Progress on July 3, 2012 at 5:26 pm

This article explains scheduler_tick(), the periodic linux scheduler.

What is scheduler_tick()
It is function defined in sched.c which gets called by timer code with HZ frequency.
HZ is a preprocessor define, which can be set at compile time

What does it do?
Determine whether current task has to be scheduled out and new task has to be brought in

How does it do it?
If current task needs to be scheduled out, then TIF_NEED_RESCHED flag is set for it
Note that task itself will not be switched, just the flag is set
if flag is set, then task will be scheduled out when schedule() function is subsequently called.

What is schedule() function?
It implements the Main Scheduler. It is defined in kernel/sched.c
It is called from many points in the kernel to allocate the CPU to a process other than the currently active one
Usually called after returning from system calls, if TIF_NEED_RESCHED is set for current task

Which functions are called by scheduler_tick to do its job?
It mainly calls curr->sched_class->task_tick function.
If task is CFS task, then this will be task_tick_fair function.

What does task_tick_fair function do?
Nothing much, just ends up calling entity_tick() function, defined in sched_fair.c

What does entity_tick() do?
Does two main tasks
(a) Updates current task timing information by calling update_curr()
(b) If there are more than 1 tasks in current queue, calls check_preempt_tick() to check if current task has to be scheduled out.

What does check_preempt_tick() do?
First, it determines time slice for the current task
Second, it set resched flag on current task if either of two conditions are satisifed
(a) if current task has run beyond its time slice
(b) if current task vruntime exceeds next task vruntime by time slice, provided time run by current task is more than sysctl_sched_min_granularity

For both these operations, it uses many helper functions.

Which helper functions are used for determining time slice of current task?
sched_slice() provides time slice of current process.

How does sched_slice determine time slice?
It does using two steps
(a) Determines scheduler period, in which all runnable tasks should run at least once. It uses __sched_period helper function for this purpose.
(b) Determines share of current task in the scheduler period.
Shared is determined on the basis of ratio of current task weight to overall weight of CFS run queue.
In other words, share = scheduler period * (current_task.weight/cfs_rq.total.weight)

How does __sched_period find scheduler period?
if(number of runnable tasks(nr_running) < nr_latency)
period = sysctl_sched_latency;
period = sysctl_sched_min_granularity * nr_running;

Idea is to prevent period from going below sysctl_sched_min_granularity.

What helper function does check_preempt_tick() function use to set resched flag?
It uses resched_task() function, which is implemented in sched.c

Linux Scheduler – CFS and Virtual Run Time (vruntime)

In Linux Scheduler, Work In Progress on July 3, 2012 at 5:20 pm

This article explains concept of virtual run time as used in Linux CFS scheduler

What is Virtual run time?
Virtual run time is the weighted time a task has run on the CPU

Where is it stored?
Virtual run time is stored in vruntime member of struct sched_entity
Note that sched_entity contains scheduling related information of task and it is member of task_struct.

struct task_struct
struct sched_entity se;

struct sched_entity

u64 vruntime;

What is the unit of vruntime?
vruntime is measured in nano seconds

Where is vruntime updated?
vruntime of current task is updated in __update_curr function defined in sched_fair.c.
__update_curr() is called from update_curr() function, again defined in sched_fair.c
update_curr() is called from many places, including periodic timer interrupt.

How is vruntime updated?
First, time elapsed since last call to update_curr is determined
Then, this time is weighted and added to vruntime of current task.

How is weighting factor calculated?
Factor depends on nice value of task.
weighting factor = weight of process with nice value 0 / weight of current task;
where ‘weight’ is roughly equivalent to 1024 * (1.25)^(-nice)

Some examples values of weight
weight is 820 for nice value 1
weight is 1024 for nice value 0
weight is 1277 for nice value -1

In other words, the weighting factor = 1.25^nice
How does this impact tasks with different nice values?

For nice value equal to 0, factor is 1; vruntime is same as real time spent by task on CPU.

For nice value less than 0, factor is < 1; vruntime is less than real time spent. vruntime grows slower than real time.

For nice value more than 0, factor is > 1; vruntime is more than real time spent. vruntime grows faster than real time.

Which functions are involved in updating vruntime?
Four functions are involved
(a) update_curr()
(b) __update_curr()
(c) calc_delta_fair()
(d) calc_delta_mine()

Brief description of what each of these functions do?
update_curr determines time elapsed since last call to update_curr, then calls __update_curr with that time
__update_curr determines weighted time using calc_delta_fair (which in turn uses calc_delta_mine) and updates vruntime of current task.

Detailed description of what each of these functions do?
static unsigned long calc_delta_mine(unsigned long delta_exec, unsigned long weight, struct load_weight *lw)
Returns (delta_exec * weight)/lw->weight;

static inline unsigned long calc_delta_fair(unsigned long delta, struct sched_entity *se)
Returns (delta * 1024)/se->load->weight;
Calls calc_delta_mine to do the processing.

static inline void __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr, unsigned long delta_exec)
Determines (delta_exec * 1024)/curr->load->weight using calc_delta_fair
Adds that value to curr->vruntime

static void update_curr(struct cfs_rq *cfs_rq)
Determines time spent since last call to update_curr
Calls __update_curr to add weighted runtime to vruntime

Minimum virtual run time of CFS RunQueue (min_vruntime)
What is min_vruntime?
Represents minimum run time of any task in the CFS run queue.

Where is stored?
It is part of CFS run queue structure (struct cfs_rq)
struct cfs_rq
u64 min_vruntime;

What is its initial value?
min_vruntime is set to -1 << 20 in init_cfs_rq() function defined in sched.c

Where is min_vruntime updated?
It is updated in update_min_vruntime() function defined in sched_fair.c
update_min_vruntime() is called from many places including
(a) __update_curr(), which is in turn called from update_curr()
(b) dequeue_entity(), which removes an entity from RB tree

How is min_vruntime updated?
Find the minimum vruntime of current task and leftmost task in run queue
Set this runtime as min_vruntime if it is greater than current value of min_vruntime
In other words

min_vruntime = MAX(min_vruntime, MIN(current_running_task.vruntime, cfs_rq.left_most_task.vruntime))

Time related members in sched_entity
What are various time related members in sched_entity?
There are following members, related to time
struct sched_entity {
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;

What are these members?
– Time when task started running on CPU
– Used in update_curr() to find time duration run by current process on CPU
– Reset in update_curr(), after using previous value to determine time current process ran on CPU

– Total time process ran on CPU
– In real time
– Nano second units
– Updated in __update_curr(), called from update_curr()

– When a process is taken to the CPU, its current sum_exec_runtime value is preserved in prev_exec_runtime.