FAQ Style Article

Archive for the ‘Linux Scheduler’ Category

CGroup – CPU allocation (cpu.shares) examples

In Linux Scheduler, Work In Progress on September 2, 2012 at 9:07 am

This article shows how cpu.shares can be used to limit CPU allocation to processes


What are Cgroups?
Cgroups (also known as control groups) is a Linux mechanism to control allocation of resources like CPU, Memory to processes.

Which aspects of CPU can be controlled?
(a) CPU time for processes using cpu cgroup
(b) CPU on which processes will run in case of multi-core system using cpuset cgroup

Briefly explain cpu cgroup?
cpu cgroup provides a mechanism to control cpu allocation to various processes. Each cpu cgroup contains a tuneable called cpu.shares. This tuneable allows user to limit the amount of CPU available to all processes within that cgroup.

Briefly explain various cgroup related Linux tools
(a) cgcreate – create cgroup
(b) cgdelete – remove cgroup
(c) cgclassify – Move a process to cgroup
(d) cgexec – run the task in given control groups

How to install cgroup tools?
sudo apt-get install cgroup-bin


#Create a mount directory and mount the cpu cgroup
mkdir cgmntdir
mount -t cgroup -o cpu test cgmntdir

Example – Create two busy wait processes such that each get 50% time.
(a) Create a cgroup A and B. By default, cpu.shares will be set to 1024.
sudo cgcreate -g cpu:A
sudo cgcreate -g cpu:B
(b) List the current cpu.shares value – should be 1024
cgget -r cpu.shares A
cgget -r cpu.shares B
(c) Create two processes and assign one each to each cgroup
sudo cgexec -g cpu:A dd if=/dev/zero of=/dev/null &
sudo cgexec -g cpu:B dd if=/dev/zero of=/dev/null &
(d) Check that processes take equal amount of CPU

Example – Create two processes such that one takes 75% and other 25%
Create processes P1 and P2.
Set P1 to A Cgroup and P2 to B Cgroup.
Set cpu.shares of A to 768 and cpu.shares of B to 256.
Now P1 will get 3 times as much CPU as P2.
sudo cgset -r cpu.shares=768 A
sudo cgset -r cpu.shares=256 B
sudo cgexec -g cpu:A dd if=/dev/zero of=/dev/null &
sudo cgexec -g cpu:B dd if=/dev/zero of=/dev/null &

Example – What will be CPU allocation for various scenarios

In a cgroup hierarchy, A and B are at the top. A has two sub cgroups A1 and A2 while B has one subgroup B1.
A cpu.shares is 1024
B cpu.shares is 2048
A1 cpu.shares is 512
A2 cpu.shares is 1024
B1 cpu.shares is 2048

Question – How will CPU be distributed between A and B
A share = 1024/(1024+2048) = 1/3
B share is 2048/(1024+2048) = 2/3

Question – How will CPU be distributed between A1 and A2 (subgroups within A)
Case 1: No processes within A cgroup.
A1 and A2 will share the CPU allocated to A cgroup.
A1 share = 512/(512+1024) = 1/3
A2 share = 1024/(512+1024) = 2/3

Case 2: Processes within A cgroup.
A processes share = 1024 / (1024 + 512 + 1024) = 2/5
A1 processes share = 512 / (1024 + 512 + 1024) = 1/5
A2 processes share = 1024 / (1024 + 512 + 1024) = 2/5

Question – What will be B1 (subgroup within B) share of CPU?
Case 1 : No processes running in B cgroup
B1 share = B CPU share and B1 cpu.shares value does not matter.

Case 2:
If there are some processes, then share will calculated using following formula
B1 = 2048/(2048+1024) = 2/3
B = 1024/(2048+1024) = 1/3

Question – What will be CPU allocation in following case
A – 1 process
B – 1 process
A1 – 1 process

B process will get 2/3 CPU.
A and A1 will share rest 1/3 CPU.
Within that A share = 1024/(512+1024) = 2/3 and A1 share = 512/(512 + 1024) = 1/3
In effect, B share = 2/3 = 6/9; A share = 2/3 * 1/3 = 2/9; A1 share = 1/9

CFS and Periodic Scheduler (scheduler_tick())

In Linux Scheduler, Work In Progress on July 3, 2012 at 5:26 pm

This article explains scheduler_tick(), the periodic linux scheduler.

What is scheduler_tick()
It is function defined in sched.c which gets called by timer code with HZ frequency.
HZ is a preprocessor define, which can be set at compile time

What does it do?
Determine whether current task has to be scheduled out and new task has to be brought in

How does it do it?
If current task needs to be scheduled out, then TIF_NEED_RESCHED flag is set for it
Note that task itself will not be switched, just the flag is set
if flag is set, then task will be scheduled out when schedule() function is subsequently called.

What is schedule() function?
It implements the Main Scheduler. It is defined in kernel/sched.c
It is called from many points in the kernel to allocate the CPU to a process other than the currently active one
Usually called after returning from system calls, if TIF_NEED_RESCHED is set for current task

Which functions are called by scheduler_tick to do its job?
It mainly calls curr->sched_class->task_tick function.
If task is CFS task, then this will be task_tick_fair function.

What does task_tick_fair function do?
Nothing much, just ends up calling entity_tick() function, defined in sched_fair.c

What does entity_tick() do?
Does two main tasks
(a) Updates current task timing information by calling update_curr()
(b) If there are more than 1 tasks in current queue, calls check_preempt_tick() to check if current task has to be scheduled out.

What does check_preempt_tick() do?
First, it determines time slice for the current task
Second, it set resched flag on current task if either of two conditions are satisifed
(a) if current task has run beyond its time slice
(b) if current task vruntime exceeds next task vruntime by time slice, provided time run by current task is more than sysctl_sched_min_granularity

For both these operations, it uses many helper functions.

Which helper functions are used for determining time slice of current task?
sched_slice() provides time slice of current process.

How does sched_slice determine time slice?
It does using two steps
(a) Determines scheduler period, in which all runnable tasks should run at least once. It uses __sched_period helper function for this purpose.
(b) Determines share of current task in the scheduler period.
Shared is determined on the basis of ratio of current task weight to overall weight of CFS run queue.
In other words, share = scheduler period * (current_task.weight/cfs_rq.total.weight)

How does __sched_period find scheduler period?
if(number of runnable tasks(nr_running) < nr_latency)
period = sysctl_sched_latency;
period = sysctl_sched_min_granularity * nr_running;

Idea is to prevent period from going below sysctl_sched_min_granularity.

What helper function does check_preempt_tick() function use to set resched flag?
It uses resched_task() function, which is implemented in sched.c

Linux Scheduler – CFS and Virtual Run Time (vruntime)

In Linux Scheduler, Work In Progress on July 3, 2012 at 5:20 pm

This article explains concept of virtual run time as used in Linux CFS scheduler

What is Virtual run time?
Virtual run time is the weighted time a task has run on the CPU

Where is it stored?
Virtual run time is stored in vruntime member of struct sched_entity
Note that sched_entity contains scheduling related information of task and it is member of task_struct.

struct task_struct
struct sched_entity se;

struct sched_entity

u64 vruntime;

What is the unit of vruntime?
vruntime is measured in nano seconds

Where is vruntime updated?
vruntime of current task is updated in __update_curr function defined in sched_fair.c.
__update_curr() is called from update_curr() function, again defined in sched_fair.c
update_curr() is called from many places, including periodic timer interrupt.

How is vruntime updated?
First, time elapsed since last call to update_curr is determined
Then, this time is weighted and added to vruntime of current task.

How is weighting factor calculated?
Factor depends on nice value of task.
weighting factor = weight of process with nice value 0 / weight of current task;
where ‘weight’ is roughly equivalent to 1024 * (1.25)^(-nice)

Some examples values of weight
weight is 820 for nice value 1
weight is 1024 for nice value 0
weight is 1277 for nice value -1

In other words, the weighting factor = 1.25^nice
How does this impact tasks with different nice values?

For nice value equal to 0, factor is 1; vruntime is same as real time spent by task on CPU.

For nice value less than 0, factor is < 1; vruntime is less than real time spent. vruntime grows slower than real time.

For nice value more than 0, factor is > 1; vruntime is more than real time spent. vruntime grows faster than real time.

Which functions are involved in updating vruntime?
Four functions are involved
(a) update_curr()
(b) __update_curr()
(c) calc_delta_fair()
(d) calc_delta_mine()

Brief description of what each of these functions do?
update_curr determines time elapsed since last call to update_curr, then calls __update_curr with that time
__update_curr determines weighted time using calc_delta_fair (which in turn uses calc_delta_mine) and updates vruntime of current task.

Detailed description of what each of these functions do?
static unsigned long calc_delta_mine(unsigned long delta_exec, unsigned long weight, struct load_weight *lw)
Returns (delta_exec * weight)/lw->weight;

static inline unsigned long calc_delta_fair(unsigned long delta, struct sched_entity *se)
Returns (delta * 1024)/se->load->weight;
Calls calc_delta_mine to do the processing.

static inline void __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr, unsigned long delta_exec)
Determines (delta_exec * 1024)/curr->load->weight using calc_delta_fair
Adds that value to curr->vruntime

static void update_curr(struct cfs_rq *cfs_rq)
Determines time spent since last call to update_curr
Calls __update_curr to add weighted runtime to vruntime

Minimum virtual run time of CFS RunQueue (min_vruntime)
What is min_vruntime?
Represents minimum run time of any task in the CFS run queue.

Where is stored?
It is part of CFS run queue structure (struct cfs_rq)
struct cfs_rq
u64 min_vruntime;

What is its initial value?
min_vruntime is set to -1 << 20 in init_cfs_rq() function defined in sched.c

Where is min_vruntime updated?
It is updated in update_min_vruntime() function defined in sched_fair.c
update_min_vruntime() is called from many places including
(a) __update_curr(), which is in turn called from update_curr()
(b) dequeue_entity(), which removes an entity from RB tree

How is min_vruntime updated?
Find the minimum vruntime of current task and leftmost task in run queue
Set this runtime as min_vruntime if it is greater than current value of min_vruntime
In other words

min_vruntime = MAX(min_vruntime, MIN(current_running_task.vruntime, cfs_rq.left_most_task.vruntime))

Time related members in sched_entity
What are various time related members in sched_entity?
There are following members, related to time
struct sched_entity {
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;

What are these members?
– Time when task started running on CPU
– Used in update_curr() to find time duration run by current process on CPU
– Reset in update_curr(), after using previous value to determine time current process ran on CPU

– Total time process ran on CPU
– In real time
– Nano second units
– Updated in __update_curr(), called from update_curr()

– When a process is taken to the CPU, its current sum_exec_runtime value is preserved in prev_exec_runtime.

Linux Scheduler – Table of Contents

In Linux Scheduler, Work In Progress on June 25, 2012 at 5:12 pm

List of FAQ style articles on Linux CFS Scheduler

  1. Introduction – Terminology
  2. CFS and Nice
  3. CFS and Latency (sched_min_granularity_ns and sched_latency_ns)
  4. CFS and Red Black Tree
  5. CFS and Virtual Run Time (vruntime)
  6. CFS and Periodic scheduler (schedule_tick())

Linux Scheduler – CFS and Red Black Tree

In Linux Scheduler, Work In Progress on June 8, 2012 at 1:56 pm

This article explains how Red Black Tree data structure is used in Linux Scheduler.

What is a Red Black Tree?
It is a type of self-balancing binary search tree, a data structure used in computer science, typically to implement associative arrays.

What is binary tree?
A tree of nodes, where each node has two children – left and right, and one parent.

What are the properties of RB tree?
(a) It is self balancing: no path in the tree will ever be more than twice as long as any other.
(b) Operations (insertion, search and delete) occur in O(log n) time (where n is the number of nodes in the tree)

Where is it used in Linux?
It is used by CFS scheduler implementation and many other places.

How does CFS use RB tree?
To represent tasks in tree and to find out which task to run next.
Each task is stored in RB tree based on its virtual run time(vruntime).
Left most node in tree will be one with least vruntime.
When CFS needs to pick next task to run, it picks left most node.

Where is RB tree declared and defined in Linux?
linux/rbtree.h – Declarations of the RB Tree
linux/lib/rbtree.c – Implementation of RB tree functions

What are major structures related to RB tree
struct rb_root – RB tree itself
struct rb_node – Node of RB tree

What does rb_root structure contain?
Just single member, pointer to root node.
struct rb_root
struct rb_node *rb_node;

What does rb_node contain?
Three members
struct rb_node
unsigned long rb_parent_color; /* Color of the parent – needed for implementation of RB tree */
struct rb_node *rb_right; /* Right child */
struct rb_node *rb_left; /* Left child */

What are the major functions/macros provided by RB tree?
rb_erase(node, tree)
– Delete a node from tree

rb_link_node(node, parent, link)
– Insert a node to either right or left tree as indicated by link.

rb_insert_color(node, tree)
– Rebalance tree, called in conjunction with rb_link_node

rb_entry(node, type, member)
– Return the address of structure of which has set to

– Return the node next to

How does CFS use RB tree?
There are two related things
(a) Where and how CFS uses rb_node (RB node structure)?
(b) Where and how CFS uses rb_root (RB tree structure)?

Which CFS structure uses rb_root?
CFS main contain context structure is struct cfs_rq.
This represents CFS run queue (list of tasks that can be run on CPU in simplest case).
Defined in sched.c, cfs_rq contains information like number of tasks, amongst other things.

What is the relation between struct cfs_rq and struct rb_root?
struct cfs_rq contains member of type rb_root;
struct cfs_rq

struct rb_root tasks_timeline;
cfs_rq.tasks_timeline is the root of RB tree representing runnable tasks in CFS run queue.

Which CFS structure(s) uses rb_node?
rb_node is used by struct cfs_rq and struct sched_entity.

What is struct sched_entity?
CFS declares struct sched_entity to represent scheduleable entity.
In a basic case, a schedulable entity consists of single task’s scheduling information.
Defined in sched.h, sched_entity structure contains weight of the task amongst other information

struct sched_entity contains member of type rb_node.
struct sched_entity
struct rb_node run_node;
This allows, sched_entity to be treated as if it is of type struct rb_node as far RB functions are concerned.

For what does cfs_rq use rb_node?
struct cfs_rq contains member of type struct rb_root *;
sturct cfs_rq
struct rb_node *rb_leftmost;
rb_leftmost points to left most mode in RB tree. More details on why this is needed later.

Which are key functions of CFS related to RB tree?

– Add a scheduling entity to RB tree, based on scheduling entity’s virtual run time.

– Removes a scheduling entity from RB tree

– Returns left most scheduling entity of CFS run queue

– Returns node next to one passed.

More about __enqueue_entity
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se);
Based on entity virtual run time(se->vruntime), RB node (se->run_node) will be inserted at right place, inside CFS runqueue tree (cfs_rq->tasks_timeline)
CFS Runqueue’s left most node (cfs_rq->rb_leftmost) is updated, if new node has least vruntime.

More about __dequeue_entity
static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
Remove RB node (se->run_node) from CFS runqueue RB tree (cfs_rq->tasks_timeline)
If leftmost is being removed, update cfs_rq->rb_leftmost to next node of removed node.

More about __pick_first_entity
static struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
Return scheduling entity of RB tree left most node (cfs_rq->rb_leftmost)

More about __pick_next_entity
static struct sched_entity *__pick_next_entity(struct sched_entity *se)
Return scheduling entity of RB node next to one passed (se->run_node)

Linux Scheduler – CFS and Nice

In Linux Scheduler, Work In Progress on June 6, 2012 at 10:07 am

This article explains how Nice (user level process priority) affects Linux Scheduler(CFS).

What is nice?
In simple words, nice is way to influence how process is scheduled in Linux.

What is possible range of nice?
Possible nice value range is: -20 to 19

How does nice value affect priority?
More the value of nice, more nice is the process to other processes!.
In the sense, higher the nice value, lower is the priority.
So, process that has a nice value of -20 is very high priority and 19 has least priority.

How can I see nice value of a process?
Two options
(a) Use ps command.
ps -o pid,comm,nice -p Will display PID, Command and Nice value of the process.
(b) Use top command
top displays a column NI, indicating nice value.

What is default nice value of a process?
The default nice value is zero

How can one change nice value of a process
There are two options command line and system calls.

Which commands can I use?
nice to change the priority when issuing a new command
nice -n

renice to change the priority of an existing process
renice -p

which system calls?
nice() system call
int nice(int inc);
nice() adds inc to the nice value for the calling process.

Nice and Weights of a task
How is nice value mapped to weight?
weight is roughly equivalent to 1024 / (1.25)^ (nice)

Some examples please
weight is 1024 for nice value 0
weight is 820 for nice value 1
weight is 1277 for nice value -1

If weight is indicative of how much time a process gets on CPU, nice seems to have non-linear effect on time consumed!
Yes, that’s right.
Nice value has exponential effect!

Is there a easy way to understand this effect?
Whole scheme has been designed to implement the below idea
nice value down by 1 – 10% more CPU
nice value up by 1 – 10% less CPU

How does weight of a process affect its CPU availability?
In simple terms,
weight of Run Queue of all tasks = sum of weight of all tasks in the queue.
proportion of CPU allotted = weight(process)/weight(RunQueue)

Too much theory…some examples please…

Consider two processes, A and B with nice value 0 (default) and running on single CPU.
nice(A) = 0 => weight(A) = 1024
nice(B) = 0 => weight(B) = 1024.

Assuming single CPU and no other process running, run queue of CPU has two tasks.
weight(RunQueue) = weight(A) + weight(B) = 2048.

proportion of CPU allotted = wieght(process)/weight(RunQueue)
cpuPercentage(A) = weight(A)/weight(RunQueue) = 1024/2048 = 50%
Needless to say cpuPercentage(B) = 50%

What if A has nice value 1 and B has 0
weight(A) with nice 1 is 820
weight(B) with nice 0 is 1024
weight(RunQueue) = weight(A) + weight(B) = 1844.
cpuPercentage(A) = weight(A)/weight(RunQueue) = 820/1844 = ~45%
cpuPercentage(B) = weight(B)/weight(RunQueue) = 1024/1844 = ~55%

So by increasing nice value of A to 1, its CPU availability is lower than nice value 0 process by 10%…

Experiments with NICE on real Linux Machine

#Create two busy wait processes – Note taskset command forces the task to run in CPU specified. I have used CPU #5 in my system.
taskset -c 5 dd if=/dev/zero of=/dev/null &
taskset -c 5 dd if=/dev/zero of=/dev/null &

# See the CPU loading using top – Note pressing ‘1’ when top is running shows individual CPU loads.
# Showed that processes are using 50% CPU each

#Change the nice value of one of the processes. 10211 was PID of the one of the processes. Now CPU usage changed to 55% for one and 44% for another as expected!
renice -n 1 10211

#Change the nice value of one of the processes. 10211 was PID of the one of the processes. Now CPU usage changed to 75% for one and 25% for another as expected!
renice -n 5 10211

Linux Scheduler – Introduction

In Linux Scheduler, Work In Progress on June 6, 2012 at 10:04 am

This article explains Linux scheduler’s terminology.

What is a task?
Linux scheduler deals with tasks. Each process and thread is a task in the eyes of the scheduler. Task is represented by task_struct structure.

What is a scheduling policy?
Scheduling policy controls how scheduler manages the tasks.

What policies does Linux support?

Which is most common scheduling policy?
SCHED_OTHER – Default Linux time-sharing scheduling policy used by the majority of processes

What is CFS?
Stands for Completely Fair Scheduler (CFS) . It is default Linux kernel scheduler.

What is a run queue?
Run queue can be thought of as list of tasks that can run on a CPU.

What is time slice?
Time a task runs on CPU before being pre-empted out.

What is task weight?
Each task has a weight. This weight is related to process (or thread) priority. Higher the priority, more will be the weight.

Linux Scheduler – CFS and Latency (sched_min_granularity_ns and sched_latency_ns)

In Linux Scheduler, Work In Progress on June 6, 2012 at 10:02 am

This article explains Linux scheduler’s latency parameters sched_min_granularity_ns and sched_latency_ns.

What is sched_min_granularity_ns ?
sched_min_granularity_ns is a scheduler tuneable.
This tuneable decides the minimum time a task will be be allowed to run on CPU before being pre-empted out.
By default, it is set to 4ms. So by default, any task will run atleast 4ms before getting pre-empted out.

What is sched_latency_ns?
sched_latency_ns is a scheduler tuneable.
sched_latency_ns and sched_min_granularity_ns decides the scheduler period, the period in which all run queue tasks are scheduled atleast once.
By default, this is set to 20ms.

How does scheduler period depend on sched_latency_ns and sched_min_granularity_ns?
If number of runnable tasks does not exceed sched_latency_ns/sched_min_granularity_ns
scheduler period = sched_latency_ns
scheduler period = number_of_running_tasks * sched_min_granularity_ns

Example – In Linux system with default values, there are two processes, both busy waiting. What will be the time slice of each task?
By default, sched_latency_ns = 20ms and sched_min_granularity_ns = 4ms
Time slice = scheduling period * (task’s weight/total weight of tasks in the run queue)
Number of running tasks = 2
Weight of each task will be same. Hence (task’s weight/total weight of tasks in the run queue) will be 1/2
Hence time slice will be 20ms * 0.5 or 10ms

What happens when there are 20 such busy wait processes?
Ideally, each task should get 1ms (=20ms * 1/20). But this is less than sched_min_granularity_ns.
sched_min_granularity_ns decides the minimum time a task will run. By default, sched_min_granularity_ns is set to 4ms.
So in the case of 20 busy wait processes, each process will run for 4ms before it is preempted.