List of FAQ style articles on Linux CFS Scheduler
Archive for June, 2012|Monthly archive page
This article explains how Red Black Tree data structure is used in Linux Scheduler.
What is a Red Black Tree?
It is a type of self-balancing binary search tree, a data structure used in computer science, typically to implement associative arrays.
What is binary tree?
A tree of nodes, where each node has two children – left and right, and one parent.
What are the properties of RB tree?
(a) It is self balancing: no path in the tree will ever be more than twice as long as any other.
(b) Operations (insertion, search and delete) occur in O(log n) time (where n is the number of nodes in the tree)
Where is it used in Linux?
It is used by CFS scheduler implementation and many other places.
How does CFS use RB tree?
To represent tasks in tree and to find out which task to run next.
Each task is stored in RB tree based on its virtual run time(vruntime).
Left most node in tree will be one with least vruntime.
When CFS needs to pick next task to run, it picks left most node.
Where is RB tree declared and defined in Linux?
linux/rbtree.h – Declarations of the RB Tree
linux/lib/rbtree.c – Implementation of RB tree functions
What are major structures related to RB tree
struct rb_root – RB tree itself
struct rb_node – Node of RB tree
What does rb_root structure contain?
Just single member, pointer to root node.
struct rb_node *rb_node;
What does rb_node contain?
unsigned long rb_parent_color; /* Color of the parent – needed for implementation of RB tree */
struct rb_node *rb_right; /* Right child */
struct rb_node *rb_left; /* Left child */
What are the major functions/macros provided by RB tree?
– Delete a
rb_link_node(node, parent, link)
– Insert a
node to either right or left
tree as indicated by
– Rebalance tree, called in conjunction with rb_link_node
rb_entry(node, type, member)
– Return the address of structure of which has set to
– Return the node next to
How does CFS use RB tree?
There are two related things
(a) Where and how CFS uses rb_node (RB node structure)?
(b) Where and how CFS uses rb_root (RB tree structure)?
Which CFS structure uses rb_root?
CFS main contain context structure is struct cfs_rq.
This represents CFS run queue (list of tasks that can be run on CPU in simplest case).
Defined in sched.c, cfs_rq contains information like number of tasks, amongst other things.
What is the relation between struct cfs_rq and struct rb_root?
struct cfs_rq contains member of type rb_root;
struct rb_root tasks_timeline;
cfs_rq.tasks_timeline is the root of RB tree representing runnable tasks in CFS run queue.
Which CFS structure(s) uses rb_node?
rb_node is used by struct cfs_rq and struct sched_entity.
What is struct sched_entity?
CFS declares struct sched_entity to represent scheduleable entity.
In a basic case, a schedulable entity consists of single task’s scheduling information.
Defined in sched.h, sched_entity structure contains weight of the task amongst other information
struct sched_entity contains member of type rb_node.
struct rb_node run_node;
This allows, sched_entity to be treated as if it is of type struct rb_node as far RB functions are concerned.
For what does cfs_rq use rb_node?
struct cfs_rq contains member of type struct rb_root *;
struct rb_node *rb_leftmost;
rb_leftmost points to left most mode in RB tree. More details on why this is needed later.
Which are key functions of CFS related to RB tree?
– Add a scheduling entity to RB tree, based on scheduling entity’s virtual run time.
– Removes a scheduling entity from RB tree
– Returns left most scheduling entity of CFS run queue
– Returns node next to one passed.
More about __enqueue_entity
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se);
Based on entity virtual run time(se->vruntime), RB node (se->run_node) will be inserted at right place, inside CFS runqueue tree (cfs_rq->tasks_timeline)
CFS Runqueue’s left most node (cfs_rq->rb_leftmost) is updated, if new node has least vruntime.
More about __dequeue_entity
static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
Remove RB node (se->run_node) from CFS runqueue RB tree (cfs_rq->tasks_timeline)
If leftmost is being removed, update cfs_rq->rb_leftmost to next node of removed node.
More about __pick_first_entity
static struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
Return scheduling entity of RB tree left most node (cfs_rq->rb_leftmost)
More about __pick_next_entity
static struct sched_entity *__pick_next_entity(struct sched_entity *se)
Return scheduling entity of RB node next to one passed (se->run_node)
This article explains how Nice (user level process priority) affects Linux Scheduler(CFS).
What is nice?
In simple words, nice is way to influence how process is scheduled in Linux.
What is possible range of nice?
Possible nice value range is: -20 to 19
How does nice value affect priority?
More the value of nice, more nice is the process to other processes!.
In the sense, higher the nice value, lower is the priority.
So, process that has a nice value of -20 is very high priority and 19 has least priority.
How can I see nice value of a process?
(a) Use ps command.
ps -o pid,comm,nice -p Will display PID, Command and Nice value of the process.
(b) Use top command
top displays a column NI, indicating nice value.
What is default nice value of a process?
The default nice value is zero
How can one change nice value of a process
There are two options command line and system calls.
Which commands can I use?
nice to change the priority when issuing a new command
renice to change the priority of an existing process
which system calls?
nice() system call
int nice(int inc);
nice() adds inc to the nice value for the calling process.
Nice and Weights of a task
How is nice value mapped to weight?
weight is roughly equivalent to 1024 / (1.25)^ (nice)
Some examples please
weight is 1024 for nice value 0
weight is 820 for nice value 1
weight is 1277 for nice value -1
If weight is indicative of how much time a process gets on CPU, nice seems to have non-linear effect on time consumed!
Yes, that’s right.
Nice value has exponential effect!
Is there a easy way to understand this effect?
Whole scheme has been designed to implement the below idea
nice value down by 1 – 10% more CPU
nice value up by 1 – 10% less CPU
How does weight of a process affect its CPU availability?
In simple terms,
weight of Run Queue of all tasks = sum of weight of all tasks in the queue.
proportion of CPU allotted = weight(process)/weight(RunQueue)
Too much theory…some examples please…
Consider two processes, A and B with nice value 0 (default) and running on single CPU.
nice(A) = 0 => weight(A) = 1024
nice(B) = 0 => weight(B) = 1024.
Assuming single CPU and no other process running, run queue of CPU has two tasks.
weight(RunQueue) = weight(A) + weight(B) = 2048.
proportion of CPU allotted = wieght(process)/weight(RunQueue)
cpuPercentage(A) = weight(A)/weight(RunQueue) = 1024/2048 = 50%
Needless to say cpuPercentage(B) = 50%
What if A has nice value 1 and B has 0
weight(A) with nice 1 is 820
weight(B) with nice 0 is 1024
weight(RunQueue) = weight(A) + weight(B) = 1844.
cpuPercentage(A) = weight(A)/weight(RunQueue) = 820/1844 = ~45%
cpuPercentage(B) = weight(B)/weight(RunQueue) = 1024/1844 = ~55%
So by increasing nice value of A to 1, its CPU availability is lower than nice value 0 process by 10%…
Experiments with NICE on real Linux Machine
#Create two busy wait processes – Note taskset command forces the task to run in CPU specified. I have used CPU #5 in my system.
taskset -c 5 dd if=/dev/zero of=/dev/null &
taskset -c 5 dd if=/dev/zero of=/dev/null &
# See the CPU loading using top – Note pressing ‘1’ when top is running shows individual CPU loads.
# Showed that processes are using 50% CPU each
#Change the nice value of one of the processes. 10211 was PID of the one of the processes. Now CPU usage changed to 55% for one and 44% for another as expected!
renice -n 1 10211
#Change the nice value of one of the processes. 10211 was PID of the one of the processes. Now CPU usage changed to 75% for one and 25% for another as expected!
renice -n 5 10211
This article explains Linux scheduler’s terminology.
What is a task?
Linux scheduler deals with tasks. Each process and thread is a task in the eyes of the scheduler. Task is represented by task_struct structure.
What is a scheduling policy?
Scheduling policy controls how scheduler manages the tasks.
What policies does Linux support?
Linux supports five of them: SCHED_FIFO, SCHED_RR, SCHED_IDLE, SCHED_BATCH and SCHED_OTHER
Which is most common scheduling policy?
SCHED_OTHER – Default Linux time-sharing scheduling policy used by the majority of processes
What is CFS?
Stands for Completely Fair Scheduler (CFS) . It is default Linux kernel scheduler.
What is a run queue?
Run queue can be thought of as list of tasks that can run on a CPU.
What is time slice?
Time a task runs on CPU before being pre-empted out.
What is task weight?
Each task has a weight. This weight is related to process (or thread) priority. Higher the priority, more will be the weight.
This article explains Linux scheduler’s latency parameters sched_min_granularity_ns and sched_latency_ns.
What is sched_min_granularity_ns ?
sched_min_granularity_ns is a scheduler tuneable.
This tuneable decides the minimum time a task will be be allowed to run on CPU before being pre-empted out.
By default, it is set to 4ms. So by default, any task will run atleast 4ms before getting pre-empted out.
What is sched_latency_ns?
sched_latency_ns is a scheduler tuneable.
sched_latency_ns and sched_min_granularity_ns decides the scheduler period, the period in which all run queue tasks are scheduled atleast once.
By default, this is set to 20ms.
How does scheduler period depend on sched_latency_ns and sched_min_granularity_ns?
If number of runnable tasks does not exceed sched_latency_ns/sched_min_granularity_ns
scheduler period = sched_latency_ns
scheduler period = number_of_running_tasks * sched_min_granularity_ns
Example – In Linux system with default values, there are two processes, both busy waiting. What will be the time slice of each task?
By default, sched_latency_ns = 20ms and sched_min_granularity_ns = 4ms
Time slice = scheduling period * (task’s weight/total weight of tasks in the run queue)
Number of running tasks = 2
Weight of each task will be same. Hence (task’s weight/total weight of tasks in the run queue) will be 1/2
Hence time slice will be 20ms * 0.5 or 10ms
What happens when there are 20 such busy wait processes?
Ideally, each task should get 1ms (=20ms * 1/20). But this is less than sched_min_granularity_ns.
sched_min_granularity_ns decides the minimum time a task will run. By default, sched_min_granularity_ns is set to 4ms.
So in the case of 20 busy wait processes, each process will run for 4ms before it is preempted.