Linux内核进程调度

[TOC]

## 多任务操作系统的两种形式：

1. 抢占式多任务操作系统

2. 协同式多任务操作系统

## 进程分为：

1. I/O繁忙

"I/O"意味着任意类型的阻塞资源。例如键盘输入或者网络I/O，而不仅仅是磁盘I/O。大多数的GUI应用是高I/O的，即使他们并没有花费很多的时间在磁盘读写上，它们大部分时间花费在等待用户的键盘或鼠标输入上。

2. CPU繁忙

高CPU意味着花费大量的时间在执行代码。它们一直运行直到被抢占。例如一个执行无线循环的程序，ssh-keygen，MATLAB.

Linux，旨在提供很好的交互响应和桌面表现，优化了进程响应，因此相较于CPU忙的进程更偏爱I/O忙的进程。

## 进程优先级

基于优先级的调度是一种典型的调度算法。即拥有高优先级的进程比低优先级的进程先执行，相同优先级的进程间round-robin（轮训）。

Linux内核实现了两种不同的优先级范围：

- nice value：从-20到+19的整数。

- 更小的值拥有更高的优先级

- linux系统使用nice值控制时间片比例。其他的基于Unix的系统，如Mac OS X，则是基于nice值来调整绝对时间片的分配数量。

- ps -el命令可以查看进程的nice值。

- Real-time 优先级：范围从0-99

- 数值更大优先级更高

- 所有的实时进程比普通进程拥有更高的优先级

- ps -eo state,uid,pid,ppid,rtprio,time,comm 命令可以查看进程的实时优先级。“-”表示该进程不是实时进程。

Linux系统上的时间片：

Linux的CFS调度没有直接给进程分配时间片，而是进程在cpu上执行时间的比例。进程的执行时间比例受每一个进程的nice值影响。nice值如同一个权值，动态调整进程获得的cpu时间。nice值越大的进程优先级越低，因而获得的执行时间越少。

## 调度策略

```

* Scheduling policies

#define SCHED_NORMAL 0

#define SCHED_FIFO 1

#define SCHED_RR 2

#define SCHED_BATCH 3

/* SCHED_ISO: reserved but not implemented yet */

#define SCHED_IDLE 5

#define SCHED_DEADLINE 6

```

## Linux的进程调度算法

### 调度类

不同的调度类实现了不同的调度算法，用于满足不同类型的进程。

内核里面包含了一下几种调度类实现：

> dl_sched_class

fair_sched_class

idle_sched_class

rt_sched_class

stop_sched_class

普通进程注册CFS调度类，通过SCHED_NORMAL宏。

```c

const struct sched_class fair_sched_class = {

.next = &idle_sched_class,

.enqueue_task = enqueue_task_fair,

.dequeue_task = dequeue_task_fair,

.yield_task = yield_task_fair,

.yield_to_task = yield_to_task_fair,

.check_preempt_curr = check_preempt_wakeup,

.pick_next_task = pick_next_task_fair,

.put_prev_task = put_prev_task_fair,

#ifdef CONFIG_SMP

.select_task_rq = select_task_rq_fair,

#ifdef CONFIG_FAIR_GROUP_SCHED

.migrate_task_rq = migrate_task_rq_fair,

#endif

.rq_online = rq_online_fair,

.rq_offline = rq_offline_fair,

.task_waking = task_waking_fair,

#endif

.set_curr_task = set_curr_task_fair,

.task_tick = task_tick_fair,

.task_fork = task_fork_fair,

.prio_changed = prio_changed_fair,

.switched_from = switched_from_fair,

.switched_to = switched_to_fair,

.get_rr_interval = get_rr_interval_fair,

.update_curr = update_curr_fair,

#ifdef CONFIG_FAIR_GROUP_SCHED

.task_move_group = task_move_group_fair,

#endif

};

```

CFS为其近似无线小的调度周期设置了一个target。这个target被称为调度延迟（targeted latency）。CFS设置了一个CPU执行时间的最小粒度，默认为1毫秒。即存在无线个进程，每一个的最小执行时间也是1毫秒。

## Linux内核调度实现

CFS的实现代码在kernel/sched_fair.c文件中。主要实现CFS的4个组成部分：

- 时间统计

- 进程选择

- 调度进入点

- 休眠和唤醒

### Time Accounting

CFS使用调度实体结构体，struct sched_entity，来记录time accounting，结构体在<linux/sched.h>文件中定义。

```c

struct sched_entity {

struct load_weight load; /* for load-balancing */

struct rb_node run_node;

struct list_head group_node;

unsigned int on_rq;

u64 exec_start;

u64 sum_exec_runtime;

u64 vruntime;

u64 prev_sum_exec_runtime;

u64 nr_migrations;

#ifdef CONFIG_FAIR_GROUP_SCHED

struct sched_entity *parent;

/* rq on which this entity is (to be) queued: */

struct cfs_rq *cfs_rq;

/* rq "owned" by this entity/group: */

struct cfs_rq *my_q;

#endif

* Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be

* removed when useful for applications beyond shares distribution (e.g.

* load-balance).

#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)

/* Per-entity load-tracking */

struct sched_avg avg;

#endif

RH_KABI_USE(1, struct sched_statistics *statistics)

/* reserved for Red Hat */

RH_KABI_RESERVE(2)

RH_KABI_RESERVE(3)

RH_KABI_RESERVE(4)

};

```

调度实体作为进程结构体（struct task_struct）中的一个成员变量se。

优先级nice值与权值得映射关系定义在*prio_to_weight*中。计算公式为：weight=1024/(1.25)^(nice)。

**virtual runtime**

vruntime存储一个进程的虚拟运行时间，即实际运行时间由可运行的进程数量加权计算获得。可用来衡量一个进程运行了多长时间以及多久未运行。

函数*update_curr()*用来计算当前进程的执行时间，定义在kernel/sched_fair.c文件中。

```

* Update the current task's runtime statistics.

static void update_curr(struct cfs_rq *cfs_rq)

{

struct sched_entity *curr = cfs_rq->curr;

u64 now = rq_clock_task(rq_of(cfs_rq));

u64 delta_exec;

if (unlikely(!curr))

return;

delta_exec = now - curr->exec_start;

if (unlikely((s64)delta_exec <= 0))

return;

curr->exec_start = now;

schedstat_set(curr->statistics->exec_max,

max(delta_exec, curr->statistics->exec_max));

curr->sum_exec_runtime += delta_exec;

schedstat_add(cfs_rq, exec_clock, delta_exec);

curr->vruntime += calc_delta_fair(delta_exec, curr);

update_min_vruntime(cfs_rq);

if (entity_is_task(curr)) {

struct task_struct *curtask = task_of(curr);

trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);

cpuacct_charge(curtask, delta_exec);

account_group_exec_runtime(curtask, delta_exec);

}

account_cfs_rq_runtime(cfs_rq, delta_exec);

}

```

update_curr()计算当前进程的执行时间存储在*delta_exec*中。然后传递给__update_curr()，其使用可运行的进程数量进行加权计算。最后当前进程的vruntime值为之前的值再加上加权计算的值。

```

* Update the current task's runtime statistics. Skip current tasks that

* are not in our scheduling class.

static inline void

__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,

unsigned long delta_exec)

{

unsigned long delta_exec_weighted;

schedstat_set(curr->exec_max, max((u64)delta_exec, curr->exec_max));

curr->sum_exec_runtime += delta_exec;

schedstat_add(cfs_rq, exec_clock, delta_exec);

delta_exec_weighted = calc_delta_fair(delta_exec, curr);

curr->vruntime += delta_exec_weighted;

update_min_vruntime(cfs_rq);

}

```

update_curr()被系统定时器调用，以及当一个进程变为可执行或者阻塞或者挂起的时候。因此，vruntime可以最为一个进程运行时间和下一个执行进程的精确衡量标准。

### 进程选择

CFS balance一个进程的vruntime使用一下简单的规则：当决定下一个执行进程的时候，挑选具有最小vruntime的进程执行。

CFS使用红黑树管理可执行进程，内核红黑树定义在include/linux/rbtree.h，lib/rbtree.c。

```

static struct sched_entity *__pick_next_entity(struct sched_entity *se)

{

struct rb_node *next = rb_next(&se->run_node);

if (!next)

return NULL;

return rb_entry(next, struct sched_entity, run_node);

}

```

__pick_next_entity()，不同的内核版本对于该函数的实现不太一样，我使用的内核版本直接遍历了红黑树，而有些版本是缓存在rb_leftmost变量里的。这里遍历红黑树的代价并不高，红黑树遍历的算法复杂度为O（logN），N是节点数量。

该函数的返回值是CFS下一个将要执行的进程。如果返回NULL，表明最左边的节点为空。这种情况说明没有可运行的进程，因此CFS会调度空闲任务。

### 添加一个进程

当一个进程变成可执行状态（唤醒）或者通过fork（）首次创建的时候，CFS会调用*enqueue_entity()*将其添加到红黑树中。

```

static void

enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)

{

* Update the normalized vruntime before updating min_vruntime

* through callig update_curr().

if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_WAKING))

se->vruntime += cfs_rq->min_vruntime;

* Update run-time statistics of the 'current'.

update_curr(cfs_rq);

enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);

account_entity_enqueue(cfs_rq, se);

update_cfs_shares(cfs_rq);

if (flags & ENQUEUE_WAKEUP) {

place_entity(cfs_rq, se, 0);

if (schedstat_enabled())

enqueue_sleeper(cfs_rq, se);

}

check_schedstat_required();

if (schedstat_enabled()) {

update_stats_enqueue(cfs_rq, se);

check_spread(cfs_rq, se);

}

if (se != cfs_rq->curr)

__enqueue_entity(cfs_rq, se);

se->on_rq = 1;

if (cfs_rq->nr_running == 1) {

list_add_leaf_cfs_rq(cfs_rq);

check_enqueue_throttle(cfs_rq);

}

```

该函数会调用实际的插入操作函数*__enqueue_entity()*。

```

* Enqueue an entity into the rb-tree:

static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)

{

struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;

struct rb_node *parent = NULL;

struct sched_entity *entry;

int leftmost = 1;

* Find the right place in the rbtree:

while (*link) {

parent = *link;

entry = rb_entry(parent, struct sched_entity, run_node);

* We dont care about collisions. Nodes with

* the same key stay together.

if (entity_before(se, entry)) {

link = &parent->rb_left;

} else {

link = &parent->rb_right;

leftmost = 0;

}

* Maintain a cache of leftmost tree entries (it is frequently

* used):

if (leftmost)

cfs_rq->rb_leftmost = &se->run_node;

rb_link_node(&se->run_node, parent, link);

rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);

}

```

Linux内核进程调度

推荐阅读更多精彩内容