Explicit block device plugging

http://lwn.net/Articles/438256/

Since the dawn of time, or for at least as long as I have been involved, the Linux kernel has deployed a concept called "plugging" on block devices. When I/O is queued to an empty device, that device enters a plugged state. This means that I/O isn't immediately dispatched to the low level device driver, instead it is held back by this plug. When a process is going to wait on the I/O to finish, the device is unplugged and request dispatching to the device driver is started. The idea behind plugging is to allow a buildup of requests to better utilize the hardware and to allow merging of sequential requests into one single larger request. The latter is an especially big win on most hardware; writing or reading bigger chunks of data at the time usually yields good improvements in bandwidth. With the release of the 2.6.39-rc1 kernel, block device plugging was drastically changed. Before we go into that, lets take a historic look at how plugging has evolved.

Back in the early days, plugging a device involved global state. This was before SMP scalability was an issue, and having global state made it easier to handle the unplugging. If a process was about to block for I/O, any plugged device was simply unplugged. This scheme persisted in pretty much the same form until the early versions of the 2.6 kernel, where it began to severely impact SMP scalability on I/O-heavy workloads.

In response to this problem, the plug state was turned into a per-device entity in 2004. This scaled well, but now you suddenly had no way to unplug all devices when going to sleep waiting for page I/O. This meant that the virtual memory subsystem had to be able to unplug the specific device that would be servicing page I/O. A special hack was added for this: <tt>sync_page()</tt> in <tt>struct address_space_operations</tt>; this hook would unplug the device of interest.

If you have a more complicated I/O setup with device mapper or RAID components, those layers would in turn unplug any lower-level device. The unplug event would thus percolate down the stack. Some heuristics were also added to auto-unplug the device if a certain depth of requests had been added, or if some period of time had passed before the unplug event was seen. With the asymmetric nature of plugging where the device was automatically plugged but had to be explicitly unplugged, we've had our fair share of I/O stall bugs in the kernel. While crude, the auto-unplug would at least ensure that we would chuck along if someone missed an unplug call after I/O submission.

With really fast devices hitting the market, once again plugging had become a scalability problem and hacks were again added to avoid this. Essentially we disabled plugging on solid-state devices that were able to do queueing. While plugging originally was a good win, it was time to reevaluate things. The asymmetric nature of the API was always ugly and a source of bugs, and the <tt>sync_page()</tt> hook was always hated by the memory management people. The time had come to rewrite the whole thing.

The primary use of plugging was to allow an I/O submitter to send down multiple pieces of I/O before handing it to the device. Instead of maintaining these I/O fragments as shared state in the device, a new on-stack structure was created to contain this I/O for a short period, allowing the submitter to build up a small queue of related requests. The state is now tracked in <tt>struct blk_plug</tt>, which is little more than a linked list and a <tt>should_sort</tt> flag informing <tt>blk_finish_plug()</tt> whether or not to sort this list before flushing the I/O. We'll come back to that later.

<pre style="overflow: visible; white-space: pre; color: rgb(0, 0, 0); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;"> struct blk_plug {
unsigned long magic;
struct list_head list;
unsigned int should_sort;
};
</pre>

The magic member is a temporary addition to detect uninitialized use cases, it will eventually be removed. The new API to do this is straightforward and simple to use:

<pre style="overflow: visible; white-space: pre; color: rgb(0, 0, 0); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;"> struct blk_plug plug;

blk_start_plug(&plug);
submit_batch_of_io();
blk_finish_plug(&plug);

</pre>

<tt>blk_start_plug()</tt> takes care of initializing the structure and tracking it inside the task structure of the current process. The latter is important to be able to automatically flush the queued I/O should the task end up blocking between the call to <tt>blk_start_plug()</tt> and <tt>blk_finish_plug()</tt>. If that happens, we want to ensure that pending I/O is sent off to the devices immediately. This is important from a performance perspective, but also to ensure that we don't deadlock. If the task is blocking for a memory allocation, memory management reclaim could end up wanting to free a page belonging to a request that is currently residing on our private plug. Similarly, the caller may itself end up waiting for some of the plugged I/O to finish. By flushing this list when the process goes to sleep, we avoid these types of deadlocks.

If <tt>blk_start_plug()</tt> is called and the task already has a plug structure registered, it is simply ignored. This can happen in cases where the upper layers plug for submitting a series of I/O, and further down in the call chain someone else does the same. I/O submitted without the knowledge of the original plugger will thus end up on the originally assigned plug, and be flushed whenever the original caller ends the plug by calling<tt>blk_finish_plug()</tt>, or if some part of the call path goes to sleep or is scheduled out.

Since the plug state is now device agnostic, we may end up in a situation where multiple devices have pending I/O on this plug list. These may end up on the plug list in an interleaved fashion, potentially causing <tt>blk_finish_plug()</tt> to grab and release the related queue locks multiple times. To avoid this problem, a <tt>should_sort</tt> flag in the <tt>blk_plug</tt> structure is used to keep track of whether we have I/O belonging to more than I/O distinct queue pending. If we do, the list is sorted to group identical queues together. This scales better than grabbing and releasing the same locks multiple times.

With this new scheme in place, the device need no longer be notified of unplug events. The queue <tt>unplug_fn()</tt> used to exist for this purpose alone, it has now been removed. For most drivers it is safe to just remove this hook and the related code. However, some drivers used plugging to delay I/O operations in response to resource shortages. One example of that was the SCSI midlayer; if we failed to map a new SCSI request due to a memory shortage, the queue was plugged to ensure that we would call back into the dispatch functions later on. Since this mechanism no longer exists, a similar API has been provided for such use cases. Drivers may now use blk_delay_queue() for this:

<pre style="overflow: visible; white-space: pre; color: rgb(0, 0, 0); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;"> blk_delay_queue(queue, delay_in_msecs);
</pre>

The block layer will re-invoke request queueing after the specified number of milliseconds have passed. It will be invoked from process context, just as it would have been with the unplug event. <tt>blk_delay_queue()</tt> honors the queue stopped state, so if <tt>blk_stop_queue()</tt> was called before <tt>blk_delay_queue()</tt>, or if is called after the fact but before the delay has passed, the request handler will not be invoked. <tt>blk_delay_queue()</tt> must only be used for conditions where the caller doesn't necessarily know when that condition will change states. If resources internal to the driver cause it to need to halt operations for a while, it is more efficient to use <tt>blk_stop_queue()</tt> and <tt>blk_start_queue()</tt> to manage those directly.

These changes have been merged for the 2.6.39 kernel. While a few problems have been found (and fixed), it would appear that the plugging changes have been integrated without greatly disturbing Linus's calm development cycle.

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 204,293评论 6 478
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,604评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 150,958评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,729评论 1 277
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,719评论 5 366
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,630评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 38,000评论 3 397
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,665评论 0 258
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,909评论 1 299
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,646评论 2 321
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,726评论 1 330
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,400评论 4 321
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,986评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,959评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,197评论 1 260
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 44,996评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,481评论 2 342

推荐阅读更多精彩内容