最近在用netty处理Http请求时,需要用到队列,一直听说Disruptor的RingBuffer比JDK的队列性能更好,因此准备先大概了解下实现原理;
RingBuffer
RingBuffer本质上就是个队列,它的成员变量包括:
private final Object[] entries;
protected final int bufferSize;
protected final Sequencer sequencer;
其中entries的length为bufferSize,首尾相连,形成一个环状;Sequence是Disruptor中比较重要的一个概念,后续会讲到;
Disruptor的RingBuffer实现有几个要注意的地方:
CacheLine Padding:
定义了8个long类型变量,长度为8*8=64,而通常情况下的CPU缓存行大小为64字节(可选范围为32-256),因此可以解决多核环境下的缓存行失效问题;
查看cpu一级缓存大小:
cat /sys/devices/system/cpu/cpu0/cache/index0/size
cat /sys/devices/system/cpu/cpu0/cache/index0/type
cat /sys/devices/system/cpu/cpu0/cache/index0/level
查看缓存行大小:
cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
生产者:
RingBuffer有生产者和消费者,生产者将元素添加到队列;在具体的实现上是通过next和publish方法实现的:
- next方法:
public long next(){ return sequencer.next();}
可以看到具体的实现是通过MultiProducerSequencer或SingleProducerSequencer的next方法实现的:
- SingleProducerSequencer
public long next(int n)
{
if (n < 1)
{
throw new IllegalArgumentException("n must be > 0");
}
long nextValue = this.nextValue;
long nextSequence = nextValue + n;
long wrapPoint = nextSequence - bufferSize;
long cachedGatingSequence = this.cachedValue;
if (wrapPoint > cachedGatingSequence || cachedGatingSequence > nextValue)
{
long minSequence;
while (wrapPoint > (minSequence = Util.getMinimumSequence(gatingSequences, nextValue)))
{
LockSupport.parkNanos(1L); // TODO: Use waitStrategy to spin?
}
this.cachedValue = minSequence;
}
this.nextValue = nextSequence;
return nextSequence;
}
由于生产者为单线程,因此不用考虑并发,通过wrapPoint > (minSequence = Util.getMinimumSequence(gatingSequences, nextValue))
判断队列是否已满(消费者速度慢,导致缓存队列满);如果队列已满,则通过LockSupport.parkNanos(1L)暂停线程执行,循环判断直至可用;可以看到作者在此添加了注释,或许后续版本会采用WaitStrategy来实现;
- MultiProducerSequencer
public long next(int n)
{
if (n < 1)
{
throw new IllegalArgumentException("n must be > 0");
}
long current;
long next;
do
{
current = cursor.get();
next = current + n;
long wrapPoint = next - bufferSize;
long cachedGatingSequence = gatingSequenceCache.get();
if (wrapPoint > cachedGatingSequence || cachedGatingSequence > current)
{
long gatingSequence = Util.getMinimumSequence(gatingSequences, current);
if (wrapPoint > gatingSequence)
{
LockSupport.parkNanos(1); // TODO, should we spin based on the wait strategy?
continue;
}
gatingSequenceCache.set(gatingSequence);
}
else if (cursor.compareAndSet(current, next))
{
break;
}
}
while (true);
return next;
}
可用看到是通过cursor.compareAndSet(current, next),也就是CAS的方式来实现的;
生产者占位成功以后,通过publish方法将元素添加到队列,并更新状态,具体的逻辑如下:
SingleProducerSequencer:
public void publish(long sequence)
{
cursor.set(sequence);
waitStrategy.signalAllWhenBlocking();
}
MultiProducerSequencer:
public void publish(final long sequence)
{
setAvailable(sequence);
waitStrategy.signalAllWhenBlocking();
}
可以看到disruptor生产者唯一存在竞争的地方在于获取sequence;
消费者
直接上代码,见Disruptor类:
EventHandlerGroup<T> createEventProcessors(final Sequence[] barrierSequences,
final EventHandler<? super T>[] eventHandlers)
{
checkNotStarted();
final Sequence[] processorSequences = new Sequence[eventHandlers.length];
final SequenceBarrier barrier = ringBuffer.newBarrier(barrierSequences);
for (int i = 0, eventHandlersLength = eventHandlers.length; i < eventHandlersLength; i++)
{
final EventHandler<? super T> eventHandler = eventHandlers[i];
final BatchEventProcessor<T> batchEventProcessor = new BatchEventProcessor<T>(ringBuffer, barrier, eventHandler);
if (exceptionHandler != null)
{
batchEventProcessor.setExceptionHandler(exceptionHandler);
}
consumerRepository.add(batchEventProcessor, eventHandler, barrier);
processorSequences[i] = batchEventProcessor.getSequence();
}
if (processorSequences.length > 0)
{
consumerRepository.unMarkEventProcessorsAsEndOfChain(barrierSequences);
}
return new EventHandlerGroup<T>(this, consumerRepository, processorSequences);
}
可用看到,每个EventProcessor(默认BatchEventProcessor)都有自己的Sequence,也就是说队列中的每个消息,每个EventProcessor都要消费一次;那么当有多个Handler时,Handler的处理顺序是怎么样的?这个问题留待后面再回答;
Disruptor通过start方法,启动消费者:
public RingBuffer<T> start()
{
final Sequence[] gatingSequences = consumerRepository.getLastSequenceInChain(true);
ringBuffer.addGatingSequences(gatingSequences);
checkOnlyStartedOnce();
for (final ConsumerInfo consumerInfo : consumerRepository)
{
consumerInfo.start(executor);
}
return ringBuffer;
}
public void start(final Executor executor)
{
executor.execute(eventprocessor);
}
executor是java.util.concurrent.Executor类型的对象,根据需要可用采用JDK提供的ThreadPoolExecutor,也可以自己实现;但从上面代码也可以看出,通过executor.execute方式启动eventprocessor,也就是disruptor通常会将每个handler放在独立的线程中进行处理;但这有个问题,比如http请求,如果使用一个handler进行实现,那意味着所有http请求将由一个线程进行处理,那岂不是性能反而下降了?那disruptor的优势体现在哪呢?
继续分析代码,consumerInfo.start(executor)实际上调用的是BatchEventProcessor的run方法:
public void run()
{
if (!running.compareAndSet(false, true))//避免重复启动
{
throw new IllegalStateException("Thread is already running");
}
sequenceBarrier.clearAlert();//清空alerted标志
notifyStart();//如果实现了LifecycleAware,触发onStart事件
T event = null;
long nextSequence = sequence.get() + 1L;//下一个消息的读索引
try
{
while (true)
{
try
{
final long availableSequence = sequenceBarrier.waitFor(nextSequence);//从队列中获取可用的消息索引,如果队列为空,则会等待,依赖于WaitStrategy
while (nextSequence <= availableSequence)
{
event = dataProvider.get(nextSequence);
eventHandler.onEvent(event, nextSequence, nextSequence == availableSequence);
nextSequence++;
}
sequence.set(availableSequence);
}
catch (final TimeoutException e)
{
notifyTimeout(sequence.get());
}
catch (final AlertException ex)
{
if (!running.get())
{
break;
}
}
catch (final Throwable ex)
{//如果发生异常,会调用exceptionHandler进行处理,流程并不会中断
exceptionHandler.handleEventException(ex, nextSequence, event);
sequence.set(nextSequence);
nextSequence++;
}
}
}
finally
{
notifyShutdown();
running.set(false);
}
}
可以看到run方法是通过sequenceBarrier.waitFor(nextSequence)获取下一个可用的事件;
下面看其具体实现,见ProcessingSequenceBarrier类:
public long waitFor(final long sequence)
throws AlertException, InterruptedException, TimeoutException
{
checkAlert();
long availableSequence = waitStrategy.waitFor(sequence, cursorSequence, dependentSequence, this);
if (availableSequence < sequence)
{
return availableSequence;
}
return sequencer.getHighestPublishedSequence(sequence, availableSequence);
}
这段代码中,waitStrategy出场了,waitStrategy有如下几种实现:
- BlockingWaitStrategy
- BusySpinWaitStrategy
- LiteBlockingWaitStrategy
- PhasedBackoffWaitStrategy
- SleepingWaitStrategy
- TimeoutBlockingWaitStrategy
- YieldingWaitStrategy
策略比较多,就不一一看了,拿其中的BlockingWaitStrategy和BusySpinWaitStrategy为例:
public long waitFor(long sequence, Sequence cursorSequence, Sequence dependentSequence, SequenceBarrier barrier)
throws AlertException, InterruptedException
{
long availableSequence;
if ((availableSequence = cursorSequence.get()) < sequence)
{
lock.lock();
try
{
while ((availableSequence = cursorSequence.get()) < sequence)
{
barrier.checkAlert();
processorNotifyCondition.await();
}
}
finally
{
lock.unlock();
}
}
while ((availableSequence = dependentSequence.get()) < sequence)
{
barrier.checkAlert();
}
return availableSequence;
}
可以看到BlockingWaitStrategy是通过锁的方式实现的,因此性能比较一般,但cpu使用率会比较稳定;
另外注意到这里有个变量dependentSequence,可以看到不同handler之间是存在依赖关系的,定义依赖关系的示例代码如下:
Executor executor = Executors.newFixedThreadPool(4);
Disruptor<DataEvent> disruptor = new Disruptor<DataEvent>(
DataEvent.FACTORY, 1024, DaemonThreadFactory.INSTANCE);
TransformingHandler handler1 = new TransformingHandler(0);
TransformingHandler handler2 = new TransformingHandler(1);
TransformingHandler handler3 = new TransformingHandler(2);
CollatingHandler collator = new CollatingHandler();
disruptor.handleEventsWith(handler1, handler2, handler3).then(collator);
disruptor.start();
BusySpinWaitStrategy实现:
public long waitFor(final long sequence, Sequence cursor, final Sequence dependentSequence, final SequenceBarrier barrier)
throws AlertException, InterruptedException
{
long availableSequence;
while ((availableSequence = dependentSequence.get()) < sequence)
{
barrier.checkAlert();
}
return availableSequence;
}
可以看到当队列没有可消费的数据时,会一直执行循环,因此会造成CPU的使用率很高,需要特别注意;
Sequence
Disruptor中很多地方都要用到Sequence,比如生产者获取下一个可写的索引,因此特别看看它的实现是如何解决并发的:
public class Sequence extends RhsPadding
{
private static final Unsafe UNSAFE;
private static final long VALUE_OFFSET;//value在内存中的偏移量
static
{
UNSAFE = Util.getUnsafe();
try
{
VALUE_OFFSET = UNSAFE.objectFieldOffset(Value.class.getDeclaredField("value"));
}
catch (final Exception e)
{
throw new RuntimeException(e);
}
}
public Sequence(final long initialValue)
{
UNSAFE.putOrderedLong(this, VALUE_OFFSET, initialValue);
}
/**
* Perform an ordered write of this sequence. The intent is
* a Store/Store barrier between this write and any previous
* store.
该方法会插入StoreStore Memory Barrier,保证在该方法之前的Store操作都会提交;
*
* @param value The new value for the sequence.
*/
public void set(final long value)
{
UNSAFE.putOrderedLong(this, VALUE_OFFSET, value);
}
/**
* Performs a volatile write of this sequence. The intent is
* a Store/Store barrier between this write and any previous
* write and a Store/Load barrier between this write and any
* subsequent volatile read.
* 插入StoreStore和StoreLoad Memory Barrier,保证变量在写之后会立即提交到主内存,从而保证了可见性;
* @param value The new value for the sequence.
*/
public void setVolatile(final long value)
{
UNSAFE.putLongVolatile(this, VALUE_OFFSET, value);
}
/**
* Perform a compare and set operation on the sequence.
*
* @param expectedValue The expected current value.
* @param newValue The value to update to.
* @return true if the operation succeeds, false otherwise.
*/
public boolean compareAndSet(final long expectedValue, final long newValue)
{
return UNSAFE.compareAndSwapLong(this, VALUE_OFFSET, expectedValue, newValue);
}
/**
* Atomically increment the sequence by one.
*
* @return The value after the increment
*/
public long incrementAndGet()
{
return addAndGet(1L);
}
/**
* Atomically add the supplied value.
*采用CAS方式,避免加锁
* @param increment The value to add to the sequence.
* @return The value after the increment.
*/
public long addAndGet(final long increment)
{
long currentValue;
long newValue;
do
{
currentValue = get();
newValue = currentValue + increment;
}
while (!compareAndSet(currentValue, newValue));
return newValue;
}
}
源码比较长,我删除了一些不关注的方法,下面看看它的实现:
1.CacheLine Padding:15个long变量,15*8=120, 在加上对象头的长度,即使在压缩指针的情况下,也会有120+8=128个字节;
2.采用了JDK底层的Unsafe,直接访问操作系统底层:
3.采用CAS避免加锁;
结论
可以看到Disruptor采用的是类似pub-sub模式,事件会被多个消费者消费;原因在于createEventProcessors时,每个BatchEventProcessor都有单独的Sequence;
如果要避免事件被消费多次,可以采用WorkPool,一个WorkPool中的多个WorkProcessor共用一个Sequence;当然由于Disruptor为每个WorkProcessor起一个线程,因此需要注册多个WorkProcessor;
在我的项目中,根据cpu个个数,决定WorkProcessor的个数,这些WorkProcessor的逻辑都是相同的(同一个类的多个实例),从RingBuffer中抢占式获取事件进行处理;