Storm official website en

Storm

Documentation

Official website
github
github-2
Wiki
Chinese
Chinese-2


concepts

Why use Storm? more

  • Apache Storm is a free and open source distributed realtime computation system.
  • Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.
  • Storm is fast: a benchmark clocked it at over a million tuples processed per second per node.
  • It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Topologies more

  • The logic for a realtime application is packaged into a Storm topology.
  • A Storm topology is analogous to a MapReduce job. One key difference is that a MapReduce job eventually finishes, whereas a topology runs forever (or until you kill it, of course).
  • A topology is a graph of spouts and bolts that are connected with stream groupings.

Topologies -> TopologyBuilder more

TopologyBuilder exposes the Java API for specifying a topology for Storm to execute. Topologies are Thrift structures in the end, but since the Thrift API is so verbose, TopologyBuilder greatly eases the process of creating topologies. The template for creating and submitting a topology looks something like:

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("1", new TestWordSpout(true), 5);
builder.setSpout("2", new TestWordSpout(true), 3);
builder.setBolt("3", new TestWordCounter(), 3)
         .fieldsGrouping("1", new Fields("word"))
         .fieldsGrouping("2", new Fields("word"));
builder.setBolt("4", new TestGlobalCount())
         .globalGrouping("1");

Map conf = new HashMap();
conf.put(Config.TOPOLOGY_WORKERS, 4);

StormSubmitter.submitTopology("mytopology", conf, builder.createTopology());

Running the exact same topology in local mode (in process), and configuring it to log all tuples emitted, looks like the following. Note that it lets the topology run for 10 seconds before shutting down the local cluster.

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("1", new TestWordSpout(true), 5);
builder.setSpout("2", new TestWordSpout(true), 3);
builder.setBolt("3", new TestWordCounter(), 3)
         .fieldsGrouping("1", new Fields("word"))
         .fieldsGrouping("2", new Fields("word"));
builder.setBolt("4", new TestGlobalCount())
         .globalGrouping("1");

Map conf = new HashMap();
conf.put(Config.TOPOLOGY_WORKERS, 4);
conf.put(Config.TOPOLOGY_DEBUG, true);

LocalCluster cluster = new LocalCluster();
cluster.submitTopology("mytopology", conf, builder.createTopology());
Utils.sleep(10000);
cluster.shutdown();

Streams more

  • The stream is the core abstraction in Storm. A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion.
  • Streams are defined with a schema that names the fields in the stream's tuples. By default, tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.
  • Every stream is given an id when declared. Since single-stream spouts and bolts are so common, OutputFieldsDeclarer has convenience methods for declaring a single stream without specifying an id. In this case, the stream is given the default id of "default".

Streams -> Tuple more

  • The tuple is the main data structure in Storm. A tuple is a named list of values, where each value can be any type. Tuples are dynamically typed – the types of the fields do not need to be declared. Tuples have helper methods like getInteger and getString to get field values without having to cast the result.
  • Storm needs to know how to serialize all the values in a tuple. By default, Storm knows how to serialize the primitive types, strings, and byte arrays. If you want to use another type, you’ll need to implement and register a serializer for that type.

Streams -> Serialization more

Spouts more

  • A spout is a source of streams in a topology. Generally spouts will read tuples from an external source and emit them into the topology (e.g. a Kestrel queue or the Twitter API).
  • Spouts can either be reliable or unreliable. A reliable spout is capable of replaying a tuple if it failed to be processed by Storm, whereas an unreliable spout forgets about the tuple as soon as it is emitted.
  • Spouts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on SpoutOutputCollector.
  • The main method on spouts is nextTuple. nextTuple either emits a new tuple into the topology or simply returns if there are no new tuples to emit. It is imperative that nextTuple does not block for any spout implementation, because Storm calls all the spout methods on the same thread.
  • The other main methods on spouts are ack and fail. These are called when Storm detects that a tuple emitted from the spout either successfully completed through the topology or failed to be completed. ack and fail are only called for reliable spouts.

Spouts -> ISpout more

Spouts -> IRichSpout more

Spouts -> Guaranteeing Message Processing more

Bolts more

  • All processing in topologies is done in bolts. Bolts can do anything from filtering, functions, aggregations, joins, talking to databases, and more.
  • Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts. For example, transforming a stream of tweets into a stream of trending images requires at least two steps: a bolt to do a rolling count of retweets for each image, and one or more bolts to stream out the top X images (you can do this particular stream transformation in a more scalable way with three bolts than with two).
  • Bolts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on OutputCollector.

Understanding the Parallelism of a Storm Topology

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 12,183评论 0 10
  • 定而后能静,静而后能安,安而后能虑,虑而后能得 我一直以为自己是个聪明人。 以前我以为,只要我想,我是能够拿起所有...
    jimmyfool阅读 1,409评论 0 0
  • 今天的故事很多,先从第一个开始吧。 一年一度的阳光行,又拉开了序幕,顺利去到小学跟老师联系后进了学校。因为提前说过...
    天涯路人各不相干阅读 2,910评论 0 0
  • 欲钓鱼,先设美味的饵 1.天底下只有一种人可以影响他人,提出他们的需要,并让他们知道如何去获得 2.成功的人际关系...
    肥肉666阅读 1,486评论 0 0
  • 回到嗜血海的时候,她的表情再次变成阴冷的诡异,但是已经不向先前那么自然了。忽然我问她:假如没有光明和黑暗分别,你会...
    鹿将归阅读 1,625评论 0 1

友情链接更多精彩内容