Spark 2.0

What is Apache Spark

sparklogo

Apache Spark is a highly scalable open source cluster computing framework and data processing engine. Originally developed at UC Berkeley’s AMPLab in 2009, it went open source in 2010 under a BSD license. It was ultimately donated to ASF in 2013. It is now distributed under Apache License 2.0.
Spark provides a unified and comprehensive framework. This framework can capably handle the various requirements for processing large datasets. Spark provides you with high-level APIs in Java, Scala, Python and R. It is also provides a higher-level rich set of tools referred to as Libraries.

Spark 2.0 – What’s New
With the upcoming release of Spark 2.0 there has been some significant improvements in the API, Libraries and Abstraction layers. Spark 2.0 attempts to improve on these three components and is said to be 10X faster than Spark 1.x.
Let’s take a look at some of the changes in Spark 2.0.

More SQL Friendly – SQL 2003 Compliant
SQL is one of the primary interfaces Spark applications use. Spark 2.0 introduces a new ANSI SQL parser. The new parser provides good error reporting. Spark 2.0 will have the ability of subqueries (both correlated & uncorrelated). Spark 2.0 can run all the 99 TPC-DS queries.
This is a major improvement which can encourage moving of applications from the traditional SQL Engines to Spark.

Unified API – DataFrames & Datasets
DataFrames is a higher level structured data API introduced in Spark 1.3 in 2015. In a nutshell, DataFrameis a collection of rows with a schema. It provides better performance, ease-of-use and flexibility in comparison with RDD (Resilient Distributed Data) API.
For the users who prefer to use type safety a new API was introduced in Spark 1.6 called DataSets.DataSet is an attempt to provide type safety on top of DataFrame.
In Spark 2.0 the two APIs will be unified together into a single API. Starting in Spark 2.0, DataFrame is just a type alias for Dataset of Row. The new Dataset API includes typed methods and untyped methods.

SparkSession – Single Entry Point
Spark 1.6 provided SparkContext API to connect to Spark cluster. There were several different context provided for different APIs. For instance to connect to SQL we required SQLContext and StreamContextfor Streaming. While using DataFrames API a common confusion is to decide which “context” to use.
Spark 2.0 introduces SparkSession. SparkSession provides a single entry point for DataFrame andDataSet API for Spark. For now SparkSession will cover SQLContext & HiveContext. It will be extended toStreamContext as well.
Please note that the SQLContext & HiveContext will be present in Spark 2.0 for backward compatibility.

Spark as a Compiler – Faster Spark
Spark is known for its performance and speed. Spark 2.0 attempts to take this performance a step further. Spark 1.x – like many other modern data engines – uses the compilers which uses of various function calls and CPU cycles. These CPU cycles are pretty much spent on unwanted work.
Spark 2.0 includes the second generation Tungsten engine. This new engine works by taking the query plan and collapsing it into a single function, which eliminates all the unwanted function calls. The engine uses the CPU register for storing the intermediate data (unlike the traditional method of using memory for storing intermediate data). This method promises around 10X improvement in the performance, depending on the data you are executing.

Structured Streaming – Continous Applications
The current Spark streaming API called DStream was introduced in Spark 0.7. It provides the ability to stream real-time data and process it. Spark 2.0 introduces Structured Streaming.
Spark Structured Streaming is a declarative API that extends DataFrames & DataSets. Spark Structured Streaming is largely built on Spark SQL and also includes ideas from Spark Streaming. It is based on the Datasets API.
Spark Streaming, which uses what’s been called a “micro-batch” architecture for streaming applications, is among the most popular Spark engines. The new Structured Streaming engine will represents Spark’s second attempt at solving some of the tough problems that developers face when building real-time applications.
Essentially, Structured Streaming enables Spark developers to run the same type of DataFrame queries against data streams as they had previously been running against static queries. Thanks to the Catalyst optimizer, the framework figures out the best way to make this all work in an efficient fashion, freeing the developer from worrying about the underlying plumbing.
Upcoming releases of Spark 2.x will include more features and improvements in Spark Structured Streaming.

DataFrame based ML API
In Spark 2.0 Machine Learning “Pipeline” DataFrame-based API will become the primary Machine Learning API.
Conclusion
Spark has already made a mark by providing an easy-to-use, unified and fast data framework. With Spark 2.0 we can expect further improvements in the performance of Spark overall. We can look forward to the GA release of Apache Spark 2.0 in the upcoming days.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,444评论 6 496
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,421评论 3 389
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 160,036评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,363评论 1 288
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,460评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,502评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,511评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,280评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,736评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,014评论 2 328
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,190评论 1 342
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,848评论 5 338
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,531评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,159评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,411评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,067评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,078评论 2 352

推荐阅读更多精彩内容