spark调优

1 概述

spark的内存主要有消耗在三个方面：

the amount of memory used by your objects（数据集的存储）；

the cost of accessing those objects（访问数据对象）；

the overhead of garbage collection（垃圾回收）

默认情况下，java对象能够快速访问，但是需要消耗原始数据字段2-5倍的内存空间，主要原因如下：

Each distinct Java object has an “object header”（每个对象都有数据头）；

Java Strings have about 40 bytes of overhead over the raw string data（字符串都是字符数组）

Common collection classes not only has a header, but also pointers（java集合不仅有数据头，还有指针）

Collections of primitive types often store them as “boxed” objects（原始类型的集合通常存的是包装对象）

2 内存模型

图1

运行于Executor中的Task同时可使用On-Heap和Off-heap两种内存。

JVM On-Heap内存：大小由”--executor-memory”(即 spark.executor.memory)参数指定。Executor中运行的并发任务共享JVM堆内内存。

JVM Off-Heap内存：大小由”spark.yarn.executor.memoryOverhead”参数指定，主要用于JVM自身，字符串, NIO Buffer等开销。

在默认情况下堆外内存并不启用，可通过配置 spark.memory.offHeap.enabled 参数启用，并由 spark.memory.offHeap.size 参数设定堆外空间的大小。除了没有 other 空间，堆外内存与堆内内存的划分方式相同，所有运行中的并发任务共享存储内存和执行内存。

on-Head：

图2：堆内内存模型

off-Head：

图3：堆外内存模型

3 确定内存消耗

数据集：创建rdd并进行缓存，通过web UI界面查看

对象：通过SizeEstimator的estimate（）方法

4 调优措施

4.1 Tuning Data Structures

prefer arrays of objects, and primitive types, instead of the standard Java or Scala collection classes（数组和简单类型替换集合类型）

Avoid nested structures with a lot of small objects and pointers when possible（避免使用大量小对象和带指针的嵌套数据结构）

Consider using numeric IDs or enumeration objects instead of strings for keys（用数字型或者枚举类型替换字符串）

If you have less than 32 GiB of RAM, set the JVM flag -XX:+UseCompressedOops to make pointers be four bytes instead of eight（如果内存小于32G，设置指针为4字节）

4.2 Serialized RDD Storage

store them in serialized form, using the serialized StorageLevels in the RDD persistence API, such as MEMORY_ONLY_SER（序列化存储rdd），The only downside of storing data in serialized form is slower access times（缺点就是访问速度变慢）

4.3 Garbage Collection Tuning

Before trying other techniques, the first thing to try if GC is a problem is to use serialized caching（出现GC问题第一个解决方案就是进行rdd的序列化缓存）

GC can also be a problem due to interference between your tasks’ working memory (the amount of space needed to run the task) and the RDDs cached on your nodes（GC问题主要来自于程序运行和rdd缓存）

Measuring the Impact of GC

The first step in GC tuning is to collect statistics on how frequently garbage collection occurs and the amount of time spent GC. This can be done by adding -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to the Java options（通过提交任务时设置参数获取GC的频率和消耗时间详情）

Advanced GC Tuning

jvm基本概念

Java Heap space is divided in to two regions Young and Old. The Young generation is meant to hold short-lived objects while the Old generation is intended for objects with longer lifetimes.

jvm分成新生代和老年代，新生代负责存储生命周期短的对象，老年代负责存储生命周期长的对象。

The Young generation is further divided into three regions [Eden, Survivor1, Survivor2].

新生代又分为Eden，Survivor1, Survivor2。两个Survivor是相同大小的，一个负责存储对象，一个负责辅助复制。

A simplified description of the garbage collection procedure: When Eden is full, a minor GC is run on Eden and objects that are alive from Eden and Survivor1 are copied to Survivor2. The Survivor regions are swapped. If an object is old enough or Survivor2 is full, it is moved to Old. Finally, when Old is close to full, a full GC is invoked.

当Eden区满了，有用的对象会被复制到Survivor1，无用的清除。两个Survivor不断交替负责存储和辅助，当负责存储的也满了后，就会将还存活的对象复制到老年代，如果老年代也满了，就会触发Full GC。

The goal of GC tuning in Spark is to ensure that only long-lived RDDs are stored in the Old generation and that the Young generation is sufficiently sized to store short-lived objects.

spark gc调优的目标就是是生命周期长的rdds存储在老年代，使得新生代有足够的空间存储生命周期短的对象（temporary objects created during task execution），从而避免full GC。

调优措施

If a full GC is invoked multiple times for before a task completes, it means that there isn’t enough memory available for executing tasks.

如果整个task过程中有多次full GC，说明executor中任务执行可用内存不够。

If there are too many minor collections but not many major GCs, allocating more memory for Eden would help.

如果新生代垃圾回收太多，应该给eden区分配更多的内存

if the OldGen is close to being full, reduce the amount of memory used for caching by lowering spark.memory.fraction;consider decreasing the size of the Young generation

如果老年代接近full GC，那么通过降低spark.memory.fraction的参数值减少用于缓存的内存，另外可以考虑减少新生代的内存。

Try the G1GC garbage collector with -XX:+UseG1GC.

可以尝试使用G1GC垃圾回收器。

Monitor how the frequency and time taken by garbage collection changes with the new settings.

通过监控gc的频率和时间来不断调整参数设置

GC tuning flags for executors can be specified by setting spark.executor.defaultJavaOptions or spark.executor.extraJavaOptions in a job’s configuration.

4.4 Other Considerations

Level of Parallelism

In general, we recommend 2-3 tasks per CPU core in your cluster.

Parallel Listing on Input Paths

Sometimes you may also need to increase directory listing parallelism when job input has large number of directories

Memory Usage of Reduce Tasks

Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large.The simplest fix here is to increase the level of parallelism. you can safely increase the level of parallelism to more than the number of cores in your clusters.

Broadcasting Large Variables

Data Locality

In situations where there is no unprocessed data on any idle executor, Spark switches to lower locality levels. There are two options: a) wait until a busy CPU frees up to start a task on data on the same server, or b) immediately start a new task in a farther away place that requires moving data there.

What Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout expires, it starts moving the data from far away to the free CPU. The wait timeout for fallback between each level can be configured individually or all together in one parameter.

5 Summary

For most programs, switching to Kryo serialization and persisting data in serialized form will solve most common performance issues.

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 217,185评论 6赞 503
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 92,652评论 3赞 393
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 163,524评论 0赞 353
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,339评论 1赞 293
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,387评论 6赞 391
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,287评论 1赞 301
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,130评论 3赞 418
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,985评论 0赞 275
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,420评论 1赞 313
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,617评论 3赞 334
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,779评论 1赞 348
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,477评论 5赞 345
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,088评论 3赞 328
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,716评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,857评论 1赞 269
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 47,876评论 2赞 370
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,700评论 2赞 354