Tez线上部署及性能测试:

背景:

如果作业由多个MR任务完成,则必然经过多次完整的Map–shuffer–Reduce,中间节点的数据多次写入HDFS,浪费IO读写。(可以将HDFS理解为多个任务之间的共享存储。)Tez的引入可以较小的代价的解决这一问题。

Tez采用了DAG(有向无环图)来组织MR任务。
核心思想:将Map任务和Reduce任务进一步拆分,Map任务拆分为Input-Processor-Sort-Merge-Output,Reduce任务拆分为Input-Shuffer-Sort-Merge-Process-output,Tez将若干小任务灵活重组,形成一个大的DAG作业。
Tez与oozie不同:oozie只能以MR任务为整体来管理、组织,本质上仍然是多个MR任务的执行,不能解决上面提到的多个任务之间硬盘IO冗余的问题。 Tez只是一个Client,部署很方便。
目前Hive使用了Tez(Hive是一个将用户的SQL请求翻译为MR任务,最终查询HDFS的工具Tez采用了DAG(有向无环图)来组织MR任务。 核心思想:将Map任务和Reduce任务进一步拆分,Map任务拆分为Input-Processor-Sort-Merge-Output,Reduce任务拆分为Input-Shuffer-Sort-Merge-Process-output,Tez将若干小任务灵活重组,形成一个大的DAG作业。 Tez与oozie不同:oozie只能以MR任务为整体来管理、组织,本质上仍然是多个MR任务的执行,不能解决上面提到的多个任务之间硬盘IO冗余的问题。 Tez只是一个Client,部署很方便。 目前Hive使用了Tez(Hive是一个将用户的SQL请求翻译为MR任务,最终查询HDFS的工具)

传统的MR:

image.png

tez:

image.png

TEZ技术:

  • Application Master Pool 初始化AM池。Tez先将作业提交到AMPoolServer服务上。AMPoolServer服务启动时就申请多个AM,Tez提交作业会优先使用缓冲池资源
  • Container Pool AM启动时会预先申请多个Container
  • Container重用

Tez实现方法:

Tez对外提供了6种可编程组件,分别是:

  • Input:对输入数据源的抽象,它解析输入数据格式,并吐出一个个Key/value
  • Output:对输出数据源的抽象,它将用户程序产生的Key/value写入文件系统
  • Paritioner:对数据进行分片,类似于MR中的Partitioner
  • Processor:对计算的抽象,它从一个Input中获取数据,经处理后,通过Output输出
  • Task:对任务的抽象,每个Task由一个Input、Ouput和Processor组成
  • Maser :管理各个Task的依赖关系,并按顺依赖关系执行他们

除了以上6种组件,Tez还提供了两种算子,分别是Sort(排序)和Shuffle(混洗),为了用户使用方便,它还提供了多种Input、Output、Task和Sort的实现

TEZ执行引擎的问世,可以帮助我们解决现有MR框架的一些不足,比如迭代计算和交互计算,除了Hive组件,Pig组件也将TEZ用到了自己的优化中。
另外,TEZ是基于YARN的,所以可以与原有的MR共存,不会相互冲突,在实际的应用中,我们只需在hadoop-env.sh文件中配置TEZ的环境变量,并在mapred-site.xml设置执行作业的架构为yarn-tez,这样在YARN上运行的作业就会跑TEZ计算模式,所以原有的系统接入TEZ很便捷。当然,如果我们只想Hive使用TEZ,并不想对整个系统做修改,那我们也可以单独在Hive中做修改,也很简单,这样Hive可以在MR和TEZ之间自由切换而对原有的Hadoop MR任务没有影响,所以TEZ这款计算框架的耦合很低,让我们使用很容易和方便。




1 CDH集群测试环境

组件 版本
CDH 5.11.0
HADOOP 2.6.0
HIVE 1.1.0

CDH集群中Tez部署

版本:tez-0.8.5

部署

上传Tez tar包及lib

 # 上传tez-tar包至hdfs;
    $ hdfs dfs -put ./tez-0.8.5.tar.gz /apps/tez/
    # 拷贝编译完的Tez,apache-tez-0.8.5-src-2.6.0-cdh5.11.0/tez-dist/target/tez-0.8.5目录下的lib包至CDH-Hive的lib目录(所有Hiveserver2以及Hive Metastore Server)
    $ cd apache-tez-0.8.5-src-2.6.0-cdh5.11.0/tez-dist/target/tez-0.8.5
    $ cp ./*.jar  /opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/hive/lib
    $ cp ./lib/*.jar  /opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/hive/lib
    ###################################################################
    # 同步所有CDH-Hive节点(包括gateway,metastore,hievserver机器)
    ###################################################################

配置Tez配置文件

image.png

添加Tez配置文件:

<property>
   <name>tez.lib.uris</name>
   <value>${fs.defaultFS}/apps/tez/tez-0.8.5.tar.gz</value>
</property>
<!-- 由于TEZ-UI与timeline之间问题没解决,暂时注释掉 -->
<!-- <property>
     <name>tez.tez-ui.history-url.base</name>
     <value>http://tez-ui-serverIP:port/tez-ui/</value>
  </property>
  <property>
     <name>yarn.timeline-service.hostname</name>
     <value>${timelineServerIP}</value>
   </property> -->
<property>
   <name>tez.runtime.io.sort.mb</name>
   <value>1600</value>
   <description>40%*hive.tez.container.size</description>
</property>
<property>
   <name>hive.auto.convert.join.noconditionaltask.size</name>
   <value>1300</value>
   <description>多个mapjoin转换为1个时,所有小表的文件大小总和的最大值,这个值只是限制输入的表文件的大小,并不代表实际mapjoin时hashtable的大小。 建议值:1/3* hive.tez.container.size</description>
</property>
<property>
   <name>tez.runtime.unordered.output.buffer.size-mb</name>
   <value>400</value>
   <description>Size of the buffer to use if not writing directly to disk.。 建议值:10%* hive.tez.container.size</description>
</property>
<property>
   <name>hive.tez.container.size</name>
   <value>4096</value>
   <description>Set hive.tez.container.size to be the same as or a small multiple(1 or 2 times that) of YARN container size yarn.scheduler.minimum-allocation-mb but NEVER more than yarn.scheduler.maximum-allocation-mb</description>
</property>

###################################################################
同步配置至所有Hiveserver2以及Hive Metastore Server、gateway机器;

###################################################################

hive服务端配置

image.png

在图示位置添加Tez配置:

<property>
      <name>tez.lib.uris</name>
      <value>${fs.defaultFS}/apps/tez/tez-0.8.5.tar.gz</value>
   </property>
   <property>
      <name>tez.runtime.io.sort.mb</name>
      <value>1600</value>
      <description>40%*hive.tez.container.size</description>
   </property>
   <property>
      <name>hive.auto.convert.join.noconditionaltask.size</name>
      <value>1300</value>
      <description>多个mapjoin转换为1个时,所有小表的文件大小总和的最大值,这个值只是限制输入的表文件的大小,并不代表实际mapjoin时hashtable的大小。 建议值:1/3* hive.tez.container.size</description>
   </property>
   <property>
      <name>tez.runtime.unordered.output.buffer.size-mb</name>
      <value>400</value>
      <description>Size of the buffer to use if not writing directly to disk.。 建议值:10%* hive.tez.container.size</description>
   </property>
   <property>
      <name>hive.tez.container.size</name>
      <value>4096</value>
      <description>Set hive.tez.container.size to be the same as or a small multiple(1 or 2 times that) of YARN container size yarn.scheduler.minimum-allocation-mb but NEVER more than yarn.scheduler.maximum-allocation-mb</description>
   </property>

TEZ配置注意事项

"hive.tez.container.size" and "hive.tez.java.opts" are the parameters that alter Tez memory settings in Hive. If "hive.tez.container.size" is set to "-1" (default value), it picks the value of "mapreduce.map.memory.mb". If "hive.tez.java.opts" is not specified, it relies on the "mapreduce.map.java.opts" setting. Thus, if Tez specific memory settings are left as default values, memory sizes are picked from mapreduce mapper memory settings "mapreduce.map.memory.mb".

Important: Please note that the setting for "hive.tez.java.opts" must be smaller than the size specified for "hive.tez.container.size", or "mapreduce.{map|reduce}.memory.mb" if "hive.tez.container.size" is not specified. Don't forget to review both of them when setting either one to ensure "hive.tez.java.opts" is smaller then "hive.tez.container.size" or "mapreduce.{map|reduce}.java.opts" is smaller then "mapreduce.{map|reduce}.memory.mb".

See Configuring Heapsize for Mappers and Reducers in Hadoop 2 for more information about the "mapreduce.map.memory.mb" and "mapreduce.map.java.opts" properties.

### Yarn Timeline server 配置(本次未启用Timeline server)

.tez使用timelineserver存储application数据,由于tez-ui与yarn-timeline server目前还有问题没解决,暂时没有添加tez-ui。贴出yarn timeline server配置

CDH集群yarn-site配置添加

<property>
  <description>The hostname of the Timeline service web application.</description>
  <name>yarn.timeline-service.hostname</name>
  <value>10.10.15.107</value>
</property>
<property>
  <description>Address for the Timeline server to start the RPC server.</description>
  <name>yarn.timeline-service.address</name>
  <value>${yarn.timeline-service.hostname}:10200</value>
</property>
 
<property>
  <description>The http address of the Timeline service web application.</description>
  <name>yarn.timeline-service.webapp.address</name>
  <value>${yarn.timeline-service.hostname}:8188</value>
</property>
 
<property>
  <description>The https address of the Timeline service web application.</description>
  <name>yarn.timeline-service.webapp.https.address</name>
  <value>${yarn.timeline-service.hostname}:8190</value>
</property>
 
<property>
  <description>Handler thread count to serve the client RPC requests.</description>
  <name>yarn.timeline-service.handler-thread-count</name>
  <value>10</value>
</property>
 
<property>
  <description>Enables cross-origin support (CORS) for web services where
  cross-origin web response headers are needed. For example, javascript making
  a web services request to the timeline server.</description>
  <name>yarn.timeline-service.http-cross-origin.enabled</name>
  <value>true</value>
</property>
 
<property>
  <description>Comma separated list of origins that are allowed for web
  services needing cross-origin (CORS) support. Wildcards (*) and patterns
  allowed</description>
  <name>yarn.timeline-service.http-cross-origin.allowed-origins</name>
  <value>*</value>
</property>
 
<property>
  <description>Comma separated list of methods that are allowed for web
  services needing cross-origin (CORS) support.</description>
  <name>yarn.timeline-service.http-cross-origin.allowed-methods</name>
  <value>GET,POST,HEAD</value>
</property>
 
<property>
  <description>Comma separated list of headers that are allowed for web
  services needing cross-origin (CORS) support.</description>
  <name>yarn.timeline-service.http-cross-origin.allowed-headers</name>
  <value>X-Requested-With,Content-Type,Accept,Origin</value>
</property>
 
<property>
  <description>The number of seconds a pre-flighted request can be cached
  for web services needing cross-origin (CORS) support.</description>
  <name>yarn.timeline-service.http-cross-origin.max-age</name>
  <value>1800</value>
</property>
<property>
  <description>Indicate to clients whether Timeline service is enabled or not.
  If enabled, the TimelineClient library used by end-users will post entities
  and events to the Timeline server.</description>
  <name>yarn.timeline-service.enabled</name>
  <value>true</value>
</property>
 
<property>
  <description>Store class name for timeline store.</description>
  <name>yarn.timeline-service.store-class</name>
  <value>org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore</value>
</property>
 
<property>
  <description>Enable age off of timeline store data.</description>
  <name>yarn.timeline-service.ttl-enable</name>
  <value>true</value>
</property>
 
<property>
  <description>Time to live for timeline store data in milliseconds.</description>
  <name>yarn.timeline-service.ttl-ms</name>
  <value>604800000</value>
</property>
 
<property>
  <name>yarn.resourcemanager.system-metrics-publisher.enabled</name>
  <value>true</value>
</property>

记得改完配置,重启Hive服务

Tez出现问题

Q1: java.lang.ArithmeticException: / by zero

Vertex failed, vertexName=Map 1, vertexId=vertex_1557110571873_0003_1_00, diagnostics=[Vertex vertex_1557110571873_0003_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: tez initializer failed, vertex=vertex_1557110571873_0003_1_00 [Map 1], java.lang.ArithmeticException: / by zero
    at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:123)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

解决:

SET hive.tez.container.size = ;(上面配置已经补充,根据集群设置而定)

Q2: map 0%卡住

解决

查看了一下hive中hive.map.aggr.hash.percentmemory属性的说明: Hive Map 端聚合的哈稀存储所占用虚拟机的内存比例。 意思是说,当内存的Map大小,占到JVM配置的Map进程的25%的时候(默认是50%),就将这个数据flush到reducer去,以释放内存Map的空间。 错误原因:Map端聚合时hash表所占用的内存比例默认为0.5,这个值超过可用内存大小,导致内存溢出。

Q3: java.lang.NoClassDefFoundError: com/esotericsoftware/kryo/Serializer

ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1558679295627_0001_1_00, diagnostics=[Vertex vertex_1558679295627_0001_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: tez initializer failed, vertex=vertex_1558679295627_0001_1_00 [Map 1], java.lang.NoClassDefFoundError: com/esotericsoftware/kryo/Serializer at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:107) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: com.esotericsoftware.kryo.Serializer at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 12 more

解决

# 切换指定目录,拷贝缺少jar
$ cd /opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/jars
###################################################################
# 同步配置至所有Hiveserver2以及Hive Metastore Server、gateway机器;
###################################################################
cp ./kryo-2.22.jar ../lib/hive/auxlib/
# 如果在hive命令行执行,则不会报错,如果在hue中执行,需要重启hive。

Q4: Caused by: java.lang.OutOfMemoryError: Java heap space

Caused by: java.lang.OutOfMemoryError: Java heap space
 at org.apache.hadoop.hive.serde2.WriteBuffers.nextBufferToWrite(WriteBuffers.java:206)
 at org.apache.hadoop.hive.serde2.WriteBuffers.write(WriteBuffers.java:182)
 at org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer$LazyBinaryKvWriter.writeValue(MapJoinBytesTableContainer.java:248)
 at org.apache.hadoop.hive.ql.exec.persistence.BytesBytesMultiHashMap.writeFirstValueRecord(BytesBytesMultiHashMap.java:574)
 at org.apache.hadoop.hive.ql.exec.persistence.BytesBytesMultiHashMap.put(BytesBytesMultiHashMap.java:229)
 at org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer.putRow(MapJoinBytesTableContainer.java:288)

解决
https://mapr.com/support/s/article/How-to-change-Tez-container-heapsize?language=en_US

修改配置:
<property>
  <name>tez.am.resource.memory.mb</name>
  <value>2048</value>
</property>

Q5 Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.tez.dag.api.TezException): App master already running a DAG

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.tez.dag.api.TezException): App master already running a DAG
    at org.apache.tez.dag.app.DAGAppMaster.submitDAGToAppMaster(DAGAppMaster.java:1379)
    at org.apache.tez.dag.api.client.DAGClientHandler.submitDAG(DAGClientHandler.java:140)
    at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.submitDAG(DAGClientAMProtocolBlockingPBServerImpl.java:175)
    at org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7636)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2220)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2214)
image.png
添加配置;
[beeswax]
    max_number_of_sessions=10

Q6 java.lang.AbstractMethodError: org.codehaus.jackson.map.AnnotationIntrospector.findSerializer(Lorg/codehaus/jackson/map/introspect/Annotated;)Ljava/lang/Object;

java.lang.AbstractMethodError: org.codehaus.jackson.map.AnnotationIntrospector.findSerializer(Lorg/codehaus/jackson/map/introspect/Annotated;)Ljava/lang/Object;
    at org.codehaus.jackson.map.ser.BasicSerializerFactory.findSerializerFromAnnotation(BasicSerializerFactory.java:362)
    at org.codehaus.jackson.map.ser.BeanSerializerFactory.createSerializer(BeanSerializerFactory.java:252)
    at org.codehaus.jackson.map.ser.StdSerializerProvider._createUntypedSerializer(StdSerializerProvider.java:782)
    at org.codehaus.jackson.map.ser.StdSerializerProvider._createAndCacheUntypedSerializer(StdSerializerProvider.java:735)
    at org.codehaus.jackson.map.ser.StdSerializerProvider.findValueSerializer(StdSerializerProvider.java:344)
    at org.codehaus.jackson.map.ser.StdSerializerProvider.findTypedValueSerializer(StdSerializerProvider.java:420)
    at org.codehaus.jackson.map.ser.StdSerializerProvider._serializeValue(StdSerializerProvider.java:601)
    at org.codehaus.jackson.map.ser.StdSerializerProvider.serializeValue(StdSerializerProvider.java:256)
    at org.codehaus.jackson.map.ObjectMapper.writeValue(ObjectMapper.java:1604)
    at org.codehaus.jackson.jaxrs.JacksonJsonProvider.writeTo(JacksonJsonProvider.java:527)
    at com.sun.jersey.api.client.RequestWriter.writeRequestEntity(RequestWriter.java:300)
    at com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:204)
    at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147)
    at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:226)
    at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:162)
    at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:237)
    at com.sun.jersey.api.client.Client.handle(Client.java:648)
    at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
    at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
    at com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
    at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:472)
    at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:321)
    at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:301)
    at org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.handleEvents(ATSHistoryLoggingService.java:358)
    at org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.serviceStop(ATSHistoryLoggingService.java:233)
    at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
    at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
    at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
    at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
    at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
    at org.apache.tez.dag.history.HistoryEventHandler.serviceStop(HistoryEventHandler.java:111)
    at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
    at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
    at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
    at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:65)
    at org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1949)
    at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:2140)
    at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
    at org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHook.run(DAGAppMaster.java:2438)
    at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

解决:

更改tez的jackson JAR为1.9.13版本

Tez与MR在Hive的性能测试

在Hive中创建待测试表

# 表UserVisits
$ create table UserVisits (sourceIP VARCHAR(116), destURL VARCHAR(100), visitDate DATE, adRevenue FLOAT, userAgent VARCHAR(256), countryCode CHAR(3), languageCode CHAR(6), searchWord VARCHAR(32), duration INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
# 表Rankings
$ create table Rankings (pageURL varchar(300), pageRank INT ,avgDuration INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

导入数据(一次1G数据集测试,一次2G数据集测试)

load data local inpath 'path' into table Rankings; ....

测试

scan操作

$ SELECT pageURL, pageRank FROM rankings WHERE pageRank > 1000000;

聚合操作

$ SELECT SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, 4);

join操作

$ SELECT sourceIP, totalRevenue FROM ( SELECT sourceIP, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL GROUP BY UV.sourceIP ) test ORDER BY totalRevenue DESC LIMIT 1;

External Script Query

$ CREATE TABLE url_counts_total AS SELECT SUM(pageRank) AS totalCount, pageURL FROM Rankings GROUP BY pageURL;

测试结果

image.png

image.png
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,099评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,828评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,540评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,848评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,971评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,132评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,193评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,934评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,376评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,687评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,846评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,537评论 4 335
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,175评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,887评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,134评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,674评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,741评论 2 351