目前发现hive on spark任务执行时,有几率会在执行过程中RunningTasksCount数逐渐减少,
导致任务执行效率降低。通过排查资源,未发现资源不足,hdfs RPC也未发现异常。
具体什么原因呢?
异常情况
2023-09-10 02:12:14,765 Stage-10_0: 0/1099 Stage-9_0: 930(+85)/9467
2023-09-10 02:12:15,767 Stage-10_0: 0/1099 Stage-9_0: 932(+85)/9467
2023-09-10 02:12:16,769 Stage-10_0: 0/1099 Stage-9_0: 934(+84)/9467
2023-09-10 02:12:17,772 Stage-10_0: 0/1099 Stage-9_0: 939(+83)/9467
2023-09-10 02:12:18,774 Stage-10_0: 0/1099 Stage-9_0: 941(+83)/9467
2023-09-10 02:12:19,776 Stage-10_0: 0/1099 Stage-9_0: 949(+83)/9467
2023-09-10 02:12:20,778 Stage-10_0: 0/1099 Stage-9_0: 951(+82)/9467
2023-09-10 02:12:21,780 Stage-10_0: 0/1099 Stage-9_0: 957(+82)/9467
2023-09-10 02:12:22,782 Stage-10_0: 0/1099 Stage-9_0: 959(+81)/9467
2023-09-10 02:12:23,783 Stage-10_0: 0/1099 Stage-9_0: 968(+79)/9467
2023-09-10 02:12:24,785 Stage-10_0: 0/1099 Stage-9_0: 972(+79)/9467
2023-09-10 02:12:25,787 Stage-10_0: 0/1099 Stage-9_0: 975(+77)/9467
2023-09-10 02:12:26,792 Stage-10_0: 0/1099 Stage-9_0: 978(+76)/9467
2023-09-10 02:12:27,795 Stage-10_0: 0/1099 Stage-9_0: 980(+76)/9467
2023-09-10 02:12:28,797 Stage-10_0: 0/1099 Stage-9_0: 981(+75)/9467
2023-09-10 02:12:30,800 Stage-10_0: 0/1099 Stage-9_0: 984(+73)/9467
2023-09-10 02:12:31,802 Stage-10_0: 0/1099 Stage-9_0: 988(+71)/9467
2023-09-10 02:12:32,804 Stage-10_0: 0/1099 Stage-9_0: 993(+68)/9467
2023-09-10 02:12:33,806 Stage-10_0: 0/1099 Stage-9_0: 998(+65)/9467
2023-09-10 02:12:34,808 Stage-10_0: 0/1099 Stage-9_0: 1006(+61)/9467
2023-09-10 02:12:35,810 Stage-10_0: 0/1099 Stage-9_0: 1009(+61)/9467
2023-09-10 02:12:36,812 Stage-10_0: 0/1099 Stage-9_0: 1011(+61)/9467
2023-09-10 02:12:37,814 Stage-10_0: 0/1099 Stage-9_0: 1014(+61)/9467
2023-09-10 02:12:38,816 Stage-10_0: 0/1099 Stage-9_0: 1019(+58)/9467
2023-09-10 02:12:39,818 Stage-10_0: 0/1099 Stage-9_0: 1022(+57)/9467
2023-09-10 02:12:40,820 Stage-10_0: 0/1099 Stage-9_0: 1025(+54)/9467
2023-09-10 02:12:41,822 Stage-10_0: 0/1099 Stage-9_0: 1028(+54)/9467
2023-09-10 02:12:42,824 Stage-10_0: 0/1099 Stage-9_0: 1030(+53)/9467
2023-09-10 02:12:43,826 Stage-10_0: 0/1099 Stage-9_0: 1036(+50)/9467
2023-09-10 02:12:44,828 Stage-10_0: 0/1099 Stage-9_0: 1038(+50)/9467
2023-09-10 02:12:45,830 Stage-10_0: 0/1099 Stage-9_0: 1040(+50)/9467
2023-09-10 02:12:46,832 Stage-10_0: 0/1099 Stage-9_0: 1042(+49)/9467
2023-09-10 02:12:47,834 Stage-10_0: 0/1099 Stage-9_0: 1043(+49)/9467
2023-09-10 02:12:48,836 Stage-10_0: 0/1099 Stage-9_0: 1048(+47)/9467
正常情况
2023-09-13 02:12:16,887 Stage-10_0: 0/1099 Stage-9_0: 472(+480)/9478
2023-09-13 02:12:17,892 Stage-10_0: 0/1099 Stage-9_0: 474(+478)/9478
2023-09-13 02:12:18,895 Stage-10_0: 0/1099 Stage-9_0: 477(+478)/9478
2023-09-13 02:12:19,907 Stage-10_0: 0/1099 Stage-9_0: 486(+478)/9478
2023-09-13 02:12:20,908 Stage-10_0: 0/1099 Stage-9_0: 491(+476)/9478
2023-09-13 02:12:21,910 Stage-10_0: 0/1099 Stage-9_0: 494(+475)/9478
2023-09-13 02:12:22,912 Stage-10_0: 0/1099 Stage-9_0: 498(+474)/9478
2023-09-13 02:12:23,914 Stage-10_0: 0/1099 Stage-9_0: 505(+469)/9478
2023-09-13 02:12:24,915 Stage-10_0: 0/1099 Stage-9_0: 507(+467)/9478
2023-09-13 02:12:25,917 Stage-10_0: 0/1099 Stage-9_0: 511(+465)/9478
2023-09-13 02:12:26,919 Stage-10_0: 0/1099 Stage-9_0: 515(+464)/9478
2023-09-13 02:12:27,922 Stage-10_0: 0/1099 Stage-9_0: 522(+461)/9478
2023-09-13 02:12:28,924 Stage-10_0: 0/1099 Stage-9_0: 527(+458)/9478
2023-09-13 02:12:29,925 Stage-10_0: 0/1099 Stage-9_0: 550(+452)/9478
2023-09-13 02:12:30,928 Stage-10_0: 0/1099 Stage-9_0: 561(+446)/9478
2023-09-13 02:12:31,930 Stage-10_0: 0/1099 Stage-9_0: 568(+444)/9478
2023-09-13 02:12:32,932 Stage-10_0: 0/1099 Stage-9_0: 576(+442)/9478
2023-09-13 02:12:33,933 Stage-10_0: 0/1099 Stage-9_0: 587(+439)/9478
2023-09-13 02:12:34,935 Stage-10_0: 0/1099 Stage-9_0: 597(+436)/9478
2023-09-13 02:12:35,937 Stage-10_0: 0/1099 Stage-9_0: 605(+431)/9478
2023-09-13 02:12:36,939 Stage-10_0: 0/1099 Stage-9_0: 612(+429)/9478
2023-09-13 02:12:37,941 Stage-10_0: 0/1099 Stage-9_0: 621(+425)/9478
2023-09-13 02:12:38,942 Stage-10_0: 0/1099 Stage-9_0: 633(+418)/9478
2023-09-13 02:12:39,944 Stage-10_0: 0/1099 Stage-9_0: 639(+414)/9478
2023-09-13 02:12:40,946 Stage-10_0: 0/1099 Stage-9_0: 647(+406)/9478
2023-09-13 02:12:41,948 Stage-10_0: 0/1099 Stage-9_0: 652(+403)/9478
2023-09-13 02:12:42,950 Stage-10_0: 0/1099 Stage-9_0: 660(+398)/9478
2023-09-13 02:12:43,952 Stage-10_0: 0/1099 Stage-9_0: 671(+391)/9478
2023-09-13 02:12:44,954 Stage-10_0: 0/1099 Stage-9_0: 682(+383)/9478
2023-09-13 02:12:45,956 Stage-10_0: 0/1099 Stage-9_0: 692(+378)/9478
2023-09-13 02:12:46,959 Stage-10_0: 0/1099 Stage-9_0: 699(+375)/9478
2023-09-13 02:12:47,962 Stage-10_0: 0/1099 Stage-9_0: 705(+371)/9478
2023-09-13 02:12:48,964 Stage-10_0: 0/1099 Stage-9_0: 721(+363)/9478
2023-09-13 02:12:49,966 Stage-10_0: 0/1099 Stage-9_0: 731(+358)/9478
2023-09-13 02:12:50,968 Stage-10_0: 0/1099 Stage-9_0: 741(+351)/9478
2023-09-13 02:12:51,970 Stage-10_0: 0/1099 Stage-9_0: 754(+346)/9478
本文只针对hive on spark展开。
日志中对比之前运行过程中,task数突然变少,导致sql执行缓慢。
通过对比执行计划,如发现task数据变少。大概率是统计信息出现问题,可以通过重新分析统计信息解决。
ANALYZE TABLE ods_fact_sale_partion PARTITION(sale_date='2010-04-12') COMPUTE STATISTICS;
如果出现长尾,首先要考虑数据倾斜。
排除数据倾斜情况,需要查看日志中慢的task的执行节点分布,如果慢task都集中在某几个节点,大概率是节点机出现异常。
首先关注CPU,内存,io(CPU IO WAIT),磁盘读写效率等指标。