Task数量过多

背景

当spark最终输出文件stage，task数量非常多时，会在driver端单线程执行大量的rename操作，比较耗时，如何解决呢？

分析

罪魁祸首，就是hadoop代码里的commitJobInternal函数，这里会单线程调mergePaths，会把每个task输出在_temporary目录的结果，移动到最终的输出目录。

  @VisibleForTesting
  protected void commitJobInternal(JobContext context) throws IOException {
    if (hasOutputPath()) {
      Path finalOutput = getOutputPath();
      FileSystem fs = finalOutput.getFileSystem(context.getConfiguration());

      // 如果v1，就执行
      if (algorithmVersion == 1) {
        for (FileStatus stat: getAllCommittedTaskPaths(context)) {
          mergePaths(fs, stat, finalOutput);
        }
      }

      if (skipCleanup) {
        LOG.info("Skip cleanup the _temporary folders under job's output " +
            "directory in commitJob.");
      } else {
        // delete the _temporary folder and create a _done file in the o/p
        // folder
        try {
          cleanupJob(context);
        } catch (IOException e) {
          if (ignoreCleanupFailures) {
            // swallow exceptions in cleanup as user configure to make sure
            // commitJob could be success even when cleanup get failure.
            LOG.error("Error in cleanup job, manually cleanup is needed.", e);
          } else {
            // throw back exception to fail commitJob.
            throw e;
          }
        }
      }
      // True if the job requires output.dir marked on successful job.
      // Note that by default it is set to true.
      if (context.getConfiguration().getBoolean(
          SUCCESSFUL_JOB_OUTPUT_DIR_MARKER, true)) {
        Path markerPath = new Path(outputPath, SUCCEEDED_FILE_NAME);
        // If job commit is repeatable and previous/another AM could write
        // mark file already, we need to set overwritten to be true explicitly
        // in case other FS implementations don't overwritten by default.
        if (isCommitJobRepeatable(context)) {
          fs.create(markerPath, true).close();
        } else {
          fs.create(markerPath).close();
        }
      }
    } else {
      LOG.warn("Output Path is null in commitJob()");
    }
  }

如果mapreduce.fileoutputcommitter.algorithm.version配成v1，就会执行上面的for循环，那有v1就有v2，v2又是怎样的呢？

简单的说，v1就是每个task执行的最终结果，输出到_temporary目录，所有task执行结束后，由CommitCoordinator，一并执行rename到最终的输出目录。
而v2则是task执行结果执行输出到最终的输出目录。

那么为什么不用v2呢？v2有什么问题呢？
答案是有问题的，一致性问题。所谓一致性，就是正确性。
坏情况1：
如果执行100个task，50个跑完了，剩下的挂了，在v2里，输出的结果就会直接残留在最终的输出目录。
坏情况2：
同样是执行100个task，但每个执行的速度不一样，有95个很快跑完了，5个还没跑完，这时有人来访问输出文件，得到的就是个错误的结果。

具体可以看下Spark CommitCoordinator 保证数据一致性这篇文章。

所以如果前面的一致性问题影响不大，那就用v2，否则依然只能用v1。v1该怎么优化呢？

减少输出文件的那个stage的task数，其实task数太多，很可能是文件太小。
前面hadoop源码里单线程的for循环，是不是可以改成多线程？

参考

分布式系统中的一致性
 Spark CommitCoordinator 保证数据一致性
 二阶段提交-维基百科

Task数量过多

背景

分析

参考

推荐阅读更多精彩内容