Spark 用代码实现求百分位数Percentile（Quentile）的方法

参考下文得到的启发

https://stackoverflow.com/questions/28805602/how-to-compute-percentiles-in-apache-spark

简单说明下分位数的定义

Scala求分位数的方法：

  /**
   * compute percentile from an unsorted Spark RDD
   * @param data: input data set of Long integers
   * @param tile: percentile to compute (eg. 85 percentile)
   * @return value of input data at the specified percentile
   */
  def computePercentile(data: RDD[Long], tile: Double): Double = {
    // NIST method; data to be sorted in ascending order
    val r = data.sortBy(x => x)
    val c = r.count()
    if (c == 1) r.first()
    else {
      val n = (tile / 100d) * (c + 1d)
      val k = math.floor(n).toLong
      val d = n - k
      if (k <= 0) r.first()
      else {
        val index = r.zipWithIndex().map(_.swap)
        val last = c
        if (k >= c) {
          index.lookup(last - 1).head
        } else {
          index.lookup(k - 1).head + d * (index.lookup(k).head - index.lookup(k - 1).head)
        }
      }
    }
  }

请注意，事例代码中求分位数的方式，是求的加权分位数，关键代码如下：

在实际工作中，可以自行改写（算术平均值）。

===================================

下面的重点来了，如何求出Spark dataframe中某一列的分位数？

思路： DataFrame得出某一列，转为Rdd，调用刚才写的函数即可。

在写此贴之前，本人曾用sparksql的方式实现了分位数的功能，但因为执行效率不高，在这里就不展示此种方法了。

下面我列出我认为比较关键的代码，仅供参考：

import org.apache.spark.sql.Row
import com.google.gson.Gson
import org.apache.spark.sql.Dataset
import org.apache.ibatis.jdbc.SQL
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.functions._
 
//initDF是一个DataFrame
//求出initDF的price列，并转为Rdd
//brand是品牌名称,需要自己去定义，或者从某个列表中获取
var price = initDF.select("price").filter($"brand_name" === brand)
                  .rdd.map(x => x(0).toString().toDouble)
 
//compute Percentile 
               
var q1 = computePercentile(price, 25)
               
var q2 = computePercentile(price, 50)
               
var q3 = computePercentile(price, 75)
 
 
//computePercentile方法，我做了些许改写，求的是算术平均分位数 
    def computePercentile(data: RDD[Double], tile: Double): Double = {
        // NIST method; data to be sorted in ascending order
        val r = data.sortBy(x => x)
        val c = r.count()
        if (c < 4) {0}
        else {
          val n = (tile / 100d) * c
          val k = math.floor(n).toLong
          val d = math.ceil(n).toLong
          val index = r.zipWithIndex().map(_.swap)
          val last = c
          
          if (k >= c) {
              index.lookup(last - 1).head
            } else {
              (index.lookup(k - 1).head + index.lookup(d-1).head)/2
            }
          }
        }

得到你想要的分位数后，你就可以和其他变量组合在一起，愉快的玩耍，生成任意你想要的数据格式(DataFrame，json等等)

Spark 用代码实现求百分位数Percentile（Quentile）的方法

推荐阅读更多精彩内容