DataFrame API&SQL练习


测试数据

  • customers.csv
  • orders.csv
  • order_items.csv
  • products.csv
  • 哪位用户买的商品最多?
    • SQL
    spark.sql("select c.customer_id, c.customer_fname, c.customer_lname, tc.order_nums 
    from customers c 
    left join 
      (select t.order_customer_id as customer_id, count(t.order_customer_id) as   order_nums 
        from (select * 
              from orders 
              where order_status = 'COMPLETE') t 
        group by t.order_customer_id) tc 
        on tc.customer_id = c.customer_id 
        order by tc.order_nums desc").show(3)
    
    +-----------+--------------+--------------+----------+                          
    |customer_id|customer_fname|customer_lname|order_nums|
    +-----------+--------------+--------------+----------+
    |       9337|          Mary|         Smith|        10|
    |       7802|          Mary|       Acevedo|         9|
    |        749|         Jesse|      Matthews|         9|
    +-----------+--------------+--------------+----------+
    
    
    spark.sql("select c.customer_id, c.customer_fname, c.customer_lname,count(c.customer_id) as order_nums, sum(oi.order_item_subtotal) as total 
    from customers c 
    inner join orders o on c.customer_id = o.order_customer_id 
    inner join order_items oi on o.order_id = oi.order_item_order_id 
    group by c.customer_id, c.customer_fname, c.customer_lname 
    order by total desc").show
     
    +-----------+--------------+--------------+----------+------------------+       
    |customer_id|customer_fname|customer_lname|order_nums|             total|
    +-----------+--------------+--------------+----------+------------------+
    |        791|          Mary|         Smith|        43|10524.169999999995|
    |       9371|          Mary|     Patterson|        44| 9299.029999999993|
    |       8766|          Mary|        Duncan|        38| 9296.139999999998|
    +-----------+--------------+--------------+----------+------------------+
    
    
    • DataFrame API
    dfCustomer.alias("c")
    .join(dfOrder.alias("o"), $"c.customer_id" === $"o.order_customer_id", "right_outer")
    .filter($"o.order_status" === "COMPLETE")
    .select("c.customer_id", "c.customer_fname", "c.customer_lname", "o.order_id")
    .groupBy($"customer_id",$"customer_fname", $"customer_lname")
    .agg(count($"order_id").as("order_nums"))
    .sort($"order_nums".desc).show(3)
    
    +-----------+--------------+--------------+----------+                          
    |customer_id|customer_fname|customer_lname|order_nums|
    +-----------+--------------+--------------+----------+
    |       9337|          Mary|         Smith|        10|
    |       7802|          Mary|       Acevedo|         9|
    |       3710|        Ashley|         Smith|         9|
    +-----------+--------------+--------------+----------+
    
  • 那个商品卖的最多?
    • SQL
    // 不考虑是否交易完成
    spark.sql("select p.product_id, p.product_name, p.product_category_id, p.product_price, tp.nums 
    from products p 
    left join 
      (select order_item_product_id as product_id, count(*) as nums  
        from order_items 
        group by order_item_product_id) tp 
     on p.product_id = tp.product_id order by tp.nums desc").show(3)
    
      +----------+--------------------+-------------------+-------------+-----+       
      |product_id|        product_name|product_category_id|product_price| nums|
      +----------+--------------------+-------------------+-------------+-----+
      |       365|Perfect Fitness P...|                 17|        59.99|24515|
      |       403|Nike Men's CJ Eli...|                 18|       129.99|22246|
      |       502|Nike Men's Dri-FI...|                 24|         50.0|21035|
      +----------+--------------------+-------------------+-------------+-----+
    
    
    // 只考虑已完成订单中的商品
    spark.sql("select p.product_id, p.product_name, p.product_category_id, p.product_price, tp.nums 
      from products p 
      right join (select order_item_product_id as product_id, count(*) as nums  
                  from (select * 
                        from order_items ot 
                        left join orders o 
                        on o.order_id = ot.order_item_order_id ) s 
                   where s.order_status = 'COMPLETE' 
                   group by order_item_product_id) tp 
                   on p.product_id = tp.product_id 
                   order by tp.nums desc").show(3)
    
      +----------+--------------------+-------------+------------+                    
      |product_id|        product_name|product_price|product_nums|
      +----------+--------------------+-------------+------------+
      |       365|Perfect Fitness P...|        59.99|        8071|
      |       403|Nike Men's CJ Eli...|       129.99|        7369|
      |       502|Nike Men's Dri-FI...|         50.0|        7036|
      +----------+--------------------+-------------+------------+
    
    
    • DataFrame API
     dfProduct.alias("p")
      .join(dfOrderItem.alias("ot"), $"p.product_id" === $"ot.order_item_product_id", "right_outer")
      .select($"product_id",$"product_name", $"product_price", $"order_item_order_id", $"product_category_id")   // 可省略
      .join(dfOrder.alias("o"), $"o.order_id" === $"order_item_order_id", "left_outer")
      .filter($"order_status" === "COMPLETE").groupBy($"product_id", $"product_name", $"product_price")
      .agg(count($"product_id").as("product_nums"))
      .sort($"product_nums".desc).show(3)
    
      +----------+--------------------+-------------+------------+                    
      |product_id|        product_name|product_price|product_nums|
      +----------+--------------------+-------------+------------+
      |       365|Perfect Fitness P...|        59.99|        8071|
      |       403|Nike Men's CJ Eli...|       129.99|        7369|
      |       502|Nike Men's Dri-FI...|         50.0|        7036|
      +----------+--------------------+-------------+------------+
    

    注:此结果只统计成功交易的订单中的商品


  • 周一到周末的购物分布情况?
    • SQL
    spark.sql("select dayOfWeek(order_date) as dayofweek, count(order_id) as order_nums 
               from orders 
                  where order_status = 'COMPLETE' 
               group by dayofweek 
               order by order_nums desc").show
    
     +---------+----------+                                                          
     |dayofweek|order_nums|
     +---------+----------+
     |        6|      3387|
     |        5|      3371|
     |        2|      3356|
     |        4|      3261|
     |        3|      3232|
     |        7|      3203|
     |        1|      3089|
     +---------+----------+
    
    • DataFrame API
    dfOrder.filter($"order_status" === "COMPLETE").withColumn("dayofweek", day_Of_Week($"order_date"))
    .select("dayofweek", "order_id").groupBy($"dayofweek").agg(count($"order_id").as("order_nums"))
    .sort($"order_nums".desc).show  
    
    +---------+----------+                                                          
    |dayofweek|order_nums|
    +---------+----------+
    |        6|      3387|
    |        5|      3371|
    |        2|      3356|
    |        4|      3261|
    |        3|      3232|
    |        7|      3203|
    |        1|      3089|
    +---------+----------+
    

    注:dayOfWeek和day_Of_Week是自定义函数


  • 一月到十二月的购物分布情况?
    • SQL
    // 使用date_format函数
    spark.sql("select t.month as month, count(t.month) as order_nums 
    from (select order_id, date_format(order_date, 'MM') as month, order_customer_id 
        from orders 
        where order_status = 'COMPLETE') t 
    group by t.month 
    order by order_nums desc").show(12)
    
      +-----+----------+                                                              
      |month|order_nums|
      +-----+----------+
      |   11|      2141|
      |   03|      1967|
      |   07|      1934|
      |   09|      1933|
      |   04|      1932|
      |   01|      1911|
      |   12|      1898|
      |   08|      1880|
      |   02|      1869|
      |   05|      1854|
      |   06|      1797|
      |   10|      1783|
      +-----+----------+
    
    
    • DataFrame API
    dfOrder.filter($"order_status" === "COMPLETE").withColumn("month", month_Of_Year($"order_date"))
    .select("month", "order_id").groupBy($"month").agg(count($"order_id").as("order_nums"))
    .sort($"order_nums".desc).show
    
      +-----+----------+                                                              
      |month|order_nums|
      +-----+----------+
      |   11|      2141|
      |    3|      1967|
      |    7|      1934|
      |    9|      1933|
      |    4|      1932|
      |    1|      1911|
      |   12|      1898|
      |    8|      1880|
      |    2|      1869|
      |    5|      1854|
      |    6|      1797|
      |   10|      1783|
      +-----+----------+
    
    

注:month_Of_Year是自定义函数

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 220,367评论 6 512
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,959评论 3 396
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 166,750评论 0 357
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 59,226评论 1 295
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 68,252评论 6 397
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,975评论 1 308
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,592评论 3 420
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,497评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 46,027评论 1 319
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 38,147评论 3 340
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,274评论 1 352
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,953评论 5 347
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,623评论 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,143评论 0 23
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,260评论 1 272
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,607评论 3 375
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,271评论 2 358

推荐阅读更多精彩内容