问题是这样的,有出发地到目的地的多次通行距离数据,需要通过SQL语句求出两地之间的平均通行距离。比如成都到重庆、重庆到成都都视为这两地之间的形成,需要合并计算,数据如下↓
下面用两种思路来解答,第一种思路是先以出发地和目的地进行聚合操作,求出合计的距离和次数;然后通过简单的窗口函数,对上面这个聚合的表格进行一次JOIN操作,通过简单的逻辑判断得到我们想要的结果。
先进行第一步的聚合操作,这一步就比较简单了,通过GROUP BY就完成了,只是我们还需要在加一个ROW_NUMBER函数来计算出编号后面使用,SQL语句和结果如下↓
SELECT
start,end,
SUM(distance) AS tot_distance,
COUNT(*) AS no_of_time,
ROW_NUMBER() OVER(ORDER BY start) as id
FROM
`distance`
GROUP BY
start,end
接下来就是对上面这个表格进行JOIN处理,使用两次上面这个结果,用第一个表格的start字段去JOIN第二个表格的end字段,我们就可以判断出相关两地之间的距离;但这里会出现两次,所有还需要加一个条件,第一个表格的id<第二个表格的id,SQL语句和结果如下↓
WITH cte AS
(SELECT
start,end,
SUM(distance) AS tot_distance,
COUNT(*) AS no_of_time,
ROW_NUMBER() OVER(ORDER BY start) as id
FROM
`distance`
GROUP BY
start,end)
SELECT
t1.start,t1.end,
t1.tot_distance,t2.tot_distance,t1.no_of_time,t2.no_of_time
FROM
cte AS t1
JOIN cte AS t2 ON t1.start = t2.end AND t1.id < t2.id
基本上就实现了,我们只需要把最后的距离和次数加起来然后相除就可以了,最后的结果如下↓
第一种思路就是实现了,但是我们这里的数据各地之间是不相重合的,如果我们再增加一个成都到上海的路径,就会出问题了,所以我们还有第二种思路。
我们先增加两行数据,成都-上海的数据,数据如下↓
然后我们通过连接语句,把出发和目的地连接,这样就是唯一的标识了,SQL语句和结果如下↓
SELECT
CONCAT(start,"-",end) AS start,
CONCAT(end,"-",start) AS end,
distance
FROM
distance
然后通过次互换出发地和目的地的连接,就可以得到两地之间的距离和次数,在稍微计算一下就得到了平均距离和,SQL语句和结果如下↓
WITH ts AS
(SELECT
CONCAT(start,"-",end) AS start,
CONCAT(end,"-",start) AS end,
distance
FROM
distance)
SELECT start,SUM(distance),COUNT(*),SUM(distance)/COUNT(*) AS avg_dist FROM
(SELECT start,distance FROM ts AS t1
UNION ALL
SELECT end,distance FROM ts AS t1) tss
GROUP BY
start
当然最后我们还可以把第一列拆分一下,只需要在上面的基础上用字符拆分函数就行了,SQL语句和结果如下↓
(WITH ts1 AS
(WITH ts AS
(SELECT
CONCAT(start,"-",end) AS start,
CONCAT(end,"-",start) AS end,
distance
FROM
distance)
SELECT start,SUM(distance)/COUNT(*) AS avg_dist FROM
(SELECT start,distance FROM ts AS t1
UNION ALL
SELECT end,distance FROM ts AS t1) tss
GROUP BY
start)
SELECT
SUBSTRING_INDEX(start,"-",1) AS start,
SUBSTRING_INDEX(start,"-",-1) AS end,
ROUND(avg_dist,2) AS avg_dist
FROM
ts1)
到了这里,我们是不是可以通过合并的方式使开始-出发地不重复,就可以通过第一种方式来解决了,SQL语句和结果如下↓
WITH cte AS
(SELECT
start,end,
SUM(distance) AS tot_distance,
COUNT(*) AS no_of_time,
ROW_NUMBER() OVER(ORDER BY start) as id
FROM
(SELECT
CONCAT(start,"-",end) AS start,
CONCAT(end,"-",start) AS end,
distance
FROM
distance) ts
GROUP BY
start,end)
SELECT
t1.start,
(t1.tot_distance+t2.tot_distance)/(t1.no_of_time+t2.no_of_time) AS avg_dist
FROM
cte AS t1
JOIN cte AS t2 ON t1.start = t2.end AND t1.id < t2.id
非常完美,各种形式的结果都有了,只需要根据需求使用就行了。
End
◆ PowerBI_RFM客户关系模型
◆ PowerBI饼图、圈图、旭日图
◆ Excel时间序列预测函数
◆ Python操作MySQL数据库
◆ Python企业微信机器人