数据:
+-------+----------+
|user_id| dt|
+-------+----------+
| 1|2020-09-01|
| 1|2020-09-02|
| 1|2020-09-03|
| 1|2020-09-04|
| 1|2020-09-05|
| 1|2020-09-07|
| 2|2020-08-01|
| 2|2020-09-02|
| 2|2020-09-03|
| 2|2020-09-04|
| 2|2020-09-05|
| 2|2020-09-07|
| 3|2020-08-01|
| 3|2020-09-02|
| 3|2020-09-04|
| 3|2020-09-04|
| 3|2020-09-05|
| 3|2020-09-07|
+-------+----------+
注意:
| 3|2020-09-04|
| 3|2020-09-04|
这里是有重复的,所以
第一步是去重复:
SELECT DISTINCT user_id, dt FROM test
+-------+----------+
|user_id| dt|
+-------+----------+
| 3|2020-09-04|
| 1|2020-09-07|
| 3|2020-08-01|
| 1|2020-09-05|
| 2|2020-09-07|
| 2|2020-09-02|
| 3|2020-09-07|
| 1|2020-09-03|
| 2|2020-09-05|
| 3|2020-09-05|
| 1|2020-09-04|
| 2|2020-09-04|
| 1|2020-09-01|
| 2|2020-09-03|
| 2|2020-08-01|
| 3|2020-09-02|
| 1|2020-09-02|
+-------+----------+
第二步:
同一个user_id的登录时间进行排序
SELECT user_id, dt, row_number() over(partition BY user_id order by dt) AS rk FROM (SELECT DISTINCT user_id, dt FROM test)
+-------+----------+---+
|user_id| dt| rk|
+-------+----------+---+
| 3|2020-08-01| 1|
| 3|2020-09-02| 2|
| 3|2020-09-04| 3|
| 3|2020-09-05| 4|
| 3|2020-09-07| 5|
| 1|2020-09-01| 1|
| 1|2020-09-02| 2|
| 1|2020-09-03| 3|
| 1|2020-09-04| 4|
| 1|2020-09-05| 5|
| 1|2020-09-07| 6|
| 2|2020-08-01| 1|
| 2|2020-09-02| 2|
| 2|2020-09-03| 3|
| 2|2020-09-04| 4|
| 2|2020-09-05| 5|
| 2|2020-09-07| 6|
+-------+----------+---+
第三步:
用dt减去排名之后,如果时间是连续的,那么结果相同。
SELECT user_id, date_add(dt, - rk) AS sim FROM (SELECT user_id, dt, row_number() over(partition BY user_id order by dt) AS rk FROM (SELECT DISTINCT user_id, dt FROM test))
+-------+----------+
|user_id| sim|
+-------+----------+
| 3|2020-07-31|
| 3|2020-08-31|
| 3|2020-09-01|
| 3|2020-09-01|
| 3|2020-09-02|
| 1|2020-08-31|
| 1|2020-08-31|
| 1|2020-08-31|
| 1|2020-08-31|
| 1|2020-08-31|
| 1|2020-09-01|
| 2|2020-07-31|
| 2|2020-08-31|
| 2|2020-08-31|
| 2|2020-08-31|
| 2|2020-08-31|
| 2|2020-09-01|
+-------+----------+
第四步:
对相同的时间进行求和:
SELECT user_id, sim, COUNT(sim) AS cs FROM (SELECT user_id, date_add(dt, - rk) AS sim FROM (SELECT user_id, dt, row_number() over(partition BY user_id order by dt) AS rk FROM (SELECT DISTINCT user_id, dt FROM test))) GROUP BY user_id, sim
+-------+----------+---+
|user_id| sim| cs|
+-------+----------+---+
| 3|2020-07-31| 1|
| 3|2020-08-31| 1|
| 3|2020-09-01| 2|
| 3|2020-09-02| 1|
| 1|2020-08-31| 5|
| 1|2020-09-01| 1|
| 2|2020-07-31| 1|
| 2|2020-08-31| 4|
| 2|2020-09-01| 1|
+-------+----------+---+
第五步:
求最大连续天数:
SELECT user_id, MAX(cs) AS max_date FROM (SELECT user_id, sim, COUNT(sim) AS cs FROM (SELECT user_id, date_add(dt, - rk) AS sim FROM (SELECT user_id, dt, row_number() over(partition BY user_id order by dt) AS rk FROM (SELECT DISTINCT user_id, dt FROM test))) GROUP BY user_id, sim) GROUP BY user_id
+-------+--------+
|user_id|max_date|
+-------+--------+
| 3| 2|
| 1| 5|
| 2| 4|
+-------+--------+
整个sql:
SELECT user_id, MAX(cs) AS max_date
FROM
(
SELECT user_id, sim, COUNT(sim) AS cs
FROM
(
SELECT user_id, date_add(dt, - rk) AS sim
FROM
(
SELECT dt, user_id, row_number() over(partition BY user_id order by dt) AS rk
FROM
(
SELECT DISTINCT user_id, dt FROM test
)
)
) GROUP BY user_id, sim
) GROUP BY user_id