对用户行为记录进行物理切分session

背景

需要对用户的访问日志记录进行转化分析,由于日志记录没有业务session,需要物理地对访问记录进行session切分,规则是30分钟内的用户操作为一个会话(session),如果两个操作之间的时间间隔超过30分钟,则判定为两个session。
环境:GreenPlum(postgresql)

数据

/*
用户ID:bid
访问时间:event_time
页面key:event_key
*/
WITH log_data AS(
    SELECT
        bid,
        event_time::TIMESTAMP AS event_time,
        event_key,
        /* 访问序号 */
        ROW_NUMBER() OVER(PARTITION BY bid ORDER BY event_time) AS num
    FROM
    (
        SELECT '123' AS bid, '2017-11-20 10:01:56' AS event_time, 'page1' AS event_key
        UNION ALL
        SELECT '123' AS bid, '2017-11-20 10:05:56' AS event_time, 'page2' AS event_key
        UNION ALL
        SELECT '123' AS bid, '2017-11-20 10:43:56' AS event_time, 'page1' AS event_key
        UNION ALL
        SELECT '123' AS bid, '2017-11-20 10:46:56' AS event_time, 'page2' AS event_key
        UNION ALL
        SELECT '123' AS bid, '2017-11-20 10:49:56' AS event_time, 'page3' AS event_key
    )tmp
)
SELECT * FROM log_data;
图一

切分session过程

需要找到一个记录A的时间减去上一个记录B的时间差值大于30分钟,如下图中,需要找到箭头指向的那条10:43:56访问的记录

图二

问题转化为寻找每个session的起始访问记录,详细sql如下:

SELECT
    t1.*,
    t2.num,
    /* 如果记录为首次或两个记录时间差大于30分钟则打上标记 */
    CASE WHEN t2. event_time is NULL OR t1.event_time - t2.event_time > INTERVAL '30 minute' THEN t1.num ELSE null END AS time_diff
FROM
    log_data t1
LEFT JOIN
    /* 前后记录关联 */
    log_data t2 on t1.bid = t2.bid AND t1.num = t2.num + 1

根据t1.bid = t2.bid AND t1.num = t2.num + 1关联得到切分点的num,如下图中的time_diff


图三

通过time_diff is not null 得到具体的切分点

SELECT
    bid,
    time_diff,
    event_time
FROM
    (
    SELECT
        t1.*,
        t2.num,
        /* 如果记录为首次或两个记录时间差大于30分钟则打上标记 */
        CASE WHEN t2. event_time is NULL OR t1.event_time - t2.event_time > INTERVAL '30 minute' THEN t1.num ELSE null END AS time_diff
    FROM
        log_data t1
    LEFT JOIN
        /* 前后记录关联 */
        log_data t2 on t1.bid = t2.bid AND t1.num = t2.num + 1
    )t02
WHERE 
    t02.time_diff IS NOT NULL
图四

图一跟图四通过t01.bid = t02.bid AND t01.num >= t02.time_diff关联排序过滤最终完成session切分

SELECT
    *
FROM
    (
        SELECT
            t01.*,
            t02.event_time,
            time_diff,
            /* 取分等值连接后的第一条数据 */
            ROW_NUMBER() OVER(PARTITION BY t01.bid,t01.num ORDER BY t02.time_diff desc) AS num_2
        FROM
            log_data t01
        LEFT JOIN
        (
            SELECT
                bid,
                time_diff,
                event_time
            FROM
                (
                SELECT
                    t1.*,
                    t2.num,
                    /* 如果记录为首次或两个记录时间差大于30分钟则打上标记 */
                    CASE WHEN t2. event_time is NULL OR t1.event_time - t2.event_time > INTERVAL '30 minute' THEN t1.num ELSE null END AS time_diff
                FROM
                    log_data t1
                LEFT JOIN
                    /* 前后记录关联 */
                    log_data t2 on t1.bid = t2.bid AND t1.num = t2.num + 1
                )t02
            WHERE 
                t02.time_diff IS NOT NULL
        )t02 ON t01.bid = t02.bid AND t01.num >= t02.time_diff
    )tmp
/* 取num_2为1的记录 */
WHERE num_2 = 1
图五
图六

再添加session_id字段即可,本文采用md5(bid+event_time1)作为session_id

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容