论文阅读：Alibaba-Deep Interest Evolution Network for Click-Through Rate Prediction

上篇介绍了Alibaba-Deep Interest Network for CTR Prediction，本篇介绍的内容可以说是Alibaba在上篇进一步的工作，本篇论文发表在2018AAAI。

提纲

解决的问题
方法
启发与疑问

1. 解决的问题

随着时间的推移，用户的兴趣会发生变化；然而在淘宝购物场景下，用户的本次的购买行为可能并不是与上一次的行为强相关，因此，这篇论文解决的该情形下用户兴趣演化的问题。

为了解决上述问题，本文做了以下几个工作：

auxiliary loss（interest extractor layer）
AUGRU（interest evolving layer）

2. 方法

由于特征的处理方式与上篇一样，因此这次不再赘述，直接从模型开始说起。本文的模型一共主要有2个部分：Interest Extractor Layer，Interest Evolving Layer；而Behavior Layer就是用户的时序行为。

DIEN

从整体模型看，与上篇的整体框架是一样的，在特征输入中都是包含了user behavior、target Ad、context feature、user profile feature。但是本文的user behavior是一个时序模型，而这个模型就是本文的重点DIEN。在Interest extractor layer，本文是用GRU构建模型，而该层的loss改进为auxiliary loss; 在Interest Evolving layer, 由于用户会浏览不同种类的商品，因此用户的行为并不一定与上一次强相关，因此本文仍然采用attention的方式，通过上一层interest extractor layer的输出与target Ad计算attention，然后通过AUGRU得到最后的user behavior表示。最后，在分别得到user behavior、target Ad、context feature、user profile feature的表示后，通过concat/flatten的方式一起作为MLP的输入。

2.1 Interest Extractor Layer

本文选取的是两周的历史窗口，如果用户行为比较稀疏的话，也可能增加时间窗口的长度。

本文将点击看做0/1分类的问题，使用的loss为log-loss:

$L_{target} = -\frac{1}{N}\sum_{(x, y) \in D}^N{y\log p(x) + (1-y)\log (1-p(x))}$

$N$ 表示数据集的大小， $\textbf{x}=[\textbf{x}_p, \textbf{x}_a, \textbf{x}_c, \textbf{x}_b]$ ，其中 $\textbf{x}_p, \textbf{x}_a, \textbf{x}_c, \textbf{x}_b$ 分别表示user profile, ad, context, user behavior; context本文指的是历史点击行为。

本文指出，由于最终的target item是被最后的兴趣触发， $L_{target}$ 只能对最终的兴趣做评价，因此在此之前的隐状态得不到有效的学习。本文假设，上一个行为会直接影响下一个行为，因此，本文提出auxiliary loss希望能让 $h_t$ 也得到很好的监控。

原文：

As the click behavior of target item is triggered by final interest, the label used in $L_{target}$ only contains the ground truth that supervises final interest’s prediction, while history state $h_t$ (t < T) can’t obtain proper supervision.

$L_{aux} = -\frac{1}{M}(\sum_{i=1}^{M}\sum_t \log \sigma(\textbf{h}_t, \textbf{e}_b^i[t+1]) + \log (1-\sigma(\textbf{h}_t, \hat{\textbf{e}}_b^i[t+1])))$

$M$ 表示 $M$ 对 $\{\textbf{e}_b^i,\hat {\textbf{e}} _b^i\}$ ， $\textbf{e}_b^i$ 表示点击行为序列， $\hat {\textbf{e}} _b^i[t]$ 表示非点击行为序列。

$\textbf{e}_b^i[t] \in G$ 表示用户 $i$ 在 $t$ 时刻点击商品的embedding vector, $\hat {\textbf{e}} _b^i[t]\in G-\textbf{e}_b^i[t]$ ；注意，这里的negative samples是在全部的商品中采样的。

$L = L_{target} + \alpha * L_{aux}$

$\alpha$ 为超参数，用来平衡兴趣表示和CTR预估

Auxiliary loss的好处：

helps each hidden state of GRU represent interest expressively.
reduces the difﬁculty of back propagation when GRU models long history behavior sequence
gives more semantic information for the learning of embedding layer, which leads to a better embedding matrix

（个人想法，觉得负采样的范围在曝光商品中可能会更好，在Airbnb的论文中也提到过）

2.2 Interest Evolving Layer

先说说GRU

GRU

这里表示用户在时刻的行为

关于attention的计算：
$a_t = \frac{\exp (\textbf{h}_t \textbf{We}_a)}{\sum_{j=1}^T{\exp (\textbf{h}_j \textbf{We}_a)}}$

本文尝试了几种attention mechanism和GRU合并的方式：

AIGRU

直接将attention作用于隐状态 $\textbf{h}_t$ :

$\textbf{i}'_t=\textbf{h}_t * a_t$
我的理解是直接将 $a_t$ 作用于隐状态（历史信息）

However, AIGRU works not very well. Because even zero input can also change the hidden state of GRU, so the less relative interests also affect the learning of interest evolving.

AGRU

用 $a_t$ 代替 $\textbf{u}_t$ :

$\textbf{h}'_t = (1-a_t) * \textbf{h}'_{t-1} + a_t * \tilde{\textbf{h}}_t'$

AGRU weakens the effect from less related interest during interest evolving. The embedding of attention into GRU improves the influence of attention mechanism, and helps AGRU overcome the defects of AIGRU.

AUGRU

$\tilde{\textbf{u}}_t' = a_t * \textbf{u}_t'$
$\textbf{h}_t' = (1-\tilde{\textbf{u}_t'}) \cdot \textbf{h}_{t-1}' + \tilde{\textbf{u}_t'} \cdot \tilde{\textbf{h}}_t'$
原作之一原话：“AUGRU的小改进是针对AGRU忽视了方向信息直接用变量替代向量的问题”。

我的理解是： $a_t$ 作用于 $\textbf{u}_t$ ，影响范围更广。

Based on the differentiated information, we use attention score $a_t$ to scale all dimensions of update gate, which results that less related interest make less effects on the hidden state. AUGRU avoids the disturbance from interest drifting more effectively, and pushes the relative
interest to evolve smoothly.

3. 启发与疑问

由于该篇论文的实验对比没什么太多的重点，因此省略了实验部分的介绍。下面是我的两点想法：

用户购买房子的周期比较长，但是在之前对用户行为的分析中发现，大部分的房源其实用户只会点击一次，而少部分的房源可能会出现在用户的整个购买周期中。如果只按时间衰减，可能会损失掉一些用户感兴趣房源的信息，如果利用本文的思路，是不是可以优化这个问题呢？
我一直对AUGRU的演化过程比较好奇，从标量变为矢量，可能这也是经验的积累吧。

参考资料

Deep Interest Evolution Network for Click-Through Rate Prediction
本文代码github地址