推荐系统召回环节思考

推荐系统是一个很大的话题，涉及到很多模块，这里主要是调研了一下推荐召回环节的主流做法。一般有如下几条线路。

基于Content的推荐

方法：只使用内容信息，而不直接使用用户行为数据。通过分析内容，推荐与用户浏览历史相似的内容。因此，如何计算内容之间的相似性是问题的关键。一般分成“分词”，“词权重分析”，“降维”三个阶段，每个阶段都有很多可以优化的地方，不同的做法会带来不一样的推荐体验。
优点：（一）不依赖用户行为数据，因此不存在新内容的冷启动问题。（二）基于内容的做法很容易在“降维”阶段引入用户行为，从而可以吸收一部分CF的优点。
缺点：（一）需要精细优化的地方会很多，没有工匠精神，效果比较难出来。（二）最重要的模块是通过对用户session实时分析用户意图。

基于CF的推荐

方法：只使用用户行为数据，而不管内容信息。通过用户行为向量，使用item-base和user-base方法推荐相似内容或者相似人群喜欢的内容。
优点：（一）当用户行为数据丰富的时候，itembase和userbase的协同过滤方法非常通用，很容易出效果。（二）用户行为关联可以推荐出偏topic属性的内容，而不会局限于关键词，因此相关性效果一般都很好。
缺点：新内容的冷启动问题很严重，只能通过EE的方式缓解。

基于内容和CF的混合推荐

方法：不仅使用用户行为信息，而且还使用内容信息。一般是使用feature-based模型来进行推荐。
优点：（一）理论完备，通过模型的推广能力来解决新内容的冷启动问题，而且在小数据集上离线指标往往比CF能取得更好的效果。（二）可以通过提高模型复杂度不断提高推荐效果。
缺点：工程实现时难度比较大，需要解决用户和海量内容的打分服务，这方面可以参考以下Facebook的一篇文章recommending-items-to-more-than-a-billion-people。

Item recommendation computation
In order to get the actual recommendations for all users, we need to find items with highest predicted ratings for each user. When dealing with the huge data sets, checking the dot product for each (user, item) pair becomes unfeasible, even if we distribute the problem to more workers. We needed a faster way to find the top K recommendations for each user, or a good approximation of it.

One possible solution is to use a ball tree data structure to hold our item vectors. A ball tree is a binary tree where leafs contain some subset of item vectors, and each inner node defines a ball that surrounds all vectors within its subtree. Using formulas for the upper bound on the dot product for the query vector and any vector within the ball, we can do greedy tree traversal, going first to the more promising branch, and prune subtrees that can’t contain the solution better than what we have already found. This approach showed to be <font color='red'>10-100x faster</font> than looking into each pair, making search for recommendations on our data sets finish in reasonable time. We also added an option to allow for specified error when looking for top recommendations to speed up calculations even more.

Another way the problem can be approximately solved is by clustering items based on the item feature vectors — which reduces the problem to finding top cluster recommendations and then extracting the actual items based on these top clusters. This approach speeds up the computation, while slightly degrading the quality of recommendations based on the experimental results. On the other hand, the items in a cluster are similar, and we can get a diverse set of recommendations by taking a limited number of the items from each cluster. Note that we also have k-means clustering implementation on top of Giraph, and incorporating this step in the calculation was very easy.

推荐系统召回环节思考

基于Content的推荐

基于CF的推荐

基于内容和CF的混合推荐

推荐阅读更多精彩内容