《美团机器学习实践》笔记

https://book.douban.com/subject/30243136/

Performance Metric

  • F1 score: 2/F = 1/P + 1/R
  • Other interpretations for AUC:
    • Wilcoxon Test of Ranks
    • Gini-index: Gini+1 = 2*AUC
    • Not sensitive to predicted score

Feature Engineering and Feature Selection

Continuous Variables

  • Bucketing for continuous variables in, for example, logistic regression (by width or by percentile)
  • Missing value treatment (imputation or code dummy variables)
  • Feed RF nodes to linear models

Discrete Variables

  • Cross-interaction
  • Statistics (e.g., unique values of B for each A)

Time, Space, Text Features

Popular Models

Logistic Regression:

  • Why not OLS (outliers)
  • How to solver: GD, or stochastic GD (Google FTRL)
  • Advantage: Fast, scalable

FM

  • Motivation:
    • Feature interaction (not done manually)
    • Polynomial kernel (too many parameters, too sparse matrix)
  • Approach:
    • Instead of learning all co-occurrence of i and j, the weight w is calculated as the dot product of v_i and v_j with dimension k.
    • Here assumption is imposed on matrix W so that it can be de-composed.
    • The parameters for different combinations are no longer independent
  • Improvement:
    • FFM to map similar features into a field
  • Application:
    • Serve as embedding for NN (e.g., User and Ad similarity)
    • Outperforms GBDT for learn complicated feature interactions (due to sparse combinations)

GBDT
Compared with Linear Models: Missing value, Range difference of attributes,, outliers, interactions, non-linear decision boundary

Data Mining

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 时年60岁的著名主持人王刚老来得子,儿子满周岁后,他第一次接受采访(CCTV《心理访谈》栏目)谈育儿心得: 很感谢...
    Micro宝阅读 828评论 0 3
  • 人的一生须练就两项本领:一是说话让人结缘,二是做事让人感动。“恶语伤人心,良言利于行”。 行事之恶,莫大于苛刻;心...
    欧韩女装传播潮流时尚阅读 501评论 0 0
  • 2016年3月10日星期五晚自习。他来了,穿着一件深蓝色上有几只蝴蝶的衣服。记得第一次见到他,也穿着这件衣服。今天...
    羽蒙162425阅读 208评论 0 0