Feature Selection

Feature Selection

Ensemble Learning
  • bagging method and boosting method
Bagging
  • sampling with replacement
  • decrease variance by introducing randomness into your model framework
  • random forest = bagging + decision tree
Random Forest
  • description of random forest
    there are n samples with m features in the training data
    • take n observations with replacement each time 行上有放回的抽样
    • in these n observations, take k features (k < m) to calculate a best decision tree 列上抽样不放回
    • repeat the below steps for several times, combine all the decision tree to a random forest
  • features:
    • decrease variance by introducing randomness into the model framework
    • the destributions of each n observations are the same as the original training data
    • we don't need to do pruning for each "weak" decision tree
    • less overfitting
    • parallel implementation
  • Feature Importance value in Random Forest
    importance(i) = performance(RF) - performance(RF^{random \,i})(advance topic: out of bag evaluation)
    how to define performance? let the column of the feature be a list of random numbers and calculate the output by the model, get the loss
    • not negative nor positive, just show how much a special feature influences the model
Support Vector Machine
  • SVM: maximize the minimum margin


    image.png

    image.png

    if not linear, add noise factors into the equation to map it to a higher dimension space, applying kernel function
    image.png
Why Feature Selection?
  • reduce overfitting
  • better understanding your model
  • improve model stability (i.e. improve generalization)
    取决于你想要做什么,如果是做一个调查,想研究每一个feature的贡献,则需要删除一些data以减少相关性太大的features对模型的影响;如果想要做prediction,则只关心结果是否准确,不太需要删除features。模型稳定性差:某一个feature变化一点点而导致系数变化特别大,说明模型不稳定variance特别大,原因可能是model特别复杂或者相关性features太多。解决办法最直观的:regularization
Pearson Correlation

to measrue linear dependency between features
\rho_{x_1, x_2} = \frac{ cov(x_1, x_2) }{\sigma x_1 \sigma x_2}

  • cov(x_1, x_2)means covariance and \sigma means standard deviation
  • covariance:
    cov(x_1, x_2) = E[(x_1 - E(x_2))(x_2 - E(x_1))] = E(x_1x_2) - E(x_1)E(x_2) where \sigma x_1^2 = E(x_1^2) - E(x_1)^2
Regularization Models

L1 tends to provide sparse solution
L2 tends to spread out more equality

for example:
image.png
Principal Component Analysis
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • 如果有一天,我们在一起,我会将我生命中最开心的时刻与你分享。 如果有一天,我们要分离,我会将我们最开心的一刻永远的...
    Lucifer83阅读 969评论 0 50
  • 雪山上 云海下 一片森林 一种奇迹 万物都有秘密 你的秘密 好像 藏在山里 你却笑着说 我的秘密 其实藏在 天空里
    博弈人艺阅读 318评论 0 0
  • 试试,先贴篇旧作。 四月初八乌饭麻糍香 黄晓慧 今天是农历四月初八,这一天,温岭的不少地方,如大溪、新河等地都保留...
    江南兰花阅读 1,391评论 0 2