Feature Selection

description of random forest
there are n samples with m features in the training data
- take n observations with replacement each time 行上有放回的抽样
- in these n observations, take k features (k < m) to calculate a best decision tree 列上抽样不放回
- repeat the below steps for several times, combine all the decision tree to a random forest
features:
- decrease variance by introducing randomness into the model framework
- the destributions of each n observations are the same as the original training data
- we don't need to do pruning for each "weak" decision tree
- less overfitting
- parallel implementation
Feature Importance value in Random Forest
(advance topic: out of bag evaluation)
how to define performance? let the column of the feature be a list of random numbers and calculate the output by the model, get the loss
- not negative nor positive, just show how much a special feature influences the model

reduce overfitting
better understanding your model
improve model stability (i.e. improve generalization)
取决于你想要做什么，如果是做一个调查，想研究每一个feature的贡献，则需要删除一些data以减少相关性太大的features对模型的影响；如果想要做prediction，则只关心结果是否准确，不太需要删除features。模型稳定性差：某一个feature变化一点点而导致系数变化特别大，说明模型不稳定variance特别大，原因可能是model特别复杂或者相关性features太多。解决办法最直观的：regularization

to measrue linear dependency between features
$\rho_{x_1, x_2} = \frac{ cov(x_1, x_2) }{\sigma x_1 \sigma x_2}$

$cov(x_1, x_2)$ means covariance and $\sigma$ means standard deviation
covariance:
$cov(x_1, x_2) = E[(x_1 - E(x_2))(x_2 - E(x_1))] = E(x_1x_2) - E(x_1)E(x_2)$ where $\sigma x_1^2 = E(x_1^2) - E(x_1)^2$