特征工程(Feature Engineering)是从原始数据中创造新的特征以提升算法学习效果的过程。
特征工程与特征选择不同,通常先通过特征工程生成新的特征,之后通过特征选择去掉无关的、冗余的、强相关的特性。
- feature engineering: This process attempts to create additional relevant features from the existing raw features in the data, and to increase the predictive power of the learning algorithm.
- feature selection: This process selects the key subset of original data features in an attempt to reduce the dimensionality of the training problem.
Normally feature engineering is applied first to generate additional features, and then the feature selection step is performed to eliminate irrelevant, redundant, or highly correlated features.
下面以Tensorflow为例简述特征工程在实战项目的应用:
Google可视化数据集分析工具 Facets
-
对于数值型特征(numeric values)
i.e. age与income为非线性关系
可使用bucketing方法对每一bucket使用不同权重
using bucket for different weights
在tensorflow中可以直接使用
age_buckets = tf.feature_column.bucketized_column{
tf.feature_column.numeric_column('age'),
boundaries=[31, 46, 60, 75, 90]
}
-
对于类别型特征(categorical values)
For small vocabulary: use the raw value
对于线性分类器,特征交叉往往是一个有用的创建新特征的方法。i.e.
feature crossing
For larger vocabulary: use hash or embedding
hash适用于无法提供完整的词汇列表或构建全连接神经网络的情况使用(节约内存但会增加噪声数据)。i.e.
occupation = tf.feature_column_categorical_column_with_hash_bucket('occupation', 1080)
Embeddings
Dense vectors vs One-hot(Sparse)
tensorflow projector可视化网站