Kaggle|Exercise7:Categorical Variables[Very Important]

By encoding categorical variables, you'll obtain your best results thus far!

Setup

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex3 import *
print("Setup Complete")

In this exercise, you will work with data from the Housing Prices Competition for Kaggle Learn Users.

Run the next code cell without changes to load the training and validation sets in X_train, X_valid, y_train, and y_valid. The test set is loaded in X_test.

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data 读取数据
X = pd.read_csv('../input/train.csv', index_col='Id') 
X_test = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors 删去包含缺失目标变量的样本,从数据集中分离出目标变量y
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, we'll drop columns with missing values为了使模型简单,使用删除缺失值方法
cols_with_missing = [col for col in X.columns if X[col].isnull().any()] 
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)

# Break off validation set from training data 划分训练集验证集
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      train_size=0.8, test_size=0.2,
                                                      random_state=0)

Notice that the dataset contains both numerical and categorical variables. You'll need to encode the categorical data before training a model.

To compare different models, you'll use the same score_dataset() function from the tutorial. This function reports the mean absolute error (MAE) from a random forest model.

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

Step 1: Drop columns with categorical data

You'll get started with the most straightforward approach. Use the code cell below to preprocess the data in X_train and X_valid to remove columns with categorical data. Set the preprocessed DataFrames to drop_X_train and drop_X_valid, respectively.

# Fill in the lines below: drop columns in training and validation data
drop_X_train = X_train.select_dtypes(exclude = ["object"])
drop_X_valid = X_valid.select_dtypes(exclude = ["object"])

# Check your answers
step_1.check()

Hint: Use the select_dtypes() method to drop all columns with the object dtype.

DataFrame.select_dtypes(include=None, exclude=None)

获得该方法的评分

print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

Step 2: Label encoding

Before jumping into label encoding, we'll investigate the dataset. Specifically, we'll look at the 'Condition2' column. The code cell below prints the unique entries in both the training and validation sets.在我们进行标签编码之前,先研究一下数据集。特别低,我们将观察Condition2列,下面的代码将返回训练和验证集中指定列的唯一值。

print("Unique values in 'Condition2' column in training data:", X_train['Condition2'].unique())
print("\nUnique values in 'Condition2' column in validation data:", X_valid['Condition2'].unique())

#Output
Unique values in 'Condition2' column in training data: ['Norm' 'PosA' 'Feedr' 'PosN' 'Artery' 'RRAe']
Unique values in 'Condition2' column in validation data: ['Norm' 'RRAn' 'RRNn' 'Artery' 'Feedr' 'PosN']

其中 .unique()是pandas中的唯一值函数,相当于把list变成set,输出变量中包含的不重复类别

If you now write code to:

  • fit a label encoder to the training data, and then
  • use it to transform both the training and validation data,

you'll get an error. Can you see why this is the case? (You'll need to use the above output to answer this question.)
如果现在就进行标签编码的训练和转换,会报错,原因是——
【是否有变量存在于验证集中却不存在于训练集中?】
使用标签编码器拟合训练集上的每一列,为训练集特征中出现的每一个唯一类别创建一个对应的整数值标签。如果验证集包含的特征未出现在训练集中,编码器将报错。因为这些特征不会被分配整数。请注意,验证数据中的“ Condition2”列包含值“ RRAn”和“ RRNn”,但它们未出现在训练数据中,因此,如果我们尝试将标签编码器与scikit-learn一起使用,代码将出错。

This is a common problem that you'll encounter with real-world data, and there are many approaches to fixing this issue. For instance, you can write a custom label encoder to deal with new categories. The simplest approach, however, is to drop the problematic categorical columns.

Run the code cell below to save the problematic columns to a Python list bad_label_cols. Likewise, columns that can be safely label encoded are stored in good_label_cols.运行下面的代码单元,将有问题的列保存在列表bad_label_cols中,同样地,将正确的编码的列储存在good_label_cols中。

# All categorical columns依然是经典的[for in if]表达式
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Columns that can be safely label encoded 这里要比较真实元素,要想当用集合
good_label_cols = [col for col in object_cols if 
                   set(X_train[col]) == set(X_valid[col])]
        
# Problematic columns that will be dropped from the dataset 所有分类变量除去好的就是有问题的,补集运算
bad_label_cols = list(set(object_cols)-set(good_label_cols))
        
print('Categorical columns that will be label encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)

才写了一半不到,To be continued

Use the next code cell to label encode the data in X_train and X_valid. Set the preprocessed DataFrames to label_X_train and label_X_valid, respectively.

  • We have provided code below to drop the categorical columns in bad_label_cols from the dataset.
  • You should label encode the categorical columns in good_label_cols.
    下面开始进行标签编码,这里我们直接将bad_label_cols中的类从数据集中删除
from sklearn.preprocessing import LabelEncoder

# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)

# Apply label encoder 这里请重点注意 是包含一个遍历循环的
label_encoder = LabelEncoder()
for col in set(good_label_cols): 
    label_X_train[col] = label_encoder.fit_transform(label_X_train[col])
    label_X_valid[col] = label_encoder.transform(label_X_valid[col])
 # Your code here
    
# Check your answer
step_2.b.check()

Run the next code cell to get the MAE for this approach.

print("MAE from Approach 2 (Label Encoding):") 
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

MAE from Approach 2 (Label Encoding):
17575.291883561644

Step 3: Investigating cardinality

So far, you've tried two different approaches to dealing with categorical variables. And, you've seen that encoding categorical data yields better results than removing columns from the dataset.对分类数据进行编码比直接删除分类变量有更好的效果

Soon, you'll try one-hot encoding. Before then, there's one additional topic we need to cover. Begin by running the next code cell without changes.

# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])

.nuique()pandas函数,用于获取唯一值的统计次数,即有几个唯一类别。

输出为

[('Street', 2),
 ('Utilities', 2),
 ('CentralAir', 2),
 ('LandSlope', 3),
 ('PavedDrive', 3),
 ('LotShape', 4),
 ('LandContour', 4),
 ('ExterQual', 4),
 ('KitchenQual', 4),
 ('MSZoning', 5),
 ('LotConfig', 5),
 ('BldgType', 5),
 ('ExterCond', 5),
 ('HeatingQC', 5),
 ('Condition2', 6),
 ('RoofStyle', 6),
 ('Foundation', 6),
 ('Heating', 6),
 ('Functional', 6),
 ('SaleCondition', 6),
 ('RoofMatl', 7),
 ('HouseStyle', 8),
 ('Condition1', 9),
 ('SaleType', 9),
 ('Exterior1st', 15),
 ('Exterior2nd', 16),
 ('Neighborhood', 25)]

The output above shows, for each column with categorical data, the number of unique values in the column. For instance, the 'Street' column in the training data has two unique values: 'Grvl' and 'Pave', corresponding to a gravel road and a paved road, respectively.

We refer to the number of unique entries of a categorical variable as the cardinality of that categorical variable. For instance, the 'Street' variable has cardinality 2.

Use the output above to answer the questions below.

# Fill in the line below: How many categorical variables in the training data
# have cardinality greater than 10?
high_cardinality_numcols = 3

# Fill in the line below: How many columns are needed to one-hot encode the 
# 'Neighborhood' variable in the training data?
num_cols_neighborhood = 25

# Check your answers
step_3.a.check()

To one-hot encode a variable, we need one column for each unique entry.

For large datasets with many rows, one-hot encoding can greatly expand the size of the dataset. For this reason, we typically will only one-hot encode columns with relatively low cardinality. Then, high cardinality columns can either be dropped from the dataset, or we can use label encoding.

As an example, consider a dataset with 10,000 rows, and containing one categorical column with 100 unique entries.

  • If this column is replaced with the corresponding one-hot encoding, how many entries are added to the dataset?
  • If we instead replace the column with the label encoding, how many entries are added?
    对于具有多行的大型数据集,OH编码会大幅度增加数据集的规模,因此,我们通常只对categorical较少的列进行一次OH编码。对于包含较多分类的列,可以删去或使用label encoding.
    Use your answers to fill in the lines below.

Hint: To calculate how many entries are added to the dataset through the one-hot encoding, begin by calculating how many entries are needed to encode the categorical variable (by multiplying the number of rows by the number of columns in the one-hot encoding). Then, to obtain how many entries are added to the dataset, subtract the number of entries in the original column.提示:要计算通过一次编码将多少项添加到数据集中,请先计算需要多少项来对分类变量进行编码(通过在一次编码中将行数乘以列数) )。 然后,要获得添加到数据集中的条目数量,请减去原始列中的条目数量。

# Fill in the line below: How many entries are added to the dataset by 
# replacing the column with a one-hot encoding?
OH_entries_added = 1e4*100-1e4

# Fill in the line below: How many entries are added to the dataset by
# replacing the column with a label encoding?
label_entries_added = 0

# Check your answers
step_3.b.check()

Step 4: One-hot encoding

In this step, you'll experiment with one-hot encoding. But, instead of encoding all of the categorical variables in the dataset, you'll only create a one-hot encoding for columns with cardinality less than 10.(仅对于类别数小于10的列进行OH编码)

Run the code cell below without changes to set low_cardinality_cols to a Python list containing the columns that will be one-hot encoded. 将被OH编码 Likewise, high_cardinality_cols contains a list of categorical columns that will be dropped from the dataset.将被从数据集分类列中删除

# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))

print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)
Categorical columns that will be one-hot encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'PavedDrive', 'SaleType', 'SaleCondition']

Categorical columns that will be dropped from the dataset: ['Neighborhood', 'Exterior2nd', 'Exterior1st']

Use the next code cell to one-hot encode the data in X_train and X_valid. Set the preprocessed DataFrames to OH_X_train and OH_X_valid, respectively.

  • The full list of categorical columns in the dataset can be found in the Python list object_cols.
  • You should only one-hot encode the categorical columns in low_cardinality_cols. All other categorical columns should be dropped from the dataset.

The next code cell is VERY IMPORTANT!

from sklearn.preprocessing import OneHotEncoder

# Use as many lines of code as you need!
#Apply one-hot encoder to each column with categorical data
onehotencoder = OneHotEncoder(handle_unknown = 'ignore',sparse = False)
OH_cols_train = pd.DataFrame(onehotencoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(onehotencoder.transform(X_valid[low_cardinality_cols]))

#不要忘记OH会删除索引,一定要牢记恢复索引 One-hot encoding removes index,put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

#获得数值型变量列 删去分类列即是 Remove categorical columns (will replace with one-hot encoding)
num_cols_train = X_train.drop(object_cols,axis=1)
num_cols_valid = X_valid.drop(object_cols,axis=1)

#拼接OH编码过的分类列和数值列 注意pd.concat函数的用法
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([OH_cols_train,num_cols_train],axis=1)   # Your code here
OH_X_valid = pd.concat([OH_cols_valid,num_cols_valid],axis=1) # Your code here

# Check your answer
step_4.check()

The code cell above is VERY IMPORTANT!

Run the next code cell to get the MAE for this approach.

print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

Step 5: Generate test predictions and submit your results

After you complete Step 4, if you'd like to use what you've learned to submit your results to the leaderboard, you'll need to preprocess the test data before generating predictions.

This step is completely optional, and you do not need to submit results to the leaderboard to successfully complete the exercise.

Check out the previous exercise if you need help with remembering how to join the competition or save your results to CSV. Once you have generated a file with your results, follow the instructions below:

测试集预处理代码 To be continued

  1. Begin by clicking on the blue Save Version button in the top right corner of this window. This will generate a pop-up window.
  2. Ensure that the Save and Run All option is selected, and then click on the blue Save button.
  3. This generates a window in the bottom left corner of the notebook. After it has finished running, click on the number to the right of the Save Version button. This pulls up a list of versions on the right of the screen. Click on the ellipsis (...) to the right of the most recent version, and select Open in Viewer. This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
  4. Click on the Output tab on the right of the screen. Then, click on the Submit to Competition button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

  1. If you want to keep working to improve your performance, select the blue Edit button in the top right of the screen. Then you can change your model and repeat the process. There's a lot of room to improve your model, and you will climb up the leaderboard as you work.

Keep going

With missing value handling and categorical encoding, your modeling process is getting complex. This complexity gets worse when you want to save your model to use in the future. The key to managing this complexity is something called pipelines.

Learn to use pipelines to preprocess datasets with categorical variables, missing values and any other messiness your data throws at you.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 219,110评论 6 508
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 93,443评论 3 395
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 165,474评论 0 356
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,881评论 1 295
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,902评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,698评论 1 305
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,418评论 3 419
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,332评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,796评论 1 316
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,968评论 3 337
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,110评论 1 351
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,792评论 5 346
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,455评论 3 331
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,003评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,130评论 1 272
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,348评论 3 373
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,047评论 2 355