背景
https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection
这次比赛主要是通过日志来抓手机点击app的“点击欺诈”的一个反欺诈项目。其实就是给你一堆数据,主要是点击者的ip、手机型号(device)、手机系统(os)、通过何种广告渠道(channel)在何时(click_time)点击了哪个app。最后让你预测是否下载了这个app(is_attributed),下载了就是1,没下载就是0。
评价标准是auc
比赛数据的size,train.csv为7.01G,test.csv为823M。
代码
从click_time中提取week,year,删除datetime
def timeFeatures(df):
# Make some new features with click_time column
df['datetime'] = pd.to_datetime(df['click_time'])
df['dow'] = df['datetime'].dt.dayofweek
df["doy"] = df["datetime"].dt.dayofyear
df.drop(['click_time', 'datetime'], axis=1, inplace=True)
return df
train_columns = ['ip', 'app', 'device', 'os', 'channel', 'click_time', 'is_attributed']
test_columns = ['ip', 'app', 'device', 'os', 'channel', 'click_time', 'click_id']
dtypes = {
'ip' : 'uint32',
'app' : 'uint16',
'device' : 'uint16',
'os' : 'uint16',
'channel' : 'uint16',
'is_attributed' : 'uint8',
'click_id' : 'uint32'
}
读取数据集
去除is_attributed特征
去除测试集click_id特征
train = pd.read_csv(path+"train.csv", skiprows=range(1,123903891), nrows=61000000, usecols=train_columns, dtype=dtypes)
test = pd.read_csv(path+"test_supplement.csv", usecols=test_columns, dtype=dtypes)
print('[{}] Finished to load data'.format(time.time() - start_time))
# Drop the IP and the columns from target
y = train['is_attributed']
train.drop(['is_attributed'], axis=1, inplace=True)
# Drop IP and ID from test rows
sub = pd.DataFrame()
#sub['click_id'] = test['click_id'].astype('int')
test.drop(['click_id'], axis=1, inplace=True)
#清理内存
gc.collect()
拼接train,test
nrow_train = train.shape[0]
merge = pd.concat([train, test])
计算不同channel下的点击数(ip: ip address of click.)为clicks_by_ip
DataFrame 数据合并,连接(merge,join,concat)
分割train,test
# Count the number of clicks by ip
ip_count = merge.groupby(['ip'])['channel'].count().reset_index()
ip_count.columns = ['ip', 'clicks_by_ip']
merge = pd.merge(merge, ip_count, on='ip', how='left', sort=False)
merge['clicks_by_ip'] = merge['clicks_by_ip'].astype('uint16')
merge.drop('ip', axis=1, inplace=True)
train = merge[:nrow_train]
test = merge[nrow_train:]
xgboost参数调节
max_depth是0其实就是不设限制。
params = {'eta': 0.3,
'tree_method': "hist",
'grow_policy': "lossguide",
'max_leaves': 1400,
'max_depth': 0,
'subsample': 0.9,
'colsample_bytree': 0.7,
'colsample_bylevel':0.7,
'min_child_weight':0,
'alpha':4,
'objective': 'binary:logistic',
'scale_pos_weight':9,
'eval_metric': 'auc',
'nthread':8,
'random_state': 99,
'silent': True}
训练
if (is_valid == True):
# Get 10% of train dataset to use as validation
#10%交叉验证集
x1, x2, y1, y2 = train_test_split(train, y, test_size=0.1, random_state=99)
#xgb矩阵赋值
dtrain = xgb.DMatrix(x1, y1)
dvalid = xgb.DMatrix(x2, y2)
del x1, y1, x2, y2
gc.collect()
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]
model = xgb.train(params, dtrain, 200, watchlist, maximize=True, early_stopping_rounds = 25, verbose_eval=5)
del dvalid
else:
dtrain = xgb.DMatrix(train, y)
del train, y
gc.collect()
watchlist = [(dtrain, 'train')]
model = xgb.train(params, dtrain, 30, watchlist, maximize=True, verbose_eval=1)
tips(节省内存)
1.及时删除无用变量并垃圾回收
del
gc.collect()
2.预定义数据类型
dtypes = {
'ip' : 'uint32',
'app' : 'uint16',
'device' : 'uint16',
'os' : 'uint16',
'channel' : 'uint16',
'is_attributed' : 'uint8',
'click_id' : 'uint32'
}
pandas一般会推断数据类型,预定义数据类型节省了超过一半的空间。
3.去除csv文件里的指定行
指定行数,nrows=xxx
跳过行数,skiprows=xxx
sampling,subprocess。在统计特征的时候,还是按照全局统计,统计出特征后对负样本进行了5%的采样,来减少内存的消耗。
train一共有lines=184903891 行,选取6100w行。
train = pd.read_csv(path+"train.csv", skiprows=range(1,123903891), nrows=61000000, usecols=train_columns, dtype=dtypes)
4.只载入若干列
5.选择和test同时段的数据
train数据是全时段数据,test是固定几个时段的数据,选择同时段的数据来减少训练集的size。