GPU算力的优越性,在深度学习方面已经体现得很充分了,税务领域的落地应用可以参阅我的文章《升级HanLP并使用GPU后端识别发票货物劳务名称》、《HanLP识别发票货物劳务名称之三 GPU加速》以及另一篇文章《外一篇:深度学习之VGG16模型雪豹识别》,HanLP使用的是Tensorflow及PyTorch深度学习框架,有兴趣的厂商也可以用自己的框架试试。
这些文章都是Python上跑的,R语言上Tensorflow及Keras有相应的接口包(后端运行还是在Python上),见《R语言深度学习》,最近也开发了R语言原生的深度学习框架Torch for R,以及原生的Apache MXNET等,后面两个还没有跑过,有时间可以试一下。
有关Linux上GPU的安装与使用,可以参阅我在简书上的系列文章。
在传统的机器学习应用领域,主要是分类与回归,也有一些算法实现尝试利用GPU的算力来提升性能。前文《墨尔本房价回归模型(Python)》及《用Tidy Models实现墨尔本房价回归模型(R)》中,XGBoost,LightGBM,CatBoost这3种Kaggle上公认的世界顶尖水平GBDT(梯度下降决策树)算法实现,都支持GPU运行,这就提出了一个问题和机会,来探索一下该领域实现GPU算力优越性的可能性和条件。这是个很有实用意义的问题,各大云平台及PC、笔记本上那么多GPU,能否充分利用,是选择算法实现和技术路线的一个重要参考标准。人家已经做出来了,网上也有不少实例展示了大规模数据集上分类或回归算法GPU算力的优越性,所以可能性是肯定的,问题是在自己的落地应用场景中找到实现的条件,需要实测了解一下。
在墨尔本房价回归分析模型的例子中,实测显示不管是Python的实现还是R语言的实现,GPU(Nvidia GeForce RTX 2060 Max-Q,1920个CUDA核)上都比CPU(Intel Core i7 8核[16虚拟核])上慢,需要深入了解原因,是GPU(参数)没用对呢还是数据集本身的特点,还是硬件本身的能力就是如此,从而搞清落地应用场景中实现GPU算力优越性的条件。
一、R语言测试
最近在写Tidy Models的介绍文章,先讲讲R语言上的情况,后面再讲讲Python上的情况,结果是一样的。
1、XGBoost算法。
XGBoost开源算法框架由University of Washington主导开发,默认安装的CRAN XGBoost是不支持GPU的,要安装其Github主页上的发行版,上面有预编译好的Windows及Linux版,下载安装即可,目前是1.7.3.1版。运行时增加一个参数tree_method="gpu_hist"即可。
set_engine('xgboost', tree_method="gpu_hist")
# -----------------------------------------------------------------------------------------
library(tidymodels)
library(kableExtra)
library(tidyr)
# All operating systems,注册并行处理
library(doParallel)
cl <- makePSOCKcluster(parallel::detectCores())
registerDoParallel(cl)
# 优先使用tidymodels的同名函数。
tidymodels_prefer()
# 异常值阈值30
threshold<- 30
# ----------------------------------------------------------------------------------------
# 加载经过预处理的数据
melbourne<- read.csv("D:/temp/data/Melbourne_housing/Melbourne_housing_pre.csv")
# 过滤缺失值
# Error: Missing data in columns: BuildingArea.
# 47 obs.
missing <- filter(melbourne, BuildingArea==0)
melbourne <- filter(melbourne, BuildingArea!=0)
# 划分训练集与测试集
set.seed(2023)
melbourne_split <- initial_split(melbourne, prop = 0.80)
melbourne_train <- training(melbourne_split)
melbourne_test <- testing(melbourne_split)
# ----------------------------------------------------------------------------------------------------
# 贝叶斯优化
# 可以调整菜谱参数、模型主参数及引擎相关参数。
# 定义菜谱:回归公式与预处理
melbourne_rec<-
recipe(LogPrice ~ Year + YearBuilt + Distance + Lattitude + Longtitude + BuildingArea
+ Rooms + Bathroom + Car + Type_h + Type_t + Type_u, data = melbourne_train) %>%
# 标准化数值型变量
step_normalize(all_numeric_predictors())
# 定义模型:XGB, 定义要调整的参数,tree_method="gpu_hist",使用GPU。
xgb_spec <-
boost_tree(tree_depth = tune(), trees = tune(), learn_rate = tune(), min_n = tune(), loss_reduction = tune(), sample_size = tune(), stop_iter = tune()) %>%
set_engine('xgboost', tree_method="gpu_hist") %>%
set_mode('regression')
# 定义工作流
xgb_wflow <-
workflow() %>%
add_model(xgb_spec) %>%
add_recipe(melbourne_rec)
# 全部参数的边界都已确定。
xgb_param <- xgb_wflow %>%
extract_parameter_set_dials() %>%
update(learn_rate = threshold(c(0.01,0.5))) %>%
update(trees = trees(c(500,1000))) %>%
update(tree_depth = tree_depth(c(5,15))) %>%
update(sample_size = threshold(c(0.5,1))) %>%
finalize(melbourne_train)
xgb_param
# 查看参数边界,都已确定
xgb_param %>% extract_parameter_dials("trees")
xgb_param %>% extract_parameter_dials("min_n")
xgb_param %>% extract_parameter_dials("tree_depth")
xgb_param %>% extract_parameter_dials("learn_rate")
xgb_param %>% extract_parameter_dials("loss_reduction")
xgb_param %>% extract_parameter_dials("sample_size")
xgb_param %>% extract_parameter_dials("stop_iter")
melbourne_folds <- vfold_cv(melbourne, v = 5)
# 执行贝叶斯优化
ctrl <- control_bayes(verbose = TRUE, no_improve = Inf)
# set.seed(2023)
t1<-proc.time()
xgb_res_bo <-
xgb_wflow %>%
tune_bayes(
resamples = melbourne_folds,
metrics = metric_set(rsq, rmse, mae),
initial = 10,
param_info = xgb_param,
iter = 100,
control = ctrl
)
t2<-proc.time()
cat(t2-t1)
# CPU 3014 1.99 3989.22 NA NA
# GPU 5892.78 4.16 8416.28 NA NA
可以看到GPU上反而慢了一倍多,在训练的过程中观察网络的流量,发现GPU与CPU之间数据拷贝的流量不小。
2、LightGBM算法。
LightGBM算法是微软开发的,默认的CRAN安装也是不支持GPU,GPU版要下载源码编译,具体请参阅《Installation Guide: Build GPU Version》以及LightGBM R-package Github主页,编译好GPU版LightGBM后,运行项目主目录下的build_r.R打包生成GPU版的lightgbm R包并安装。GPU版默认是OpenCL API,Nvidia也支持,如果要编译CUDA专用API,请参阅《Installation Guide: Build CUDA Version》。当时在Python上测试时用CMake + VS Build Tools编译的3.2.1.99版,看了一下LightGBM的发布信息,目前最新的版本是3.3.4,主要是适配R-4.2,3.2.1之后的版本,在GPU支持上没有大的更新,就先用着3.2.1.99版测试,以后有需要再升级。该文档提供了LightGBM原生R语言API的简单测试例子,小数据集下显然是CPU比GPU要快。
Rscript build_r.R --use-gpu
参阅LightGBM参数文档,OpenCL API下它需要两个参数来确定GPU的厂商及设备编号:gpu_platform_id与gpu_device_id,可以用工具GPUCapsViewer来查看,如下图所示,但LightGBM中的编号是从0开始的,引用时都要减1,比如我的笔记本上有集成的intel显卡,它的gpu_platform_id是1,Nvidia的gpu_platform_id是2,R程序中引用时,gpu_platform_id是2-1=1,gpu_device_id是1-1=0。
加载数据等相同的程序就不重复了,指定
device="gpu"等参数就可以使用GPU。
set_engine('lightgbm', device="gpu", gpu_platform_id=1, gpu_device_id = 0)
# 为LightGBM提供 parsnip接口支持
library(bonsai)
# 贝叶斯优化
# 可以调整菜谱参数、模型主参数及引擎相关参数。
# 定义菜谱:回归公式与预处理
melbourne_rec<-
recipe(LogPrice ~ Year + YearBuilt + Distance + Lattitude + Longtitude + BuildingArea
+ Rooms + Bathroom + Car + Type_h + Type_t + Type_u, data = melbourne_train) %>%
# 标准化数值型变量
step_normalize(all_numeric_predictors())
# 定义模型:Light GBM, 定义要调整的参数
lgbm_spec <-
boost_tree(tree_depth = tune(), trees = tune(), learn_rate = tune(), min_n = tune(),
loss_reduction = tune(), sample_size = tune(), mtry=tune()) %>%
# set_engine('lightgbm') %>%
# 有一个集成的Intel显卡,它的gpu_platform_id=0,gpu_device_id = 0,Nvidia独立显卡的gpu_platform_id=1
set_engine('lightgbm', device="gpu", gpu_platform_id=1, gpu_device_id = 0) %>%
set_mode('regression')
# 定义工作流
lgbm_wflow <-
workflow() %>%
add_model(lgbm_spec) %>%
add_recipe(melbourne_rec)
# mtry参数的边界未完全确定,用finalize()函数确定。
lgbm_param <- lgbm_wflow %>%
extract_parameter_set_dials() %>%
update(learn_rate = threshold(c(0.01,0.5))) %>%
update(trees = trees(c(500,1000))) %>%
update(tree_depth = tree_depth(c(5,15))) %>%
update(mtry = mtry(c(3,6))) %>%
update(sample_size = threshold(c(0.5,1))) %>%
finalize(melbourne_train)
# 查看参数边界,都已确定
lgbm_param %>% extract_parameter_dials("trees")
lgbm_param %>% extract_parameter_dials("min_n")
lgbm_param %>% extract_parameter_dials("tree_depth")
lgbm_param %>% extract_parameter_dials("learn_rate")
lgbm_param %>% extract_parameter_dials("loss_reduction")
lgbm_param %>% extract_parameter_dials("sample_size")
lgbm_param %>% extract_parameter_dials("mtry")
melbourne_folds <- vfold_cv(melbourne, v = 5)
# 执行贝叶斯优化
ctrl <- control_bayes(verbose = TRUE, no_improve = Inf)
# set.seed(2023)
t1<-proc.time()
lgbm_res_bo <-
lgbm_wflow %>%
tune_bayes(
resamples = melbourne_folds,
metrics = metric_set(rsq, rmse, mae),
initial = 10,
param_info = lgbm_param,
iter = 100,
control = ctrl
)
t2<-proc.time()
cat(t2-t1)
#CPU 4760.83 2.64 5503.5 NA NA
#GPU 5834.04 5.57 8285.5 NA NA
可以看到CPU也是比GPU要快了近一倍。
3、CatBoost算法。
CatBoost是俄国Yandex搜索引擎开发的开源GBDT算法框架,它各个操作系统的预编译版本都是支持GPU的,可以从项目主页的最新版本处下载安装,目前的最新版本是1.1.1。使用时增加一个参数task_type = 'GPU'即可,参数文档。
# 为catboost提供 parsnip接口支持
library(treesnip)
# 定义菜谱:回归公式与预处理
# 'Year','YearBuilt','Distance','Lattitude','Longtitude','Propertycount',
# 'Landsize','BuildingArea', 'Rooms','Bathroom', 'Car','Type_h','Type_t','Type_u'
melbourne_rec<-
recipe(LogPrice ~ Year + YearBuilt + Distance + Lattitude + Longtitude + BuildingArea
+ Rooms + Bathroom + Car + Type_h + Type_t + Type_u, data = melbourne_train) %>%
#step_log(BuildingArea, base = 10) %>%
# 标准化数值型变量
step_normalize(all_numeric_predictors())
# 定义模型:Cat
cat_model<-
boost_tree(trees = 1000, learn_rate=0.05) %>%
set_engine("catboost",
loss_function = "RMSE",
eval_metric='RMSE',
task_type = 'GPU' # Catboost GPU上运行的效率还不如CPU, 可能是数据集还不够大。
) %>%
set_mode("regression")
# 定义工作流
cat_wflow <-
workflow() %>%
add_model(cat_model) %>%
add_recipe(melbourne_rec)
# 训练模型
t1<-proc.time()
cat_fit <- fit(cat_wflow, melbourne_train)
t2<-proc.time()
cat(t2-t1)
# CPU 2.42 0.07 2.68 NA NA
# GPU 12.77 3.44 12.78 NA NA
CatBoost在GPU上慢了5倍多,所以暂时没有测试100次迭代的贝叶斯优化,但5折交叉验证时,它的CPU和GPU都几乎满格了。
二、Python测试
1、XGBoost算法。
XGBoost的Python版是预编译的二进制版本,已经支持GPU,用pip安装即可。
pip install xgboost
贝叶斯优化,在GPU上训练时也是增加一个参数tree_method='gpu_hist'。Python上的贝叶斯优化实现与R上可能有所不同,它的高斯过程速度很快,可能只是估算1组候选参数(R上是数千组),所以在Python上要迭代1000次,R上只迭代100次。在CPU与GPU模式之间切换只需更新贝叶斯优化的代价函数f_xgb()即可。
各算法公共的部分,加载软件包与数据。
# 加载公用包
# Ignore Warnings
import warnings
warnings.filterwarnings('ignore')
# Basic Imports
import numpy as np
import pandas as pd
import time
# Preprocessing
from sklearn.model_selection import train_test_split, KFold, cross_val_score
# Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Model Tuning
from hyperopt import fmin, tpe, hp, Trials
from hyperopt.fmin import generate_trials_to_calculate
# 加载数据,划分训练集与测试集,标准化数据
# 9015
df_NN = pd.read_csv("D:/temp/data/Melbourne_housing/Melbourne_housing_pre.csv", encoding="utf-8")
X=df_NN[['Year','YearBuilt','Distance','Lattitude','Longtitude','Propertycount',
'Landsize','BuildingArea', 'Rooms','Bathroom', 'Car','Type_h','Type_t','Type_u']]
y=df_NN['LogPrice']
train_X, valid_X, train_y, valid_y = train_test_split(X,y, test_size = .20, random_state=42)
train_X2 = train_X.copy()
valid_X2 = valid_X.copy()
# Data standardization
mean = train_X.mean(axis=0)
train_X -= mean
std = train_X.std(axis=0)
train_X /= std
valid_X -= mean
valid_X /= std
XGBoost:
# ML Models
from xgboost import XGBRegressor
# 定义参数搜索空间,缩小参数取值范围,搜索会快很多
space_xgb = {
'max_bin': hp.choice('max_bin', range(8, 128)), # CPU 50-501 GPU 8-128
'max_depth': hp.choice('max_depth', range(3, 11)),
'n_estimators': hp.choice('n_estimators', range(100, 1001)),
'learning_rate': hp.uniform('learning_rate', 0.01, 0.3),
'subsample': hp.uniform('subsample', 0.5, 0.99),
'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 0.99),
'reg_alpha': hp.uniform('reg_alpha', 0, 5), # lambda_l1
'reg_lambda': hp.uniform('reg_lambda', 0, 3), # lambda_l2
'gamma': hp.uniform('gamma',0.0, 10), # min_split_loss, min_split_gain
'min_child_weight': hp.uniform('min_child_weight',0.0001, 50),
}
# 定义代价函数
def f_xgb(params):
# Set extra_trees=True to avoid overfitting
# CPU 4.96s/trial
# xgb = XGBRegressor(objective ='reg:squarederror', seed = 0,verbosity=0, **params)
# GPU 8.68s/trial
xgb = XGBRegressor(tree_method='gpu_hist', objective ='reg:squarederror', seed = 0,verbosity=0,**params)
#xgb_model = xgb.fit(train_X, train_y)
#acc = xgb_model.score(valid_X,valid_y)
# acc = cross_val_score(xgb, train_X, train_y).mean() # CPU
acc = cross_val_score(xgb, train_X, train_y, n_jobs=6).mean() # GPU
return -acc
# trials = Trials()
# Set initial values, start searching from the best point of GridSearchCV(), and default values
trials = generate_trials_to_calculate([{
'max_bin':4, # default 256
'max_depth':5, # default 6
'n_estimators':578, # default 100
'learning_rate':0.05508679239402551, # default 0.3
'subsample':0.8429852720715357, # default 1.0
'colsample_bytree':0.8413894273173292, # default 1.0
'reg_alpha': 0.809791155072757, # default 0.0
'reg_lambda':1.4490119256389808, # default 1.0
'gamma':0.008478702584417519, # default 0.0
'min_child_weight':24.524635200338793, # default 1
}])
t1 = time.time()
# GPU: 1000trial [2:24:41, 8.68s/trial, best loss: -0.9080128034320879]
best_params = fmin(f_xgb, space_xgb, algo=tpe.suggest, max_evals=999, trials=trials)
t2 = time.time()
# 8681.310757875443
print("Time elapsed: ", t2-t1)
print('best:')
print(best_params)
XGBoost Python版在GPU上训练的速度也是比CPU上慢了近一倍。
2、LightGBM算法。
Python版安装参阅LightGBM python-package主页文档,升级到最新的3.3.4,这个安装选项使用的是默认的OpenCL API,这里用它来测试。
pip install lightgbm --install-option=--gpu
Windows上安装CUDA API专用版要先配好Visual Studio开发环境,pip要调用它来编译。
pip install lightgbm --install-option=--cuda
贝叶斯优化,调用lightgbm时增加几个参数:device='gpu', gpu_platform_id=1, gpu_device_id = 0,注意它的参数max_bin在CPU和GPU上的取值范围不同,GPU上如果不正确设置会引起index out of range的错误。公共的程序就不重复了。
# ML Models
from lightgbm import LGBMRegressor
# --------------------------------------------------------------------------------------------------
# Auto search for better hyper parameters with hyperopt, only need to give a range
# Reference: https://www.pythonf.cn/read/6998
# https://lightgbm.readthedocs.io/en/latest/Parameters.html
# https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html#deal-with-over-fitting
# https://lightgbm.readthedocs.io/en/latest/GPU-Performance.html
# 处理过拟合
# 设置较少的直方图数目 max_bin
# 设置较小的叶节点数 num_leaves
# 使用 min_child_samples(min_data_in_leaf) 和 min_child_weight(= min_sum_hessian_in_leaf)
# 通过设置 subsample(bagging_fraction) 和 subsample_freq(= bagging_freq) 来使用 bagging
# 通过设置 colsample_bytree(feature_fraction) 来使用特征子抽样
# 使用更大的训练数据
# 使用 reg_alpha(lambda_l1) , reg_lambda(lambda_l2) 和 min_split_gain(min_gain_to_split) 来使用正则
# 尝试 max_depth 来避免生成过深的树
# Try extra_trees
# Try increasing path_smooth
# trials = generate_trials_to_calculate([{'max_bin':63-8, # default CPU 255 GPU 63
# 'max_depth':5-3, # default -1
# 'num_leaves':31-20, # default 31
# 'min_child_samples':20-10, # default 20
# 'subsample_freq':1-1, # default 1
# 'n_estimators':6000-1000, # default 10
# 'learning_rate':0.01, # default 0.1
# 'subsample':0.75, # default 1.0
# 'colsample_bytree':0.8, # default 1.0
# 'lambda_l1':0.0, # default 0.0
# 'lambda_l2':0.0, # default 0.0
# 'min_child_weight':0.001, # default 0.001
# 'min_split_gain':0.0, # default 0.0
# #'path_smooth':0.0 # default 0.0
# }])
# 缩小参数取值范围,搜索会快很多
space_lgbm = {
'max_bin': hp.choice('max_bin', range(8, 128)), # CPU 50-501 GPU 8-128
'max_depth': hp.choice('max_depth', range(3, 31)),
'num_leaves': hp.choice('num_leaves', range(10, 256)),
'min_child_samples': hp.choice('min_child_samples', range(10, 51)),
'subsample_freq': hp.choice('subsample_freq', range(1, 6)),
'n_estimators': hp.choice('n_estimators', range(500, 6001)),
'learning_rate': hp.uniform('learning_rate', 0.005, 0.15),
'subsample': hp.uniform('subsample', 0.5, 0.99),
'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 0.99),
'reg_alpha': hp.uniform('reg_alpha', 0, 5), # lambda_l1
'reg_lambda': hp.uniform('reg_lambda', 0, 3), # lambda_l2
'min_child_weight': hp.uniform('min_child_weight',0.0001, 50),
'min_split_gain': hp.uniform('min_split_gain',0.0, 1),
#'path_smooth': hp.uniform('path_smooth',0.0, 3)
}
def f_lgbm(params):
# Set extra_trees=True to avoid overfitting
# lgbm = LGBMRegressor(seed=0,verbose=-1, **params) # CPU 4.96s/trial
lgbm = LGBMRegressor(device='gpu', gpu_platform_id=1, gpu_device_id = 0, num_threads =3, **params) # GPU 65.93s/trial
#lgb_model = lgbm.fit(train_X, train_y)
#acc = lgb_model.score(valid_X,valid_y)
# acc = cross_val_score(lgbm, train_X, train_y).mean() # CPU
acc = cross_val_score(lgbm, train_X, train_y, n_jobs=6).mean() # GPU
return -acc
# trials = Trials()
# Set initial values, start searching from the best point of GridSearchCV(), and default values
trials = generate_trials_to_calculate([{'max_bin':63, # default CPU 255 GPU 63
'max_depth':17, # default -1
'num_leaves':12, # default 31
'min_child_samples':14, # default 20
'subsample_freq':0, # default 1
'n_estimators':2647, # default 10
'learning_rate':0.0203187560767722, # default 0.1
'subsample':0.788703175392162, # default 1.0
'colsample_bytree':0.5203150334508861, # default 1.0
'reg_alpha': 0.988139501870491, # default 0.0
'reg_lambda':2.789779486137205, # default 0.0
'min_child_weight':21.813225361674828, # default 0.001
'min_split_gain':0.00039636685518264865, # default 0.0
#'path_smooth':0.0 # default 0.0
}])
t1 = time.time()
# 1000trial [5:23:09, 19.39s/trial, best loss: -0.9082183160929432] CPU
# 1000trial [1:22:39, 4.96s/trial, best loss: -0.9079837941918502] CPU
# 1000trial [1:02:28, 3.75s/trial, best loss: -0.9080477825539048] CPU
best_params = fmin(f_lgbm, space_lgbm, algo=tpe.suggest, max_evals=9, trials=trials)
t2 = time.time()
print("Time elapsed: ", t2-t1)
print(best_params)
迭代100次,15.09s/trial,CPU满格,Nvidia GPU过半,小数据集内存消耗较低,GPU与CPU间有一些数据拷贝流量,应该是正常的情况,结果也是CPU版3.28s/trial要快4倍多。
3、CatBoost算法。
pip安装默认已支持GPU:
pip install catboost
CatBoost使用GPU只需要指定参数task_type='GPU',不过贝叶斯调参时,这几个参数是CPU版才有的,GPU版不支持:random_strength、subsample,rsm,然后参数border_countGPU版与CPU版的取值范围不同,需要注意。公共的程序也不重复了,见前文。
# ML Models
from catboost import CatBoostRegressor
# Auto search for better hyper parameters with hyperopt, only need to give a range
# Reference: https://github.com/talperetz/hyperspace/tree/master/GBDTs
# https://catboost.ai/docs/concepts/python-reference_parameters-list.html#python-reference_parameters-list
# https://catboost.ai/docs/concepts/parameter-tuning.html
# https://affine.ai/catboost-a-new-game-of-machine-learning/
'''
https://catboost.ai/en/docs/concepts/speed-up-training
Speeding up the training
1. iterations, worked
2. learning_rate, worked
2. boosting_type, Ordered, Plain, not worked
3. bootstrap_type, Bayesian, Bernoulli, MVS, Poisson, not worked
4. subsample, not worked
This parameter can be used if one of the following bootstrap types is selected:
Poisson Bernoulli MVS
5. one_hot_max_size, One-hot encoding
6. rsm, colsample_bylevel, Random subspace method
7.leaf_estimation_iterations, worked, set to 1.
Try setting the value to "1" or "5" to speed up the training on datasets with a small number of features.
8. max_ctr_complexity, worked, 0 or 2 to speed up trainning.
This parameter can affect the training time only if the dataset contains categorical features.
9. border_count, worked, set to less.
10.Reusing quantized datasets in Python, not applyable to cross_val_score()
11.Golden features. If the dataset has a feature, which is a strong predictor of the result, the
pre-quantisation of this feature may decrease the information that the model can get from it.
It is recommended to use an increased number of borders (1024) for this feature.
per_float_feature_quantization=['0:border_count=1024', '1:border_count=1024']
'''
# default values
# trials = generate_trials_to_calculate([{'border_count':254-150, # default CPU 254 GPU 128
# 'iterations':1000-500, # default 1000
# 'depth': 6-2, # default 6
# 'random_strength':1.0, # default 1.0, CPU only
# 'learning_rate': 0.03, # default 0.03
# 'subsample':0.8, # default 0.8
# 'l2_leaf_reg': 3.0, # default 3
# 'rsm':0.8, # default 1.0 CPU only
# 'fold_len_multiplier':2.0, # default 2.0
# 'bagging_temperature':1.0 # default 1.0
# }])
# 缩小参数取值范围,搜索会快很多
space_cat = {'border_count': hp.choice('border_count', range(8, 128)), # CPU 150-351 GPU 8-128
'iterations': hp.choice('iterations', range(500, 1501)),
'depth': hp.choice('depth', range(2, 10)),
#'random_strength': hp.uniform('random_strength', 1, 20),
'learning_rate': hp.uniform('learning_rate', 0.005, 0.15),
# 'subsample': hp.uniform('subsample', 0.5, 1),
'l2_leaf_reg': hp.uniform('l2_leaf_reg', 1, 100),
# 'rsm': hp.uniform('rsm', 0.5, 0.99), # colsample_bylevel
'fold_len_multiplier': hp.uniform('fold_len_multiplier', 1.0, 10.0),
'bagging_temperature': hp.uniform('bagging_temperature', 0.0, 1.0) }
def f_cat(params):
# cat = CatBoostRegressor(task_type='CPU', random_seed=0,
cat = CatBoostRegressor(task_type='GPU', random_seed=0,
# boosting_type='Plain', bootstrap_type = 'Bayesian', max_ctr_complexity=1,
one_hot_max_size=3,
leaf_estimation_iterations=1,
#per_float_feature_quantization=['3:border_count=1024', '4:border_count=1024'], # Golden features: lat, long
verbose=False, **params) # CPU 13.05s/trial
acc = cross_val_score(cat, train_X, train_y, n_jobs=3).mean()
return -acc
# trials = Trials()
# Set initial values, start searching from the best point of GridSearchCV(), and default values
trials = generate_trials_to_calculate([{'border_count':112, # default CPU 254 GPU 128
'iterations':989, # default 1000
'depth': 4, # default 6
#'random_strength':6.6489521372262645, # default 1.0, CPU only
'learning_rate': 0.07811835381238333, # default 0.03
#'subsample':0.9484820488113903, # default 0.8
'l2_leaf_reg': 8.070279328038293, # default 3
#'rsm':0.7188098046587024, # default 1.0 CPU only
'fold_len_multiplier': 6.034216410528531, # default 2.0
'bagging_temperature':0.47787665340753926 # default 1.0
}])
t1 = time.time()
# 1000trial [50:28, 3.03s/trial, best loss: -0.905859099632395]
best_params = fmin(f_cat, space_cat, algo=tpe.suggest, max_evals=9, trials=trials)
t2 = time.time()
print("Time elapsed: ", t2-t1)
print('best:')
这么小一个数据集,GPU满格,CPU过半,内存满格,迭代一次要350秒,是不太正常的,应该是哪些参数还没有设对,或者是数据集需要一些其它的预处理。
与此相对,CPU版CPU满格,内存消耗不大,7.46s/trial速度也不错。
三、LightGBM官方例子
LightGBM官方的HIGGS分类例子,说GPU应该有三倍以上的加速,在我的笔记本上性能大致相当(GPU没有满负荷跑,因为GBDT算法的GPU版其实是把部分的运算搬到GPU上,很多运算还会在CPU上算,一般是CPU满格,GPU不会满格。),这说明应该是硬件能力的限制,联想拯救者Y9000X的NVIDIA GeForce RTX 2060 Max-Q还不够强悍。该数据集有1100万行28个变量,解压后约7.5G,不小了,它的测试应该有参考价值。划分990万行为训练集,110万行为验证集(10%)。数据下载,参考资料。
各算法公共的部分,加载数据。
# -*- coding: utf-8 -*-
"""
Created on Thu Sep 23 15:58:22 2021
@author: Jean
"""
'''
This is a classification problem to distinguish between a signal process
which produces Higgs bosons and a background process which does not.
The first column is the class label (1 for signal, 0 for background),
followed by the 28 features (21 low-level features then 7 high-level features):
lepton pT, lepton eta, lepton phi, missing energy magnitude, missing energy phi,
jet 1 pt, jet 1 eta, jet 1 phi, jet 1 b-tag, jet 2 pt, jet 2 eta, jet 2 phi, jet 2 b-tag,
jet 3 pt, jet 3 eta, jet 3 phi, jet 3 b-tag, jet 4 pt, jet 4 eta, jet 4 phi, jet 4 b-tag,
m_jj, m_jjj, m_lv, m_jlv, m_bb, m_wbb, m_wwbb.
'''
import pandas as pd
import time
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
t1 = time.time()
# Load dataset
df = pd.read_csv("D:/temp/data/HIGGS/HIGGS.csv", encoding="utf-8", header=None)
# Target column changed to int
df.iloc[:,0] = df.iloc[:,0].astype(int)
t2 = time.time()
# 50.2845721244812
print(t2-t1)
df.shape
df.head(2)
t1 = time.time()
X = df.iloc[:,1:]
y = df.iloc[:,0]
t1 = time.time()
train_X, valid_X, train_y, valid_y = train_test_split(X,y, test_size = .10, random_state=2023)
t2 = time.time()
print(t2-t1)
# 5.838041305541992
在不同的迭代次数下比较CPU与GPU的性能。
# Set objective=regression to change to a regression problem
import lightgbm as lgb
# create dataset for lightgbm
dtrain = lgb.Dataset(train_X, train_y)
dvalid = lgb.Dataset(valid_X, valid_y, reference=dtrain)
t_cpu = []; t_nvidia = []; t_intel=[]
a_cpu = []; a_nvidia = []; a_intel=[]
t3 = time.time()
for num_iterations in [50,100,150,200]:
# CPU --------------------------------------------------------------------------------
params = {'objective':'binary',
'num_iterations':num_iterations,
'max_bin': 63,
'num_leaves': 255,
'learning_rate': 0.1,
'tree_learner': 'serial',
'task': 'train',
'is_training_metric': 'false',
'min_data_in_leaf': 1,
'min_sum_hessian_in_leaf': 100,
'ndcg_eval_at': [1, 3, 5, 10],
'device': 'cpu'
}
t0 = time.time()
gbm = lgb.train(params, train_set=dtrain, num_boost_round=10,
valid_sets=dvalid, feature_name='auto', categorical_feature='auto')
t1 = time.time()
t = round(t1-t0,2)
t_cpu.append(t)
# 50: 46.00722551345825 100: 138.17840361595154 150: 195.13047289848328
print('cpu version elapse time: {}'.format(t1-t0))
# predict
y_pred = gbm.predict(valid_X, num_iteration=gbm.best_iteration)
# AUC 0.8207000864285819 0.8304736031418638 0.8353184609588433
auc_score = roc_auc_score(valid_y,y_pred)
a_cpu.append(round(auc_score,4))
print(auc_score)
# NVIDIA GeForce RTX 2060 with Max-Q Design ------------------------------------------
params = {'objective':'binary',
'num_iterations':num_iterations,
'max_bin': 63,
'num_leaves': 255,
'learning_rate': 0.1,
'tree_learner': 'serial',
'task': 'train',
'is_training_metric': 'false',
'min_data_in_leaf': 1,
'min_sum_hessian_in_leaf': 100,
'ndcg_eval_at': [1, 3, 5, 10],
'device': 'gpu',
'gpu_platform_id': 1,
'gpu_device_id': 0
}
t0 = time.time()
gbm = lgb.train(params, train_set=dtrain, num_boost_round=10,
valid_sets=dvalid, feature_name='auto', categorical_feature='auto')
t1 = time.time()
t = round(t1-t0,2)
t_nvidia.append(t)
# 50: 54.93808197975159 100: 103.01487278938293 150: 146.14963364601135
print('gpu version elapse time: {}'.format(t1-t0))
# predict
y_pred = gbm.predict(valid_X, num_iteration=gbm.best_iteration)
# AUC 0.8207000821252757 0.8304736011279031 0.8353184579727403
auc_score = roc_auc_score(valid_y,y_pred)
a_nvidia.append(round(auc_score,4))
print(auc_score)
# Intel(R) UHD Graphics ------------------------------------------------------------
params = {'objective':'binary',
'num_iterations':num_iterations,
'max_bin': 63,
'num_leaves': 255,
'learning_rate': 0.1,
'tree_learner': 'serial',
'task': 'train',
'is_training_metric': 'false',
'min_data_in_leaf': 1,
'min_sum_hessian_in_leaf': 100,
'ndcg_eval_at': [1, 3, 5, 10],
'device': 'gpu'
}
t0 = time.time()
gbm = lgb.train(params, train_set=dtrain, num_boost_round=10,
valid_sets=dvalid, feature_name='auto', categorical_feature='auto')
t1 = time.time()
t = round(t1-t0,2)
t_intel.append(t)
# 62.83425784111023
print('gpu version elapse time: {}'.format(t1-t0))
# predict
y_pred = gbm.predict(valid_X, num_iteration=gbm.best_iteration)
# AUC 0.8207000820323747
auc_score = roc_auc_score(valid_y,y_pred)
a_intel.append(round(auc_score,4))
print(auc_score)
t4 = time.time()
print('Total elapse time: {}'.format(t4-t3))
作图。
perf_t = pd.DataFrame({"iterations":[50,100,150,200],"cpu":t_cpu,"Nvidia":t_nvidia,"Intel":t_intel})
perf_a = pd.DataFrame({"iterations":[50,100,150,200],"cpu":a_cpu,"Nvidia":a_nvidia,"Intel":a_intel})
perf_a["cpu"] = perf_a["cpu"]*100
perf_a["Nvidia"] = perf_a["Nvidia"]*100
perf_a["Intel"] = perf_a["Intel"]*100
iterations = [50,100,150,200]
import matplotlib.pyplot as plt
plt.rcParams["font.sans-serif"]=["SimHei"] #设置字体
plt.rcParams["axes.unicode_minus"]=False #正常显示负号
fig,ax1 = plt.subplots()
ax2 = ax1.twinx() # 做镜像处理
ax1.plot(iterations,t_cpu,'b', label="CPU")
ax1.plot(iterations,t_nvidia,'g', label="Nvidia")
ax1.plot(iterations,t_intel,'r', label="Intel")
ax1.legend(loc="upper left")
ax2.plot(iterations,perf_a["cpu"],"b--", label="CPU")
ax2.plot(iterations,perf_a["Nvidia"],"g--", label="Nvidia")
ax2.plot(iterations,perf_a["Intel"] ,"r--", label="Intel")
ax2.legend(loc="lower right")
ax1.set_xlabel('迭代次数') #设置x轴标题
ax1.set_ylabel('时间(秒)') #设置Y1轴标题
ax2.set_ylabel('AUC(%)') #设置Y2轴标题
plt.show()
CPU上测试, num_iterations=50时约48秒。第一次运行要加载数据,时间要长一点,以第二次运行为准(说明:这几个都是全部1100万条数据都用于训练的截图)。
[LightGBM] [Info] Number of positive: 5829123, number of negative: 5170877
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.191587 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1524
[LightGBM] [Info] Number of data points in the train set: 11000000, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.529920 -> initscore=0.119824
[LightGBM] [Info] Start training from score 0.119824
cpu version elapse time: 47.76462769508362
Nvidia GPU上测试,
num_iterations=50时约60秒,当迭代次数增多,比如100次以后,GPU会比CPU快。
[LightGBM] [Info] Number of positive: 5829123, number of negative: 5170877
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1524
[LightGBM] [Info] Number of data points in the train set: 11000000, number of used features: 28
[LightGBM] [Info] Using requested OpenCL platform 1 device 0
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce RTX 2060 with Max-Q Design, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 64 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 28 dense feature groups (293.73 MB) transferred to GPU in 0.480376 secs. 0 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.529920 -> initscore=0.119824
[LightGBM] [Info] Start training from score 0.119824
gpu version elapse time: 59.27095651626587
CPU中集成的Intel GPU,大约要70秒:
[LightGBM] [Info] Number of positive: 5829123, number of negative: 5170877
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1524
[LightGBM] [Info] Number of data points in the train set: 11000000, number of used features: 28
[LightGBM] [Info] Using GPU Device: Intel(R) UHD Graphics, Vendor: Intel(R) Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 64 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 28 dense feature groups (293.73 MB) transferred to GPU in 0.262090 secs. 0 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.529920 -> initscore=0.119824
[LightGBM] [Info] Start training from score 0.119824
gpu version elapse time: 70.56596112251282

上图显示,随着参数迭代次数
num_iterations(分别取值50、100、150、200)的增加,Nvidia GPU(绿色)的速度逐渐赶超了CPU(蓝色),当num_iterations=200时,已经快了50%了,红色实线显示此时集成的Intel GPU也已经比CPU快了。红色虚线显示了随着迭代次数的增加,准确率AUC也会逐渐提高,相同参数下,在不同设备上训练的精度没有区别,三条虚线是重合的。这说明GPU算力的优越性在此例中是可以验证的,当需要更高的精度时,就需要更多的迭代次数去训练,此时GPU可以加速训练的过程,就是说数据集要大,迭代次数要多,优越性才能体现出来。
四、XGBoost上跑HIGGS数据集分类
有关XGBoost在kaggle HIGGS竞赛上参数讨论的总结帖子并没有给出好的AUC指标。我从默认值开始用贝叶斯优化从头训练模型,可参阅XGBoost参数文档。加载数据的代码就不重复了。
from xgboost import XGBClassifier
space_xgb = {
'max_bin': hp.choice('max_bin', range(50, 512)), # CPU 50-501
'max_depth': hp.choice('max_depth', range(3, 11)),
'n_estimators': hp.choice('n_estimators', range(100, 1001)),
'learning_rate': hp.uniform('learning_rate', 0.01, 0.5),
'subsample': hp.uniform('subsample', 0.5, 1),
'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1),
'reg_alpha': hp.uniform('reg_alpha', 0, 5), # lambda_l1
'reg_lambda': hp.uniform('reg_lambda', 0, 3), # lambda_l2
'gamma': hp.uniform('gamma',0.0, 10), # min_split_loss, min_split_gain
'min_child_weight': hp.uniform('min_child_weight',0.0001, 50),
}
def f_xgb(params):
# xgb = XGBClassifier(objective ='binary:logistic', use_label_encoder=False, seed = 2023,\
# nthread=-1, verbosity=0, **params) # CPU
xgb = XGBClassifier(tree_method='gpu_hist', objective ='binary:logistic', use_label_encoder=False,\
nthread=-1, seed = 2023, verbosity=0,**params) # GPU
xgb_model = xgb.fit(train_X, train_y)
acc = xgb_model.score(valid_X,valid_y)
# acc = cross_val_score(xgb, train_X, train_y).mean() # CPU
# acc = cross_val_score(xgb, train_X, train_y, n_jobs=6).mean() # GPU
return -acc
# trials = Trials()
# Set initial values, start searching from the best point of GridSearchCV(), and default values
trials = generate_trials_to_calculate([{'max_bin':256-50, # default 256
'max_depth':6-3, # default 6 [0,∞]
'n_estimators':200-100, # default 10
'learning_rate':0.3, # default 0.3 [0,1]
'subsample':1.0, # default 1.0 (0,1]
'colsample_bytree':1.0, # default 1.0 (0,1]
'reg_alpha':0, # default 0.0
'reg_lambda':1.0, # default 1.0
'gamma':0, # default 0.0 [0,∞]
'min_child_weight':1 # default 1 [0,∞]
}])
t1 = time.time()
best_params = fmin(f_xgb, space_xgb, algo=tpe.suggest, max_evals=9, trials=trials)
t2 = time.time()
print("Time elapsed: ", t2-t1)
先用训练10次找到的参数来评估一下,虽然这组参数的AUC指标不高,GPU加速的效果还是比较显著的,画图的代码也不重复了。
t_cpu = []; t_nvidia = []
a_cpu = []; a_nvidia = []
t3 = time.time()
# for num_iterations in [50]:
for num_iterations in [50,100,150,200]:
# num_iterations =689
# CPU --------------------------------------------------------------------------------
params = {'objective':'binary:logistic',
'max_bin':286,
'n_estimators':num_iterations,
'learning_rate': 0.3359071085471539,
'max_depth': 4,
'min_child_weight':6.4817419839798385,
'colsample_bytree': 0.7209249276177966,
'subsample': 0.5532140686826488,
'reg_alpha': 2.2793074958255986,
'reg_lambda': 2.4142485681002315,
'gamma' :2.9324177415122934,
'nthread': -1,
'tree_method': 'hist'
}
t0 = time.time()
xgb = XGBClassifier(random_state =2023, use_label_encoder=False, **params)
xgb_model = xgb.fit(train_X, train_y)
t1 = time.time()
t = round(t1-t0,2)
t_cpu.append(t)
print('cpu version elapse time: {}'.format(t1-t0))
# predict
y_pred = xgb_model.predict(valid_X)
auc_score = roc_auc_score(valid_y,y_pred)
a_cpu.append(round(auc_score,4))
print(auc_score)
# NVIDIA GeForce RTX 2060 with Max-Q Design ------------------------------------------
params = {'objective':'binary:logistic',
'max_bin':286,
'n_estimators':num_iterations,
'learning_rate': 0.3359071085471539,
'max_depth': 4,
'min_child_weight':6.4817419839798385,
'colsample_bytree': 0.7209249276177966,
'subsample': 0.5532140686826488,
'reg_alpha': 2.2793074958255986,
'reg_lambda': 2.4142485681002315,
'gamma' :2.9324177415122934,
'nthread': -1,
'tree_method': 'gpu_hist'
}
t0 = time.time()
xgb = XGBClassifier(random_state =2023, use_label_encoder=False, **params)
xgb_model = xgb.fit(train_X, train_y)
t1 = time.time()
t = round(t1-t0,2)
t_nvidia.append(t)
print('gpu version elapse time: {}'.format(t1-t0))
# predict
y_pred = xgb_model.predict(valid_X)
auc_score = roc_auc_score(valid_y,y_pred)
a_nvidia.append(round(auc_score,4))
print(auc_score)
t4 = time.time()
print('Total elapse time: {}'.format(t4-t3))
五、CatBoost上跑HIGGS数据集分类
前面已经有两个Python上使用GPU加速的例子,所以这一次尝要试一下R语言。
Higgs Boson Machine Learning Challenge是Kaggle上8年前结束的一项竞赛,参赛的队伍有1,784支。这里有XGBoost的实现,相关的讨论在该帖子。因为没有好的AUC指标,决定测试一下R语言上用贝叶斯优化从头开始训练模型,看看能否找到好一点的参数组合。这么大的数据集,训练相当耗时,GPU加速的效果还是比较显著的。
贝叶斯优化光生成前10组参数就1个多小时(tidymodels的高斯过程要求初始参数组数要比参数个数多,Python上可以是1组;tidymodels每次生成数千个候选参数组合,耗时较长,但迭代次数少,Python上只生成1组,迭代次数要足够多,这是两个平台高斯过程实现的显著不同),每组参数训练模型大约需要5~10分钟。为了加快速度,使用bootstrap()重采样,只做一次,而不是5折交叉验证,然后只迭代10次。
然而其中有5次因为申请不到需要的内存训练失败了,网上搜到,这是XGBoost在GPU上运行的一个问题,上一次训练完后,不会主动释放占用的内存,见该帖子,Python上有一些解决的办法,见《XGBoost GPU Support》 Memory Usage一节及帖子《How do I free all memory on GshujuPU in XGBoost》。于是尝试一下用CatBoost来测试。
CatBoost多折交叉验证与网格搜索是支持GPU并行的,下图是Windows上用doParallel包2进程5折交叉验证,CPU和GPU都有较高的负荷。但在贝叶斯优化时,如果设置了并行,则提示parsnip没有找到boost-tree算法的CatBoost实现,应该是pkgs参数没有起作用,并行进程没有加载treesnip及catboost包。
参考该帖子,在并行处理的cluster中为每个worker进程预加载需要的包可解决问题。
# 对于使用GPU的大数据集训练,验证2路最低限度并行。
# All operating systems,注册并行处理,并为每个并行处理worker加载需要的包。
# # https://github.com/tidymodels/tune/issues/157
library(doParallel)
cl <- makePSOCKcluster(2) # parallel::detectCores()
registerDoParallel(cl)
# 显示有几个worker
foreach::getDoParWorkers()
# 为每个并行处理worker加载需要的包
clusterEvalQ(cl,
{library(tidymodels)
library(treesnip)
library(catboost)
})
CatBoost-HIGGS-GPU 2进程并行6参数贝叶斯优化。
但是跑了一夜后,主进程最终没能建立与worker进程的连接读取结果,另外速度比单进程要慢得多,单进程上产生10个初始点,即进行最初10次的训练,一个多小时就跑完了,双进程要跑一夜(估计是参数较多内存不够,详见下文)。
Forced gc(): 0.1 Seconds.
> Generating a set of 10 initial parameter results
Error in unserialize(socklist[[n]]) : error reading from connection
> t2<-proc.time()
> cat(t2-t1)
17.41 18.23 20864.81 NA NA
然后再试试LightGBM,发觉LightGBM是可以两路worker进程在GPU上并行的,只用贝叶斯优化调试一个参数,比如下图的tree_depth,可以顺利跑完。但如果调试的参数比较多,比如7个,就会因内存不够而产生大量的内存交换磁盘IO,把程序卡死跑不下来。LightGBM是CPU和内存满格,GPU 2路并行最高去到30%左右。据说单进程训练配置的内存大概要3倍左右的数据占用内存,双进程就要6倍以上了,我的笔记本24G内存跑多进程并行训练HIGGS数据集还是不够。
LightGBM贝叶斯调试一个参数,2个网格搜索初始值,迭代10次,其中5次发现了更优的参数,效果显著。
Forced gc(): 0.2 Seconds.
-- Iteration 4 -----------------------------------------------------------------------------------------------------
i Current best: roc_auc=0.7369 (@iter 2)
i Gaussian process model
! Gaussian process model: X should be in range (0, 1)
√ Gaussian process model
i Generating 5000 candidates
i Predicted candidates
i learn_rate=0.046
i Estimating performance
√ Estimating performance
<3 Newest results: roc_auc=0.8011
Forced gc(): 0.3 Seconds.
-- Iteration 5 -----------------------------------------------------------------------------------------------------
i Current best: roc_auc=0.8011 (@iter 4)
i Gaussian process model
√ Gaussian process model
i Generating 5000 candidates
i Predicted candidates
i learn_rate=0.0999
i Estimating performance
√ Estimating performance
<3 Newest results: roc_auc=0.8116
然后再试试CatBoost GPU 2进程并行调一个参数,就顺利的跑完了,每轮训练平均大约530秒。测试证明,跑大数据集,内存一定要够大,参数不能太多。:)
-- Iteration 10 ----------------------------------------------------------------------------------------------------
i Current best: roc_auc=0.8295 (@iter 3)
i Gaussian process model
√ Gaussian process model
i Generating 5000 candidates
i Predicted candidates
i learn_rate=0.0999
i Estimating performance
√ Estimating performance
(x) Newest results: roc_auc=0.8293
> t2<-proc.time()
> cat(t2-t1)
105.55 5711.49 10611.05 NA NA
为了更好的AUC指标,要调多个参数,最后用单进程跑CatBoost GPU上6个参数的调优,这里实际只跑了前50次,已经是一夜了。
library(tidymodels)
library(kableExtra)
library(tidyr)
# 这个版本的treesnip支持classification。
# remotes::install_github("Glemhel/treesnip", INSTALL_opts = c("--no-multiarch"))
# library(catsnip)
library(treesnip)
library(data.table)
# 对于使用GPU的大数据集训练,验证2路最低限度并行,一个参数通过,多参数跑不出来,内存不够大。
# All operating systems,注册并行处理,并为每个并行处理worker加载需要的包。
# # https://github.com/tidymodels/tune/issues/157
# library(doParallel)
# cl <- makePSOCKcluster(2) # parallel::detectCores()
# registerDoParallel(cl)
# # 显示有几个worker
# foreach::getDoParWorkers()
# # 为每个并行处理worker加载需要的包
# clusterEvalQ(cl,
# {library(tidymodels)
# library(treesnip)
# library(catboost)
# })
# 优先使用tidymodels的同名函数。
tidymodels_prefer()
# ----------------------------------------------------------------------------------------
# 加载经过预处理的数据
t1<-proc.time()
higgs<- fread("D:/temp/data/HIGGS/HIGGS.csv", header=FALSE, encoding="UTF-8")
higgs$V1<-as.factor(higgs$V1)
t2<-proc.time()
cat(t2-t1)
# 17.41 16.25 34.02 NA NA
names(higgs)
# 划分训练集与测试集
t1<-proc.time()
set.seed(2023)
higgs_split <- initial_split(higgs, prop = 0.90)
higgs_train <- training(higgs_split)
higgs_test <- testing(higgs_split)
t2<-proc.time()
cat(t2-t1)
# 6.63 0.55 7.2 NA NA
# ----------------------------------------------------------------------------------------
# 贝叶斯优化
# 可以调整菜谱参数、模型主参数及引擎相关参数。
# 定义菜谱:回归公式与预处理
higgs_rec<-
recipe(V1 ~ ., data = higgs_train) %>%
# 标准化数值型变量
step_normalize(all_numeric_predictors())
# 定义模型:XGB, 定义要调整的参数,tree_method="gpu_hist",使用GPU。
cat_spec <-
boost_tree(mtry=tune(), tree_depth = tune(), trees = tune(), learn_rate = tune(), min_n = tune()) %>%
set_engine('catboost', subsample = tune("subsample"), task_type = 'GPU') %>% #
set_mode('classification')
# 定义工作流
cat_wflow <-
workflow() %>%
add_model(cat_spec) %>%
add_recipe(higgs_rec)
# 全部参数的边界都已确定。
cat_param <- cat_wflow %>%
extract_parameter_set_dials() %>%
update(learn_rate = threshold(c(0.01,0.5))) %>%
update(trees = trees(c(500,1000))) %>%
update(tree_depth = tree_depth(c(5,15))) %>%
update(mtry = mtry(c(3,6))) %>%
update(subsample = threshold(c(0.5,1)))
# 查看参数边界,都已确定
cat_param
# 查看参数边界,都已确定
cat_param %>% extract_parameter_dials("trees")
cat_param %>% extract_parameter_dials("min_n")
cat_param %>% extract_parameter_dials("tree_depth")
cat_param %>% extract_parameter_dials("learn_rate")
cat_param %>% extract_parameter_dials("mtry")
cat_param %>% extract_parameter_dials("subsample")
# 对于大数据集来说,多折交叉验证的时间太长了,用boostraps抽样验证,只做一次加快训练速度。
#higgs_folds <- vfold_cv(higgs_train, v = 5)
higgs_folds <- bootstraps(higgs_train, times = 1)
gc()
# 执行贝叶斯优化
ctrl <- control_bayes(verbose = TRUE, no_improve = Inf)
# set.seed(2023)
t1<-proc.time()
cat_res_bo <-
cat_wflow %>%
tune_bayes(
resamples = higgs_folds,
# metrics = metric_set(recall, precision, f_meas, accuracy, kap,roc_auc, sens, spec)
metrics = metric_set(accuracy, roc_auc, precision,),
initial = 10,
param_info = cat_param,
iter = 100,
control = ctrl,
# Hack了一下tune_bayes()函数,增加参数force_gc,迭代中每次训练前可以选择强制回收内存。
force_gc = TRUE
)
t2<-proc.time()
cat(t2-t1)
# 9435.55 269.2 9085.21 NA NA
# 画图查看贝叶斯优化效果
autoplot(cat_res_bo, type = "performance", metric="roc_auc")
# 查看准确率最高的模型
show_best(cat_res_bo, metric="precision")
show_best(cat_res_bo, metric="accuracy")
show_best(cat_res_bo, metric="roc_auc")
# 选择准确率最高的模型
select_best(cat_res_bo, metric="roc_auc")
# 直接读取调参的最佳结果
cat_param_best<- select_best(cat_res_bo, metric="roc_auc")
# 最佳参数回填到工作流
cat_wflow_bo <-
cat_wflow %>%
finalize_workflow(cat_param_best)
cat_wflow_bo
# 用最佳参数在训练集全集上训练模型
t1<-proc.time()
# 回收内存,否则训练可能因申请不到内存而失败,
# 前面贝叶斯优化函数中如果加入回收内存的机制,应该就可以避免训练失败。
gc()
cat_fit_bo<- cat_wflow_bo %>% fit(higgs_train)
t2<-proc.time()
cat(t2-t1)
# 647.2 183.99 507.11 NA NA
# 测试集
# 预测值
# https://parsnip.tidymodels.org/reference/predict.model_fit.html
# https://yardstick.tidymodels.org/reference/roc_auc.html
t1<-proc.time()
higgs_test_bo <- predict(cat_fit_bo, new_data = higgs_test %>% select(-V1), type = "prob") %>%
bind_cols(predict(cat_fit_bo, new_data = higgs_test %>% select(-V1), type = "class"))
t2<-proc.time()
cat(t2-t1)
# 67.8 0.39 5.83 NA NA
# 合并真实值
higgs_test_bo <- bind_cols(higgs_test_bo, higgs_test %>% select(V1))
higgs_metrics <- metric_set(precision, accuracy)
higgs_metrics(higgs_test_bo, truth = V1, estimate = .pred_class)
roc_auc(
higgs_test_bo,
truth = V1,
estimate=.pred_0,
options = list(smooth = TRUE)
)
> show_best(cat_res_bo, metric="precision")
# A tibble: 5 x 13
mtry trees min_n tree_depth learn_rate subsample .metric .estimator mean n std_err .config .iter
<int> <int> <int> <int> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> <int>
1 4 993 25 15 0.198 0.602 precision binary 0.703 1 NA Iter9 9
2 5 918 14 15 0.294 0.627 precision binary 0.702 1 NA Iter1 1
3 5 931 5 15 0.175 0.703 precision binary 0.701 1 NA Iter12 12
4 4 981 18 15 0.153 0.522 precision binary 0.701 1 NA Iter10 10
5 3 988 21 15 0.153 0.945 precision binary 0.701 1 NA Iter4 4
> show_best(cat_res_bo, metric="accuracy")
# A tibble: 5 x 13
mtry trees min_n tree_depth learn_rate subsample .metric .estimator mean n std_err .config .iter
<int> <int> <int> <int> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> <int>
1 4 997 16 15 0.118 0.685 accuracy binary 0.754 1 NA Iter18 18
2 3 985 36 15 0.124 0.583 accuracy binary 0.753 1 NA Iter17 17
3 5 966 8 15 0.128 0.869 accuracy binary 0.753 1 NA Iter11 11
4 5 989 3 15 0.117 0.816 accuracy binary 0.753 1 NA Iter15 15
5 4 981 18 15 0.153 0.522 accuracy binary 0.753 1 NA Iter10 10
> show_best(cat_res_bo, metric="roc_auc")
# A tibble: 5 x 13
mtry trees min_n tree_depth learn_rate subsample .metric .estimator mean n std_err .config .iter
<int> <int> <int> <int> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> <int>
1 6 986 24 14 0.117 0.558 roc_auc binary 0.849 1 NA Iter16 16
2 5 957 11 15 0.109 0.839 roc_auc binary 0.849 1 NA Iter14 14
3 4 997 16 15 0.118 0.685 roc_auc binary 0.849 1 NA Iter18 18
4 5 989 3 15 0.117 0.816 roc_auc binary 0.849 1 NA Iter15 15
5 3 985 36 15 0.124 0.583 roc_auc binary 0.848 1 NA Iter17 17
> select_best(cat_res_bo, metric="roc_auc")
# A tibble: 1 x 7
mtry trees min_n tree_depth learn_rate subsample .config
<int> <int> <int> <int> <dbl> <dbl> <chr>
1 6 986 24 14 0.117 0.558 Iter16
> higgs_metrics(higgs_test_bo, truth = V1, estimate = .pred_class)
# A tibble: 2 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 precision binary 0.698
2 accuracy binary 0.755
> roc_auc(
+ higgs_test_bo,
+ truth = V1,
+ estimate=.pred_0,
+ options = list(smooth = TRUE)
+ )
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.854
此时CPU与GPU的负荷都不高,不到30%,硬件的能力还没有充分发挥出来。
注意这种并行是多进程并行,进程间不共享数据,通过doParallel包与父进程通讯,有多少个进程就有多少份数据拷贝,所以内存要求较大。下文中CatBoost CPU模式下支持多线程(同一父进程下),线程间可以共享同一份数据,内存开销就小。不过后面的测试表明CatBoost GPU模式nthread参数不起作用,似乎不支持多线程,即多进程下不能再开多线程加速了。所以要充分利用GPU的处理能力,只能增加内存了。
然后比较一下在GPU和CPU上的训练和预测的耗时。CatBoost的fit()函数支持多线程并行,treesnip包封装后用nthread参数来设置。如果在相同的线程数下来比较,毫无疑问是GPU要快很多(事实上也是如此),因为多了一个GPU来协助计算。我的笔记本有8个物理核16个虚拟核,CPU上开12个线程时满格。经测试GPU上nthread参数没有作用,但fit()函数并没有贝叶斯优化那样开多个进程的选项,它是单进程的。参阅资料。
library(tidymodels)
library(kableExtra)
library(tidyr)
# 这个版本的treesnip支持classification。
# remotes::install_github("Glemhel/treesnip", INSTALL_opts = c("--no-multiarch"))
# library(catsnip)
library(treesnip)
library(data.table)
# 对于使用GPU的大数据集训练,验证2路最低限度并行。
# All operating systems,注册并行处理,并为每个并行处理worker加载需要的包。
# # https://github.com/tidymodels/tune/issues/157
# https://curso-r.github.io/treesnip/articles/parallel-processing.html
library(doParallel)
# cl <- makePSOCKcluster(parallel::detectCores())
cl <- makePSOCKcluster(12) # CPU fit
# cl <- makePSOCKcluster(2) # GPU fit
registerDoParallel(cl)
# 显示有几个worker
foreach::getDoParWorkers()
# 为每个并行处理worker加载需要的包
clusterEvalQ(cl,
{library(tidymodels)
library(treesnip)
library(catboost)
})
# 优先使用tidymodels的同名函数。
tidymodels_prefer()
# ----------------------------------------------------------------------------------------
# 加载经过预处理的数据
t1<-proc.time()
higgs<- fread("D:/temp/data/HIGGS/HIGGS.csv", header=FALSE, encoding="UTF-8")
higgs$V1<-as.factor(higgs$V1)
t2<-proc.time()
cat(t2-t1)
# 17.41 16.25 34.02 NA NA
names(higgs)
# 划分训练集与测试集
t1<-proc.time()
set.seed(2023)
higgs_split <- initial_split(higgs, prop = 0.90)
higgs_train <- training(higgs_split)
higgs_test <- testing(higgs_split)
t2<-proc.time()
cat(t2-t1)
# 6.63 0.55 7.2 NA NA
# 用一组较好的参数比较CPU和GPU的性能----------------------------------------------------
# https://curso-r.github.io/treesnip/articles/parallel-processing.html
# 定义菜谱:回归公式与预处理
higgs_rec<-
recipe(V1 ~ ., data = higgs_train) %>%
# 标准化数值型变量
step_normalize(all_numeric_predictors())
# 定义模型:Catboost, 定义要调整的参数
cat_spec <-
boost_tree(mtry=tune(), tree_depth = tune(), trees = tune(), learn_rate = tune(), min_n = tune()) %>%
#set_engine('catboost', subsample = tune("subsample"), task_type = 'GPU', nthread = 2) %>% # GPU
set_engine('catboost', subsample = tune("subsample"), task_type = 'CPU', nthread = 12) %>% # CPU
set_mode('classification')
# 定义工作流
cat_wflow <-
workflow() %>%
add_model(cat_spec) %>%
add_recipe(higgs_rec)
# 构造最佳参数
cat_param_best<-
tibble(
mtry = 6,
trees = 986,
min_n = 24,
tree_depth = 14,
learn_rate = 0.117 ,
subsample = 0.558
)
# 最佳参数回填到工作流
cat_wflow_bo <-
cat_wflow %>%
finalize_workflow(cat_param_best)
# 用最佳参数在训练集全集上训练模型
t1<-proc.time()
# fit函数没有并行,比的都是单进程。
cat_fit_bo<- cat_wflow_bo %>% fit(higgs_train)
t2<-proc.time()
cat(t2-t1)
#GPU单线程 650.52 183.77 511.34 NA NA
# CPU 12线程 65252.86 2728.51 6305.28 NA NA
# CPU 12线程 15980.84 672.2 1944.65 NA NA
#生成训练、测试预测及性能数据
t1<-proc.time()
higgs_test_bo <- predict(cat_fit_bo, new_data = higgs_test %>% select(-V1), type = "prob")
t2<-proc.time()
cat(t2-t1)
#GPU 36.42 0.07 2.69 NA NA
#CPU 40.11 0.9 3.35 NA NA
higgs_test_bo <- bind_cols(higgs_test_bo, higgs_test %>% select(V1))
roc_auc(
higgs_test_bo,
truth = V1,
estimate=.pred_0,
options = list(smooth = TRUE)
)
#GPU 85.4
#CPU 85.4
因为这组参数训练要迭代986次,比较慢,CPU 12线程跑要6305.28秒,GPU是511.34秒,12倍,GPU算力的优越性已经得到充分的体现了(并且硬件的负荷不高)。预测都差不多,主要的加速在训练,数据集越大,迭代的次数越多,GPU算力的优越性越明显。
参考资料:《When to Choose CatBoost Over XGBoost or LightGBM [Practical Guide]》,可以了解控制过拟合与训练速度的主要参数,以及算法之间的比较。