实现机器学习算法GPU算力的优越性

GPU算力的优越性，在深度学习方面已经体现得很充分了，税务领域的落地应用可以参阅我的文章《升级HanLP并使用GPU后端识别发票货物劳务名称》、《HanLP识别发票货物劳务名称之三 GPU加速》以及另一篇文章《外一篇：深度学习之VGG16模型雪豹识别》，HanLP使用的是Tensorflow及PyTorch深度学习框架，有兴趣的厂商也可以用自己的框架试试。
这些文章都是Python上跑的，R语言上Tensorflow及Keras有相应的接口包（后端运行还是在Python上），见《R语言深度学习》，最近也开发了R语言原生的深度学习框架Torch for R，以及原生的Apache MXNET等，后面两个还没有跑过，有时间可以试一下。
有关Linux上GPU的安装与使用，可以参阅我在简书上的系列文章。
在传统的机器学习应用领域，主要是分类与回归，也有一些算法实现尝试利用GPU的算力来提升性能。前文《墨尔本房价回归模型(Python)》及《用Tidy Models实现墨尔本房价回归模型(R)》中，XGBoost，LightGBM，CatBoost这3种Kaggle上公认的世界顶尖水平GBDT(梯度下降决策树)算法实现，都支持GPU运行，这就提出了一个问题和机会，来探索一下该领域实现GPU算力优越性的可能性和条件。这是个很有实用意义的问题，各大云平台及PC、笔记本上那么多GPU，能否充分利用，是选择算法实现和技术路线的一个重要参考标准。人家已经做出来了，网上也有不少实例展示了大规模数据集上分类或回归算法GPU算力的优越性，所以可能性是肯定的，问题是在自己的落地应用场景中找到实现的条件，需要实测了解一下。
在墨尔本房价回归分析模型的例子中，实测显示不管是Python的实现还是R语言的实现，GPU（Nvidia GeForce RTX 2060 Max-Q，1920个CUDA核）上都比CPU(Intel Core i7 8核[16虚拟核])上慢，需要深入了解原因，是GPU（参数）没用对呢还是数据集本身的特点，还是硬件本身的能力就是如此，从而搞清落地应用场景中实现GPU算力优越性的条件。
一、R语言测试
最近在写Tidy Models的介绍文章，先讲讲R语言上的情况，后面再讲讲Python上的情况，结果是一样的。
1、XGBoost算法。
XGBoost开源算法框架由University of Washington主导开发，默认安装的CRAN XGBoost是不支持GPU的，要安装其Github主页上的发行版，上面有预编译好的Windows及Linux版，下载安装即可，目前是1.7.3.1版。运行时增加一个参数tree_method="gpu_hist"即可。

set_engine('xgboost', tree_method="gpu_hist")

# -----------------------------------------------------------------------------------------
library(tidymodels)
library(kableExtra)
library(tidyr)
# All operating systems，注册并行处理
library(doParallel)
cl <- makePSOCKcluster(parallel::detectCores())
registerDoParallel(cl)
# 优先使用tidymodels的同名函数。
tidymodels_prefer()

# 异常值阈值30
threshold<- 30

# ----------------------------------------------------------------------------------------
# 加载经过预处理的数据
melbourne<- read.csv("D:/temp/data/Melbourne_housing/Melbourne_housing_pre.csv")
# 过滤缺失值
# Error: Missing data in columns: BuildingArea.
# 47 obs.
missing <- filter(melbourne, BuildingArea==0)
melbourne <- filter(melbourne, BuildingArea!=0)

# 划分训练集与测试集
set.seed(2023)
melbourne_split <- initial_split(melbourne, prop = 0.80)
melbourne_train <- training(melbourne_split)
melbourne_test  <-  testing(melbourne_split)

# ----------------------------------------------------------------------------------------------------
# 贝叶斯优化
# 可以调整菜谱参数、模型主参数及引擎相关参数。

# 定义菜谱：回归公式与预处理
melbourne_rec<-
  recipe(LogPrice ~ Year + YearBuilt + Distance + Lattitude + Longtitude + BuildingArea
         + Rooms + Bathroom + Car + Type_h + Type_t + Type_u, data = melbourne_train) %>%
  # 标准化数值型变量
  step_normalize(all_numeric_predictors())

# 定义模型：XGB， 定义要调整的参数，tree_method="gpu_hist"，使用GPU。
xgb_spec <-
  boost_tree(tree_depth = tune(), trees = tune(), learn_rate = tune(), min_n = tune(), loss_reduction = tune(), sample_size = tune(), stop_iter = tune()) %>%
  set_engine('xgboost', tree_method="gpu_hist") %>%
  set_mode('regression')

# 定义工作流
xgb_wflow <- 
  workflow() %>% 
  add_model(xgb_spec) %>% 
  add_recipe(melbourne_rec)

# 全部参数的边界都已确定。
xgb_param <- xgb_wflow %>%
  extract_parameter_set_dials() %>%
  update(learn_rate = threshold(c(0.01,0.5))) %>%
  update(trees = trees(c(500,1000))) %>%
  update(tree_depth = tree_depth(c(5,15))) %>%
  update(sample_size = threshold(c(0.5,1))) %>%
  finalize(melbourne_train)

xgb_param
# 查看参数边界，都已确定
xgb_param %>% extract_parameter_dials("trees")
xgb_param %>% extract_parameter_dials("min_n")
xgb_param %>% extract_parameter_dials("tree_depth")
xgb_param %>% extract_parameter_dials("learn_rate")
xgb_param %>% extract_parameter_dials("loss_reduction")
xgb_param %>% extract_parameter_dials("sample_size")
xgb_param %>% extract_parameter_dials("stop_iter")

melbourne_folds <- vfold_cv(melbourne, v = 5)

# 执行贝叶斯优化
ctrl <- control_bayes(verbose = TRUE, no_improve = Inf)
# set.seed(2023)
t1<-proc.time()
xgb_res_bo <-
  xgb_wflow %>%
  tune_bayes(
    resamples = melbourne_folds,
    metrics = metric_set(rsq, rmse, mae),
    initial = 10,
    param_info = xgb_param,
    iter = 100,
    control = ctrl
  )
t2<-proc.time()

XGBoost使用GPU训练模型

cat(t2-t1)
# CPU 3014 1.99 3989.22 NA NA
# GPU 5892.78 4.16 8416.28 NA NA

可以看到GPU上反而慢了一倍多，在训练的过程中观察网络的流量，发现GPU与CPU之间数据拷贝的流量不小。

2、LightGBM算法。
LightGBM算法是微软开发的，默认的CRAN安装也是不支持GPU，GPU版要下载源码编译，具体请参阅《Installation Guide: Build GPU Version》以及LightGBM R-package Github主页，编译好GPU版LightGBM后，运行项目主目录下的build_r.R打包生成GPU版的lightgbm R包并安装。GPU版默认是OpenCL API，Nvidia也支持，如果要编译CUDA专用API，请参阅《Installation Guide: Build CUDA Version》。当时在Python上测试时用CMake + VS Build Tools编译的3.2.1.99版，看了一下LightGBM的发布信息，目前最新的版本是3.3.4，主要是适配R-4.2，3.2.1之后的版本，在GPU支持上没有大的更新，就先用着3.2.1.99版测试，以后有需要再升级。该文档提供了LightGBM原生R语言API的简单测试例子，小数据集下显然是CPU比GPU要快。

Rscript build_r.R --use-gpu

参阅LightGBM参数文档，OpenCL API下它需要两个参数来确定GPU的厂商及设备编号：gpu_platform_id与gpu_device_id，可以用工具GPUCapsViewer来查看，如下图所示，但LightGBM中的编号是从0开始的，引用时都要减1，比如我的笔记本上有集成的intel显卡，它的gpu_platform_id是1，Nvidia的gpu_platform_id是2，R程序中引用时，gpu_platform_id是2-1=1，gpu_device_id是1-1=0。

查看系统中的GPU列表

加载数据等相同的程序就不重复了，指定device="gpu"等参数就可以使用GPU。

set_engine('lightgbm', device="gpu", gpu_platform_id=1, gpu_device_id = 0)

# 为LightGBM提供 parsnip接口支持
library(bonsai)

# 贝叶斯优化
# 可以调整菜谱参数、模型主参数及引擎相关参数。

# 定义菜谱：回归公式与预处理
melbourne_rec<-
  recipe(LogPrice ~ Year + YearBuilt + Distance + Lattitude + Longtitude + BuildingArea
         + Rooms + Bathroom + Car + Type_h + Type_t + Type_u, data = melbourne_train) %>%
  # 标准化数值型变量
  step_normalize(all_numeric_predictors())

# 定义模型：Light GBM， 定义要调整的参数 
lgbm_spec <-
  boost_tree(tree_depth = tune(), trees = tune(), learn_rate = tune(), min_n = tune(),
             loss_reduction = tune(), sample_size = tune(), mtry=tune()) %>%
  # set_engine('lightgbm') %>%
  # 有一个集成的Intel显卡，它的gpu_platform_id=0，gpu_device_id = 0，Nvidia独立显卡的gpu_platform_id=1
  set_engine('lightgbm', device="gpu", gpu_platform_id=1, gpu_device_id = 0) %>%
  set_mode('regression')

# 定义工作流
lgbm_wflow <- 
  workflow() %>% 
  add_model(lgbm_spec) %>% 
  add_recipe(melbourne_rec)

# mtry参数的边界未完全确定，用finalize()函数确定。
lgbm_param <- lgbm_wflow %>%
  extract_parameter_set_dials() %>%
  update(learn_rate = threshold(c(0.01,0.5))) %>%
  update(trees = trees(c(500,1000))) %>%
  update(tree_depth = tree_depth(c(5,15))) %>%
  update(mtry = mtry(c(3,6))) %>%
  update(sample_size = threshold(c(0.5,1))) %>%
  finalize(melbourne_train)

# 查看参数边界，都已确定
lgbm_param %>% extract_parameter_dials("trees")
lgbm_param %>% extract_parameter_dials("min_n")
lgbm_param %>% extract_parameter_dials("tree_depth")
lgbm_param %>% extract_parameter_dials("learn_rate")
lgbm_param %>% extract_parameter_dials("loss_reduction")
lgbm_param %>% extract_parameter_dials("sample_size")
lgbm_param %>% extract_parameter_dials("mtry")

melbourne_folds <- vfold_cv(melbourne, v = 5)

# 执行贝叶斯优化
ctrl <- control_bayes(verbose = TRUE, no_improve = Inf)
# set.seed(2023)
t1<-proc.time()
lgbm_res_bo <-
  lgbm_wflow %>%
  tune_bayes(
    resamples = melbourne_folds,
    metrics = metric_set(rsq, rmse, mae),
    initial = 10,
    param_info = lgbm_param,
    iter = 100,
    control = ctrl
  )
t2<-proc.time()

cat(t2-t1)
#CPU 4760.83 2.64 5503.5 NA NA
#GPU 5834.04 5.57 8285.5 NA NA

可以看到CPU也是比GPU要快了近一倍。

LightGBM在GPU上训练

3、CatBoost算法。
CatBoost是俄国Yandex搜索引擎开发的开源GBDT算法框架，它各个操作系统的预编译版本都是支持GPU的，可以从项目主页的最新版本处下载安装，目前的最新版本是1.1.1。使用时增加一个参数task_type = 'GPU'即可，参数文档。

# 为catboost提供 parsnip接口支持
library(treesnip)

# 定义菜谱：回归公式与预处理
# 'Year','YearBuilt','Distance','Lattitude','Longtitude','Propertycount',
# 'Landsize','BuildingArea', 'Rooms','Bathroom', 'Car','Type_h','Type_t','Type_u'
melbourne_rec<-
  recipe(LogPrice ~ Year + YearBuilt + Distance + Lattitude + Longtitude + BuildingArea
         + Rooms + Bathroom + Car + Type_h + Type_t + Type_u, data = melbourne_train) %>%
  #step_log(BuildingArea, base = 10) %>%
  # 标准化数值型变量
  step_normalize(all_numeric_predictors())

# 定义模型：Cat  
cat_model<-
  boost_tree(trees = 1000, learn_rate=0.05) %>%
  set_engine("catboost", 
             loss_function = "RMSE", 
             eval_metric='RMSE',
             task_type = 'GPU'           # Catboost GPU上运行的效率还不如CPU, 可能是数据集还不够大。
  )  %>%
  set_mode("regression")

# 定义工作流
cat_wflow <- 
  workflow() %>% 
  add_model(cat_model) %>% 
  add_recipe(melbourne_rec)

# 训练模型
t1<-proc.time()
cat_fit <- fit(cat_wflow, melbourne_train)
t2<-proc.time()

cat(t2-t1)
# CPU  2.42 0.07 2.68 NA NA
# GPU  12.77 3.44 12.78 NA NA

CatBoost在GPU上慢了5倍多，所以暂时没有测试100次迭代的贝叶斯优化，但5折交叉验证时，它的CPU和GPU都几乎满格了。

CatBoost在GPU上训练

二、Python测试
1、XGBoost算法。
XGBoost的Python版是预编译的二进制版本，已经支持GPU，用pip安装即可。

pip install xgboost

贝叶斯优化，在GPU上训练时也是增加一个参数tree_method='gpu_hist'。Python上的贝叶斯优化实现与R上可能有所不同，它的高斯过程速度很快，可能只是估算1组候选参数（R上是数千组），所以在Python上要迭代1000次，R上只迭代100次。在CPU与GPU模式之间切换只需更新贝叶斯优化的代价函数f_xgb()即可。
各算法公共的部分，加载软件包与数据。

# 加载公用包
# Ignore Warnings 
import warnings
warnings.filterwarnings('ignore')

# Basic Imports 
import numpy as np
import pandas as pd
import time

# Preprocessing
from sklearn.model_selection import train_test_split, KFold, cross_val_score

# Metrics 
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Model Tuning 
from hyperopt import fmin, tpe, hp, Trials
from hyperopt.fmin import generate_trials_to_calculate
    
# 加载数据，划分训练集与测试集，标准化数据
# 9015
df_NN = pd.read_csv("D:/temp/data/Melbourne_housing/Melbourne_housing_pre.csv",  encoding="utf-8")

X=df_NN[['Year','YearBuilt','Distance','Lattitude','Longtitude','Propertycount',
          'Landsize','BuildingArea', 'Rooms','Bathroom', 'Car','Type_h','Type_t','Type_u']]
y=df_NN['LogPrice']
train_X, valid_X, train_y, valid_y = train_test_split(X,y, test_size = .20, random_state=42)

train_X2 = train_X.copy()
valid_X2 = valid_X.copy()

# Data standardization
mean = train_X.mean(axis=0)
train_X -= mean
std = train_X.std(axis=0)
train_X /= std
valid_X -= mean
valid_X /= std

XGBoost:

# ML Models
from xgboost import XGBRegressor 

# 定义参数搜索空间，缩小参数取值范围，搜索会快很多  
space_xgb = {
    'max_bin': hp.choice('max_bin', range(8, 128)),                  # CPU 50-501 GPU 8-128
    'max_depth': hp.choice('max_depth', range(3, 11)),    
    'n_estimators': hp.choice('n_estimators', range(100, 1001)),
    'learning_rate': hp.uniform('learning_rate', 0.01, 0.3),    
    'subsample': hp.uniform('subsample', 0.5, 0.99),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 0.99),
    'reg_alpha': hp.uniform('reg_alpha', 0, 5),                       # lambda_l1
    'reg_lambda': hp.uniform('reg_lambda', 0, 3),                     # lambda_l2
    'gamma': hp.uniform('gamma',0.0, 10),                             # min_split_loss, min_split_gain
    'min_child_weight': hp.uniform('min_child_weight',0.0001, 50),
}

# 定义代价函数
def f_xgb(params):
    # Set extra_trees=True to avoid overfitting
     # CPU 4.96s/trial
    # xgb = XGBRegressor(objective ='reg:squarederror', seed = 0,verbosity=0, **params) 
    # GPU 8.68s/trial
    xgb = XGBRegressor(tree_method='gpu_hist', objective ='reg:squarederror', seed = 0,verbosity=0,**params)   
    #xgb_model = xgb.fit(train_X, train_y)
    #acc = xgb_model.score(valid_X,valid_y)    
    # acc = cross_val_score(xgb, train_X, train_y).mean()              # CPU
    acc = cross_val_score(xgb, train_X, train_y, n_jobs=6).mean()  # GPU
    return -acc
# trials = Trials()
# Set initial values, start searching from the best point of GridSearchCV(), and default values
trials = generate_trials_to_calculate([{
                                        'max_bin':4,                                 # default 256
                                        'max_depth':5,                               # default 6
                                        'n_estimators':578,                          # default 100
                                        'learning_rate':0.05508679239402551,         # default 0.3
                                        'subsample':0.8429852720715357,              # default 1.0
                                        'colsample_bytree':0.8413894273173292,       # default 1.0
                                        'reg_alpha': 0.809791155072757,              # default 0.0
                                        'reg_lambda':1.4490119256389808,             # default 1.0
                                        'gamma':0.008478702584417519,                # default 0.0                                        
                                        'min_child_weight':24.524635200338793,       # default 1
                                        }])

t1 = time.time()  
# GPU: 1000trial [2:24:41,  8.68s/trial, best loss: -0.9080128034320879]                           
best_params = fmin(f_xgb, space_xgb, algo=tpe.suggest, max_evals=999, trials=trials)
t2 = time.time()
# 8681.310757875443
print("Time elapsed: ", t2-t1)

print('best:')
print(best_params)

XGBoost Python版在GPU上训练的速度也是比CPU上慢了近一倍。

Pyton版XGBoost在GPU上训练

2、LightGBM算法。
Python版安装参阅LightGBM python-package主页文档，升级到最新的3.3.4，这个安装选项使用的是默认的OpenCL API，这里用它来测试。

pip install lightgbm --install-option=--gpu

Windows上安装CUDA API专用版要先配好Visual Studio开发环境，pip要调用它来编译。

pip install lightgbm --install-option=--cuda

贝叶斯优化，调用lightgbm时增加几个参数：device='gpu', gpu_platform_id=1, gpu_device_id = 0，注意它的参数max_bin在CPU和GPU上的取值范围不同，GPU上如果不正确设置会引起index out of range的错误。公共的程序就不重复了。

# ML Models
from lightgbm import LGBMRegressor 
    
# --------------------------------------------------------------------------------------------------
# Auto search for better hyper parameters with hyperopt, only need to give a range
# Reference: https://www.pythonf.cn/read/6998
#            https://lightgbm.readthedocs.io/en/latest/Parameters.html
#            https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html#deal-with-over-fitting
#            https://lightgbm.readthedocs.io/en/latest/GPU-Performance.html
# 处理过拟合

#     设置较少的直方图数目 max_bin
#     设置较小的叶节点数 num_leaves
#     使用 min_child_samples（min_data_in_leaf） 和 min_child_weight（= min_sum_hessian_in_leaf）
#     通过设置 subsample（bagging_fraction） 和 subsample_freq（= bagging_freq） 来使用 bagging
#     通过设置 colsample_bytree（feature_fraction） 来使用特征子抽样
#     使用更大的训练数据
#     使用 reg_alpha（lambda_l1） , reg_lambda（lambda_l2） 和 min_split_gain（min_gain_to_split） 来使用正则
#     尝试 max_depth 来避免生成过深的树
#     Try extra_trees
#     Try increasing path_smooth

# trials = generate_trials_to_calculate([{'max_bin':63-8,               # default CPU 255 GPU 63
#                                         'max_depth':5-3,              # default -1
#                                         'num_leaves':31-20,           # default 31
#                                         'min_child_samples':20-10,    # default 20
#                                         'subsample_freq':1-1,         # default 1
#                                         'n_estimators':6000-1000,     # default 10
#                                         'learning_rate':0.01,         # default 0.1
#                                         'subsample':0.75,             # default 1.0
#                                         'colsample_bytree':0.8,       # default 1.0
#                                         'lambda_l1':0.0,              # default 0.0
#                                         'lambda_l2':0.0,              # default 0.0
#                                         'min_child_weight':0.001,     # default 0.001
#                                         'min_split_gain':0.0,         # default 0.0
#                                         #'path_smooth':0.0            # default 0.0
#                                         }])
# 缩小参数取值范围，搜索会快很多  
space_lgbm = {
    'max_bin': hp.choice('max_bin', range(8, 128)),                  # CPU 50-501 GPU 8-128
    'max_depth': hp.choice('max_depth', range(3, 31)),    
    'num_leaves': hp.choice('num_leaves', range(10, 256)),
    'min_child_samples': hp.choice('min_child_samples', range(10, 51)), 
    'subsample_freq': hp.choice('subsample_freq', range(1, 6)),      
    'n_estimators': hp.choice('n_estimators', range(500, 6001)),
    'learning_rate': hp.uniform('learning_rate', 0.005, 0.15),    
    'subsample': hp.uniform('subsample', 0.5, 0.99),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 0.99),
    'reg_alpha': hp.uniform('reg_alpha', 0, 5),                       # lambda_l1
    'reg_lambda': hp.uniform('reg_lambda', 0, 3),                     # lambda_l2
    'min_child_weight': hp.uniform('min_child_weight',0.0001, 50),
    'min_split_gain': hp.uniform('min_split_gain',0.0, 1),
    #'path_smooth': hp.uniform('path_smooth',0.0, 3)
}
def f_lgbm(params):
    # Set extra_trees=True to avoid overfitting
    # lgbm = LGBMRegressor(seed=0,verbose=-1, **params)                 # CPU 4.96s/trial
    lgbm = LGBMRegressor(device='gpu', gpu_platform_id=1, gpu_device_id = 0, num_threads =3, **params)    # GPU 65.93s/trial
    #lgb_model = lgbm.fit(train_X, train_y)
    #acc = lgb_model.score(valid_X,valid_y)
    # acc = cross_val_score(lgbm, train_X, train_y).mean()             # CPU
    acc = cross_val_score(lgbm, train_X, train_y, n_jobs=6).mean()  # GPU
    return -acc
# trials = Trials()
# Set initial values, start searching from the best point of GridSearchCV(), and default values
trials = generate_trials_to_calculate([{'max_bin':63,                # default CPU 255 GPU 63
                                        'max_depth':17,                             # default -1
                                        'num_leaves':12,                            # default 31
                                        'min_child_samples':14,                     # default 20
                                        'subsample_freq':0,                         # default 1
                                        'n_estimators':2647,                        # default 10
                                        'learning_rate':0.0203187560767722,         # default 0.1
                                        'subsample':0.788703175392162,              # default 1.0
                                        'colsample_bytree':0.5203150334508861,      # default 1.0
                                        'reg_alpha': 0.988139501870491,             # default 0.0
                                        'reg_lambda':2.789779486137205,             # default 0.0
                                        'min_child_weight':21.813225361674828,      # default 0.001
                                        'min_split_gain':0.00039636685518264865,    # default 0.0
                                        #'path_smooth':0.0                          # default 0.0
                                        }])

t1 = time.time()  
# 1000trial [5:23:09, 19.39s/trial, best loss: -0.9082183160929432] CPU  
# 1000trial [1:22:39,  4.96s/trial, best loss: -0.9079837941918502] CPU 
# 1000trial [1:02:28,  3.75s/trial, best loss: -0.9080477825539048] CPU
best_params = fmin(f_lgbm, space_lgbm, algo=tpe.suggest, max_evals=9, trials=trials)
t2 = time.time()
print("Time elapsed: ", t2-t1)
print(best_params)

Pyton版LightGBM在GPU上训练

迭代100次，15.09s/trial，CPU满格，Nvidia GPU过半，小数据集内存消耗较低，GPU与CPU间有一些数据拷贝流量，应该是正常的情况，结果也是CPU版3.28s/trial要快4倍多。

3、CatBoost算法。
pip安装默认已支持GPU：

pip install catboost

CatBoost使用GPU只需要指定参数task_type='GPU'，不过贝叶斯调参时，这几个参数是CPU版才有的，GPU版不支持：random_strength、subsample，rsm，然后参数border_countGPU版与CPU版的取值范围不同，需要注意。公共的程序也不重复了，见前文。

# ML Models
from catboost import CatBoostRegressor

# Auto search for better hyper parameters with hyperopt, only need to give a range
# Reference: https://github.com/talperetz/hyperspace/tree/master/GBDTs
#            https://catboost.ai/docs/concepts/python-reference_parameters-list.html#python-reference_parameters-list
#            https://catboost.ai/docs/concepts/parameter-tuning.html
#            https://affine.ai/catboost-a-new-game-of-machine-learning/
'''
https://catboost.ai/en/docs/concepts/speed-up-training
Speeding up the training
1. iterations, worked
2. learning_rate, worked
2. boosting_type, Ordered, Plain,  not worked
3. bootstrap_type, Bayesian, Bernoulli, MVS, Poisson, not worked
4. subsample, not worked
   This parameter can be used if one of the following bootstrap types is selected:
    Poisson   Bernoulli    MVS
5. one_hot_max_size, One-hot encoding
6. rsm, colsample_bylevel, Random subspace method
7.leaf_estimation_iterations, worked, set to 1.
   Try setting the value to "1" or "5" to speed up the training on datasets with a small number of features.
8. max_ctr_complexity, worked, 0 or 2 to speed up trainning.
   This parameter can affect the training time only if the dataset contains categorical features.
9. border_count, worked, set to less.
10.Reusing quantized datasets in Python, not applyable to cross_val_score()
11.Golden features. If the dataset has a feature, which is a strong predictor of the result, the
   pre-quantisation of this feature may decrease the information that the model can get from it. 
   It is recommended to use an increased number of borders (1024) for this feature.
   per_float_feature_quantization=['0:border_count=1024', '1:border_count=1024']
'''
# default values
# trials = generate_trials_to_calculate([{'border_count':254-150,       # default CPU 254 GPU 128
#                                         'iterations':1000-500,        # default 1000
#                                         'depth': 6-2,                 # default 6
#                                         'random_strength':1.0,        # default 1.0, CPU only
#                                         'learning_rate': 0.03,        # default 0.03
#                                         'subsample':0.8,              # default 0.8
#                                         'l2_leaf_reg': 3.0,           # default 3
#                                         'rsm':0.8,                    # default 1.0  CPU only
#                                         'fold_len_multiplier':2.0,    # default 2.0
#                                         'bagging_temperature':1.0     # default 1.0
#                                         }])

# 缩小参数取值范围，搜索会快很多  
space_cat = {'border_count': hp.choice('border_count', range(8, 128)), # CPU 150-351 GPU 8-128
             'iterations': hp.choice('iterations', range(500, 1501)),    
             'depth': hp.choice('depth', range(2, 10)),  
             #'random_strength': hp.uniform('random_strength', 1, 20),           
             'learning_rate': hp.uniform('learning_rate', 0.005, 0.15), 
             # 'subsample': hp.uniform('subsample', 0.5, 1),    
             'l2_leaf_reg': hp.uniform('l2_leaf_reg', 1, 100),
             # 'rsm': hp.uniform('rsm', 0.5, 0.99),                        # colsample_bylevel
             'fold_len_multiplier': hp.uniform('fold_len_multiplier', 1.0, 10.0),
             'bagging_temperature': hp.uniform('bagging_temperature', 0.0, 1.0) }

def f_cat(params):
    # cat = CatBoostRegressor(task_type='CPU', random_seed=0,
    cat = CatBoostRegressor(task_type='GPU', random_seed=0,                         
        # boosting_type='Plain', bootstrap_type = 'Bayesian', max_ctr_complexity=1,
        one_hot_max_size=3, 
        leaf_estimation_iterations=1,
        #per_float_feature_quantization=['3:border_count=1024', '4:border_count=1024'], # Golden features: lat, long
        verbose=False, **params)  # CPU 13.05s/trial
    acc = cross_val_score(cat, train_X, train_y,  n_jobs=3).mean()     
    return -acc

# trials = Trials()
# Set initial values, start searching from the best point of GridSearchCV(), and default values
trials = generate_trials_to_calculate([{'border_count':112,                         # default CPU 254 GPU 128
                                        'iterations':989,                           # default 1000
                                        'depth': 4,                                 # default 6
                                        #'random_strength':6.6489521372262645,       # default 1.0, CPU only
                                        'learning_rate': 0.07811835381238333,       # default 0.03
                                        #'subsample':0.9484820488113903,             # default 0.8
                                        'l2_leaf_reg': 8.070279328038293,           # default 3
                                        #'rsm':0.7188098046587024,                   # default 1.0  CPU only
                                        'fold_len_multiplier': 6.034216410528531,   # default 2.0
                                        'bagging_temperature':0.47787665340753926   # default 1.0
                                        }])

t1 = time.time()  
# 1000trial [50:28,  3.03s/trial, best loss: -0.905859099632395]                         
best_params = fmin(f_cat, space_cat, algo=tpe.suggest, max_evals=9, trials=trials)
t2 = time.time()
print("Time elapsed: ", t2-t1)

print('best:')

Pyton版CatBoost在GPU上训练

这么小一个数据集，GPU满格，CPU过半，内存满格，迭代一次要350秒，是不太正常的，应该是哪些参数还没有设对，或者是数据集需要一些其它的预处理。

Pyton版CatBoost在CPU上训练

与此相对，CPU版CPU满格，内存消耗不大，7.46s/trial速度也不错。

三、LightGBM官方例子
LightGBM官方的HIGGS分类例子，说GPU应该有三倍以上的加速，在我的笔记本上性能大致相当（GPU没有满负荷跑，因为GBDT算法的GPU版其实是把部分的运算搬到GPU上，很多运算还会在CPU上算，一般是CPU满格，GPU不会满格。），这说明应该是硬件能力的限制，联想拯救者Y9000X的NVIDIA GeForce RTX 2060 Max-Q还不够强悍。该数据集有1100万行28个变量，解压后约7.5G，不小了，它的测试应该有参考价值。划分990万行为训练集，110万行为验证集(10%)。数据下载，参考资料。
各算法公共的部分，加载数据。

# -*- coding: utf-8 -*-
"""
Created on Thu Sep 23 15:58:22 2021

@author: Jean
"""
'''
This is a classification problem to distinguish between a signal process 
which produces Higgs bosons and a background process which does not.
The first column is the class label (1 for signal, 0 for background), 
followed by the 28 features (21 low-level features then 7 high-level features): 
    lepton pT, lepton eta, lepton phi, missing energy magnitude, missing energy phi, 
    jet 1 pt, jet 1 eta, jet 1 phi, jet 1 b-tag, jet 2 pt, jet 2 eta, jet 2 phi, jet 2 b-tag,
    jet 3 pt, jet 3 eta, jet 3 phi, jet 3 b-tag, jet 4 pt, jet 4 eta, jet 4 phi, jet 4 b-tag, 
    m_jj, m_jjj, m_lv, m_jlv, m_bb, m_wbb, m_wwbb.  
'''
import pandas as pd
import time
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

t1 = time.time()
# Load dataset
df = pd.read_csv("D:/temp/data/HIGGS/HIGGS.csv",  encoding="utf-8", header=None)
# Target column changed to int
df.iloc[:,0] = df.iloc[:,0].astype(int)
t2 = time.time()
# 50.2845721244812
print(t2-t1)
df.shape
df.head(2)

t1 = time.time()
X = df.iloc[:,1:]
y = df.iloc[:,0]
t1 = time.time()
train_X, valid_X, train_y, valid_y = train_test_split(X,y, test_size = .10, random_state=2023)
t2 = time.time()
print(t2-t1)
# 5.838041305541992

在不同的迭代次数下比较CPU与GPU的性能。

# Set objective=regression to change to a regression problem
import lightgbm as lgb
# create dataset for lightgbm
dtrain = lgb.Dataset(train_X, train_y)
dvalid = lgb.Dataset(valid_X, valid_y, reference=dtrain)

t_cpu = []; t_nvidia = []; t_intel=[]
a_cpu = []; a_nvidia = []; a_intel=[]

t3 = time.time()
for num_iterations in [50,100,150,200]:

    # CPU --------------------------------------------------------------------------------
    params = {'objective':'binary',
              'num_iterations':num_iterations,
              'max_bin': 63,
              'num_leaves': 255,
              'learning_rate': 0.1,
              'tree_learner': 'serial',
              'task': 'train',
              'is_training_metric': 'false',
              'min_data_in_leaf': 1,
              'min_sum_hessian_in_leaf': 100,
              'ndcg_eval_at': [1, 3, 5, 10],
              'device': 'cpu'
              }
    
    t0 = time.time()
    gbm = lgb.train(params, train_set=dtrain, num_boost_round=10,
                    valid_sets=dvalid, feature_name='auto', categorical_feature='auto')
    t1 = time.time()
    t = round(t1-t0,2)
    t_cpu.append(t)
    # 50: 46.00722551345825 100: 138.17840361595154 150: 195.13047289848328
    print('cpu version elapse time: {}'.format(t1-t0))
    # predict
    y_pred = gbm.predict(valid_X, num_iteration=gbm.best_iteration)
    # AUC 0.8207000864285819 0.8304736031418638 0.8353184609588433
    auc_score = roc_auc_score(valid_y,y_pred)
    a_cpu.append(round(auc_score,4))
    print(auc_score)
    
    # NVIDIA GeForce RTX 2060 with Max-Q Design ------------------------------------------
    params = {'objective':'binary',
              'num_iterations':num_iterations,          
              'max_bin': 63,
              'num_leaves': 255,
              'learning_rate': 0.1,
              'tree_learner': 'serial',
              'task': 'train',
              'is_training_metric': 'false',
              'min_data_in_leaf': 1,
              'min_sum_hessian_in_leaf': 100,
              'ndcg_eval_at': [1, 3, 5, 10],
              'device': 'gpu',
              'gpu_platform_id': 1,
              'gpu_device_id': 0
    }
    
    t0 = time.time()
    gbm = lgb.train(params, train_set=dtrain, num_boost_round=10,
                    valid_sets=dvalid, feature_name='auto', categorical_feature='auto')
    t1 = time.time()
    t = round(t1-t0,2)
    t_nvidia.append(t)

    # 50: 54.93808197975159 100: 103.01487278938293  150: 146.14963364601135
    print('gpu version elapse time: {}'.format(t1-t0))
    # predict
    y_pred = gbm.predict(valid_X, num_iteration=gbm.best_iteration)
    # AUC 0.8207000821252757 0.8304736011279031 0.8353184579727403
    auc_score = roc_auc_score(valid_y,y_pred)
    a_nvidia.append(round(auc_score,4))
    print(auc_score)
    
    # Intel(R) UHD Graphics ------------------------------------------------------------
    params = {'objective':'binary',
              'num_iterations':num_iterations,          
              'max_bin': 63,
              'num_leaves': 255,
              'learning_rate': 0.1,
              'tree_learner': 'serial',
              'task': 'train',
              'is_training_metric': 'false',
              'min_data_in_leaf': 1,
              'min_sum_hessian_in_leaf': 100,
              'ndcg_eval_at': [1, 3, 5, 10],
              'device': 'gpu'
              }
    
    
    t0 = time.time()
    gbm = lgb.train(params, train_set=dtrain, num_boost_round=10,
                    valid_sets=dvalid, feature_name='auto', categorical_feature='auto')
    t1 = time.time()
    t = round(t1-t0,2)
    t_intel.append(t)
    
    # 62.83425784111023
    print('gpu version elapse time: {}'.format(t1-t0))
    # predict
    y_pred = gbm.predict(valid_X, num_iteration=gbm.best_iteration)
    # AUC 0.8207000820323747
    auc_score = roc_auc_score(valid_y,y_pred)
    a_intel.append(round(auc_score,4))    
    print(auc_score)

t4 = time.time()
print('Total elapse time: {}'.format(t4-t3))

作图。

perf_t = pd.DataFrame({"iterations":[50,100,150,200],"cpu":t_cpu,"Nvidia":t_nvidia,"Intel":t_intel})
perf_a = pd.DataFrame({"iterations":[50,100,150,200],"cpu":a_cpu,"Nvidia":a_nvidia,"Intel":a_intel})
perf_a["cpu"] =  perf_a["cpu"]*100
perf_a["Nvidia"] =  perf_a["Nvidia"]*100
perf_a["Intel"] =  perf_a["Intel"]*100

iterations =  [50,100,150,200]

import matplotlib.pyplot as plt 

plt.rcParams["font.sans-serif"]=["SimHei"] #设置字体
plt.rcParams["axes.unicode_minus"]=False #正常显示负号
fig,ax1 = plt.subplots()
ax2 = ax1.twinx()           # 做镜像处理
ax1.plot(iterations,t_cpu,'b', label="CPU")
ax1.plot(iterations,t_nvidia,'g', label="Nvidia")
ax1.plot(iterations,t_intel,'r', label="Intel")
ax1.legend(loc="upper left")

ax2.plot(iterations,perf_a["cpu"],"b--", label="CPU")
ax2.plot(iterations,perf_a["Nvidia"],"g--", label="Nvidia")
ax2.plot(iterations,perf_a["Intel"] ,"r--", label="Intel")
ax2.legend(loc="lower right") 
ax1.set_xlabel('迭代次数')    #设置x轴标题
ax1.set_ylabel('时间（秒）')   #设置Y1轴标题
ax2.set_ylabel('AUC(%)')   #设置Y2轴标题
plt.show()

CPU上测试， num_iterations=50时约48秒。第一次运行要加载数据，时间要长一点，以第二次运行为准（说明：这几个都是全部1100万条数据都用于训练的截图）。

[LightGBM] [Info] Number of positive: 5829123, number of negative: 5170877
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.191587 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1524
[LightGBM] [Info] Number of data points in the train set: 11000000, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.529920 -> initscore=0.119824
[LightGBM] [Info] Start training from score 0.119824
cpu version elapse time: 47.76462769508362

Pyton-LightGBM-HIGGS数据集在CPU上训练

Nvidia GPU上测试， num_iterations=50时约60秒，当迭代次数增多，比如100次以后，GPU会比CPU快。

[LightGBM] [Info] Number of positive: 5829123, number of negative: 5170877
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1524
[LightGBM] [Info] Number of data points in the train set: 11000000, number of used features: 28
[LightGBM] [Info] Using requested OpenCL platform 1 device 0
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce RTX 2060 with Max-Q Design, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 64 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 28 dense feature groups (293.73 MB) transferred to GPU in 0.480376 secs. 0 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.529920 -> initscore=0.119824
[LightGBM] [Info] Start training from score 0.119824
gpu version elapse time: 59.27095651626587

Pyton-LightGBM-HIGGS数据集在Nvidia GPU上训练，GPU负荷不高

CPU中集成的Intel GPU，大约要70秒：

[LightGBM] [Info] Number of positive: 5829123, number of negative: 5170877
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 1524
[LightGBM] [Info] Number of data points in the train set: 11000000, number of used features: 28
[LightGBM] [Info] Using GPU Device: Intel(R) UHD Graphics, Vendor: Intel(R) Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 64 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 28 dense feature groups (293.73 MB) transferred to GPU in 0.262090 secs. 0 sparse feature groups
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.529920 -> initscore=0.119824
[LightGBM] [Info] Start training from score 0.119824
gpu version elapse time: 70.56596112251282

Pyton-LightGBM-HIGGS数据集在CPU集成显卡上训练

LightGBM的性能曲线

上图显示，随着参数迭代次数num_iterations（分别取值50、100、150、200）的增加，Nvidia GPU（绿色）的速度逐渐赶超了CPU（蓝色），当num_iterations=200时，已经快了50%了，红色实线显示此时集成的Intel GPU也已经比CPU快了。红色虚线显示了随着迭代次数的增加，准确率AUC也会逐渐提高，相同参数下，在不同设备上训练的精度没有区别，三条虚线是重合的。这说明GPU算力的优越性在此例中是可以验证的，当需要更高的精度时，就需要更多的迭代次数去训练，此时GPU可以加速训练的过程，就是说数据集要大，迭代次数要多，优越性才能体现出来。

四、XGBoost上跑HIGGS数据集分类
有关XGBoost在kaggle HIGGS竞赛上参数讨论的总结帖子并没有给出好的AUC指标。我从默认值开始用贝叶斯优化从头训练模型，可参阅XGBoost参数文档。加载数据的代码就不重复了。

from xgboost import XGBClassifier
space_xgb = {
    'max_bin': hp.choice('max_bin', range(50, 512)),                  # CPU 50-501
    'max_depth': hp.choice('max_depth', range(3, 11)),    
    'n_estimators': hp.choice('n_estimators', range(100, 1001)),
    'learning_rate': hp.uniform('learning_rate', 0.01, 0.5),    
    'subsample': hp.uniform('subsample', 0.5, 1),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1),
    'reg_alpha': hp.uniform('reg_alpha', 0, 5),                       # lambda_l1
    'reg_lambda': hp.uniform('reg_lambda', 0, 3),                     # lambda_l2
    'gamma': hp.uniform('gamma',0.0, 10),                             # min_split_loss, min_split_gain
    'min_child_weight': hp.uniform('min_child_weight',0.0001, 50),
}

def f_xgb(params):
    # xgb = XGBClassifier(objective ='binary:logistic', use_label_encoder=False, seed = 2023,\
    #                     nthread=-1, verbosity=0, **params)  # CPU 
    xgb = XGBClassifier(tree_method='gpu_hist', objective ='binary:logistic', use_label_encoder=False,\
                        nthread=-1, seed = 2023, verbosity=0,**params)  # GPU    
    xgb_model = xgb.fit(train_X, train_y)
    acc = xgb_model.score(valid_X,valid_y)    
    # acc = cross_val_score(xgb, train_X, train_y).mean()            # CPU
    # acc = cross_val_score(xgb, train_X, train_y, n_jobs=6).mean()  # GPU
    return -acc
# trials = Trials()
# Set initial values, start searching from the best point of GridSearchCV(), and default values
trials = generate_trials_to_calculate([{'max_bin':256-50,                      # default 256
                                        'max_depth':6-3,                       # default 6 [0,∞]
                                        'n_estimators':200-100,                # default 10
                                        'learning_rate':0.3,                   # default 0.3 [0,1]
                                        'subsample':1.0,                       # default 1.0 (0,1]
                                        'colsample_bytree':1.0,                # default 1.0 (0,1]
                                        'reg_alpha':0,                         # default 0.0
                                        'reg_lambda':1.0,                      # default 1.0
                                        'gamma':0,                             # default 0.0 [0,∞]                                      
                                        'min_child_weight':1                   # default 1 [0,∞]
                                        }])
t1 = time.time()  
best_params = fmin(f_xgb, space_xgb, algo=tpe.suggest, max_evals=9, trials=trials)
t2 = time.time()
print("Time elapsed: ", t2-t1)

XGBoost-HIGGS-GPU训练模型，CPU负荷不高，GPU有时过半

先用训练10次找到的参数来评估一下，虽然这组参数的AUC指标不高，GPU加速的效果还是比较显著的，画图的代码也不重复了。

XGBoost-HIGGS-GPU在低迭代次数下就已经表现出显著的加速效果

t_cpu = []; t_nvidia = []
a_cpu = []; a_nvidia = [] 

t3 = time.time()
# for num_iterations in [50]:
for num_iterations in [50,100,150,200]:
    # num_iterations =689
    # CPU --------------------------------------------------------------------------------
    params = {'objective':'binary:logistic',
              'max_bin':286,
              'n_estimators':num_iterations,
              'learning_rate': 0.3359071085471539,
              'max_depth': 4,
              'min_child_weight':6.4817419839798385,
              'colsample_bytree': 0.7209249276177966,
              'subsample': 0.5532140686826488,
              'reg_alpha': 2.2793074958255986,
              'reg_lambda': 2.4142485681002315,
              'gamma' :2.9324177415122934,
              'nthread': -1,              
              'tree_method': 'hist'
              }
           
    t0 = time.time()
    xgb = XGBClassifier(random_state =2023, use_label_encoder=False, **params)                         
    xgb_model = xgb.fit(train_X, train_y)
    t1 = time.time()
    t = round(t1-t0,2)
    t_cpu.append(t)
    
    print('cpu version elapse time: {}'.format(t1-t0))
    # predict
    y_pred = xgb_model.predict(valid_X)
    auc_score = roc_auc_score(valid_y,y_pred)
    a_cpu.append(round(auc_score,4))
    print(auc_score)
    
    # NVIDIA GeForce RTX 2060 with Max-Q Design ------------------------------------------
    params = {'objective':'binary:logistic',
              'max_bin':286,
              'n_estimators':num_iterations,
              'learning_rate': 0.3359071085471539,
              'max_depth': 4,
              'min_child_weight':6.4817419839798385,
              'colsample_bytree': 0.7209249276177966,
              'subsample': 0.5532140686826488,
              'reg_alpha': 2.2793074958255986,
              'reg_lambda': 2.4142485681002315,
              'gamma' :2.9324177415122934,
              'nthread': -1,              
              'tree_method': 'gpu_hist'
              }
    
    t0 = time.time()
    xgb = XGBClassifier(random_state =2023, use_label_encoder=False, **params)                         
    xgb_model = xgb.fit(train_X, train_y)
    t1 = time.time()
    t = round(t1-t0,2)
    t_nvidia.append(t)
    
    print('gpu version elapse time: {}'.format(t1-t0))
    # predict
    y_pred = xgb_model.predict(valid_X)
    auc_score = roc_auc_score(valid_y,y_pred)
    a_nvidia.append(round(auc_score,4))
    print(auc_score)    

t4 = time.time()
print('Total elapse time: {}'.format(t4-t3))

五、CatBoost上跑HIGGS数据集分类
前面已经有两个Python上使用GPU加速的例子，所以这一次尝要试一下R语言。
Higgs Boson Machine Learning Challenge是Kaggle上8年前结束的一项竞赛，参赛的队伍有1,784支。这里有XGBoost的实现，相关的讨论在该帖子。因为没有好的AUC指标，决定测试一下R语言上用贝叶斯优化从头开始训练模型，看看能否找到好一点的参数组合。这么大的数据集，训练相当耗时，GPU加速的效果还是比较显著的。
贝叶斯优化光生成前10组参数就1个多小时（tidymodels的高斯过程要求初始参数组数要比参数个数多，Python上可以是1组；tidymodels每次生成数千个候选参数组合，耗时较长，但迭代次数少，Python上只生成1组，迭代次数要足够多，这是两个平台高斯过程实现的显著不同），每组参数训练模型大约需要5~10分钟。为了加快速度，使用bootstrap()重采样，只做一次，而不是5折交叉验证，然后只迭代10次。
然而其中有5次因为申请不到需要的内存训练失败了，网上搜到，这是XGBoost在GPU上运行的一个问题，上一次训练完后，不会主动释放占用的内存，见该帖子，Python上有一些解决的办法，见《XGBoost GPU Support》 Memory Usage一节及帖子《How do I free all memory on GshujuPU in XGBoost》。于是尝试一下用CatBoost来测试。
CatBoost多折交叉验证与网格搜索是支持GPU并行的，下图是Windows上用doParallel包2进程5折交叉验证，CPU和GPU都有较高的负荷。但在贝叶斯优化时，如果设置了并行，则提示parsnip没有找到boost-tree算法的CatBoost实现，应该是pkgs参数没有起作用，并行进程没有加载treesnip及catboost包。

CatBoost-HIGGS-GPU 2进程并行5折交叉验证

参考该帖子，在并行处理的cluster中为每个worker进程预加载需要的包可解决问题。

# 对于使用GPU的大数据集训练，验证2路最低限度并行。
# All operating systems，注册并行处理，并为每个并行处理worker加载需要的包。
# # https://github.com/tidymodels/tune/issues/157 
library(doParallel)
cl <- makePSOCKcluster(2)   # parallel::detectCores()
registerDoParallel(cl)
# 显示有几个worker
foreach::getDoParWorkers()
# 为每个并行处理worker加载需要的包
clusterEvalQ(cl, 
             {library(tidymodels)
              library(treesnip)
              library(catboost)
})

CatBoost-HIGGS-GPU 2进程并行6参数贝叶斯优化。

CatBoost-HIGGS-GPU贝叶斯优化并行处理

但是跑了一夜后，主进程最终没能建立与worker进程的连接读取结果，另外速度比单进程要慢得多，单进程上产生10个初始点，即进行最初10次的训练，一个多小时就跑完了，双进程要跑一夜（估计是参数较多内存不够，详见下文）。

Forced gc():  0.1  Seconds.

>  Generating a set of 10 initial parameter results
Error in unserialize(socklist[[n]]) : error reading from connection
> t2<-proc.time()
> cat(t2-t1)
17.41 18.23 20864.81 NA NA

然后再试试LightGBM，发觉LightGBM是可以两路worker进程在GPU上并行的，只用贝叶斯优化调试一个参数，比如下图的tree_depth，可以顺利跑完。但如果调试的参数比较多，比如7个，就会因内存不够而产生大量的内存交换磁盘IO，把程序卡死跑不下来。LightGBM是CPU和内存满格，GPU 2路并行最高去到30%左右。据说单进程训练配置的内存大概要3倍左右的数据占用内存，双进程就要6倍以上了，我的笔记本24G内存跑多进程并行训练HIGGS数据集还是不够。

LightGBM-HIGGS-GPU双worker并行是可以的，调试一个参数GPU的负荷也不高

LightGBM贝叶斯调试一个参数，2个网格搜索初始值，迭代10次，其中5次发现了更优的参数，效果显著。

Forced gc():  0.2  Seconds.

-- Iteration 4 -----------------------------------------------------------------------------------------------------

i Current best:     roc_auc=0.7369 (@iter 2)
i Gaussian process model
! Gaussian process model: X should be in range (0, 1)
√ Gaussian process model
i Generating 5000 candidates
i Predicted candidates
i learn_rate=0.046
i Estimating performance
√ Estimating performance
<3 Newest results:  roc_auc=0.8011

Forced gc():  0.3  Seconds.

-- Iteration 5 -----------------------------------------------------------------------------------------------------

i Current best:     roc_auc=0.8011 (@iter 4)
i Gaussian process model
√ Gaussian process model
i Generating 5000 candidates
i Predicted candidates
i learn_rate=0.0999
i Estimating performance
√ Estimating performance
<3 Newest results:  roc_auc=0.8116

然后再试试CatBoost GPU 2进程并行调一个参数，就顺利的跑完了，每轮训练平均大约530秒。测试证明，跑大数据集，内存一定要够大，参数不能太多。:)

-- Iteration 10 ----------------------------------------------------------------------------------------------------

i Current best:     roc_auc=0.8295 (@iter 3)
i Gaussian process model
√ Gaussian process model
i Generating 5000 candidates
i Predicted candidates
i learn_rate=0.0999
i Estimating performance
√ Estimating performance
(x) Newest results: roc_auc=0.8293
> t2<-proc.time()
> cat(t2-t1)
105.55 5711.49 10611.05 NA NA

为了更好的AUC指标，要调多个参数，最后用单进程跑CatBoost GPU上6个参数的调优，这里实际只跑了前50次，已经是一夜了。

library(tidymodels)
library(kableExtra)
library(tidyr)
# 这个版本的treesnip支持classification。
# remotes::install_github("Glemhel/treesnip", INSTALL_opts = c("--no-multiarch"))
# library(catsnip)
library(treesnip)
library(data.table)
# 对于使用GPU的大数据集训练，验证2路最低限度并行，一个参数通过，多参数跑不出来，内存不够大。
# All operating systems，注册并行处理，并为每个并行处理worker加载需要的包。
# # https://github.com/tidymodels/tune/issues/157 
# library(doParallel)
# cl <- makePSOCKcluster(2)   # parallel::detectCores()
# registerDoParallel(cl)
# # 显示有几个worker
# foreach::getDoParWorkers()
# # 为每个并行处理worker加载需要的包
# clusterEvalQ(cl, 
#              {library(tidymodels)
#                library(treesnip)
#                library(catboost)
#              })

# 优先使用tidymodels的同名函数。
tidymodels_prefer()

# ----------------------------------------------------------------------------------------
# 加载经过预处理的数据
t1<-proc.time()
higgs<- fread("D:/temp/data/HIGGS/HIGGS.csv", header=FALSE, encoding="UTF-8")
higgs$V1<-as.factor(higgs$V1)
t2<-proc.time()
cat(t2-t1)
# 17.41 16.25 34.02 NA NA
names(higgs)

# 划分训练集与测试集
t1<-proc.time()
set.seed(2023)
higgs_split <- initial_split(higgs, prop = 0.90)
higgs_train <- training(higgs_split)
higgs_test  <-  testing(higgs_split)
t2<-proc.time()
cat(t2-t1)
# 6.63 0.55 7.2 NA NA

# ----------------------------------------------------------------------------------------
# 贝叶斯优化
# 可以调整菜谱参数、模型主参数及引擎相关参数。

# 定义菜谱：回归公式与预处理
higgs_rec<-
  recipe(V1 ~ ., data = higgs_train) %>%
  # 标准化数值型变量
  step_normalize(all_numeric_predictors())

# 定义模型：XGB， 定义要调整的参数，tree_method="gpu_hist"，使用GPU。
cat_spec <-
  boost_tree(mtry=tune(), tree_depth = tune(), trees = tune(), learn_rate = tune(), min_n = tune()) %>%
  set_engine('catboost', subsample = tune("subsample"), task_type = 'GPU') %>%    # 
  set_mode('classification')

# 定义工作流
cat_wflow <- 
  workflow() %>% 
  add_model(cat_spec) %>% 
  add_recipe(higgs_rec)

# 全部参数的边界都已确定。
cat_param <- cat_wflow %>%
  extract_parameter_set_dials() %>%
  update(learn_rate = threshold(c(0.01,0.5))) %>%
  update(trees = trees(c(500,1000))) %>%
  update(tree_depth = tree_depth(c(5,15))) %>%
  update(mtry = mtry(c(3,6))) %>%
  update(subsample = threshold(c(0.5,1)))

# 查看参数边界，都已确定
cat_param

# 查看参数边界，都已确定
cat_param %>% extract_parameter_dials("trees")
cat_param %>% extract_parameter_dials("min_n")
cat_param %>% extract_parameter_dials("tree_depth")
cat_param %>% extract_parameter_dials("learn_rate")
cat_param %>% extract_parameter_dials("mtry")
cat_param %>% extract_parameter_dials("subsample")  

# 对于大数据集来说，多折交叉验证的时间太长了，用boostraps抽样验证，只做一次加快训练速度。
#higgs_folds <- vfold_cv(higgs_train, v = 5)
higgs_folds <- bootstraps(higgs_train, times = 1)

gc()

# 执行贝叶斯优化
ctrl <- control_bayes(verbose = TRUE, no_improve = Inf)
# set.seed(2023)
t1<-proc.time()
cat_res_bo <-
  cat_wflow %>%
  tune_bayes(
    resamples = higgs_folds,
    # metrics = metric_set(recall, precision, f_meas, accuracy, kap,roc_auc, sens, spec)
    metrics = metric_set(accuracy, roc_auc, precision,),  
    initial = 10,
    param_info = cat_param,
    iter = 100,
    control = ctrl,
    # Hack了一下tune_bayes()函数，增加参数force_gc，迭代中每次训练前可以选择强制回收内存。
    force_gc = TRUE
  )
t2<-proc.time()
cat(t2-t1)
# 9435.55 269.2 9085.21 NA NA

# 画图查看贝叶斯优化效果
autoplot(cat_res_bo, type = "performance", metric="roc_auc")

# 查看准确率最高的模型
show_best(cat_res_bo, metric="precision")
show_best(cat_res_bo, metric="accuracy")
show_best(cat_res_bo, metric="roc_auc")


# 选择准确率最高的模型
select_best(cat_res_bo, metric="roc_auc")
# 直接读取调参的最佳结果
cat_param_best<- select_best(cat_res_bo, metric="roc_auc")

# 最佳参数回填到工作流
cat_wflow_bo <-
  cat_wflow %>%
  finalize_workflow(cat_param_best)
cat_wflow_bo

# 用最佳参数在训练集全集上训练模型
t1<-proc.time()
# 回收内存，否则训练可能因申请不到内存而失败，
# 前面贝叶斯优化函数中如果加入回收内存的机制，应该就可以避免训练失败。
gc()
cat_fit_bo<- cat_wflow_bo %>% fit(higgs_train)
t2<-proc.time()
cat(t2-t1)
#  647.2 183.99 507.11 NA NA

# 测试集
# 预测值
# https://parsnip.tidymodels.org/reference/predict.model_fit.html
# https://yardstick.tidymodels.org/reference/roc_auc.html
t1<-proc.time()
higgs_test_bo <- predict(cat_fit_bo, new_data = higgs_test %>% select(-V1), type = "prob") %>%
  bind_cols(predict(cat_fit_bo, new_data = higgs_test %>% select(-V1), type = "class"))
t2<-proc.time()
cat(t2-t1)
#  67.8 0.39 5.83 NA NA

# 合并真实值
higgs_test_bo <- bind_cols(higgs_test_bo, higgs_test %>% select(V1))
higgs_metrics <- metric_set(precision, accuracy)
higgs_metrics(higgs_test_bo, truth = V1, estimate = .pred_class)
roc_auc(
  higgs_test_bo,
  truth = V1,
  estimate=.pred_0,
  options = list(smooth = TRUE)
)

> show_best(cat_res_bo, metric="precision")
# A tibble: 5 x 13
   mtry trees min_n tree_depth learn_rate subsample .metric   .estimator  mean     n std_err .config .iter
  <int> <int> <int>      <int>      <dbl>     <dbl> <chr>     <chr>      <dbl> <int>   <dbl> <chr>   <int>
1     4   993    25         15      0.198     0.602 precision binary     0.703     1      NA Iter9       9
2     5   918    14         15      0.294     0.627 precision binary     0.702     1      NA Iter1       1
3     5   931     5         15      0.175     0.703 precision binary     0.701     1      NA Iter12     12
4     4   981    18         15      0.153     0.522 precision binary     0.701     1      NA Iter10     10
5     3   988    21         15      0.153     0.945 precision binary     0.701     1      NA Iter4       4
> show_best(cat_res_bo, metric="accuracy")
# A tibble: 5 x 13
   mtry trees min_n tree_depth learn_rate subsample .metric  .estimator  mean     n std_err .config .iter
  <int> <int> <int>      <int>      <dbl>     <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>   <int>
1     4   997    16         15      0.118     0.685 accuracy binary     0.754     1      NA Iter18     18
2     3   985    36         15      0.124     0.583 accuracy binary     0.753     1      NA Iter17     17
3     5   966     8         15      0.128     0.869 accuracy binary     0.753     1      NA Iter11     11
4     5   989     3         15      0.117     0.816 accuracy binary     0.753     1      NA Iter15     15
5     4   981    18         15      0.153     0.522 accuracy binary     0.753     1      NA Iter10     10
> show_best(cat_res_bo, metric="roc_auc")
# A tibble: 5 x 13
   mtry trees min_n tree_depth learn_rate subsample .metric .estimator  mean     n std_err .config .iter
  <int> <int> <int>      <int>      <dbl>     <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>   <int>
1     6   986    24         14      0.117     0.558 roc_auc binary     0.849     1      NA Iter16     16
2     5   957    11         15      0.109     0.839 roc_auc binary     0.849     1      NA Iter14     14
3     4   997    16         15      0.118     0.685 roc_auc binary     0.849     1      NA Iter18     18
4     5   989     3         15      0.117     0.816 roc_auc binary     0.849     1      NA Iter15     15
5     3   985    36         15      0.124     0.583 roc_auc binary     0.848     1      NA Iter17     17
> select_best(cat_res_bo, metric="roc_auc")
# A tibble: 1 x 7
   mtry trees min_n tree_depth learn_rate subsample .config
  <int> <int> <int>      <int>      <dbl>     <dbl> <chr>  
1     6   986    24         14      0.117     0.558 Iter16 
> higgs_metrics(higgs_test_bo, truth = V1, estimate = .pred_class)
# A tibble: 2 x 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 precision binary         0.698
2 accuracy  binary         0.755
> roc_auc(
+   higgs_test_bo,
+   truth = V1,
+   estimate=.pred_0,
+   options = list(smooth = TRUE)
+ )
# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.854

此时CPU与GPU的负荷都不高，不到30%，硬件的能力还没有充分发挥出来。

CatBoost-HIGGS-GPU-单进程训练

注意这种并行是多进程并行，进程间不共享数据，通过doParallel包与父进程通讯，有多少个进程就有多少份数据拷贝，所以内存要求较大。下文中CatBoost CPU模式下支持多线程（同一父进程下），线程间可以共享同一份数据，内存开销就小。不过后面的测试表明CatBoost GPU模式nthread参数不起作用，似乎不支持多线程，即多进程下不能再开多线程加速了。所以要充分利用GPU的处理能力，只能增加内存了。
然后比较一下在GPU和CPU上的训练和预测的耗时。CatBoost的fit()函数支持多线程并行，treesnip包封装后用nthread参数来设置。如果在相同的线程数下来比较，毫无疑问是GPU要快很多（事实上也是如此），因为多了一个GPU来协助计算。我的笔记本有8个物理核16个虚拟核，CPU上开12个线程时满格。经测试GPU上nthread参数没有作用，但fit()函数并没有贝叶斯优化那样开多个进程的选项，它是单进程的。参阅资料。

library(tidymodels)
library(kableExtra)
library(tidyr)
# 这个版本的treesnip支持classification。
# remotes::install_github("Glemhel/treesnip", INSTALL_opts = c("--no-multiarch"))
# library(catsnip)
library(treesnip)
library(data.table)
# 对于使用GPU的大数据集训练，验证2路最低限度并行。
# All operating systems，注册并行处理，并为每个并行处理worker加载需要的包。
# # https://github.com/tidymodels/tune/issues/157 
# https://curso-r.github.io/treesnip/articles/parallel-processing.html
library(doParallel)
# cl <- makePSOCKcluster(parallel::detectCores()) 
cl <- makePSOCKcluster(12)   # CPU fit
# cl <- makePSOCKcluster(2)   # GPU fit
registerDoParallel(cl)
# 显示有几个worker
foreach::getDoParWorkers()
# 为每个并行处理worker加载需要的包
clusterEvalQ(cl,
             {library(tidymodels)
               library(treesnip)
               library(catboost)
             })

# 优先使用tidymodels的同名函数。
tidymodels_prefer()

# ----------------------------------------------------------------------------------------
# 加载经过预处理的数据
t1<-proc.time()
higgs<- fread("D:/temp/data/HIGGS/HIGGS.csv", header=FALSE, encoding="UTF-8")
higgs$V1<-as.factor(higgs$V1)
t2<-proc.time()
cat(t2-t1)
# 17.41 16.25 34.02 NA NA
names(higgs)

# 划分训练集与测试集
t1<-proc.time()
set.seed(2023)
higgs_split <- initial_split(higgs, prop = 0.90)
higgs_train <- training(higgs_split)
higgs_test  <-  testing(higgs_split)
t2<-proc.time()
cat(t2-t1)
# 6.63 0.55 7.2 NA NA

# 用一组较好的参数比较CPU和GPU的性能----------------------------------------------------
# https://curso-r.github.io/treesnip/articles/parallel-processing.html
# 定义菜谱：回归公式与预处理
higgs_rec<-
  recipe(V1 ~ ., data = higgs_train) %>%
  # 标准化数值型变量
  step_normalize(all_numeric_predictors())

# 定义模型：Catboost， 定义要调整的参数 
cat_spec <-
  boost_tree(mtry=tune(), tree_depth = tune(), trees = tune(), learn_rate = tune(), min_n = tune()) %>%
  #set_engine('catboost', subsample = tune("subsample"), task_type = 'GPU', nthread = 2) %>%  # GPU
  set_engine('catboost', subsample = tune("subsample"), task_type = 'CPU', nthread = 12) %>%  # CPU
  set_mode('classification')

# 定义工作流
cat_wflow <- 
  workflow() %>% 
  add_model(cat_spec) %>% 
  add_recipe(higgs_rec)

# 构造最佳参数
cat_param_best<-
  tibble(
    mtry = 6,
    trees = 986,
    min_n = 24,
    tree_depth = 14,
    learn_rate = 0.117 ,
    subsample =  0.558
  )

# 最佳参数回填到工作流
cat_wflow_bo <-
  cat_wflow %>%
  finalize_workflow(cat_param_best)

# 用最佳参数在训练集全集上训练模型
t1<-proc.time()
# fit函数没有并行，比的都是单进程。
cat_fit_bo<- cat_wflow_bo %>% fit(higgs_train)
t2<-proc.time()

cat(t2-t1)
#GPU单线程 650.52 183.77 511.34 NA NA
# CPU 12线程 65252.86 2728.51 6305.28 NA NA
# CPU 12线程 15980.84 672.2 1944.65 NA NA
#生成训练、测试预测及性能数据

t1<-proc.time()
higgs_test_bo <- predict(cat_fit_bo, new_data = higgs_test %>% select(-V1), type = "prob")
t2<-proc.time()
cat(t2-t1)
#GPU 36.42 0.07 2.69 NA NA
#CPU 40.11 0.9 3.35 NA NA

higgs_test_bo <- bind_cols(higgs_test_bo, higgs_test %>% select(V1))
roc_auc(
  higgs_test_bo,
  truth = V1,
  estimate=.pred_0,
  options = list(smooth = TRUE)
)
#GPU 85.4
#CPU 85.4

CatBoost-HIGGS-CPU-12线程训练，CPU满格。

因为这组参数训练要迭代986次，比较慢，CPU 12线程跑要6305.28秒，GPU是511.34秒，12倍，GPU算力的优越性已经得到充分的体现了（并且硬件的负荷不高）。预测都差不多，主要的加速在训练，数据集越大，迭代的次数越多，GPU算力的优越性越明显。

参考资料：《When to Choose CatBoost Over XGBoost or LightGBM [Practical Guide]》，可以了解控制过拟合与训练速度的主要参数，以及算法之间的比较。

实现机器学习算法GPU算力的优越性

推荐阅读更多精彩内容