1. 概述
Ray.Tune是一个人工智能模型自动调参工具,有几个优势:
- 方便定义参数空间,包含模型超参,以及训练的一些特殊参数
- 并行训练,且支持分布式,可以设置服务器资源,包含CPU、GPU等
- 模型筛选,可以定义多个指标综合筛选需要的模型
- 支持多种主流的AI模型训练框架,例如Torch、Keras、Sckit-learn、XGBoost、TransForms、TensorFlow等
- 训练结果可视化,可以查看模型选择的过程
本文对Tune做一些入门介绍,并尝试训练一个XGBoost多折交叉验证的模型。
2. 测试
2.1. QuickStart
参考:https://docs.ray.io/en/latest/tune/index.html
from ray import train, tune
def objective(config): # ①
score = config["a"] ** 2 + config["b"]
return {"score": score}
search_space = { # ②
"a": tune.grid_search([0.001, 0.01, 0.1, 1.0]),
"b": tune.choice([1, 2, 3]),
}
tuner = tune.Tuner(objective, param_space=search_space) # ③
results = tuner.fit()
print(results.get_best_result(metric="score", mode="min").config)
- ① Define an objective function.
- ② Define a search space.
- ③ Start a Tune run and print the best result.
2.2. async_hyperband_example
参考:https://docs.ray.io/en/latest/tune/examples/includes/async_hyperband_example.html
#!/usr/bin/env python
import argparse
import time
from ray import train, tune
from ray.tune.schedulers import AsyncHyperBandScheduler
def evaluation_fn(step, width, height):
time.sleep(0.1)
return (0.1 + width * step / 100) ** (-1) + height * 0.1
def easy_objective(config):
# Hyperparameters
width, height = config["width"], config["height"]
for step in range(config["steps"]):
# Iterative training function - can be an arbitrary training procedure
intermediate_score = evaluation_fn(step, width, height)
# Feed the score back back to Tune.
train.report({"iterations": step, "mean_loss": intermediate_score})
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--smoke-test", action="store_true", help="Finish quickly for testing"
)
args, _ = parser.parse_known_args()
# AsyncHyperBand enables aggressive early stopping of bad trials.
scheduler = AsyncHyperBandScheduler(grace_period=5, max_t=100)
# 'training_iteration' is incremented every time `trainable.step` is called
stopping_criteria = {"training_iteration": 1 if args.smoke_test else 9999}
tuner = tune.Tuner(
tune.with_resources(easy_objective, {"cpu": 1, "gpu": 0}),
run_config=train.RunConfig(
name="asynchyperband_test",
stop=stopping_criteria,
verbose=1,
),
tune_config=tune.TuneConfig(
metric="mean_loss", mode="min", scheduler=scheduler, num_samples=20
),
param_space={ # Hyperparameter space
"steps": 100,
"width": tune.uniform(10, 100),
"height": tune.uniform(0, 100),
},
)
results = tuner.fit()
print("Best hyperparameters found were: ", results.get_best_result().config)
- Output
Trial status: 20 TERMINATED
Current time: 2024-11-18 09:46:02. Total running time: 13s
Logical resource usage: 1.0/104 CPUs, 0/6 GPUs (0.0/1.0 accelerator_type:G)
Current best trial: d6052_00005 with mean_loss=1.6583436014787982 and params={'steps': 100, 'width': 30.1545328711941, 'height': 16.24957950094885}
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status width height loss iter total time (s) iterations │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ easy_objective_d6052_00000 TERMINATED 91.3495 70.1548 7.28186 5 0.504913 4 │
│ easy_objective_d6052_00001 TERMINATED 70.6207 20.7465 2.08894 100 10.0803 99 │
│ easy_objective_d6052_00002 TERMINATED 14.3436 56.1975 7.10399 5 0.504261 4 │
│ easy_objective_d6052_00003 TERMINATED 35.7087 91.5464 9.80894 5 0.504937 4 │
│ easy_objective_d6052_00004 TERMINATED 48.1253 52.0651 5.70033 5 0.504423 4 │
│ easy_objective_d6052_00005 TERMINATED 30.1545 16.2496 1.65834 100 10.075 99 │
│ easy_objective_d6052_00006 TERMINATED 21.9189 21.8973 2.2356 100 10.084 99 │
│ easy_objective_d6052_00007 TERMINATED 60.62 44.5165 4.84772 5 0.504675 4 │
│ easy_objective_d6052_00008 TERMINATED 58.0047 57.4326 6.15645 5 0.504766 4 │
│ easy_objective_d6052_00009 TERMINATED 43.5265 21.1586 2.23533 20 2.01825 19 │
│ easy_objective_d6052_00010 TERMINATED 53.6623 96.5493 10.1001 5 0.504696 4 │
│ easy_objective_d6052_00011 TERMINATED 78.2983 47.806 5.09001 5 0.504365 4 │
│ easy_objective_d6052_00012 TERMINATED 22.5627 36.9524 4.69274 5 0.50439 4 │
│ easy_objective_d6052_00013 TERMINATED 30.4316 22.9093 3.05008 5 0.504323 4 │
│ easy_objective_d6052_00014 TERMINATED 49.3256 33.6124 3.84363 5 0.504481 4 │
│ easy_objective_d6052_00015 TERMINATED 87.1597 51.4711 5.42595 5 0.504122 4 │
│ easy_objective_d6052_00016 TERMINATED 97.0634 73.4299 7.59408 5 0.504775 4 │
│ easy_objective_d6052_00017 TERMINATED 97.0855 48.7491 5.12595 5 0.504584 4 │
│ easy_objective_d6052_00018 TERMINATED 48.7951 57.7981 6.26718 5 0.504418 4 │
│ easy_objective_d6052_00019 TERMINATED 91.0838 43.6023 4.62737 5 0.504944 4 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Best hyperparameters found were: {'steps': 100, 'width': 30.1545328711941, 'height': 16.24957950094885}
- TensorBoard
tensorboard --logdir /tmp/ray/session_2024-11-18_09-45-40_738364_1022831/artifacts/2024-11-18_09-45-48/asynchyperband_test/driver_artifacts
3. XGBoost
参考:https://docs.ray.io/en/latest/tune/examples/tune-xgboost.html
3.1. Code
- train.py
from typing import List, Dict
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from ray import train, tune
from ray.train import ScalingConfig, RunConfig
from ray.tune.schedulers import ASHAScheduler
from ray.tune.integration.xgboost import TuneReportCheckpointCallback
import wandb
from wandb.integration.xgboost import WandbCallback
from src.train_with_kfold import train_xgboost
# wandb.init(project="kpuu-model", name="xgboost")
search_space = dict(
cv=dict(
n_splits=tune.choice([5,10]),
),
xgb=dict(
objective="binary:logistic",
eval_metric="auc",
# eval_metric=["logloss", "error"],
subsample=tune.uniform(0.8, 1.0),
scale_pos_weight=2,
reg_lambda=10,
reg_alpha=0.1,
n_estimators=tune.randint(300, 500),
min_child_weight=tune.choice([1, 2]),
max_depth=tune.randint(5, 9),
learning_rate=tune.loguniform(1e-2, 1e-1),
colsample_bytree=0.8,
early_stopping_rounds=10,
# callbacks=[WandbCallback(log_model=False)],
# callbacks=[TuneReportCheckpointCallback(frequency=1)],
)
)
scheduler = ASHAScheduler(
max_t=10, grace_period=5, reduction_factor=2 # 10 training iterations
)
def tune_xgboost(smoke_test=False):
tuner = tune.Tuner(
train_xgboost,
# tune.with_resources(train_xgboost, {"cpu": 80, "gpu": 0}), # bad
tune_config=tune.TuneConfig(
metric="mean_auc",
mode="max",
scheduler=scheduler,
num_samples=1 if smoke_test else 100,
max_concurrent_trials=40,
),
param_space=search_space,
run_config=RunConfig(
name="kpuu-model",
verbose=1,
log_to_file="./tune.log", # no use
)
)
results = tuner.fit()
# print(f"{results=}")
return results
if __name__ == '__main__':
results = tune_xgboost(smoke_test=False)
best_trial = results.get_best_result()
print(f"Best trial: {best_trial.path}")
print(f"With hyperparameters: {best_trial.config}")
print(f"All metrics: {best_trial.metrics}")
- src/train_with_kfold.py
import numpy as np
from pathlib import Path
import xgboost as xgb
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import roc_auc_score
from ray import train
import joblib
from .dataset import ds
def train_xgboost(params, save_dir=None, report=True):
# TODO use dataclass may be better
cv_params = params.get("cv", {})
xgb_params = params.get("xgb", {})
n_splits = cv_params.get("n_splits", 5)
submodels = []
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
cv_scores = []
for train_index, test_index in kf.split(ds.features):
X_train, X_test = ds.features[train_index], ds.features[test_index]
y_train, y_test = ds.labels[train_index], ds.labels[test_index]
model: xgb.XGBClassifier = xgb.XGBClassifier(**xgb_params)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
submodels.append(model)
# 在测试集上进行评估
# preds = model.predict(dtest)
preds = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, preds)
cv_scores.append(auc)
mean_auc = np.mean(cv_scores)
if report:
train.report({"mean_auc": mean_auc, "done": True})
if save_dir and Path(save_dir).is_dir():
save_models(submodels, save_dir)
print(f"{mean_auc=}, {cv_scores=}")
def save_models(models, model_dir):
model_dir = Path(model_dir)
if not model_dir.is_dir():
print(f"{model_dir=} not exists or not a dir")
return
for i, model in enumerate(models):
name = model_dir / f"fold_{i}_model.pkl"
# model.save_model(name)
joblib.dump(model, name)
代码小结
- 采用
train.report
的方式,上报多折模型训练结束后的一些metric,便于Ray.Tune选择最好的model - 每次训练都是一个Trial,尽量不要保存checkpoint,否则磁盘空间会占用非常大,也会导致训练速度减慢
- 得到最佳参数后,可以采用最佳参数单独训练,得到的model是一样的。前提是,seed是固定的,包含选择数据的seed和模型训练的seed。
- 定义参数空间,包含模型超参和训练时的控制参数,比较灵活。但是,也不可太大,不然训练时间也会很长,也不一定能找到最佳参数。
- 关于ASHAScheduler的作用,并没有搞太清楚,尤其是关于iterations的概念
- 资源限制,可以采用
max_concurrent_trials
参数,tune.with_resources
在进程调度方面效率很低,不建议采用
3.2. TensorBoard
To visualize your results with TensorBoard, run:
tensorboard --logdir /tmp/ray/session_2024-11-14_08-31-54_388227_2726263/artifacts/2024-11-14_08-31-57/kpuu-model/driver_artifacts
3.3. FAQ
参考:https://docs.ray.io/en/latest/tune/faq.html
- ray.tune如何支持xgboost的交叉验证。
如果采用callback的方式callbacks=[TuneReportCheckpointCallback(frequency=1)]
,会报错找不到评价指标(如下)。
原因应该是每次子模型训练结束后,只有类似'validation_0-auc'
的指标。
所以,采用手动report的方式。
ValueError: Trial returned a result which did not include the specified metric(s) `mean_auc` that `tune.TuneConfig()` expects. Make sure your calls to `tune.report()` include the metric, or set the TUNE_DISABLE_STRICT_METRIC_CHECKING environment variable to 1. Result: OrderedDict([('validation_0-auc', 0.6905393457117595), ('timestamp', 1731452298), ('checkpoint_dir_name', 'checkpoint_000000'), ('should_checkpoint', True), ('done', False), ('training_iteration', 1), ('trial_id', '98313_00000'), ('date', '2024-11-13_06-58-18'), ('time_this_iter_s', 0.08122944831848145), ('time_total_s', 0.08122944831848145), ('pid', 3466022), ('hostname', 'yfzy-NF5468M5'), ('node_ip', '101.237.34.11'), ('time_since_restore', 0.08122944831848145), ('iterations_since_restore', 1), ('config/objective', 'binary:logistic'), ('config/eval_metric', 'auc'), ('config/subsample', 0.9502511731447341), ('config/scale_pos_weight', 2), ('config/reg_lambda', 10), ('config/reg_alpha', 0.1), ('config/n_estimators', 400), ('config/min_child_weight', 3), ('config/max_depth', 1), ('config/learning_rate', 0.016444553380492718), ('config/colsample_bytree', 0.8), ('config/early_stopping_rounds', 10), ('config/callbacks', [<ray.train.xgboost._xgboost_utils.RayTrainReportCallback object at 0x7fbcc8254130>])])
4. 总结
关于机器学习模型自动调参的工具,还有很多,可以参看GitHub的topic:
例如:
- https://github.com/optuna/optuna
- https://github.com/hyperopt/hyperopt
- https://github.com/abhishekkrthakur/autoxgb
本文仅尝试了Ray.Tune,其他工具待以后有时间再尝试。
关于训练过程的可视化,亦可以尝试wandb,实现起来也比较简单,几行代码可以搞定。