学习pycaret之前,先搭建好jupyter notebook。代码实现是基于jupyter的。
安装pycaret(默认的是cpu版本)
参考 pycaret github 最新版pycaret使用说明
#create a conda environment
conda create --name pycaret3 python=3.9
# activate conda environment
conda activate pycaret3
# install pycaret
pip install pycaret [full]
#创建一个notebook kernel
python -m ipykernel install --user --name pycaret3 --display-name "pycaret3"
如果你有GPU可以考虑安装支持GPU的pycaret
前面的步骤和上面的cpu版本完全一样。下面是需要手动安装的内容
pip3 uninstall lightgbm -y
#先降级pip版本,否则无法使用--install-option参数
pip3 install pip==22.2.1
pip3 install lightgbm --install-option=--gpu --install-option="--opencl-include-dir=~/CUDA11.8/include/" --install-option="--opencl-library=~/CUDA11.8/lib64/libOpenCL.so"
上面的~/CUDA11.8/是我的cuda的安装位置。需要修改为你自己的cuda的安装位置
还需要cuml ,这个需要根据自己情况选择对应版本Installation Guide - RAPIDS Docs
RAPIDS里面包含这个cuml.
pycaret是可以实现多个机器学习的包装器
包含的有scikit-learn,XGBoost,LightGBM,CatBoost,SpaCy,Optuna,Hyperopt,Ray等。
有监督机器学习
分类Classification
- 二元分类
- 多元分类
pycaret.classification
官方的分类的所有函数的API
image.png
回归Regression
pycaret.regression
官方的回归的所有函数的API

无监督机器学习
异常检测Anomaly Detection
pycaret.anomaly
异常检测的官方API

聚类Clustering
pycaret.clustering
聚类官方API

时间序列分析 Time Series Forecasting
pycaret.time_series
时间序列官方API

pycaret分析的基本步骤
- 读取数据get_data
- 初始化安装,导入分析模型类型
- 模型训练和选择
- 可视化最优的模型
- 预测测试集的数据
- 预测新的数据的结果
- 保存模型
数据预处理
数据预处理原文
缺失值,一般为空白或NaN
使用setup函数后会自动初始化,并填充缺失值
# load dataset
from pycaret.datasets import get_data
hepatitis = get_data('hepatitis')

# init setup
from pycaret.classification import *
clf1 = setup(data = hepatitis, target = 'Class')


MAPE值越低,说明填充的结果约接近真实值
软件默认的缺失数据填充
数字值
numeric_imputation: int, float, or string, defaul:mean 默认是用均值可以使用的参数值:
drop 删除包含缺失的行
mean 均值
median 使用中间值填充
mode 使用频率最多的值填充
knn 使用knn近邻法填充
int or float 使用提供的数值
分类值 categorical_imputation: string, defaul:mode
可以使用的参数值:
drop
mode
str 使用提供的字符串
imputation_type设置填补类型
默认是simple
可选值是: simple, iterative, None
如果是None则不填充
数据填充使用的模型
numeric_iterative_imputer:str or sklearn estimator ,默认值是:lightgbm
categorical_iterative_imputer:str or sklearn estimator ,默认值是:lightgbm
数据类型,包括数字,分类或日期时间 ,pycaret会自动检测数据类型
如果pycaret自动检测的数据类型和预期的不一致,则可以手动指定为对应的数据类
一次性编码,数据集的分类特征包含标签值
序数编码,数据集中的分类特征包含具有内在自然顺序的变量,例如:(低,中,高)
基数编码
目标不平衡,当训练数据集的目标类分布不均匀时,可以使用fix_imbalance设置中的参数进行修复。
删除异常值 remove_outliers
pycaret3可用的模型种类
分类模型classification
| 缩写 | 模型全称 |
|---|---|
| lr | Logistic Regression |
| knn | K Neighbors Classifier |
| nb | Naive Bayes |
| dt | Decision Tree Classifier |
| svm | SVM - Linear Kernel |
| rbfsvm | SVM - Radial Kernel |
| gpc | Gaussian Process Classifier |
| mlp | MLP Classifier |
| ridge | Ridge Classifier |
| rf | Random Forest Classifier |
| qda | Quadratic Discriminant Analysis |
| ada | Ada Boost Classifier |
| gbc | Gradient Boosting Classifier |
| lda | Linear Discriminant Analysis |
| et | Extra Trees Classifier |
| xgboost | Extreme Gradient Boosting |
| lightgbm | Light Gradient Boosting Machine |
| catboost | CatBoost Classifier |
回归模型 regression
| 模型缩写 | 模型全称 |
|---|---|
| lr | Linear Regression |
| lasso | Lasso Regression |
| ridge | Ridge Regression |
| en | Elastic Net |
| lar | Least Angle Regression |
| llar | Lasso Least Angle Regression |
| omp | Orthogonal Matching Pursuit |
| br | Bayesian Ridge |
| ard | Automatic Relevance Determination |
| par | Passive Aggressive Regressor |
| ransac | Random Sample Consensus |
| tr | TheilSen Regressor |
| huber | Huber Regressor |
| kr | Kernel Ridge |
| svm | Support Vector Regression |
| knn | K Neighbors Regressor |
| dt | Decision Tree Regressor |
| rf | Random Forest Regressor |
| et | Extra Trees Regressor |
| ada | AdaBoost Regressor |
| gbr | Gradient Boosting Regressor |
| mlp | MLP Regressor |
| xgboost | Extreme Gradient Boosting |
| lightgbm | Light Gradient Boosting Machine |
| catboost | CatBoost |
时间序列模型列表Time Series
| 时间序列模型缩写 | 模型全称 |
|---|---|
| naive | Naive Forecaster |
| grand_means | Grand Means Forecaster |
| snaive | Seasonal Naive Forecaster (disabled when seasonal_period = 1) |
| polytrend | Polynomial Trend Forecaster |
| arima | ARIMA family of models (ARIMA, SARIMA, SARIMAX) |
| auto_arima | Auto ARIMA |
| exp_smooth | Exponential Smoothing |
| stlf | STL Forecaster |
| croston | Croston Forecaster |
| ets | ETS |
| theta | Theta Forecaster |
| tbats | TBATS |
| bats | BATS |
| prophet | Prophet Forecaster |
| lr_cds_dt | Linear w/ Cond. Deseasonalize & Detrending |
| en_cds_dt | Elastic Net w/ Cond. Deseasonalize & Detrending |
| ridge_cds_dt | Ridge w/ Cond. Deseasonalize & Detrending |
| lasso_cds_dt | Lasso w/ Cond. Deseasonalize & Detrending |
| llar_cds_dt | Lasso Least Angular Regressor w/ Cond. Deseasonalize & Detrending |
| br_cds_dt | Bayesian Ridge w/ Cond. Deseasonalize & Deseasonalize & Detrending |
| huber_cds_dt | Huber w/ Cond. Deseasonalize & Detrending |
| omp_cds_dt | Orthogonal Matching Pursuit w/ Cond. Deseasonalize & Detrending |
| knn_cds_dt | K Neighbors w/ Cond. Deseasonalize & Detrending |
| dt_cds_dt | Decision Tree w/ Cond. Deseasonalize & Detrending |
| rf_cds_dt | Random Forest w/ Cond. Deseasonalize & Detrending |
| et_cds_dt | Extra Trees w/ Cond. Deseasonalize & Detrending |
| gbr_cds_dt | Gradient Boosting w/ Cond. Deseasonalize & Detrending |
| ada_cds_dt | AdaBoost w/ Cond. Deseasonalize & Detrending |
| lightgbm_cds_dt | Light Gradient Boosting w/ Cond. Deseasonalize & Detrending |
| catboost_cds_dt | CatBoost w/ Cond. Deseasonalize & Detrending |
聚类模型列表Clustering
| 聚类的模型名称缩写 | 模型的全称 |
|---|---|
| kmeans | K-Means Clustering |
| ap | Affinity Propagation |
| meanshift | Mean shift Clustering |
| sc | Spectral Clustering |
| hclust | Agglomerative Clustering |
| dbscan | Density-Based Spatial Clustering |
| optics | OPTICS Clustering |
| birch | Birch Clustering |
| kmodes | K-Modes Clustering |
异常检测Anomaly Detection
| 异常检测的模型缩写 | 异常检测的模型全称 |
|---|---|
| abod | Angle-base Outlier Detection |
| cluster | Clustering-Based Local Outlier |
| cof | Connectivity-Based Outlier Factor |
| histogram | Histogram-based Outlier Detection |
| iforest | Isolation Forest |
| knn | k-Nearest Neighbors Detector |
| lof | Local Outlier Factor |
| svm | One-class SVM detector |
| pca | Principal Component Analysis |
| mcd | Minimum Covariance Determinant |
| sod | Subspace Outlier Detection |
| sos | Stochastic Outlier Selection |
