Python机器学习: Scikit-learn库实现模型训练与预测

1. 环境准备与Scikit-learn基础

1.1 安装与库导入

Scikit-learn（简称sklearn）是Python最流行的机器学习库之一，基于NumPy和SciPy构建。我们建议使用Python 3.8+版本，并通过以下命令安装：

pip install scikit-learn numpy pandas matplotlib

典型项目需要导入的核心模块包括：

# 数据预处理

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

# 模型选择

from sklearn.ensemble import RandomForestClassifier

from sklearn.svm import SVC

# 评估指标

from sklearn.metrics import accuracy_score

1.2 数据加载与探索

Scikit-learn内置经典数据集便于快速验证模型。以鸢尾花（Iris）数据集为例：

from sklearn.datasets import load_iris

# 加载数据集

iris = load_iris()

X = iris.data # 特征矩阵（150x4）

y = iris.target # 目标向量（150x1）

# 查看特征描述

print(iris.feature_names) # 输出：['sepal length (cm)', ...]

print(iris.target_names) # 输出：['setosa', 'versicolor', 'virginica']

2. 机器学习流程实现

2.1 数据预处理（Data Preprocessing）

特征标准化是提升模型性能的关键步骤。我们使用StandardScaler进行归一化处理：

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# 数据集划分（训练集70%，测试集30%）

X_train, X_test, y_train, y_test = train_test_split(

X_scaled, y, test_size=0.3, random_state=42

)

2.2 模型训练（Model Training）

以支持向量机（Support Vector Machine, SVM）为例演示分类器训练：

model = SVC(kernel='rbf', C=1.0, gamma='scale')

model.fit(X_train, y_train)

重要参数说明：

kernel：核函数类型（linear/rbf/poly）

C：正则化参数（默认1.0）

gamma：RBF核的系数（auto/scale）

2.3 模型评估（Model Evaluation）

使用混淆矩阵和分类报告评估分类性能：

from sklearn.metrics import classification_report

y_pred = model.predict(X_test)

print(accuracy_score(y_test, y_pred)) # 输出：0.977

print(classification_report(y_test, y_pred))

3. 高级应用技巧

3.1 超参数调优（Hyperparameter Tuning）

使用网格搜索（Grid Search）自动寻找最优参数组合：

from sklearn.model_selection import GridSearchCV

param_grid = {

'C': [0.1, 1, 10],

'gamma': ['scale', 'auto'],

'kernel': ['rbf', 'linear']

}

grid_search = GridSearchCV(SVC(), param_grid, cv=5)

grid_search.fit(X_train, y_train)

print(f"最优参数：{grid_search.best_params_}")

3.2 特征工程实践

使用Pipeline构建完整处理流程：

from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(

StandardScaler(),

SVC()

)

pipeline.fit(X_train, y_train)

4. 回归任务实战案例

4.1 波士顿房价预测

演示随机森林回归（Random Forest Regression）应用：

from sklearn.datasets import fetch_openml

from sklearn.ensemble import RandomForestRegressor

boston = fetch_openml(name='boston')

X, y = boston.data, boston.target

rf = RandomForestRegressor(n_estimators=100)

rf.fit(X_train, y_train)

# 计算R²得分

print(rf.score(X_test, y_test)) # 典型输出：0.85-0.92

5. 模型部署与生产应用

使用joblib实现模型持久化：

from joblib import dump

dump(model, 'iris_classifier.joblib')

# 加载模型进行预测

loaded_model = load('iris_classifier.joblib')

new_pred = loaded_model.predict([[5.1, 3.5, 1.4, 0.2]])

#Python机器学习 #Scikit-learn教程 #模型训练 #预测分析 #特征工程

Python机器学习: Scikit-learn库实现模型训练与预测