- 朴素贝叶斯
最后,让我们看看lamda参数对朴素贝叶斯模型的影响。该参数可以控制相加式平滑(additive smoothing),解决数据中某个类别和某个特征值的组合没有同时出现的问题。- 和之前的做法一样,首先需要创建一个方便调用的辅助函数,用来训练不同lamba级别下的
模型
def trainNBWithParams(input: RDD[LabeledPoint], lambda: Double) = { val nb = new NaiveBayes nb.setLambda(lambda) nb.run(input) } val nbResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map { param => val model = trainNBWithParams(dataNB, param) val scoreAndLabels = dataNB.map { point => (model.predict(point.features), point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (s"$param lambda", metrics.areaUnderROC) } nbResults.foreach { case (param, auc) => println(f"$param, AUC = ${auc * 100}%2.2f%%") } 0.001 lambda, AUC = 60.51% 0.01 lambda, AUC = 60.51% 0.1 lambda, AUC = 60.51% 1.0 lambda, AUC = 60.51% 10.0 lambda, AUC = 60.51%
- 和之前的做法一样,首先需要创建一个方便调用的辅助函数,用来训练不同lamba级别下的
- 交叉验证
- 将数据集分成60%的训练集和40%的测试集(为了方便解释,我们在代码中使用一个固定的随机种子123来保证每次实验能得到相同的结果)
val trainTestSplit = scaledDataCats.randomSplit(Array(0.6, 0.4), 123) val train = trainTestSplit(0) val test = trainTestSplit(1)
- 接下来在不同的正则化参数下评估模型的性能
val regResultsTest = Seq(0.0, 0.001, 0.0025, 0.005, 0.01).map { param => val model = trainWithParams(train, param, numIterations, new SquaredL2Updater, 1.0) createMetrics(s"$param L2 regularization parameter", test, model) } regResultsTest.foreach { case (param, auc) => println(f"$param,AUC = ${auc * 100}%2.6f%%") } 0.0 L2 regularization parameter,AUC = 66.126842% 0.001 L2 regularization parameter,AUC = 66.126842% 0.0025 L2 regularization parameter,AUC = 66.126842% 0.005 L2 regularization parameter,AUC = 66.126842% 0.01 L2 regularization parameter,AUC = 66.093195%