RandomForest建模中,入参中包含了数值型、字符串类型的值。入模的时候,统一使用df.na.fill(0.0)会导致NullPointerException或者Cannot have an empty string for name。
如果不需要string入参,直接去掉。
val (trainingData, testData) = splitData(featsDFWithLabel, trainingSampleRatio)
val formula = new RFormula()
.setFormula("label ~ . - user_id - relation_type - industry - is_phonenum - relation_type_definite - type - - category - mark -
name - job_first_level - job_second_level - result -
duration - phone_label - dt")
.setFeaturesCol("features")
.setLabelCol("label")
val pipelineModel: PipelineModel = getRFModel(formula, trainingData)
如果需要string类型的变量,则需要分开处理。
val trainingDFNew = trainingDF.na.fill(Map("industry" -> "empty", "category" -> "empty", "phone_label" -> "empty")).na.fill(0.0)
.na.replace("industry", Map("" -> "empty"))
.na.replace("category", Map("" -> "empty"))
.na.replace("phone_label", Map("" -> "empty"))
Error:
ERROR ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: requirement failed: Cannot have an empty string for name.
java.lang.IllegalArgumentException: requirement failed: Cannot have an empty string for name.
如果string类型的值存在空值,也需要处理,否则在onehot编码时会报错。