阿里云安全算法挑战赛网页文本分类总结

第一次参加阿里天池的比赛，虽然总成绩没能进前十，但是网页题做的还不错（最终排名第二，成绩0.8312分）。在这里记录一下网页文本分类题的解题思路和赛后总结：

出题背景是阿里云服务器上存在着很多非法网页，需要通过机器学习算法识别出网页的类别，对非法网页进行打击。题目的详细描述见赛题官方网址赛题二。原始数据是网页静态html代码，但可转换为文本分类问题。训练集的标签已给出，共有4个类别：

fake_card(证件卡票)：提供伪造证件、发票、证明材料等制作服务；
gambling(赌博类)：包括赌博交易、网络博彩、赌博器具等；
sexy(色情类)：包括色情传播、文图视频、网络招嫖等；
normal(正常类)：不包含以上内容的网页。

网页分类

算法整体流程图：

整体流程图

关键步骤：

数据预处理

训练数据中，id字段是经过hash处理的，会出现id重复的情况，需要根据id对数据进行去重；另外对各类型网页html文本长度的统计结果如下所示：

网页文本长度统计图

网页文本长度大于200000的不到网页数据总量的1%，且基本不包含风险网页，所以对其进行清除，减少后续的计算量；观察出字符长度小于200的网页中大部分内容为"404 ERROR","正在等待响应...",'"排队中..."等内容，对模型的训练没有用处，也要进行清除。代码在下方：

--计算id的出现次数 
select id, risk, html, count(id) over (partition by id) as count from ${t1} ;
--计算html文本的长度
select *, length(html) as html_length from ${t1} ;
--数据清理
select * from ${t1} where count = 1 and html_length > 200 and html_length < 200000 ;

网页文本提取

html字段是网页的源代码，包含了结构化的html标签，<style>、<script>等标签中的内容不包含语义信息，参考文献[5]中结论，实验中只对title、keywords、description、body标签中的文本进行提取，另外去除文本中的特殊字符，仅保留英文、中文以及常见的标点符号。html中文本内容的提取通过在MapReduce的map函数中调用jsoup库实现（因为ODPS有沙箱隔离的限制，实验中抽取了jsoup的核心代码放到自定义的包中打包上传）

package cn.edu.whu.tianchi_htmlclassify;

import cn.edu.whu.tianchi_htmlclassify.nodes.Document;
import cn.edu.whu.tianchi_htmlclassify.nodes.Element;
import cn.edu.whu.tianchi_htmlclassify.select.Elements;

import com.aliyun.odps.OdpsException;
import com.aliyun.odps.data.Record;
import com.aliyun.odps.data.TableInfo;
import com.aliyun.odps.mapred.JobClient;
import com.aliyun.odps.mapred.MapperBase;
import com.aliyun.odps.mapred.conf.JobConf;
import com.aliyun.odps.mapred.utils.InputUtils;
import com.aliyun.odps.mapred.utils.OutputUtils;
import com.aliyun.odps.mapred.utils.SchemaUtils;

import java.io.IOException;

public class HtmlExtractor {
    public static void main(String[] args) throws OdpsException {
        JobConf job = new JobConf();
        job.setMapperClass(MapperClass.class);
        job.setMapOutputKeySchema(SchemaUtils.fromString("id:string"));
        job.setMapOutputValueSchema(SchemaUtils.fromString(
                "title:string,keywords:string,description:string,bodytext:string"));
        job.setNumReduceTasks(0);

        InputUtils.addTable(TableInfo.builder().tableName(args[0]).build(), job);
        OutputUtils.addTable(TableInfo.builder().tableName(args[1]).build(), job);

        JobClient.runJob(job);
    }

    public static class MapperClass extends MapperBase {
        private Record output;

        @Override
        public void setup(TaskContext context) throws IOException {
            output = context.createOutputRecord();
        }

        @Override
        public void map(long key, Record record, TaskContext context)
            throws IOException {
            //id原样设置
            output.set(0, record.get("id").toString());

            String html = record.get("html").toString();
            String bodyText = new String();
            String title = new String();
            String description = new String();
            String keywords = new String();

            Document doc = null;
            try {
                //使用Jsoup解析html
                doc = Jsoup.parse(html); 
                //去除隐藏的html标签
                doc.select("*[style*=display:none]").remove();   
                //获取title文本
                Elements titles = doc.select("title");    
                for (Element element : titles) {
                    title = title + element.text();
                }
                //获取keywords文本
                Elements metaKey = doc.select("meta[name=keywords]");    
                for (Element element : metaKey) {
                    keywords = keywords + element.attr("content");
                }
                //获取description文本
                Elements metaDes = doc.select("meta[name=description]");    
                for (Element element : metaDes) {
                    description = description + element.attr("content");
                }
                //获取body文本
                Elements body = doc.select("body");    
                for (Element element : body) {
                    bodyText = bodyText + element.text();
                }
            } catch (Exception e) {
                bodyText = "";
            } finally {
                //仅保留数字、字母、中文、常用的标点符号
                output.set(1,title.replaceAll(
                          "[^0-9a-zA-Z\u4e00-\u9fa5.，、,。？“”]+", ""));
                output.set(2,keywords.replaceAll(
                          "[^0-9a-zA-Z\u4e00-\u9fa5.，、,。？“”]+", ""));
                output.set(3,description.replaceAll(
                          "[^0-9a-zA-Z\u4e00-\u9fa5.，、,。？“”]+", ""));
                output.set(4,bodyText.replaceAll(
                          "[^0-9a-zA-Z\u4e00-\u9fa5.，、,。？“”]+", ""));
                context.write(output);
            }
        }
    }
}

文本分词

在中文文本分类中，常用的特征单元有字、词、字串（character n-gram）及其组合，中文中单个字的语义质量不是很好，一般不作为特征单元，词与二字串的效果相似。在天池PAI平台上，有基于AliWS(Alibaba Word Segmenter的简称)的分词组件，若用户指定了词性标注或语义标注相关参数，则会将分词结果、词性标注结果和语义标注结果一同输出。在这里勾选词性标注，在词的初步过滤中会用到词性。

文本分词参数

词初步过滤

停用词过滤
使用 https://github.com/JNU-MINT/TextBayesClassifier 中停用词；
词性过滤（仅保留动词、名词）
参考[7]中实验结果，仅适用动词、名词作为特征词。

词频统计

在对文章进行分词的基础上，按行保序输出对应文章ID列(docId)对应文章的词，统计指定文章ID列(docId)对应文章内容(docContent)的词频。
使用词频统计组件分别对title、keywords、description、bodytext进行词频统计，得到文档i中词j出现在title、keywords、description、bodytext中的个数

词频统计结果表

可以对title、keywords、description、body设置不同的权重，测试了不同权重的组合，当四个标签权重都为1时分类精度最高，词频权重组合的方法如下所示：

select *, (title_count * 1 + key_count * 1 + des_count * 1 + body_count) as count from ${t1}

特征词选择

特征词的选择在很多博客与论文[1][6]中都有叙述，常见的特征提取方法有文档词频（Document Frequency, DF）、信息增益(Information Gain, IG)、互信息（Mutual Information, MI）、卡方统计（CHI）、期望交叉熵（Expected Cross Entropy）等等，由于时间限制，我们试验了几种通常情况表现好的方法进行结合：文档词频 + 信息增益 + 卡方增益，对比了几组不同的阈值，得到了较好的分类效果。

文档词频：在训练文本集中对每个词计算其文档频数，若该词的DF值小于某个阈值或者大于某个阈值都要进行去除，因为小于某个阈值代表了该词“没有代表性”，大于某个阈值代表了“没有区分度”。
信息增益：信息增益是一种基于熵的评估方法，定义为某个词为分类能够提供的信息量，其值为不考虑任何词的熵与考虑该词后的熵的差值。计算公式如下：

信息增益计算公式

卡方统计：卡方统计体现特征词与类别之间独立性的缺乏程度，它同时考虑了特征词存在与不存在的情况。卡方统计值越大，独立性越小，相关性越大。计算公式如下：

卡方统计计算公式

计算步骤：

统计训练样本的文档的总数记为total_count；
统计每个类别的文档总数分别为sexy_total_count、gamble_total_count、fake_total_count、normal_total_count；
计算每个词在各类别文档中出现的总次数分别记为sexy_count、gamble_count、fake_count、normal_count；
每个词的文档频数doc_count = sexy_count + gamble_count + fake_count + normal_count；
根据下图中对各类型ABCD值的定义，求出sexyA、sexyB、sexyC、sexyD、gambleA、gambleB、gambleC、gambleD、fakeA、fakeB、fakeC、fakeD、normalA、normalB、normalC、normalD；

ABCD定义

--计算特征词在各类型中的ABCD值
select *,
sexy_count as sexyA,
doc_count - sexy_count as sexyB,
sexy_total_count - sexy_count as sexyC,
total_count - doc_count - sexy_total_count + sexy_count as sexyD,

gamble_count as gambleA,
doc_count - gamble_count as gambleB,
gamble_total_count - gamble_count as gambleC,
total_count - doc_count - gamble_total_count + gamble_count as gambleD,

fake_count as fakeA,
doc_count - fake_count as fakeB,
fake_total_count - fake_count as fakeC,
total_count - doc_count - fake_total_count + fake_count as fakeD,

normal_count as normalA,
doc_count - normal_count as normalB,
normal_total_count - normal_count as normalC,
total_count - doc_count - normal_total_count + normal_count as normalD

from ${t1}

根据卡方统计计算出各类型CHI值，再用下式计算词t对于整个样本的CHI值；

卡方统计值

--计算特征词在各类型中的卡方统计
select *,
total_count * (sexyA * sexyD - sexyC * sexyB) * (sexyA * sexyD - sexyC * sexyB) / 
((sexyA + sexyC) * (sexyB + sexyD) * (sexyA + sexyB) * (sexyC + sexyD)) 
as sexy_x2,

total_count * (gambleA * gambleD - gambleC * gambleB) * (gambleA * gambleD - gambleC * gambleB) / 
((gambleA + gambleC) * (gambleB + gambleD) * (gambleA + gambleB) * (gambleC + gambleD)) 
as gamble_x2,

total_count * (fakeA * fakeD - fakeC * fakeB) * (fakeA * fakeD - fakeC * fakeB) / 
((fakeA + fakeC) * (fakeB + fakeD) * (fakeA + fakeB) * (fakeC + fakeD)) 
as fake_x2,

total_count * (normalA * normalD - normalC * normalB) * (normalA * normalD - normalC * normalB) / 
((normalA + normalC) * (normalB + normalD) * (normalA + normalB) * (normalC + normalD)) 
as normal_x2

from ${t1};

select *, GREATEST(sexy_x2,  gamble_x2, fake_x2, normal_x2) as x2 from ${t1};

代入信息增益公式计算出word_ig值；

p(c_sexy) = sexy_total_count / total_count
p(t) = doc_count / total_count
p(c_sexy| t) = sexy_count / doc_count
p(t^━) = 1 - doc_count / total_count
p(c_sexy| t^━) = (sexy_total_count - sexy_count) / (total_count- doc_count)

select *,
- ((sexy_total_count / total_count * log(2, sexy_total_count / total_count )
 +  gamble_total_count / total_count * log(2, gamble_total_count / total_count )
 +  fake_total_count / total_count * log(2, fake_total_count / total_count ) 
 +  normal_total_count / total_count * log(2, normal_total_count / total_count ) ) 
+ doc_count / total_count * 
  (sexy_count / doc_count * log(2, sexy_count / doc_count + 0.01) + 
   gamble_count / doc_count * log(2, gamble_count / doc_count  + 0.01) + 
   fake_count / doc_count * log(2, fake_count / doc_count  + 0.01) + 
   normal_count / doc_count * log(2, normal_count / doc_count  + 0.01)) 
+ (1 - doc_count / total_count ) * 
  ((sexy_total_count- sexy_count) / (total_count- doc_count) * 
      log(2, (sexy_total_count - sexy_count) / (total_count- doc_count) + 0.01) + 
   (gamble_total_count - gamble_count) / (total_count - doc_count) * 
      log(2, (gamble_total_count - gamble_count) / (total_count- doc_count) + 0.01) + 
   (fake_total_count - fake_count) / (total_count - doc_count) * 
      log(2, (fake_total_count  - fake_count) / (total_count - doc_count) + 0.01) + 
   (normal_total_count- normal_count) / (total_count - doc_count) * 
      log(2, (normal_total_count - normal_count) / (total_count - doc_count) + 0.01))  
as word_ig from ${t1}

结合doc_count，信息增益，卡方统计选择特征词（设置各自的阈值）。

特征计算

特征计算常用tfidf，它的形式为：

tfidf计算公式

其中tf(t_i, d_j)是词在当前文本中的局部因子，idf(t_i)是只与t_i相关的全局因子，在论文[2]中认为idf不能充分反映词的重要程度，总结了一些改进的方法：

其他特征值计算方法

我们实验了几种在论文结论中表现较好的几种方法：TFIDF、Probability based、tf * info gain，感到惊讶的是传统的TFIDF的特征计算方法表现最好（当然实验次数有限，不能充分证明）。tfidf的计算方法如下：

-- total_count 为总文档数
-- doc_count 为特征词在文档中出现总次数
-- count 为在文档j中特征词i出现的频数
select *, sum(tfidf_square) over ( partition by id) as tfidf_square_sum from 
(select *, count * log(2, total_count / doc_count + 0.01) * count * log(2, total_count / doc_count + 0.01) 
as tfidf_square from ${t1} ) c

select *, count * log(2, total_count / doc_count + 0.01) / sqrt(tfidf_square_sum) as tfidf from ${t1}

模型训练及预测

在训练样本中，各类型的比值大约为50:5:3:3，需要考虑类别不平衡对模型训练的影响[4]。常见的解决方法有随机采样、SMOTE算法、代价矩阵等。我们采用的是随机采样算法，将各类型的比值降采样到10:5:3:3时分类效果最好。

总结

比赛之前对我们对特征工程、类别不平衡问题、特征穿越这些基本问题一无所知，都是在做题的过程中遇到了问题再去查找解决方法，加上比赛经验不足，过程上可谓是磕磕绊绊。给自己总结几个经验和教训：

经典的算法适合入手，学习成本低，通过完善往往能够得到不错的结果；
比赛的前期要不断的尝试不同的思路，关注赛题官方论坛、查找相关文献、和队友交流思路、动手实验要同步进行；
比赛后期要考虑试错的成本，特别是最后关键时刻还是要采用最稳妥的方法提高成绩；
选择好的队友很关键，要找和自己有同样的激情和目标的人，在这里感谢我的队友们，一起努力走到最后。

比赛中和队友一起进步了很多，对学习机器学习的热情也增加了不少，感谢阿里天池给我们提供了这么好的学习平台。

参考文献

Zheng Z, Wu X, Srihari R. Feature selection for text categorization on imbalanced data[J]. Acm Sigkdd Explorations Newsletter, 2004, 6(1):80-89.
Liu Y, Han T L, Sun A. Imbalanced text classification: A term weighting approach[J]. Expert Systems with Applications, 2009, 36(1):690-701.
Shen D, Chen Z, Yang Q, et al. Web-page classification through summarization[C]// International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2004:242-249.
He H, Garcia E A. Learning from Imbalanced Data[J]. IEEE Transactions on Knowledge & Data Engineering, 2009, 21(9):1263-1284.
Golub K, Ard? A. Importance of HTML structural elements and metadata in automated subject classification[C]// European Conference on Research and Advanced Technology for Digital Libraries. Springer-Verlag, 2005:368-378.
王小青. 中文文本分类特征选择方法研究[D]. 西南大学, 2010.
段军峰, 黄维通, 陆玉昌. 中文网页分类研究与系统实现[J]. 计算机科学, 2007, 34(6):210-213