机器学习实战读书笔记-朴素贝叶斯

机器学习实战读书笔记-朴素贝叶斯

核心思想:要求分类器给出一个最优类别的猜测结果,同时给出这个猜测概率的估计值

我们称之为朴素,是因为整个形式化过程只做最原始,最简单的假设

概率基础

p(c_i|x,y) = \frac{p(x,y|c_i)p(c_i)}{p(x,y)}

其中p(c_i|x,y)的意义为:给定某个由x,y标注的数据点,那么该数据点来自类别c_i的概率为多少

如果p(c_1|x,y) > p(c_2|x,y),那么属于类别1,反之亦然。

独立:如果每个特征需要N个样本,如果假设有10个特征,那么则需要N^{10}个样本,如果特征之间相互独立,则需要的样本数就可以从N^{10}减少到10N个,所谓独立,指的是统计意义上的独立,即一个特征或者单词出现的可能性与它和其他单词相邻没有关系。虽然我们知道这个假设并不正确,这也就是朴素的含义。

朴素贝叶斯假设

  • 特征之间相互独立(单词beacon出现在unhealty后面和出现在delicious后面的概率相同)
  • 每个特征同等重要(判断留言是否得等,需要看完所有的单词)

虽然这两个假设通常不成立,但是咱朴素贝叶斯就这么假设了。

使用python进行文本分类

训练算法

p(c_i|\textbf{w})=\frac{p(\textbf{w}|c_i)p(c_i)}{p(\textbf{w})}

将以上公式中的x, y换为\textbf{w},粗体表示这是一个向量,此外由于朴素贝叶斯的独立性假设,可以按如下公式计算p(\textbf{w}|c_i),以此来简化计算过程。
p(\textbf{w}|c_i) = p(w_0,w_1,...,w_n|c_i) =\prod_{j}p(w_j|c_i)

计算每个类别中的文档数目
对于每篇训练文档:
    对于每个类别:
        如果词条出现在文档中,增加该词条的计数
        增加所有词条的计数
    对每个类别:
        对每个词条:
            将该词条的数目初一总词条数得到条件概率
    返回每个类别的条件概率

实战

  • 进行文本分类

    分类评论是否是恶意评论

    import numpy as np
    
    
    def load_data_set():
        """
        Generate train data set and associated classify result
        :return: (train_data_set, classify_result)
        """
        posting_list = [
            ['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
            ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
            ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
            ['stop', 'posting', 'stupid', 'worthless', 'gar e'],
            ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
            ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
        class_vec = [0, 1, 0, 1, 0, 1]  # 1 is an abuse, 0 is not
        return posting_list, class_vec
    
    
    def create_vocab_list(data_set):
        """
        Get a set of words which appear in the train data set
        :param data_set: train_data_set
        :return: set of words
        """
        vocab_set = set()
        for doc in data_set:
            # union of 2 sets
            vocab_set = vocab_set | set(doc)
        return list(vocab_set)
    
    
    def word2vec_set(vocab_list, input_sentence):
        """
        Transfer a sentence to a vector based on the words appear in the sentence Using Set Model
        :param vocab_list: All word appeared in train set
        :param input_sentence: Input sentence
        :return: The vector representative of the input sentence
        """
        ret_vector = [0] * len(vocab_list)
        for word in input_sentence:
            if word in vocab_list:
                ret_vector[vocab_list.index(word)] = 1
            else:
                print("the word {} is not in the vocabulary".format(word))
        return ret_vector
    
    
    def word2vec_bag(vocab_list, input_sentence):
        """
        Transfer a sentence to a vector using Word Bag Model, in case that one work might appears in on sentence more than once
        :param vocab_list: 
        :param input_sentence: 
        :return: 
        """
        ret_vector = [0] * len(vocab_list)
        for word in input_sentence:
            if word in input_sentence:
                ret_vector[vocab_list.index(word)] += 1
        return ret_vector
    
    
    def train_naive(train_matrix, train_category):
        """
        Get probabilities to calculate bayes classify result
        :param train_matrix: All sentence vector of train set
        :param train_category: The classify result of train set
        :return: p(w|c_0) p(w|c_1) p(c_1)
        """
        # number of comment
        doc_num = len(train_matrix)
        # number of word in the vocabulary
        word_num = len(train_matrix[0])
    
        # probability of abusive p(c_1)
        # Seeing as is a 2 class problem, we could get the probability of non-abusive through 1-p_abuse
        p_abuse = sum(train_category) / float(doc_num)
    
        # p0_num = np.zeros(word_num)
        # p1_num = np.zeros(word_num)
        #
        # # p0_num/p0_denominator = p(w|c_0)
        # p0_denominator = 0.0
        # p1_denominator = 0.0
        p0_num = np.ones(word_num)
        p1_num = np.ones(word_num)
    
        p0_denominator = 2.0
        p1_denominator = 2.0
    
        for i in range(doc_num):
            # if this comment is abusive
            if train_category[i] == 1:
                p1_num += train_matrix[i]
                p1_denominator += sum(train_matrix[i])
            else:
                p0_num += train_matrix[i]
                p0_denominator += sum(train_matrix[i])
        # pi_condition is p(w|c_i)
        p1_condition = np.log(p1_num / p1_denominator)
        p0_condition = np.log(p0_num / p0_denominator)
        return p0_condition, p1_condition, p_abuse
    
    
    def classify_naive(test_vector, p0_condition, p1_condition, p_1):
        # because we already process np.log
        # p(w|c_i) = p(w_0|c_i)p(w_1|c_i)p(w_2|c_i) ....
        # Asterisk means element-wise multiply in numpy
        p1 = sum(test_vector * p1_condition) + np.log(p_1)
        p0 = sum(test_vector * p0_condition) + np.log(1 - p_1)
        if p1 > p0:
            return 1
        else:
            return 0
    
    
    def test_naive():
        post_list, class_list = load_data_set()
        vocab = create_vocab_list(post_list)
        train_matrix = []
        for post in post_list:
            train_matrix.append(word2vec_set(vocab, post))
        p0_condition, p1_conditon, p_aubsive = train_naive(train_matrix, class_list)
        test_entry = ["love", "my", "dalmation"]
        test_vector = word2vec_set(vocab, test_entry)
        print("The vector of input sentence is: ", test_vector)
        print("Classify result is: ", classify_naive(test_vector, p0_condition, p1_conditon, p_abusive))
    
    
    post_list, classes = load_data_set()
    print(post_list)
    vocab = create_vocab_list(post_list)
    print(word2vec_set(vocab, post_list[0]))
    print(vocab)
    
    train_matrix = []
    for post in post_list:
        train_matrix.append(word2vec_set(vocab, post))
    p_non_abusive_condition, p_abusive_condition, p_abusive = train_naive(train_matrix, classes)
    
    print(p_abusive)
    print(p_abusive_condition)
    
    max_index = p_abusive_condition.argmax()
    # argmax of p_abusive_condition is stupid, basically means the word 'stupid' contribute a lot to an abusive comment
    print(vocab[max_index])
    
    
  • 过滤垃圾邮件

    import re
    import random
    import numpy as np
    
    
    def create_vocab_list(data_set):
        """
        Get a set of words which appear in the train data set
        :param data_set: train_data_set
        :return: set of words
        """
        vocab_set = set()
        for doc in data_set:
            # union of 2 sets
            vocab_set = vocab_set | set(doc)
        return list(vocab_set)
    
    
    def word2vec_bag(vocab_list, input_sentence):
        """
        Transfer a sentence to a vector using Word Bag Model, in case that one work might appears in on sentence more than once
        :param vocab_list:
        :param input_sentence:
        :return:
        """
        ret_vector = [0] * len(vocab_list)
        for word in input_sentence:
            if word in input_sentence:
                ret_vector[vocab_list.index(word)] += 1
        return ret_vector
    
    
    def train_naive(train_matrix, train_category):
        """
        Get probabilities to calculate bayes classify result
        :param train_matrix: All sentence vector of train set
        :param train_category: The classify result of train set
        :return: p(w|c_0) p(w|c_1) p(c_1)
        """
        # number of comment
        doc_num = len(train_matrix)
        # number of word in the vocabulary
        word_num = len(train_matrix[0])
    
        # probability of abusive p(c_1)
        # Seeing as is a 2 class problem, we could get the probability of non-abusive through 1-p_abuse
        p_abuse = sum(train_category) / float(doc_num)
    
        # p0_num = np.zeros(word_num)
        # p1_num = np.zeros(word_num)
        #
        # # p0_num/p0_denominator = p(w|c_0)
        # p0_denominator = 0.0
        # p1_denominator = 0.0
        p0_num = np.ones(word_num)
        p1_num = np.ones(word_num)
    
        p0_denominator = 2.0
        p1_denominator = 2.0
    
        for i in range(doc_num):
            # if this comment is abusive
            if train_category[i] == 1:
                p1_num += train_matrix[i]
                p1_denominator += sum(train_matrix[i])
            else:
                p0_num += train_matrix[i]
                p0_denominator += sum(train_matrix[i])
        # pi_condition is p(w|c_i)
        p1_condition = np.log(p1_num / p1_denominator)
        p0_condition = np.log(p0_num / p0_denominator)
        return p0_condition, p1_condition, p_abuse
    
    
    def classify_naive(test_vector, p0_condition, p1_condition, p_1):
        # because we already process np.log
        # p(w|c_i) = p(w_0|c_i)p(w_1|c_i)p(w_2|c_i) ....
        # Asterisk means element-wise multiply in numpy
        p1 = sum(test_vector * p1_condition) + np.log(p_1)
        p0 = sum(test_vector * p0_condition) + np.log(1 - p_1)
        if p1 > p0:
            return 1
        else:
            return 0
    
    
    def parse_text(input_sentence):
        token_list = re.split(r'\W+', input_sentence)
        return [token.lower() for token in token_list if len(token) > 2]
    
    
    def spam_test():
        # Import and parse files
        doc_list = []
        class_list = []
        for i in range(1, 26):
            try:
                words = parse_text(open("email/spam/{}.txt".format(i)).read())
            except:
                words = parse_text(open("email/spam/{}.txt".format(i), encoding='Windows 1252').read())
            doc_list.append(words)
            class_list.append(1)
    
            try:
                words = parse_text(open("email/ham/{}.txt".format(i)).read())
            except:
                words = parse_text(open("email/ham/{}.txt".format(i), encoding='Windows 1252').read())
            doc_list.append(words)
            class_list.append(0)
        vocab = create_vocab_list(doc_list)
    
        # Generate Training Set and Test Set
        test_set = [int(num) for num in random.sample(range(50), 10)]
        training_set = list(set(range(50)) - set(test_set))
    
        training_matrix = []
        training_class = []
        for doc_index in training_set:
            training_matrix.append(word2vec_bag(vocab, doc_list[doc_index]))
            training_class.append(class_list[doc_index])
        p0_conditon, p1_conditon, p_spam = train_naive(np.array(training_matrix), np.array(training_class))
    
        # Test the classify result
        err_count = 0
        for doc_index in test_set:
            test_vector = word2vec_bag(vocab, doc_list[doc_index])
            classify_result = classify_naive(test_vector, p0_conditon, p1_conditon, p_spam)
            if classify_result != class_list[doc_index]:
                err_count += 1
        print("The error rate is {}".format(err_count / len(test_set)))
    
    
    spam_test()
    

总结

  • 朴素贝叶斯以及贝叶斯准则提供了一种使用已知的值估算未知值的方法;
  • 通过特征间的条件独立性假设,可以用于降低对数据量的需求,虽然这个假设过于简单,但是贝叶斯假设仍然是一种有效的分类器
  • 在编程实现朴素贝叶斯时需要考虑很多问题,例如通过取自然对数来解决下溢出的问题等
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,718评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,683评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 158,207评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,755评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,862评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,050评论 1 291
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,136评论 3 410
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,882评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,330评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,651评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,789评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,477评论 4 333
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,135评论 3 317
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,864评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,099评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,598评论 2 362
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,697评论 2 351

推荐阅读更多精彩内容