In this blog, I will explain Naive Bayes Classifier through an example and I also provide the source code.
1. Concept
Naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable.
For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features. For more introduction about Naive Bayes Classifiers please visit wiki.
2. Introduction
We assume a dataset is illustrated as following.
- The red points belong to class A.
- Green points belong to class B.
We use PA(x,y) to describe the possibility of point (x,y) belonging to class A, PB(x,y) to describe the possibility of point (x,y) belonging to class B. if there is a new point, we predict its class using the following strategy: - If PA(x1,y1)>PB(x2,y2), then point(x1,y1) belongs to class A.
- If PA(x1,y1)<PB(x2,y2) then point (x2,y2) belongs to class B.
This means that the point will be classified to the class with higher probability.
In bayes classification, we usually use the following equation to calculate condition probability.
- p(ci) means the frequency of class I appears in prior knowledge.
- p(w|ci) means the frequency of word wi appears in class i.
- p(w) means the frequency of word wi appears in all classes.
- p(ci|wj) means the probability that this sentence belongs to class i if word wj appears in the sentence.
3. Example
In the following, I will take an example to help have a better understanding on naïve bayes classification.
In the sentence “this dog is very cute”, the word vector is
w=[“this”,”dog”,”is”,”very”,”cute”].
The frequency of each word in this sentence is
word | this | dog | is | very | cute |
---|---|---|---|---|---|
frequency | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 |
We assume in some dataset, the frequency of these words p(wi) is:
word | this | dog | is | very | cute |
---|---|---|---|---|---|
frequency | 0.1 | 0.2 | 0.5 | 0.3 | 0.1 |
In class I, the frequency of these words p(w|ci) is:
word | this | dog | is | very | cute |
---|---|---|---|---|---|
frequency | 0.3 | 0.2 | 0.5 | 0.3 | 0.1 |
In the dataset, the frequency of class I p(ci) appearing is 0.4.
Then in this sentence, word “this” appears and the possibility of it belonging to class I is
Other possibility could also be calculated by this way.
Reference
Book: Matching Learning in Action.