title	Naive Bayes Classifier
tags	Machine Learning, Classification, Naive Bayes

Naive Bayes Classifier (NBC)

Naive Bayes is a simple, fast and reliable probabilistic model. It has been widely used for classification tasks but especially work well with NLP problems.

Principles of NBC

"Naive" Part:

Naive Bayes works on the assumption that the features/indicators are independent (incorrelated). That is, the presense of one particular feature does not affect the other, whereas in the real world problems, this situation rarely happens. Hence, this is called "Naive". An additional assumption can be taken into account, which is, all the predictors/features equally impact outcome(No feature is more important than the others). Ex: The day being windy is not more important for deciding to play tennis or not.

"Bayes" Part:

NBC is the probabilistic model that classifies an instance by giving it the probability of each class, the class with highest probability would be the chosen class of that instance. NBC performs this by using Bayes Theorem.

Bayes Theorem

$$ P (A|B) = \frac{P(B|A) P(A)}{P(B)} $$

$P(A|B)$: Posterior probability
$P(B|A)$: The likelihood
$P(A)$: The prior probability (priori)
$P(B)$: The marginal likelihood

With Bayes Theorem, one can find the probability of A happening given that B has occured (B is the evidence and A is the hypothesis).

Posterior probability

Posterior probability is just the conditional probability that is outputted by the Bayes theorem. There is nothing special about it, it does not differ anyhow from any other conditional probability, it just has it's own name.
Posterior $P(A|B)$ is the probability of $A$ "after" learning about $B$.

Prior probability

Prior $P(A)$ is the probability of 𝐴 "before" learning about $B$.

What is the Likelihood in Probability ?

How it works ?

$$ P (y|X) = \frac{P(X|y) P(y)}{P(X)} $$ Notice: X is uppercase since it is a matrix, where y is only a vector.

y: Class variable
X = ($x_1,x_2,x_3,..., x_n$): Features/predictors

Substituting X into the above formula and using chain rule, we obtain:

$$ P (y|x_1,...,x_n) = \frac{P(x_1,x_2,...,x_n|y) P(y)}{P(x_1,x_2,...x_n)} = \frac{P(x_1|y)P(x_2|y)...P(x_n|y) P(y)}{P(x_1)P(x_2)...P(x_n)} $$

Recall that NBC will calculate all the probabilities and choose the highest one as the final class. It is noteworthy that the denominator does not change for all classes, it remain static. Therefore, for the sake of simplicity, we can simply get rid of the denominators and just compare the numerators. However, in a case that one wants to obtain the exact probability value (of one instance belong to a class), the denominator should be kept (as if we eliminate it, the probability is no longer valid).

Therefore, this can be re-written as: $$ P (y|x_1,...,x_n) \propto P(y) \prod_{i=1}^{n} P(x_i|y) $$

Most of the real-world problems are multi-class classification tasks (multivariate classfication). Therefore, we need to find the class y with maximum probability: $$ y = arg \max_{y} P(y) \prod_{i=1}^{n} P(x_i|y) $$

Avoiding Zero Probability

When the testing set has a categorical variable that is not observed in the training dataset, NBC will assign a zero probability and unable to make a prediction. This is often referenced as Zero Frequency. In order to tackle this problem, we can use a smoothing technique. One of the simplest technique is Laplace estimation.

Advantages

Easily implemented and fast on both training and testing.
When the assumption of independence holds, NBC performs better other models and it requires less training data.
It performs well in case of categorical features/indicators comparing to numerical ones. For numerical variables, normal/gaussian distribution is assumed.

Disadvantages

The assumption of independent predictors/features is usually not hold in most of real-world problems. This weakens the accuracy of NBC.

Application - When to use ?

The advantages indicate that NBC is a good choice as an initial baseline classification.
If it performs suitably, then you have a very fast, interpretable classifier for your problem. If it does not perform well, then you can begin exploring more sophiscated models, with some baseline knowledge of how well they should perform.

Why Naive Bayes is fast ?

The parameters of the model of a Naive Bayes classifier i.e., apriori (prior probabilities) and conditional probabilities are 'learnt' or rather determined using a deterministic set of steps - this involves two very trivial operations that can be blindingly fast on modern day computers:

Counting
Dividing

There is no "iteration". There is no "epoch". There is no "optimization of a cost function". There is no "error back-propagation". There is no operation(s) involving "solving a matrix equation".

This makes the TRAINING FAST!

The model parameters - apriori and conditional probabilities are stored for inference (on new data). During inference phase, it uses Bayes equation to compute the posterior probabilities. This involves trivial arithmetic operations like addition and multiplication and further normalization is only a division by a scalar. This is why INFERENCE is FAST!

So, Naive Bayes - on the whole - is FAST!!

Types of Naive Bayes Classifier

1. Multinomial Naive Bayes

Mostly used for document classification problem, i.e the problem in which a document is predicted to belong to Sports, Economics, Politics, etc. The features used in this case can be the frequency of the words occured within a document.

2. Bernoulli Naive Bayes

Similar to Multinomial NB, but the features/predictors are boolean/binary variables. The features used in this case can be whether the word is occured in the document or not.

3. Gaussian Naive Bayes

This is used for continous predictors/features and are not discrete, and it is assumed that these values are sampled in a Gaussian(Normal) distribution.

Practical Example

MonkeyLearn - A Practical Example of Naive Bayes

tranctan/Naive Bayes Classifier.md

Select an option

No results found