Discriminative classifiers divide the feature space into regions that separate the data belonging to different classes such that every separate region contains samples belonging to a single class. The decision boundary is determined by constructing and solving equations of the form
In practice equations of this kind may not always be solvable, so discriminative classifiers minimize some loss to solve an approximation of this equation. Algorithms of this type are not necessarily concerned with the probability distributions of features or labels.
Additionally, determining probabilities using the conditional distribution of y given x, can give us an idea of how strong the possibility of a data instance belonging to a specific class is.
Generative classifiers are concerned with finding the joint probability of features and labels, i.e. they try to estimate the probabilities of all pairs of features and labels found in the training data. Generative models assume that there is an underlying probability distribution from which the data has been sampled, and try to estimate the parameters of this distribution. Thus, they see the data as being "generated" through an estimated distribution (discriminative models, in contrast, are need not make this assumption).
The Naïve Bayes algorithm is a set of generative classifiers. The fundamental assumption in a Naïve Bayes algorithm is that
conditional on the class, features are independent, i.e. one feature value appearing in a given class is independent of the value of any other feature in the same class.
E.g. Given that a sample is drawn from the "setosa" class of the iris dataset, knowing it's sepal length should tell us nothing about its petal width.
The algorithm assumes that a change in value of one feature does not impact the other (within the ambit of the subset as above).
This is impractical in real life, since features are likely to be interrelated (e.g. in the Titanic dataset, given that a passenger survived and paid a low fare, we might be able to conclude that they are a woman. In text data, we know that occurences of words are not independent of each other). Hence it is a ‘naïve’ (oversimplifying) assumption.
Thus, the algorithm:
- assumes that the features in each class in the dataset is sampled from a distribution
- estimates the parameters for each such distribution
Ultimately, the model ends up with one unique distribution for each class. Note that these distributions are all from from the same family (i.e. Normal, Binomial, Gamma, etc).
Then, from the set of features in the test dataset, the algorithm estimates the probabilities that the new sample belongs to each of the class-wise districutions, and the class that has the highest probability is the predicted class.
Add an example - perferably from the HWG dataset.
Explain briefly the example.
The goal of the prediction is to find the probability that a feature vector belongs to a given
class. In other words, for a set of
where
However, in the training dataset, we only have the prior probabilities, i.e. we can only compute:
which denotes the probability of finding a feature vector in a given class.
So, the algorithm uses the Bayes’ theorem to convert the prior probabilities into the posterior probabilities.
As in typical machine learning problems, the parameters of the class-wise distributions are unknown. These are typically computed with maximum likelihood estimation.
For a sample
we want to find the probability of the sample belonging to class
- the probability of the class being
$c_i$ given the sample$\mathbf{x}$ , parameterized by$\mathbf{\theta}$ .
(Note:
This expression, then can be expanded using Bayes theorem as follows:
We know the following quantities from the training data:
- the prior probability
$P(\mathbf{x}|y=c_i)$ from the training data, and - the class priors
$P(y=c_i) \forall i \in [1, k]$
The denominator in the RHS can be ignored, since it is only a normalizing factor, and will not affect the relative probabilities of a sample belonging to different classes.
The expression for the prior probability,
(note that
Now, this is a representation which can be computed from the training data.
With all these quantities in place for each class, we can compute the LHS for all
Insert Titanic example
When to use a naïve bayes algorithm? Small datasets (n<=m) Features are uncorrelated Categorical features (Bernoulli/Categorical NB) Gaussian Naïve bayes can be used when data is non-linear Advantages Naïve Bayes classifiers generally have high bias and low variance. They generalise the training dataset properties. For example in the Titanic dataset considered above, variables like Name, Age and Fare take a large range of values and hence are highly prone to overfitting if we use classifiers like Linear regression, logistic regression or Decision trees. Naïve Bayes is a good baseline model owing to it’s simple assumption and generalisation property. Naïve Bayes models train quickly. No requirement of transformation of non-linear datasets Disadvantages The features need not necessarily be conditionally independent and the naive assumption fails in most cases. Could produce lower accuracy. Cannot be used with data where features are numerical and not distributed normally.
Sources: Generative vs discriminative models https://cs229.stanford.edu/summer2019/cs229-notes2.pdf https://stats.stackexchange.com/questions/12421/generative-vs-discriminative Naïve Bayes vs Logistic regression: https://ai.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf Naïve bayes as a baseline classifier: https://www.cl.cam.ac.uk/teaching/1617/MLRD/handbook/nb.pdf Also refer: https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote05.html