# Bayes classifier

In many applications the relationship between the attribute set and the class variable is non-deterministic. In other words, the class label of a test record cannot be predicted with certainty even though it’s attribute set is identical to some of the training examples. This situation may arise because of noisy data or the presence of certain confounding factors that affect classification but are not included in the analysis. Hence an approach was required for modeling probabilistic relationships between the attribute set and the class variable.

## Bayes Theorem

Lets $X$ and $Y$ be a pair of random variables. Their joint probability, $P(X = x, Y=y)$, refers to the probability that variable $X$ will take on the value $x$ and variable $Y$ will take on the value $y$. A conditional probability is the probability that a random variable will take in a particular value given that the outcome for another random variable is known. For example, the conditional probability $P(Y = y| X = x)$ refers to the probability that the variable $Y$ will take on the value $y$, given that the variable $X$ is observed to have the value $x$. The joint and conditional probabilities for $X$ and $Y$ are related in the following way: $$P(X, Y) = P(Y|X) * P(X) = P(X|Y) * P(Y)$$ Rearranging the last two expressions in the above equation leads to the following formula known as the Bayes Theorem. $$P(Y|X) = \frac{P(X|Y) * P(Y)}{P(X)}$$

## Using the bayes theorem for classification

Before describing how the bayes theorem can be used for classification, let us formalize the classification problem from a statistical perspective. Let $X$ denote the attribute set and $Y$ denote the class variable. If the class variable has a non-deterministic relationship with the attributesm then we can treat X and Y as random variable and capture their relationship probabilistically using $P(Y|X)$. This conditional probability is also knows as the **posterior probability** for $Y$, as oppesed to its **Prior Probability**, $P(Y)$.

During the training phase, we need to learn the posterior probabilities $P(Y|X)$ for every combination of $X$ and $Y$ based on information gathered from the training data. By knowning these probabilities, a test record $X'$ can be classified by finding the class $Y'$ that maximizes the posterior probability (P(Y’, X’)).

Estimating the posterior probabilities accurately for every possible combination of class label and attribute is a difficult problem because it requires a very large training set, even for a moderate number of attributes. The bayes theorem is usefull because it allows us to express the posterior probability in terms of the prior probability $P(Y)$, the class-conditional probability $P(X|Y)$, and the evidence, $P(X)$:

$$P(Y|X) = \frac{P(X|Y) * P(Y)}{P(X)}$$

When comparing the posterior probabilities for different values of $Y$, the denominator term, $P(X)$, is always constant, and thus, can be ignored. The prior probability P(Y) can be easily estimated from the training set by computing the fraction of training records that belong to each class. To estimate the class-conditional probabilities $P(X|Y)$, we have two implementations of Bayesian classification methos: the naive Bayes classifier and the bayesian belief network.

## Naive Bayes Classifier

A naive bayes classifier estimates the class-conditional probability by assuming that the attributes are conditionally independent, given the class label $y$. The conditional independence assumption can be formally stated as follows: $$P(X|Y = y) = \prod_{i=1}^{d} P(X_{i}| Y = y)$$ where each attribute set $X = $ {$X_{1}, X_{2}, X_{3}…$} consists of d attributes.

## Conditional independence

Let $X$, $Y$, and $Z$ denote three sets of random variables. The variables in $X$ are said to be conditionally independent of $Y$, give $Z$, if the following condition holds: $$P(X|Y, Z) = P(X|Z)$$. An example of conditional independence is the relationship between a person’s arm length and his or her reading skills.

## How a naive bayes classifier works

With the conditional independence assumption, instead of computing the class-conditional probability for every combination of $X$, we only have to estimate the conditional probability of each $X_{i}$, give Y. The latter approach is more practical because it doesn’t requiea a very large training set to obtain a good estimate for each class Y: $$P(Y|X) = \frac{P(Y) \prod_{i=1}^{d}P(X_{i}|Y)}{P(X)}$$

Since $P(X)$ is foxed for every $Y$, it is sufficient to choose the class that maximizes the numerator term, $P(Y) \prod_{i=1}^{d} P(X_{i}|Y)$.

## Estimating conditional probabilities for categorical attributes

For a categorical attribute $X_{i}$, the condition probability $P(X_{i} = x_{i} | Y = y)$ is estimated according to the fraction of training instances in class $y$ that take on a particular attribute value $x_{i}$.

## Estimating Conditional probabilities for continues attributes

There are two ways to estimate the class-conditional probabilities for continues attributes in naive bayes classifier.

- We can discretize each continues attribute and then replace the continues attribute bases with it’s corrosponding discrete interval.
- We can assume a certain form of probability distribution for continues varuable and estimatee the parameters of the distribution using the training data.