Monday, July 14, 2014

Bayesian Naive Bayes for Classification with the Dirichlet Distribution

I have a classification task and was reading up on various approaches. In the specific case where all inputs are categorical, one can use “Bayesian Na├»ve Bayes” using the Dirichlet distribution. 

Poking through the freely available text by Barber, I found a rather detailed discussion in chapters 9 and 10, as well as example matlab code for the book, so took it upon myself to port it to R as a learning exercise.

I was not immediately familiar with the Dirichlet distribution, but in this case it appeals to the intuitive counting approach to discrete event probabilities.

In a nutshell we use the training data to learn the posterior distribution, which turns out to be counts of how often a given event occurs, grouped by class, feature and feature state.

Prediction is a case of counting events in the test vector. The more this count differs from the per-class trained counts, the lower the probability the current candidate class is a match.

Anyway, there are three files. The first is a straightforward port of Barber’s code, but this wasn’t very R-like, and in particular only seemed to handle input features with the same number of states.

I developed my own version that expects everything to be represented as factors. It is all a bit rough and ready but appears to work and there is a test/example script up here. As a bigger test I ran it on a sample  car evaluation data set from here, the confusion matrix is as follows:

testY   acc good unacc vgood
  acc    83    3    29     0
  good   16    5     0     0
  unacc  17    0   346     0
  vgood  13    0     0     6

That’s it for now. Comments/feedback appreciated. You can find me on twitter here

Links to files:

Everything in one directory (with data) here