The final task of this chapter will be to apply our newly gained skills to a real spam filter!
Naive Bayes classifiers are actually a very popular model for email filtering. Their naivety lends itself nicely to the analysis of text data, where each feature is a word (or a bag of words), and it would not be feasible to model the dependence of every word on every other word.
There are a bunch of good email datasets out there, such as the following:
- The Ling-Spam corpus: http://csmining.org/index.php/ling-spam-datasets.html
- The Hewlett-Packard spam database: https://archive.ics.uci.edu/ml/machine-learning-databases/spambase
- The Enrom-Spam dataset: http://www.aueb.gr/users/ion/data/enron-spam
- The Apache SpamAssassin public corpus: http://csmining.org/index.php/spam-assassin-datasets.html
In this section, we will be using the Enrom...