
Python Machine Learning By Example
By :

From the preceding analysis, we can safely conclude that if we want to figure out whether a document was from the rec.autos
newsgroup, the presence or absence of words such as car
, doors
, and bumper
can be very useful features. The presence or not of a word is a Boolean variable, and we can also look at the count of certain words. For instance, car
occurs multiple times in the document. Maybe the more times such a word is found in a text, the more likely it is that the document has something to do with cars.
It seems that we are only interested in the occurrence of certain words, their count, or a related measure, and not in the order of the words. We can therefore view a text as a collection of words. This is called the Bag of Words (BoW) model. This is a very basic model but it works pretty well in practice. We can optionally define a more complex model that takes into account the order of words...