-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

Natural Language Understanding with Python
By :

The preprocessing topics we have covered in the previous sections are generally applicable to many types of text in many applications. Additional preprocessing steps can also be used in specific applications, and we will cover these in the next sections.
Sometimes data includes specific words or tokens that have equivalent semantics. For example, a text corpus might include the names of US states, but for the purposes of the application, we only care that some state was mentioned – we don’t care which one. In that case, we can substitute a class token for the specific state name. Consider the interaction in Figure 5.10:
Figure 5.10 – Class token substitution
If we substitute the class token, <state_name>
, for Texas
, all of the other state names will be easier to recognize, because instead of having to learn 50 states, the system...