
Python Natural Language Processing Cookbook
By :

In this recipe, we will use the LDA algorithm to discover topics that appear in the BBC dataset. This algorithm can be thought of as dimensionality reduction, or going from a representation where words are counted (such as how we represent documents using CountVectorizer
or TfidfVectorizer
, see Chapter 3, Representing Text: Capturing Semantics, we instead represent documents as sets of topics, each topic with a weight. The number of topics is of course much smaller than the number of words in the vocabulary. To learn more about how the LDA algorithm works, see https://highdemandskills.com/topic-modeling-intuitive/.
We will use the sklearn
and pandas
packages. If you haven't installed them, do so using the following command:
pip install sklearn pip install pandas
We will use a dataframe to parse in the data, then represent the documents using the CountVectorizer
object, apply the LDA algorithm, and...