-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

The Handbook of NLP with Gensim
By :

Let’s first do BoW and TF-IDF. We learned how to prepare BoW and TF-IDF in Chapter 2, Text Representation. BoW is actually the count frequency of words, while its variation, TF-IDF, is designed to reflect the importance of a word in a document of a corpus.
We will first use the Dictionary
class to build and manage dictionaries of terms (words or tokens). It creates a mapping between unique terms in a corpus and their integer IDs. This is actually the BoW:
from gensim.corpora import Dictionarygensim_dictionary = Dictionary()
Let’s examine the dictionary list object, gensim_dictionary
. How many unique words are in it? Let’s check the length of this list to get the number of words:
len(gensim_dictionary)
We get the following output:
40360
So, there are 40,360 words!
Now, we will create the BoW.
We create the BoW by using the .doc2bow()
function:
bow_corpus = [gensim_dictionary.doc2bow...