
Python Natural Language Processing Cookbook
By :

We can now train our own word2vec
model on a corpus. For this task, we will use the top 20 Project Guttenberg books, which includes The Adventures of Sherlock Holmes. The reason for this is that training a model on just one book will result in suboptimal results. Once we get more text, the results will be better.
You can download the dataset for this recipe from Kaggle: https://www.kaggle.com/currie32/project-gutenbergs-top-20-books. The dataset includes files in RTF format, so you will have to save them as text. We will use the same package, gensim
, to train our custom model.
We will use the pickle
package to save the model on disk. If you do not have it installed, install it by using pip:
pip install pickle
We will read in all 20 books and use the text to create a word2vec
model. Make sure all the books are located in one directory. Let's get started: