Chapter 3: Text Wrangling and Preprocessing | The Handbook of NLP with Gensim

Book Overview & Buying
Table Of Contents
Feedback & Rating

The Handbook of NLP with Gensim

By : Chris Kuo

5 (6)

Buy this Book

The Handbook of NLP with Gensim

5 (6)

By: Chris Kuo

Buy this Book

Overview of this book

Navigating the terrain of NLP research and applying it practically can be a formidable task made easy with The Handbook of NLP with Gensim. This book demystifies NLP and equips you with hands-on strategies spanning healthcare, e-commerce, finance, and more to enable you to leverage Gensim in real-world scenarios. You’ll begin by exploring motives and techniques for extracting text information like bag-of-words, TF-IDF, and word embeddings. This book will then guide you on topic modeling using methods such as Latent Semantic Analysis (LSA) for dimensionality reduction and discovering latent semantic relationships in text data, Latent Dirichlet Allocation (LDA) for probabilistic topic modeling, and Ensemble LDA to enhance topic modeling stability and accuracy. Next, you’ll learn text summarization techniques with Word2Vec and Doc2Vec to build the modeling pipeline and optimize models using hyperparameters. As you get acquainted with practical applications in various industries, this book will inspire you to design innovative projects. Alongside topic modeling, you’ll also explore named entity handling and NER tools, modeling procedures, and tools for effective topic modeling applications. By the end of this book, you’ll have mastered the techniques essential to create applications with Gensim and integrate NLP into your business processes.

Preface

Why read this book?

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Share Your Thoughts

Download a free PDF copy of this book

Part 1: NLP Basics

Free Chapter

Chapter 1: Introduction to NLP

Introduction to natural language processing

NLU + NLG = NLP

Gensim and its NLP modeling techniques

Topic modeling with BERTopic

Common NLP Python modules included in this book

Summary

Questions

References

Chapter 2: Text Representation

Technical requirements

What word embedding is

Simple encoding methods

What TF-IDF is

Shining applications of BoW and TF-IDF

Coding – BoW

Coding – Bag-of-N-grams

Coding – TF-IDF

Summary

Questions

References

Chapter 3: Text Wrangling and Preprocessing

Technical requirements

Key steps in NLP preprocessing

Coding with spaCy

Coding with NLTK

Coding with Gensim

Building a pipeline with spaCy

Summary

Questions

References

Part 2: Latent Semantic Analysis/Latent Semantic Indexing

Chapter 4: Latent Semantic Analysis with scikit-learn

Technical requirements

Understanding matrix operations

Understanding a transformation matrix

Understanding eigenvectors and eigenvalues

An introduction to SVD

Coding truncatedSVD with scikit-learn

Using TruncatedSVD for LSI with real data

Summary

Questions

Chapter 5: Cosine Similarity

Technical requirements

What is cosine similarity?

How cosine similarity is used in images

How to compute cosine similarity with scikit-learn

Summary

Questions

References

Chapter 6: Latent Semantic Indexing with Gensim

Technical requirements

Performing text preprocessing

Performing word embedding with BoW and TF-IDF

Modeling with Gensim

Using the coherence score to find the optimal number of topics

Saving the model for production

Using the model as an information retrieval tool

Summary

Questions

References

Part 3: Word2Vec and Doc2Vec

Chapter 7: Using Word2Vec

Technical requirements

Introduction to Word2Vec

Introduction to Skip-Gram (SG)

Introduction to CBOW

Using a pretrained model for semantic search

Adding and subtracting words/concepts

Visualizing Word2Vec with TensorBoard

Training your own Word2Vec model in CBOW and Skip-Gram

Visualizing your Word2Vec model with t-SNE

Comparing Word2Vec with Doc2Vec, GloVe, and fastText

Summary

Questions

References

Chapter 8: Doc2Vec with Gensim

Technical requirements

From Word2Vec to Doc2Vec

PV-DBOW

PV-DM

The real-world applications of Doc2Vec

Doc2Vec modeling with Gensim

Putting the model into production

Tips on building a good Doc2Vec model

Summary

Questions

References

Part 4: Topic Modeling with Latent Dirichlet Allocation

Chapter 9: Understanding Discrete Distributions

Technical requirements

The basics of discrete probability distributions

Bernoulli distributions

Binomial distributions

Multinomial distributions

Beta distributions

Dirichlet distributions

Summary

Questions

References

Chapter 10: Latent Dirichlet Allocation

What is generative modeling?

Understanding the idea behind LDA

Understanding the structure of LDA

Variational inference

Variational E-M

Variational E-M versus Gibbs sampling

Summary

Questions

References

Chapter 11: LDA Modeling

Technical requirements

Text preprocessing

Experimenting with LDA modeling

Building LDA models with a different number of topics

Determining the optimal number of topics

Using the model to score new documents

Summary

Questions

References

Chapter 12: LDA Visualization

Technical requirements

Designing an infographic

Data visualization with pyLDAvis

Summary

Questions

References

Chapter 13: The Ensemble LDA for Model Stability

Technical requirements

From LDA to Ensemble LDA

The process of Ensemble LDA

Understanding DBSCAN and CBDBSCAN

Building an Ensemble LDA model with Gensim

Summary

Questions

References

Part 5: Comparison and Applications

Chapter 14: LDA and BERTopic

Technical requirements

Understanding the Transformer model

Understanding BERT

Describing how BERTopic works

Building a BERTopic model

Reviewing the results of BERTopic

Visualizing the BERTopic model

Predicting new documents

Using the modular property of BERTopic

Comparing BERTopic with LDA

Summary

Questions

References

Chapter 15: Real-World Use Cases

Word2Vec for medical fraud detection

Comparing LDA/NMF/BERTopic on Twitter/X posts

Interpretable text classification from electronic health records

BERTopic for legal documents

Word2Vec for 10-K financial documents to the SEC

Summary

References

Assessments

Chapter 1 – Introduction to NLP

Chapter 2 – Text Representation

Chapter 3 – Text Wrangling and Preprocessing

Chapter 4 – Latent Semantic Analysis with scikit-learn

Chapter 5 – Cosine Similarity

Chapter 6 – Latent Semantic Indexing with Gensim

Chapter 7 – Using Word2Vec

Chapter 8 – Doc2Vec with Gensim

Chapter 9 – Understanding Discrete Distributions

Chapter 10 – Latent Dirichlet Allocation

Chapter 11 – LDA Modeling

Chapter 12 – LDA Visualization

Chapter 13 – The Ensemble LDA for Model Stability

Chapter 14 – LDA and BERTopic

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Customer Reviews

5 (6)

5 star

100%

4 star

3 star

2 star

1 star

The Handbook of NLP with Gensim

By : Chris Kuo

The Handbook of NLP with Gensim

By: Chris Kuo

Overview of this book

Text Wrangling and Preprocessing

Unlock full access

Continue reading for free

The Handbook of NLP with Gensim

By : Chris Kuo

The Handbook of NLP with Gensim

By: Chris Kuo

Overview of this book

Text Wrangling and Preprocessing

Unlock full access

Continue reading for free

Create a Note

Delete Bookmark

Delete Note

Confirmation

Buy this book with your credits?