Book Image

Natural Language Processing with Java and LingPipe Cookbook

Book Image

Natural Language Processing with Java and LingPipe Cookbook

Overview of this book

Table of Contents (14 chapters)
Natural Language Processing with Java and LingPipe Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Introduction


This chapter introduces the LingPipe toolkit in the context of its competition and then dives straight into text classifiers. Text classifiers assign a category to text, for example, they assign the language to a sentence or tell us if a tweet is positive, negative, or neutral in sentiment. This chapter covers how to use, evaluate, and create text classifiers based on language models. These are the simplest machine learning-based classifiers in the LingPipe API. What makes them simple is that they operate over characters only—later, classifiers will have notions of words/tokens and even more. However, don't be fooled, character-language models are ideal for language identification, and they were the basis of some of the world's earliest commercial sentiment systems.

This chapter also covers crucial evaluation infrastructure—it turns out that almost everything we do turns out to be a classifier at some level of interpretation. So, do not skimp on the power of cross validation, definitions of precision/recall, and F-measure.

The best part is that you will learn how to programmatically access Twitter data to train up and evaluate your own classifiers. There is a boring bit concerning the mechanics of reading and writing LingPipe objects from/to disk, but other than that, this is a fun chapter. The goal of this chapter is to get you up and running quickly with the basic care and feeding of machine-learning techniques in the domain of natural language processing (NLP).

LingPipe is a Java toolkit for NLP-oriented applications. This book will show you how to solve common NLP problems with LingPipe in a problem/solution format that allows developers to quickly deploy solutions to common tasks.

LingPipe and its installation

LingPipe 1.0 was released in 2003 as a dual-licensed open source NLP Java library. At the time of writing this book, we are coming up on 2000 hits on Google Scholar and have thousands of commercial installs, ranging from universities to government agencies to Fortune 500 companies.

Current licensing is either AGPL (http://www.gnu.org/licenses/agpl-3.0.html) or our commercial license that offers more traditional features such as indemnification and non-sharing of code as well as support.

Projects similar to LingPipe

Nearly all NLP projects have awful acronyms so we will lay bare our own. LingPipe is the short form for linguistic pipeline, which was the name of the cvs directory in which Bob Carpenter put the initial code.

LingPipe has lots of competition in the NLP space. The following are some of the more popular ones with a focus on Java:

  • NLTK: This is the dominant Python library for NLP processing.

  • OpenNLP: This is an Apache project built by a bunch of smart folks.

  • JavaNLP: This is a rebranding of Stanford NLP tools, again built by a bunch of smart folks.

  • ClearTK: This is a University of Boulder toolkit that wraps lots of popular machine learning frameworks.

  • DkPro: Technische Universität Darmstadt from Germany produced this UIMA-based project that wraps many common components in a useful manner. UIMA is a common framework for NLP.

  • GATE: GATE is really more of a framework than competition. In fact, LingPipe components are part of their standard distribution. It has a nice graphical "hook the components up" capability.

  • Learning Based Java (LBJ): LBJ is a special-purpose programming language based on Java, and it is geared toward machine learning and NLP. It was developed at the Cognitive Computation Group of the University of Illinois at Urbana Champaign.

  • Mallet: This name is the short form of MAchine Learning for LanguagE Toolkit. Apparently, reasonable acronym generation is short in supply these days. Smart folks built this too.

Here are some pure machine learning frameworks that have broader appeal but are not necessarily tailored for NLP tasks:

  • Vowpal Wabbit: This is very focused on scalability around Logistic Regression, Latent Dirichelet Allocation, and so on. Smart folks drive this.

  • Factorie: It is from UMass, Amherst and an alternative offering to Mallet. Initially it focused primarily on graphic models, but now it also supports NLP tasks.

  • Support Vector Machine (SVM): SVM light and libsvm are very popular SVM implementations. There is no SVM implementation in LingPipe, because logistic regression does this as well.

So, why use LingPipe?

It is very reasonable to ask why choose LingPipe with such outstanding free competition mentioned earlier. There are a few reasons:

  • Documentation: The class-level documentation in LingPipe is very thorough. If the work is based on academic work, that work is cited. Algorithms are laid out, the underlying math is explained, and explanations are precise. What the documentation lacks is a "how to get things done" perspective; however, this is covered in this book.

  • Enterprise/server optimized: LingPipe is designed from the ground up for server applications, not for command-line usage (though we will be using the command line extensively throughout the book).

  • Coded in the Java dialect: LingPipe is a native Java API that is designed according to standard Java class design principles (Joshua Bloch's Effective Java, by Addison-Wesley), such as consistency checks on construction, immutability, type safety, backward-compatible serializability, and thread safety.

  • Error handling: Considerable attention is paid to error handling through exceptions and configurable message streams for long-running processes.

  • Support: LingPipe has paid employees whose job is to answer your questions and make sure that LingPipe is doing its job. The rare bug gets fixed in under 24 hours typically. They respond to questions very quickly and are very willing to help people.

  • Consulting: You can hire experts in LingPipe to build systems for you. Generally, they teach developers how to build NLP systems as a byproduct.

  • Consistency: The LingPipe API was designed by one person, Bob Carpenter, with an obsession of consistency. While it is not perfect, you will find a regularity and eye to design that can be missing in academic efforts. Graduate students come and go, and the resulting contributions to university toolkits can be quite varied.

  • Open source: There are many commercial providers, but their software is a black box. The open source nature of LingPipe provides transparency and confidence that the code is doing what we ask it to do. When the documentation fails, it is a huge relief to have access to code to understand it better.

Downloading the book code and data

You will need to download the source code for this cookbook, with supporting models and data from http://alias-i.com/book.html. Untar and uncompress it using the following command:

tar –xvzf lingpipeCookbook.tgz

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Alternatively, your operating system might provide other ways of extracting the archive. All recipes assume that you are running the commands in the resulting cookbook directory.

Downloading LingPipe

Downloading LingPipe is not strictly necessary, but you will likely want to be able to look at the source and have a local copy of the Javadoc.

The download and installation instructions for LingPipe can be found at http://alias-i.com/lingpipe/web/install.html.

The examples from this chapter use command-line invocation, but it is assumed that the reader has sufficient development skills to map the examples to their preferred IDE/ant or other environment.