Book Image

Natural Language Processing with Java and LingPipe Cookbook

Book Image

Natural Language Processing with Java and LingPipe Cookbook

Overview of this book

Table of Contents (14 chapters)
Natural Language Processing with Java and LingPipe Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

The Jaccard distance


The Jaccard distance is a very popular and efficient way of comparing strings. The Jaccard distance operates at a token level and compares two strings by first tokenizing them and then dividing the number of common tokens by the total number of tokens. In the Eliminate near duplicates with the Jaccard distance recipe in Chapter 1, Simple Classifiers, we applied the distance to eliminate near-duplicate tweets. This recipe will go into a bit more detail and show you how it is computed.

A distance of 0 is a perfect match, that is, the strings share all their terms, and a distance of 1 is a perfect mismatch, that is, the strings have no terms in common. Remember that proximity and distance are additive inverses, so proximity also ranges from 1 to 0. Proximity of 1 is a perfect match, and proximity of 0 is a perfect mismatch:

proximity  = count(common tokens)/count(total tokens)
distance = 1 – proximity

The tokens are generated by TokenizerFactory, which is passed in during...