
Python Natural Language Processing Cookbook
By :

Noun chunks are known in linguistics as noun phrases. They represent nouns and any words that depend on and accompany nouns. For example, in the sentence The big red apple fell on the scared cat, the noun chunks are the big red apple and the scared cat. Extracting these noun chunks is instrumental to many other downstream NLP tasks, such as named entity recognition and processing entities and relationships between them. In this recipe, we will explore how to extract named entities from a piece of text.
We will be using the spacy
package, which has a function for extracting noun chunks and the text from the sherlock_holmes_1.txt
file as an example.
In this section, we will use another spaCy language model, en_core_web_md
. Follow the instructions in the Technical requirements section to learn how to download it.
Use the following steps to get the noun chunks from a piece of text:
spacy
package and the read_text_file
from the code files of Chapter 1:import spacy from Chapter01.dividing_into_sentences import read_text_file
Important note
If you are importing functions from other chapters, run it from the directory that precedes Chapter02
and use the python -m Chapter02.extract_noun_chunks
command.
sherlock_holmes_1.txt
file:text = read_text_file("sherlock_holmes_1.txt")
spacy
engine and then use it to process the text:nlp = spacy.load('en_core_web_md') doc = nlp(text)
doc.noun_chunks
class variable. We can print out the chunks:for noun_chunk in doc.noun_chunks: print(noun_chunk.text)
This is the partial result. See this book's GitHub repository for the full printout, which can be found in the Chapter02/all_text_noun_chunks.txt
file:
Sherlock Holmes she the_ woman I him her any other name his eyes she the whole …
The spaCy Doc
object, as we saw in the previous recipe, contains information about grammatical relationships between words in a sentence. Using this information, spaCy determines noun phrases or chunks contained in the text.
In step 1, we import spacy and the read_text_file
function from the Chapter01
module. In step 2, we read in the text from the sherlock_holmes_1.txt
file.
In step 3, we initialize the spacy
engine with a different model, en_core_web_md
, which is larger and will most likely give better results. There is also the large model, en_core_web_lg
, which is even larger. It will give better results, but the processing will be slower. After loading the engine, we run it on the text we loaded in step 2.
In step 4, we print out the noun chunks that appear in the text. As you can see, it gets the pronouns, nouns, and noun phrases that are in the text correctly.
Noun chunks are spaCy Span
objects and have all their properties. See the official documentation at https://spacy.io/api/token.
Let's explore some properties of noun chunks:
spacy
package:import spacy
spacy
engine:nlp = spacy.load('en_core_web_sm')
sentence = "All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind."
spacy
engine:doc = nlp(sentence)
for noun_chunk in doc.noun_chunks: print(noun_chunk.text)
All emotions his cold, precise but admirably balanced mind
for noun_chunk in doc.noun_chunks: print(noun_chunk.text, "\t", noun_chunk.start, "\t", noun_chunk.end)
The result will be as follows:
All emotions 0 2 his cold, precise but admirably balanced mind 11 19
for noun_chunk in doc.noun_chunks: print(noun_chunk.text, "\t", noun_chunk.sent)
Predictably, this results in the following:
All emotions All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. his cold, precise but admirably balanced mind All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.
for noun_chunk in doc.noun_chunks: print(noun_chunk.text, "\t", noun_chunk.root.text)
All emotions emotions his cold, precise but admirably balanced mind mind
Span
is similarity
, which is the semantic similarity of different texts. Let's try it out. We will load another noun chunk, emotions
, and process it using spacy
:other_span = "emotions" other_doc = nlp(other_span)
for noun_chunk in doc.noun_chunks: print(noun_chunk.similarity(other_doc))
This is the result:
UserWarning: [W007] The model you're using has no word vectors loaded, so the result of the Span.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available. print(noun_chunk.similarity(other_doc)) All emotions 0.373233604751925 his cold, precise but admirably balanced mind 0.030945358271699138
spacy
model, which contains vector representations for words. Substitute this line for the line in step 2; the rest of the code will remain the same:nlp = spacy.load('en_core_web_md')
All emotions 0.8876554549427152 that one 0.37378867755652434 his cold, precise but admirably balanced mind 0.5102475977383759
The result shows the similarity of all emotions
to emotions
being very high, 0.89, and to his cold, precise but admirably balanced mind
, 0.51. We can also see that the larger model detects another noun chunk, that one.
Important note
A larger spaCy
model, such as en_core_web_md
, takes up more space, but is more precise.
The topic of semantic similarity will be explored in more detail in Chapter 3, Representing Text: Capturing Semantics.
Change the font size
Change margin width
Change background colour