Python Natural Language Processing Cookbook
By :

When we work with text, we frequently deal with compound (sentences with two parts that are equally important) and complex sentences (sentences with one part depending on another). It is sometimes useful to split these composite sentences into its component clauses for easier processing down the line. This recipe uses the dependency parse from the previous recipe.
You will only need the spacy
package in this recipe.
We will work with two sentences, He eats cheese, but he won't eat ice cream and If it rains later, we won't be able to go to the park. Other sentences may turn out to be more complicated to deal with, and I leave it as an exercise for you to split such sentences. Follow these steps:
package:import spacy
engine:nlp = spacy.load('en_core_web_sm')
He eats cheese, but he won't eat ice cream
:sentence = "He eats cheese, but he won't eat ice cream."
engine:doc = nlp(sentence)
for token in doc: ancestors = [t.text for t in token.ancestors] children = [t.text for t in token.children] print(token.text, "\t", token.i, "\t", token.pos_, "\t", token.dep_, "\t", ancestors, "\t", children)
def find_root_of_sentence(doc): root_token = None for token in doc: if (token.dep_ == "ROOT"): root_token = token return root_token
root_token = find_root_of_sentence(doc)
def find_other_verbs(doc, root_token): other_verbs = [] for token in doc: ancestors = list(token.ancestors) if (token.pos_ == "VERB" and len(ancestors) == 1\ and ancestors[0] == root_token): other_verbs.append(token) return other_verbs
other_verbs = find_other_verbs(doc, root_token)
We will use the following function to find the token spans for each verb:
def get_clause_token_span_for_verb(verb, doc, all_verbs): first_token_index = len(doc) last_token_index = 0 this_verb_children = list(verb.children) for child in this_verb_children: if (child not in all_verbs): if (child.i < first_token_index): first_token_index = child.i if (child.i > last_token_index): last_token_index = child.i return(first_token_index, last_token_index)
token_spans = [] all_verbs = [root_token] + other_verbs for other_verb in all_verbs: (first_token_index, last_token_index) = \ get_clause_token_span_for_verb(other_verb, doc, all_verbs) token_spans.append((first_token_index, last_token_index))
list at the end so that the clauses are in the order they appear in the sentence:sentence_clauses = [] for token_span in token_spans: start = token_span[0] end = token_span[1] if (start < end): clause = doc[start:end] sentence_clauses.append(clause) sentence_clauses = sorted(sentence_clauses, key=lambda tup: tup[0])
He eats cheese, but he won't eat ice cream
:clauses_text = [clause.text for clause in sentence_clauses] print(clauses_text)
The result is as follows:
['He eats cheese,', 'he won't eat ice cream']
Important note
The code in this section will work for some cases, but not others; I encourage you to test it out on different cases and amend the code.
The way the code works is based on the way complex and compound sentences are structured. Each clause contains a verb, and one of the verbs is the main verb of the sentence (root). The code looks for the root verb, always marked with the ROOT
dependency tag in spaCy processing, and then looks for the other verbs in the sentence.
The code then uses the information about each verb's children to find the left and right boundaries of the clause. Using this information, the code then constructs the text of the clauses. A step-by-step explanation follows.
In step 1, we import the spaCy
package and in step 2, we load the spacy
engine. In step 3, we set the sentence variable and in step 4, we process it using the spacy
engine. In step 5, we print out the dependency parse information. It will help us determine how to split the sentence into clauses.
In step 6, we define the find_root_of_sentence
function, which returns the token that has a dependency tag of ROOT
. In step 7, we find the root of the sentence we are using as an example.
In step 8, we define the find_other_verbs
function, which will find other verbs in the sentence. In this function, we look for tokens that have the VERB
part of speech tag and has the root token as its only ancestor. In step 9, we apply this function.
In step 10, we define the get_clause_token_span_for_verb
function, which will find the beginning and ending index for the verb. The function goes through all the verb's children; the leftmost child's index is the beginning index, while the rightmost child's index is the ending index for this verb's clause.
In step 11, we use the preceding function to find the clause indices for each verb. The token_spans
variable contains the list of tuples, where the first tuple element is the beginning clause index and the second tuple element is the ending clause index.
In step 12, we create token Span
objects for each clause in the sentence using the list of beginning and ending index pairs we created in step 11. We get the Span
object by slicing the Doc
object and then appending the resulting Span
objects to a list. As a final step, we sort the list to make sure that the clauses in the list are in the same order as in the sentence.
In step 13, we print the clauses in our sentence. You will notice that the word but is missing, since its parent is the root verb eats, although it appears in the other clause. The exercise of including but is left to you.
Change the font size
Change margin width
Change background colour