-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

Deep Learning for Natural Language Processing
By :

Solution:
We begin by setting up the data pre-processing pipeline. For each one of the authors, we aggregate all the known papers into a single long text. We assume that style does not change across the various papers, hence a single text is equivalent to multiple small ones yet it is much easier to deal with programmatically.
For each paper of each author we perform the following steps:
import numpy as np
import os
from sklearn.model_selection import train_test_split
# Classes for A/B/Unknown
A = 0
B = 1
UNKNOWN = -1
def preprocess_text(file_path):
with open(file_path, 'r') as f:
lines = f.readlines()
text = ' '.join(lines[1:]).replace("\n", ' ').replace(' ',' ').lower().replace('hamilton','').replace('madison', '')
text = ' '.join(text.split())
return text
# Concatenate all the papers known to be written by A/B into a single long text
all_authorA, all_authorB = '',''
for x in os.listdir('./papers/A/'):
all_authorA += preprocess_text('./papers/A/' + x)
for x in os.listdir('./papers/B/'):
all_authorB += preprocess_text('./papers/B/' + x)
# Print lengths of the large texts
print("AuthorA text length: {}".format(len(all_authorA)))
print("AuthorB text length: {}".format(len(all_authorB)))
The output for this should be as follows:
The next step is to break the long text for each author into many small sequences. As described above, we empirically choose a length for the sequence and use it throughout the model's lifecycle. We get our full dataset by labeling each sequence with its author.
To break the long texts into smaller sequences we use the Tokenizer class from the keras framework. In particular, note that we set it up to tokenize according to characters and not words.
from keras.preprocessing.text import Tokenizer
# Hyperparameter - sequence length to use for the model
SEQ_LEN = 30
def make_subsequences(long_sequence, label, sequence_length=SEQ_LEN):
len_sequences = len(long_sequence)
X = np.zeros(((len_sequences - sequence_length)+1, sequence_length))
y = np.zeros((X.shape[0], 1))
for i in range(X.shape[0]):
X[i] = long_sequence[i:i+sequence_length]
y[i] = label
return X,y
# We use the Tokenizer class from Keras to convert the long texts into a sequence of characters (not words)
tokenizer = Tokenizer(char_level=True)
# Make sure to fit all characters in texts from both authors
tokenizer.fit_on_texts(all_authorA + all_authorB)
authorA_long_sequence = tokenizer.texts_to_sequences([all_authorA])[0]
authorB_long_sequence = tokenizer.texts_to_sequences([all_authorB])[0]
# Convert the long sequences into sequence and label pairs
X_authorA, y_authorA = make_subsequences(authorA_long_sequence, A)
X_authorB, y_authorB = make_subsequences(authorB_long_sequence, B)
# Print sizes of available data
print("Number of characters: {}".format(len(tokenizer.word_index)))
print('author A sequences: {}'.format(X_authorA.shape))
print('author B sequences: {}'.format(X_authorB.shape))
The output should be as follows:
# Calculate the number of unique words in the text
word_tokenizer = Tokenizer()
word_tokenizer.fit_on_texts([all_authorA, all_authorB])
print("Total word count: ", len((all_authorA + ' ' + all_authorB).split(' ')))
print("Total number of unique words: ", len(word_tokenizer.word_index))
The output should be as follows:
We now proceed to create our train, validation sets.
# Take equal amounts of sequences from both authors
X = np.vstack((X_authorA, X_authorB))
y = np.vstack((y_authorA, y_authorB))
# Break data into train and test sets
X_train, X_val, y_train, y_val = train_test_split(X,y, train_size=0.8)
# Data is to be fed into RNN - ensure that the actual data is of size [batch size, sequence length]
X_train = X_train.reshape(-1, SEQ_LEN)
X_val = X_val.reshape(-1, SEQ_LEN)
# Print the shapes of the train, validation and test sets
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_validate shape: {}".format(X_val.shape))
print("y_validate shape: {}".format(y_val.shape))
The output is as follows:
Finally, we construct the model graph and perform the training procedure.
from keras.layers import SimpleRNN, Embedding, Dense
from keras.models import Sequential
from keras.optimizers import SGD, Adadelta, Adam
Embedding_size = 100
RNN_size = 256
model = Sequential()
model.add(Embedding(len(tokenizer.word_index)+1, Embedding_size, input_length=30))
model.add(SimpleRNN(RNN_size, return_sequences=False))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics = ['accuracy'])
model.summary()
The output is as follows:
Batch_size = 4096
Epochs = 20
model.fit(X_train, y_train, batch_size=Batch_size, epochs=Epochs, validation_data=(X_val, y_val))
The output is as follows:
Do this all the papers in the Unknown folder
for x in os.listdir('./papers/Unknown/'):
unknown = preprocess_text('./papers/Unknown/' + x)
unknown_long_sequences = tokenizer.texts_to_sequences([unknown])[0]
X_sequences, _ = make_subsequences(unknown_long_sequences, UNKNOWN)
X_sequences = X_sequences.reshape((-1,SEQ_LEN))
votes_for_authorA = 0
votes_for_authorB = 0
y = model.predict(X_sequences)
y = y>0.5
votes_for_authorA = np.sum(y==0)
votes_for_authorB = np.sum(y==1)
print("Paper {} is predicted to have been written by {}, {} to {}".format(
x.replace('paper_','').replace('.txt',''),
("Author A" if votes_for_authorA > votes_for_authorB else "Author B"),
max(votes_for_authorA, votes_for_authorB), min(votes_for_authorA, votes_for_authorB)))
The output is as follows:
Change the font size
Change margin width
Change background colour