
Machine Learning Automation with TPOT
By :

Just over 25 years ago (1994), a question was asked in an episode of The Today Show – "What is the internet, anyway?" It's hard to imagine that a couple of decades ago, the general population had difficulty defining what the internet is and how it works. Little did they know that we would have intelligent systems managing themselves only a quarter of a century later, available to the masses.
The concept of machine learning was introduced much earlier in 1949 by Donald Hebb. He presented theories on neuron excitement and communication between neurons (A Brief History of Machine Learning – DATAVERSITY, Foote, K., March 26, 2019). He was the first to introduce the concept of artificial neurons, their activation, and their relationships through weights.
In the 1950s, Arthur Samuel developed a computer program for playing checkers. The memory was quite limited at that time, so he designed a scoring function that attempted to measure every player's probability of winning based on the positions on the board. The program chose its next move using a MinMax strategy, which eventually evolved into the MinMax algorithm (A Brief History of Machine Learning – DATAVERSITY, Foote, K., March 26; 2019). Samuel was also the first one to come up with the term machine learning.
Frank Rosenblatt decided to combine Hebb's artificial brain cell model with the work of Arthur Samuel to create a perceptron. In 1957, a perceptron was planned as a machine, which led to building a Mark 1 perceptron machine, designed for image classification.
The idea seemed promising, to say at least, but the machine couldn't recognize useful visual patterns, which caused a stall in further research – this period is known as the first AI winter. There wasn't much going on with the perceptron and neural network models until the 1990s.
The preceding couple of paragraphs tell us more than enough about the state of machine learning and deep learning at the end of the 20th century. Groups of individuals were making tremendous progress with neural networks, while the general population had difficulty understanding even what the internet is.
To make machine learning useful in the real world, scientists and researchers required two things:
The first was rapidly becoming more available due to the rise of the internet. The second was slowly moving into a phase of exponential growth – both in CPU performance and storage capacity.
Still, the state of machine learning in the late 1990s and early 2000s was nowhere near where it is today. Today's hardware has led to a significant increase in the use of machine-learning-powered systems in production applications. It is difficult to imagine a world where Netflix doesn't recommend movies, or Google doesn't automatically filter spam from regular email.
But, what is machine learning, anyway?
There are a lot of definitions of machine learning out there, some more and some less formal. Here are a couple worth mentioning:
Even though these definitions are expressed differently, they convey the same information. Machine learning aims to develop a system or an algorithm capable of learning from data without human intervention.
The goal of a data scientist isn't to instruct the algorithm on how to learn, but rather to provide an adequately sized and prepared dataset to the algorithm and briefly specify the relationships between the dataset variables. For example, suppose the goal is to produce a model capable of predicting housing prices. In that case, the dataset should provide observations on a lot of historical prices, measured through variables such as location, size, number of rooms, age, whether it has a balcony or a garage, and so on.
It's up to the machine learning algorithm to decide which features are important and which aren't, ergo, which features have significant predictive power. The example in the previous paragraph explained the idea of a regression problem solved with supervised machine learning methods. We'll soon dive into both concepts, so don't worry if you don't quite understand it.
Further, we might want to build a model that can predict, with a decent amount of confidence, whether a customer is likely to churn (break the contract). Useful features would be the list of services the client is using, how long they have been using the service, whether the previous payments were made on time, and so on. This is another example of a supervised machine learning problem, but the target variable (churn) is categorical (yes or no) and not continuous, as was the case in the previous example. We call these types of problems classification machine learning problems.
Machine learning isn't limited to regression and classification. It is applied to many other areas, such as clustering and dimensionality reduction. These fall into the category of unsupervised machine learning techniques. These topics won't be discussed in this chapter.
But first, let's answer a question on the usability of machine learning models, and discuss who uses these models and in which circumstances.
In a single word – everywhere. But you'll have to continue reading to get a complete picture. Machine learning has been adopted in almost every industry in the last decade or two. The main reason is the advancements in hardware. Also, machine learning has become easier for the broader masses to use and understand.
It would be impossible to list every industry in which machine learning is used and to discuss further the specific problems it solves. The easier task would be to list the industries that can't benefit from machine learning, as there are far fewer of those.
We'll focus only on the better-known industries in this section.
Here's a list and explanation of the ten most common use cases of machine learning, both from the industry standpoint and as a general overview:
These are just a couple of examples of what machine learning can do – not an exhaustive list by any means. You are now familiar with a brief history of machine learning and know how machine learning can be applied to a wide array of tasks.
The next section will provide a brief refresher on supervised machine learning techniques, such as regression and classification.
The majority of practical machine learning problems are solved through supervised learning algorithms. Supervised learning refers to a situation where you have an input variable (a predictor), typically denoted with X, and an output variable (what you are trying to predict), typically denoted with y.
There's a reason why features (X) are denoted with a capital letter and the target variable (y) isn't. In math terms, X denotes a matrix of features, and matrices are typically denoted with capital letters. On the other hand, y is a vector, and lowercase letters are typically used to denote vectors.
The goal of a supervised machine learning algorithm is to learn the function that can transform any input into the output. The most general math representation of a supervised learning algorithm is represented with the following formula:
Figure 1.1 – General supervised learning formula
We must apply one of two corrections to make this formula acceptable. The first one is to replace y with y-hat, as y generally denotes the true value, and y-hat denotes the prediction. The second correction we could make is to add the error term, as only then can we have the correct value of y on the other side. The error term represents the irreducible error – the type of error that can't be reduced by further training.
Here's how the first corrected formula looks:
Figure 1.2 – Corrected supervised learning formula (v1)
And here's the second one:
Figure 1.3 – Corrected supervised learning formula (v2)
It's more common to see the second one, but don't be confused by any of the formats – these formulas generally represent the same thing.
Supervised machine learning is called "supervised" because we have the labeled data at our disposal. You might have already picked this because of the feature and target discussion. This means that we have the correct answers already, ergo, we know which combinations of X yield the corresponding values of y.
The end goal is to make the best generalization from the data available. We want to produce the most unbiased model capable of generalizing on new, unseen data. The concepts of overfitting, underfitting, and the bias-variance trade-off are important to produce such a model, but they are not in the scope of this book.
As we've already mentioned, supervised learning problems are grouped into two main categories:
Both regression and classification are explored in the following sections.
As briefly discussed in the previous sections, regression refers to a phenomenon where the target variable is continuous. The target variable could represent a price, a weight, or a height, to name a few.
The most common type of regression is linear regression, a model where a linear relationship between variables is assumed. Linear regression further divides into a simple linear regression (only one feature), and multiple linear regression (multiple features).
Important note
Keep in mind that linear regression isn't the only type of regression. You can perform regression tasks with algorithms such as decision trees, random forests, support vector machines, gradient boosting, and artificial neural networks, but the same concepts still apply.
To make a quick recap of the regression concept, we'll declare a simple pandas.DataFrame
object consisting of two columns – Living area
and Price
. The goal is to predict the price based only on the living space. We are using a simple linear regression model here just because it makes the data visualization process simpler, which, as the end result, makes the regression concept easy to understand:
import pandas as pd df = pd.DataFrame({ 'LivingArea': [300, 356, 501, 407, 950, 782, 664, 456, 673, 821, 1024, 900, 512, 551, 510, 625, 718, 850], 'Price': [100, 120, 180, 152, 320, 260, 210, 150, 245, 300, 390, 305, 175, 185, 160, 224, 280, 299] })
matplotlib
library. By default, the library doesn't look very appealing, so a couple of tweaks are made with the matplotlib.rcParams
package:import matplotlib.pyplot as plt from matplotlib import rcParams rcParams['figure.figsize'] = 14, 8 rcParams['axes.spines.top'] = False rcParams['axes.spines.right'] = False
plt.scatter(df['LivingArea'], df['Price'], color='#7e7e7e', s=200) plt.title('Living area vs. Price (000 USD)', size=20) plt.xlabel('Living area', size=14) plt.ylabel('Price (000 USD)', size=14) plt.show()
The preceding code produces the following graph:
Figure 1.4 – Regression – Scatter plot of living area and price (000 USD)
scikit-learn
library. The library contains tons of different algorithms and techniques we can apply on our data. The sklearn-learn.linear_model
module contains the LinearRegression
class. We'll use it to train the model on the entire dataset, and then to make predictions on the entire dataset. That's not something you would usually do in production environment, but is essential here to get a further understanding of how the model works:from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(df[['LivingArea']], df[['Price']]) preds = model.predict(df[['LivingArea']]) df['Predicted'] = preds
plt.scatter(df['LivingArea'], df['Price'], color='#7e7e7e', s=200, label='Data points') plt.plot(df['LivingArea'], df['Predicted'], color='#040404', label='Best fit line') plt.title('Living area vs. Price (000 USD)', size=20) plt.xlabel('Living area', size=14) plt.ylabel('Price (000 USD)', size=14) plt.legend() plt.show()
The preceding code produces the following graph:
Figure 1.5 – Regression – Scatter plot of living area and price (000 USD) with the line of best fit
model.predict([[1000]]) >>> array([[356.18038708]])
Both the R2 score and the RMSE are calculated as follows:
import numpy as np from sklearn.metrics import r2_score, mean_squared_error rmse = lambda y, ypred: np.sqrt(mean_squared_error(y, ypred)) model_r2 = r2_score(df['Price'], df['Predicted']) model_rmse = rmse(df['Price'], df['Predicted']) print(f'R2 score: {model_r2:.2f}') print(f'RMSE: {model_rmse:.2f}') >>> R2 score: 0.97 >>> RMSE: 13.88
To conclude, we've built a simple but accurate model. Don't expect data in the real world to behave this nicely, and also don't expect to build such accurate models most of the time. The process of model selection and tuning is tedious and prone to human error, and that's where automation libraries such as TPOT come into play.
We'll cover a classification refresher in the next section, again on the fairly simple example.
Classification in machine learning refers to a type of problem where the target variable is categorical. We could turn the example from the Regression section in the classification problem by converting the target variable into categories, such as Sold/Did not sell.
In a nutshell, classification algorithms help us in various scenarios, such as predicting customer attrition, whether a tumor is malignant or not, whether someone has a given disease or not, and so on. You get the point.
Classification tasks can be further divided into binary classification tasks and multi-class classification tasks. We'll explore binary classification tasks briefly in this section. The most basic classification algorithm is logistic regression, and we'll use it in this section to build a simple classifier.
Note
Keep in mind that you are not limited only to logistic regression for performing classification tasks. On the contrary – it's good practice to use a logistic regression model as a baseline, and to use more sophisticated algorithms in production. More sophisticated algorithms include decision trees, random forests, gradient boosting, and artificial neural networks.
The data is completely made up and arbitrary in this example:
Radius
), and the second column denotes the classification (either 0 or 1). The dataset is constructed with the following Python code:import numpy as np import pandas as pd df = pd.DataFrame({ 'Radius': [0.3, 0.1, 1.7, 0.4, 1.9, 2.1, 0.25, 0.4, 2.0, 1.5, 0.6, 0.5, 1.8, 0.25], 'Class': [0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0] })
matplotlib
library once again for visualization purposes. Here's how to import it and make it a bit more visually appealing:import matplotlib.pyplot as plt from matplotlib import rcParams rcParams['figure.figsize'] = 14, 8 rcParams['axes.spines.top'] = False rcParams['axes.spines.right'] = False
Class
attribute is 0, and on the right where it's 1:plt.scatter(df['Radius'], df['Class'], color='#7e7e7e', s=200) plt.title('Radius classification', size=20) plt.xlabel('Radius (cm)', size=14) plt.ylabel('Class', size=14) plt.show()
The following graph is the output of the preceding code:
Figure 1.6 – Classification – Scatter plot between measurements and classes
The goal of a classification model isn't to produce a line of best fit, but instead to draw out the best possible separation between the classes.
sklearn.linear_model
package. We'll use it to train the model on the entire dataset, and then to make predictions on the entire dataset. Again, that's not something we will keep doing later on in the book, but is essential to get insights into the inner workings of the model at this point:from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(df[['Radius']], df['Class']) preds = model.predict(df[['Radius']]) df['Predicted'] = preds
np.linspace
method. It takes three arguments – start
, stop
, and the number of elements. We'll set the number of elements to 1000
. xs = np.linspace(0, df['Radius'].max() + 0.1, 1000) ys = [model.predict([[x]]) for x in xs] plt.scatter(df['Radius'], df['Class'], color='#7e7e7e', s=200, label='Data points') plt.plot(xs, ys, color='#040404', label='Decision boundary') plt.title('Radius classification', size=20) plt.xlabel('Radius (cm)', size=14) plt.ylabel('Class', size=14) plt.legend() plt.show()
The preceding code produces the following visualization:
Figure 1.7 – Classification – Scatter plot between measurements and classes and the decision boundary
Our classification model is basically a step function, which is understandable for this simple problem. Nothing more complex is needed to correctly classify every instance in our dataset. This won't always be the case, but more on that later.
Read the previous list as many times as necessary to completely understand the idea. The confusion matrix is an essential concept in classifier evaluation, and the later chapters in this book assume you know how to interpret it.
sklearn.metrics
package. Here's how to import it and obtain the results:from sklearn.metrics import confusion_matrix confusion_matrix(df['Class'], df['Predicted'])
Here are the results:
Figure 1.8 – Classification – Evaluation with a confusion matrix
The previous figure shows that our model was able to classify every instance correctly. As a rule of thumb, if the diagonal elements stretching from the bottom left to the top right are zeros, it means the model is 100% accurate.
The confusion matrix interpretation concludes our brief refresher on supervised machine learning methods. Next, we will dive into the idea of automation, and discuss why we need it in machine learning.
Change the font size
Change margin width
Change background colour