-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

Machine Learning with LightGBM and Python
By :

In the previous section, we introduced a model as a construct to replace a set of instructions that typically comprise a program to perform a specific task. This section covers models and other core machine learning concepts in more detail.
More formally, a model is a mathematical or algorithmic representation of a specific process that performs a particular task. A machine learning model learns a particular task by being trained on a dataset using a training algorithm.
Note
An alternative term for training is fit. Historically, fit stems from the statistical field. A model is said to “fit the data” when trained. We’ll use both terms interchangeably throughout this book.
Many distinct types of models exist, all of which use different mathematical, statistical, or algorithmic techniques to model the training data. Examples of machine learning algorithms include linear regression, logistic regression, decision trees, support vector machines, and neural networks.
A distinction is made between the model type and a trained instance of that model: the majority of machine learning models can be trained to perform various tasks. For example, decision trees (a model type) can be trained to forecast sales, recognize heart disease, and predict football match results. However, each of these tasks requires a different instance of a decision tree that has been trained on a distinct dataset.
What a specific model does depends on the model’s parameters. Parameters are also sometimes called weights, which are technically particular types of model parameters.
A training algorithm is an algorithm for finding the most appropriate model parameters for a specific task.
We determine the quality of fit, or how well the model performs, using an objective function. This is a mathematical function that measures the difference between the predicted output and the actual output for a given input. The objective function quantifies the performance of a model. We may seek to minimize or maximize the objective function depending on the problem we are solving. The objective is often measured as an error we aim to minimize during training.
We can summarize the model training process as follows: a training algorithm uses data from a dataset to optimize a model’s parameters for a particular task, as measured through an objective function.
While a model is composed of parameters, the training algorithm has parameters of its own called hyperparameters. A hyperparameter is a controllable value that influences the training process or algorithm. For example, consider finding the minimum of a parabola function: we could start by guessing a value and then take small steps in the direction that minimizes the function output. The step size would have to be chosen well: if our steps are too small, it will take a prohibitively long time to find the minimum. If the step size is too large, we may overshoot and miss the minimum and then continue oscillating (jumping back and forth) around the minimum:
Figure 1.1 – Effect of using a step size that is too large (left) and too small (right)
In this example, the step size would be a hyperparameter of our minimization algorithm. The effect of the step size is illustrated in Figure 1.1.
As explained previously, the machine learning model is trained using a dataset. Data is at the heart of the machine learning process, and data preparation is often the part of the process that takes up the most time.
Throughout this book, we’ll work with tabular datasets. Tabular datasets are very common in the real world and consist of rows and columns. Rows are often called samples, examples, or observations, and columns are usually called features, variables, or attributes.
Importantly, there is no restriction on the data type in a column. Features may be strings, numbers, Booleans, geospatial coordinates, or encoded formats such as audio, images, or video.
Datasets are also rarely perfectly defined. Data may be incomplete, noisy, incorrect, inconsistent, and contain various formats.
Therefore, data preparation and cleaning are essential parts of the machine learning process.
Data preparation concerns processing the data to make it suitable for machine learning and typically consists of the following steps:
In Chapter 6, Solving Real-World Data Science Problems with LightGBM, we’ll go through the entire data preparation process to show how the preceding steps are applied practically.
Note
A good adage to remember is “garbage in, garbage out”. A model learns from any data given to it, including any flaws or biases contained in the data. When we train the model on garbage data, it results in a garbage model.
One final concept to understand regarding datasets is the training, validation, and test datasets. We split our datasets into these three subsets after the data preparation step is done:
Another use of the validation set is to monitor whether the model is overfitting the data. Let’s discuss overfitting in more detail.
To understand overfitting, we must first define what we mean by model generalization. As stated previously, generalization is the model’s ability to accurately predict data it has not seen before. Compared to training accuracy, generalization accuracy is more significant as an estimate of model performance as this indicates how our model will perform in production. Generalization comes in two forms, interpolation and extrapolation:
Of the two types of generalization, extrapolation is much more challenging and may require a specific type of model to achieve. However, in both cases, a model can overfit the data, losing its ability to interpolate or extrapolate accurately.
Overfitting is a phenomenon where the model fits the training data too closely and loses its ability to generalize to unseen data. Instead of learning the underlying pattern in the data, the model has memorized the training data. More technically, the model fits the noise contained in the training data. The term noise stems from the concept of data containing signal and noise. Signal refers to the underlying pattern or information captured in the data we are trying to predict. In contrast, noise refers to random or irrelevant variations of data points that mask the signal.
For example, consider a dataset where we try to predict the rainfall for specific locations. The signal in the data would be the general trend of rainfall: rainfall increases in the winter or summer, or vice versa for other locations. The noise would be the slight variations in rainfall measurement for each month and location in our dataset.
The following graph illustrates the phenomenon of overfitting:
Figure 1.2 – Graph showing overfitting. The model has overfitted and predicted the training data perfectly but has lost the ability to generalize to the actual signal
The preceding figure shows the difference between signal and noise: each data point was sampled from the actual signal. The data follows the general pattern of the signal, with slight, random variations. We can see how the model has overfitted the data: the model has fit the training data perfectly but at the cost of generalization. We can also see that if we use the model to interpolate by predicting a value for 4, we get a result much higher than the actual signal (6.72 versus 6.2). Also shown is the model’s failure to extrapolate: the prediction for 12 is much lower than a forecast of the signal (7.98 versus 8.6).
In reality, all real-world datasets contain noise. As data scientists, we aim to prepare the data to remove as much noise as possible, making the signal easier to detect. Data cleaning, normalization, feature selection, feature engineering, and regularization are techniques for removing noise from the data.
Since all real-world data contains noise, overfitting is impossible to eliminate. The following conditions may lead to overfitting:
As the validation set is a part of the training data that remains unseen by the model, we use the validation set to monitor for overfitting. We can recognize the point of overfitting by looking at the training and generalization errors over time. At the point of overfitting, the validation error increases. In contrast, the training error continues to improve: the model is fitting noise in the training data and losing its ability to generalize.
Techniques that prevent overfitting usually aim to address the conditions that lead to overfitting we discussed previously. Here are some strategies to avoid overfitting:
It’s important to note that overfitting and the techniques to prevent overfitting are specific to our model. Our goal should always be to minimize overfitting to ensure generalization to unseen data. Some strategies, such as regularization, might not work for specific models, while others might be more effective. There are also more bespoke strategies for models, an example of which we’ll see when we discuss overfitting in decision trees.
The store sales example is an instance of supervised learning – we have a dataset consisting of features and are training the model to predict a target.
Supervised learning problems can be divided into two main types of problem categories: classification problems and regression problems.
With a classification problem, the label that needs to be predicted by the model is categorical or defines a class. Some examples of classes are spam
or not spam
, cat
or dog
, and diabetic
or not diabetic
. These are examples of binary classifications: there are only two classes.
Multi-class classification is also possible; for example, email may be classified as Important
, Promotional
, Clutter
, or Spam
; images of clouds could be classified as Cirro
, Cumulo
, Strato
, or Nimbo
.
With regression problems, the goal is to predict a continuous, numerical value. Examples include predicting revenue, sales, temperature, house prices, and crowd size.
A big part of the art of machine learning is correctly defining or transcribing a problem as a classification or regression problem (or perhaps unsupervised or reinforcement). Later chapters will cover multiple end-to-end case studies of both types of problems.
Let’s briefly discuss how we measure our model’s performance. Model performance refers to the ability of a machine learning model to make accurate predictions or generate meaningful outputs based on the given inputs. An evaluation metric quantifies how well a model generalizes to new, unseen data. High model performance indicates that the model has learned the underlying patterns in the data effectively and can make accurate predictions on data it has not seen before. We can measure the model’s performance relative to the known targets when working with supervised learning problems (either classification or regression problems).
Importantly, how we measure the model’s performance on classification tasks and regression tasks differs. scikit-learn has many built-in metrics functions ready for use with either a classification or regression problem (https://scikit-learn.org/stable/modules/model_evaluation.html). Let’s review the most common of these.
Classification metrics can be defined in terms of positive and negative predictions made by the model. The following definitions can be used to calculate classification metrics:
Given these definitions, the most common classification metrics are as follows:
Accuracy = TP + TN ______________ TP + FP + TN + FN
Precision = TP _ TP + FP
Recall = TP _ TP + FN
F 1 = 2 × Precision × Recall _______________ Precision + Recall = 2 × TP _____________ 2 × TP + FP + FN
The preceding classification metrics are the most common, but there are many more. Even though the F1 score is commonly used in classification problems (as it summarizes precision and recall), choosing the best metric is specific to the problem you are solving. Often, it might be the case that a specific metric is required, but other times, you must choose based on experience and your understanding of the data. We will look at examples of different metrics later in this book.
The following are common regression metrics:
As with the classification metrics, choosing the most appropriate regression metric is specific to the problem you are trying to solve.
Metrics versus objectives
We defined training a model as finding the most appropriate parameters to minimize an objective function. It’s important to note that the objective function and metrics used for a specific problem may differ. A good example is decision trees, where a measure of impurity (entropy) is used as the objective function when building a tree. However, we still calculate the metrics explained previously to determine the tree’s performance on the data.
With our understanding of basic metrics in place, we can conclude our introduction to machine learning concepts. Now, let’s review the terms and concepts we’ve discussed using an example.
Consider the following data of sales by month, in thousands:
Jan |
Feb |
Mar |
Apr |
May |
Jun |
4,140 |
4,850 |
7,340 |
6,890 |
8,270 |
10,060 |
Jul |
Aug |
Sept |
Oct |
Nov |
Dec |
8,110 |
11,670 |
10,450 |
11,540 |
13,400 |
14,420 |
Table 1.1 – Sample sales data, by month, in thousands
This problem is straightforward: there is only one feature, the month, and the target is the number of sales. Therefore, this is an example of a supervised regression problem.
Note
You might have noticed that this is an example of a time series problem: time is the primary variable. Time series can also be predicted using more advanced time series-specific algorithms such as ANOVA, but we’ll use a simple algorithm for illustration purposes in this section.
We can plot our data as a graph of sales per month to understand it better:
Figure 1.3 – Graph showing store sales by month
Here, we’re using a straight-line model, also known as simple linear regression, to model our sales data. The definition of a straight line is given by the following formula:
y = mx + c
Here, m is the line’s slope and c is the Y-intercept. In machine learning, the straight line is the model, and m and c are the model parameters.
To find the best parameters, we must measure how well our model fits the data for a particular set of parameters – that is, the error in our outputs. We will use the MAE as our metric:
MAE = ∑ i=1 n | ˆ y − y| _ n
Here, ˆ y is the predicted output, y is the actual output, and n is the number of predictions. We calculate the MAE by making a prediction for each of our inputs and then calculating the MAE based on the formula.
Now, let’s fit our linear model to our data. Our process for fitting the line is iterative, and we start this process by guessing values for m and c and then iterating from there. For example, let’s consider m = 0.1, c = 4:
Figure 1.4 – Graph showing the prediction of a linear model with m = 0.1 and c = 4
With these parameters, we achieve an error of 4,610
.
Our guess is far too low, but that’s okay; we can now update the parameters to attempt to improve the error. In reality, updating the model parameters is done algorithmically using a training algorithm such as gradient descent. We’ll discuss gradient descent in Chapter 2, Ensemble Learning – Bagging and Boosting.
In this example, we’ll use our understanding of straight lines and intuition to update the parameters for each iteration manually. Our line is too shallow, and the intercept is too low; therefore, we must increase both values. We can control the updates we make each iteration by choosing a step size. We must update the m and c values with each iteration by adding the step size. The results, for a step size of 0.1, is shown in Table 1.2.
Guess# |
m |
c |
MAE |
1 |
0.1 |
4 |
4.61 |
2 |
0.2 |
4.1 |
3.89 |
3 |
0.3 |
4.2 |
3.17 |
4 |
0.3 |
4.3 |
2.5 |
5 |
0.4 |
4.4 |
1.83 |
Table 1.2 – Step wise guessing of the slope (m) and y-intercept (c) for a straight line to fit our data. The quality of fit is measured using the MAE
In our example, the step size is a hyperparameter of our training process.
We end up with an error of 1.83, which means, on average, our predictions are wrong by less than 2,000.
Now, let’s see how we can solve this problem using scikit-learn.
Instead of manually modeling, we can use scikit-learn to build a linear regression model. As this is our first example, we’ll walk through the code line by line and explain what’s happening.
To start with, we must import the Python tools we are going to use:
import numpy as np import pandas as pd from matplotlib import pyplot as plt import seaborn as sns from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error
There are three sets of imports: we import numpy
and pandas
first. Importing NumPy and pandas is a widely used way to start all your data science notebooks. Also, note the short names np
and pd
, which are the standard conventions when working with numpy
and pandas
.
Next, we import a few standard plotting libraries we will use to plot some graphs: pyplot
from matplotlib
and seaborn
. Matplotlib is a widely used plotting library that we access via the pyplot python interface. Seaborn is another visualization tool built on top of Matplotlib, which makes it easier to draw professional-looking graphs.
Finally, we get to our scikit-learn imports. In Python code, the scikit-learn library is called sklearn
. From its linear_model
package, we import LinearRegression
. scikit-learn implements a wide variety of predefined metrics, and here, we will be using mean_absolute_error
.
Now, we are ready to set up our data:
months = np.array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]) sales = np.array([4.14, 4.85, 7.34, 6.89, 8.27, 10.06, 8.11, 11.67, 10.45, 11.54, 13.4 , 14.42]) df = pd.DataFrame({"month": months, "sales": sales})
Here, we define a new numpy
array for the months and the corresponding sales, and to make them easier to work with, we gather both arrays into a new pandas
DataFrame.
With the data in place, we get to the interesting part of the code: modeling using scikit-learn. The code is straightforward:
model = LinearRegression() model = model.fit(df[["month"]], df[["sales"]])
First, we create our model by constructing an instance of LinearRegression
. We then fit our model using model.fit
and passing in the month and sales data from our DataFrame. These two lines are all that’s required to fit a model, and as we’ll see in later chapters, even complicated models use the same recipe to instantiate and train a model.
We can now calculate our MAE by creating predictions for our data and passing the predictions and actual targets to the metric function:
predicted_sales = model.predict(df[["month"]]) mean_absolute_error(predicted_sales, df[["sales"]])
We get an error of 0.74, which is slightly lower than our guesswork. We can also examine the model’s coefficient and intercept (m and c from earlier):
print(f"Gradient: ${model.coef_}") print(f"Intercept: ${model.intercept_}")
scikit-learn has fitted a model with a coefficient of 0.85 and an intercept of 3.68. We were in the right neighborhood with our guesses, but it might have taken us some time to get to the optimal values.
That concludes our introduction to scikit-learn and the basics of modeling and machine learning. In our toy example, we did not split our data into separate datasets, optimize our model’s hyperparameters, or apply any techniques to ensure our model does not overfit. In the next section, we’ll look at classification and regression examples, where we’ll apply these and other best practices.
Change the font size
Change margin width
Change background colour