
Debugging Machine Learning Models with Python
By :

Machine learning contains multiple modeling types that may rely on output data, a variable type of model output, and learning from prerecorded data or experience. Although the examples in this book focus on supervised learning, we will review other types of modeling, including unsupervised learning, self-supervised learning, semi-supervised learning, reinforcement learning (RL), and generative machine learning to cover the six major categories of machine learning modeling (Figure 1.2). We will also talk about techniques in machine learning modeling and provide code examples that are not parallel to these categories, such as active learning, transfer learning, ensemble learning, and deep learning:
Figure 1.2 – Types of machine learning modeling
Self-supervised and semi-supervised learning are sometimes considered sub-categories of supervised learning. However, we will separate them here so that we can establish the differences between the usual supervised learning models you are familiar with and these two types of modeling.
Supervised learning is about identifying the relationship between inputs/features and the output for each data point. But what do input and output mean?
Imagine that we want to build a machine learning model to predict whether a person is likely to get breast cancer or not. The output of the model could be 1 for getting breast cancer and 0 for not getting breast cancer and the inputs could be the characteristics of the people, such as their age, weight, and whether they smoke or not. There could even be inputs that are measured using advanced technologies, such as the genetic information of each person. In this case, we want to use our machine learning model to predict which patient will get cancer in the future.
You can also design a machine learning model to estimate the price of houses in a city. Here, your model could use characteristics of houses, such as the number of bedrooms and size of the house, the neighborhood, and access to schools, to estimate house prices.
In both of these examples, we have models trying to identify patterns within input features, such as a high number of bedrooms but only one bathroom, and associate those with the output. Depending on the output variable type, your model can be categorized as a classification model, in which the output is categorical, such as getting or not getting cancer, or a regression model, in which the output is continuous, such as house prices.
The majority of our life, at least in childhood, has been spent using our five senses (eyesight, hearing, taste, touch, and smell) to collect information about our surroundings, food, and so on, without us trying to find supervised learning style relationships such as whether a banana is ripe or not based on its color and shape. Similarly, in unsupervised learning, we are not seeking to identify the relationship between the features (input) and the output. Instead, the goal is to identify relationships between data points, as in clustering, extract new features (that is, embeddings or representations), and, if needed, reduce the dimensionality (that is, the number of features) of our data without using any output for the data points.
The third category of machine learning modeling is called self-supervised learning. In this category, the goal is to identify the relationship between inputs and outputs, but the difference with supervised learning is the source of outputs. For example, if the goal of the supervised machine learning model is to translate from English to French, the inputs come from English words and sentences and the outputs come from French words and sentences. However, we can have a self-supervised learning model within English sentences to try to predict the next word or a missing word in a sentence. For example, let’s say we aim to recognize that “talking” is a good candidate to fill the gap in “Jack is ____ with Julie.” Self-supervised learning models have been used in recent years across different fields to identify new features. This is commonly called representation learning. We will talk about some examples of self-supervised learning in Chapter 14, Introduction to Recent Advancements in Machine Learning.
Semi-supervised learning can help us benefit from supervised learning without throwing out the data points that don’t have output values. Sometimes, we have data points for which we don’t have the output values and only their feature values are available. In such cases, semi-supervised learning helps us use data points with or without output. One simple process to do so is to group data points that are similar to each other and use known outputs of the data points in each group to assign output for other data points of the same group that don’t have output value.
In RL, a model is rewarded according to its experience in an environment (real or virtual). In other words, RL is about identifying relationships with piecewise example addition. In RL, data is not considered part of the model and is independent of the model itself. We will go through some details of RL in Chapter 14, Introduction to Recent Advancements in Machine Learning.
Generative machine learning modeling helps us develop models that can generate images, text, or any data point that is close to the probability distribution of data provided in the training process. ChatGPT is one of the most famous tools that’s built on top of a generative model to generate realistic and meaningful text in response to user questions and answers (https://openai.com/blog/chatgpt). We will go through more details about generative modeling and the available tools built on top of it in Chapter 14, Introduction to Recent Advancements in Machine Learning.
In this section, we provided a brief review of the basic components for building machine learning models and different types of modeling. But if you want to develop machine learning models for automation or discovery, for healthcare or any other application, with a low or high number of data points, on your laptop or the cloud, using a central processing unit (CPU) or graphics processing unit (GPU), you need to develop high-quality code that works as expected. Although this book is not a software debugging book, an overview of software debugging challenges and techniques could help you in developing your machine learning models.