Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Debugging Machine Learning Models with Python
  • Toc
  • feedback
Debugging Machine Learning Models with Python

Debugging Machine Learning Models with Python

By : Ali Madani
4.9 (16)
close
Debugging Machine Learning Models with Python

Debugging Machine Learning Models with Python

4.9 (16)
By: Ali Madani

Overview of this book

Debugging Machine Learning Models with Python is a comprehensive guide that navigates you through the entire spectrum of mastering machine learning, from foundational concepts to advanced techniques. It goes beyond the basics to arm you with the expertise essential for building reliable, high-performance models for industrial applications. Whether you're a data scientist, analyst, machine learning engineer, or Python developer, this book will empower you to design modular systems for data preparation, accurately train and test models, and seamlessly integrate them into larger technologies. By bridging the gap between theory and practice, you'll learn how to evaluate model performance, identify and address issues, and harness recent advancements in deep learning and generative modeling using PyTorch and scikit-learn. Your journey to developing high quality models in practice will also encompass causal and human-in-the-loop modeling and machine learning explainability. With hands-on examples and clear explanations, you'll develop the skills to deliver impactful solutions across domains such as healthcare, finance, and e-commerce.
Table of Contents (26 chapters)
close
1
Part 1:Debugging for Machine Learning Modeling
5
Part 2:Improving Machine Learning Models
10
Part 3:Low-Bug Machine Learning Development and Deployment
15
Part 4:Deep Learning Modeling
19
Part 5:Advanced Topics in Model Debugging

Flaws in data used for modeling

Data is one of the core components of machine learning modeling (Figure 1.1). Applications of machine learning across different industries such as healthcare, finance, automotive, retail, and marketing are made possible by getting access to the necessary data for training and testing machine learning models. As the data gets fed into machine learning models for training (that is, identifying optimal model parameters) and testing, flaws in data could result in problems in models, such as low performance in training (for example, high bias), low generalizability (for example high variance), or socioeconomic biases. Here, we will discuss examples of flaws and properties of data that need to be considered when designing a machine learning model.

Data format and structure

There could be issues with how data is stored, read, and moved through different functions and classes in your code or pipeline. You might need to work with structured or tabular data or unstructured data such as videos and text documents. This data could be stored in relational databases such as MySQL or NoSQL (that is, non-relational) databases, data warehouses, and data lakes, or even stored locally in different file formats, such as CSV. Either way, the expected and existing file data structure and formats need to match. For example, if your code is expecting a tab-separated file format but instead the input file of the corresponding function is comma-separated, then all the columns could be lumped together. Luckily, most of the time, these kinds of issues result in errors in the code.

There could also be mismatches in the provided and expected data that wouldn’t cause any errors if the code is not defended against them and not enough information is logged. For example, imagine a scikit-learn fit function that expects training data with 100 features and at the same time, you have 100 data points. In this case, your code will not return any errors if features are in rows or columns of an input DataFrame. Then, your code needs to check if each row of an input DataFrame contains values of one feature across all data points or the feature values of one data point. The following figure shows how switching features with data points, such as transposing a DataFrame that switches rows with columns, could provide wrong input files but result in no error. In this figure, we have considered four columns and rows for simplicity. Here, F and D are used as abbreviations for feature and data point, respectively:

Figure 1.4 – Simplified example showcasing how the transpose of a DataFrame can be used by mistake in a scikit-learn fit function that expects four features

Figure 1.4 – Simplified example showcasing how the transpose of a DataFrame can be used by mistake in a scikit-learn fit function that expects four features

Data flaws are not restricted to structure and format issues. Some data characteristics need to be considered when you’re trying to build and improve a machine learning model.

Data quantity and quality

Despite machine learning being a more than half-century-old concept, the rise of excitement around machine learning started in 2012. Although there were algorithmic advancements for image classification between 2010 and 2015, it was the availability of 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest and the necessary computing power that played a crucial role in the development of the first high-performance image classification models, such as AlexNet (Krizhevsky et al., 2012) and VGG (Simonyan and Zisserman, 2014).

In addition to data quantity, the quality of the data also plays a very important role. In some applications, such as clinical cancer settings, a high quantity of high-quality data is not accessible. Benefitting from both quantity and quality could also become a tradeoff as we could have access to more data but with lower quality. We can choose to stick to high-quality data or low-quality ones or try to benefit from both high-quality and low-quality data if possible. Selecting the right approach is domain-specific and depends on the data and algorithm used for modeling.

Data biases

Machine learning models can have different kinds of biases, depending on the data we feed them. Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) is a famous example of machine learning models with reported biases. COMPAS is designed to estimate the likelihood of a defendant to re-offend based on their response to more than 100 survey questions. A summary of the responses to the questions results in a risk score, which includes questions such as whether one of the prisoner’s parents was ever in prison. Although the tool has been successful in many examples, when it has been wrong in terms of prediction, the results for white and black offenders were not the same. The developer company of COMPAS presented data that supports its algorithm’s findings. You can find articles and blog posts to read more about its current status and whether it is still used or still has biases or not.

These were some examples of issues in data and their consequences in the resulting machine learning models. But there are other problems in models that do not originate from data.

bookmark search playlist download font-size

Change the font size

margin-width

Change margin width

day-mode

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Delete Bookmark

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete