ML project life cycle

In this section, we will learn about the typical life cycle of an ML project, from defining the problem to model development, and finally, to the operationalization of the model. Figure 1.1 shows the high-level steps almost every ML project goes through. Let’s go through all these steps in detail.

Figure 1.1 – Life cycle of a typical ML project

Just like the Software Development Life Cycle (SDLC), the Machine Learning Project/Development Lifecycle (MDLC) guides the end-to-end process of ML model development and operationalization. At a high level, the life cycle of a typical ML project in an enterprise setting remains somewhat consistent and includes eight key steps:

Define the ML use case: The first step of any ML project is where the ML team works with business stakeholders to assess the business needs around predictive analytics and identifies a use case where ML can be used, along with some success criteria, performance metrics, and possible datasets that can be used to build the models.
For example, if the sales/marketing department of an insurance company called ABC Insurance Inc. wants to better utilize its resources to target customers who are more likely to buy a certain product, they might approach the ML team to build a solution that can sift through all possible leads/customers and, based on the data points for each lead (age, prior purchase, length of policy history, income level, etc.), identify the customers who are most likely to buy a policy. Then the sales team can ask their customer representatives to prioritize reaching out to these customers instead of calling all possible customers blindly. This can significantly improve the outcome of outbound calls by the reps and improve the sales-related KPIs.
Once the use case is defined, the next step is to define a set of KPIs to measure the success of the solution. For this sales use case, this could be the customer sign-up rate—what percentage of the customers whom sales reps talk to sign up for a new insurance policy?
To measure the effectiveness of the ML solution, the sales team and the ML team might agree to measure the increase or decrease in customer sign-up rate once the ML model is live and iteratively improve on the model to optimize the sign-up rate.
At this stage, there will also be a discussion about the possible datasets that can be utilized for the model training. These could include the following:
- Internal customer/product datasets being generated by marketing and sales teams, for example, customer metadata, such as their age, education profile, income level, prior purchase behavior, number and type of vehicles they own, etc.
- External datasets that can be acquired through third parties; for example, an external marketing consultancy might have collected data about the insurance purchase behavior of customers based on the car brand they own. This additional data can be used to predict how likely they are to purchase the insurance policy being sold by ABC Insurance Inc.
Explore/analyze data: The next step is to do a detailed analysis of the datasets. This is usually an iterative process in which the ML team works closely with the data and business SMEs to better understand the nuances of the available datasets, including the following:
- Data sources
- Data granularity
- Update frequency
- Description of individual data points and their business meaning
This is a key step where data scientists/ML engineers analyze the available data and decide what datasets might be relevant to the ML solution being considered, analyze the robustness of the data, and identify any gaps. Issues that the team might identify at this stage could relate to the cleanliness and completeness of data or problems with the timely availability of the data in production. For example, the age of the customer could be a great indicator of their purchase behavior, but if it’s an optional field in the customer profile, only a handful of customers might have provided their date of birth or age.
So, the team would need to figure out if they want to use the field and, if so, how to handle the samples where age is missing. They could also work with sales and marketing teams to make the field a required field whenever a new customer requests an insurance quote online and generates a lead in the system.
Select ML model type: Once the use case has been identified along with the datasets that can possibly be used to train the model, the next step is to consider the types of models that can be used to achieve the requirements. We won’t go too deep into the topic of general model selection here since entire books could be written on the topic, but in the next few chapters, you will see what different model types can be built for specific use cases in Vertex AI. At a very high level, the key considerations at this stage are as follows:
- Type of model: For example, for the insurance customer/lead ranking example, we could build a classification model that will predict whether a new customer is high/medium/low in terms of their likelihood to purchase a policy. Or a regression model could be built to output a sales probability number for each likely customer.
- Does the conventional ML model satisfy our requirements or do we need a deep learning model?
- Explainability requirements: Does the use case require an explanation for each prediction as to why the sample was classified a certain way?
- Single versus ensemble model: Do we need a single model to give us the final prediction, or do we need to employ a set of interconnected models that feed into each other? For example, a first model might assign a customer to a particular customer group, and the next model might use that grouping to identify the final likelihood of purchase.
- Separation of models: For example, sometimes we might build a single global model for the entire customer base, or we might need separate models for each region due to significant differences in products and user behavior in different regions.
Feature engineering: This process is usually the most time-consuming and involves several steps:
1. Data cleanup–Imputing missing values where possible, dropping fields with too many missing values
2. Data and feature augmentation–Joining datasets to bring in additional fields, and cross-joining existing features to generate new features
3. Feature analysis–Calculating feature correlation and analyzing collinearity, checking for data leakage in features
Again, since this is an extremely broad topic, we are not diving too deep into it and suggest you refer to other books on this topic.
Iterate over the model design/build: The actual design and build of the ML model is an iterative process involving the following key steps:
1. Select model architecture
2. Split acquired data into train/validation/test subsets
3. Run model training experiments, tune hyperparameters
4. Evaluate trained models with the test dataset
5. Rank and select the best models
Figure 1.2 shows the typical ML model development life cycle:

Figure 1.2 – ML model development life cycle

Consensus on results: Once a satisfactory model has been obtained, the ML team shares the results with the business stakeholders to ensure the results fully align with the business needs and performs additional optimizations and post-processing steps to make the model predictions usable by the business. To assure business stakeholders that ML solution is aligned with the business goals and is accurate enough to drive value, ML teams could use one of a number of approaches:
- Evaluate using historical test datasets: ML teams can run historical data through the new ML models and evaluate the predictions against the ground truth values. For example, in the insurance use case discussed previously, the ML team can take last month’s data on customer leads and use the ML model to predict which customers are most likely to purchase a new insurance policy. Then they can compare the model’s predictions against the actual purchase history from the previous month and see how accurate the model’s predictions were. If the model’s output is close to the real purchase behavior of customers, then the model is working as desired, and this information can be presented to business stakeholders to convince them of the ML solution’s efficacy in driving additional revenue. On the contrary, if the model’s output significantly deviates from the customer’s behavior, the ML team needs to go back and work on improving the model. This usually is an iterative process and can take a number of iterations, depending on the complexity of the model.
- Evaluate with live data: In some scenarios, an organization might decide to conduct a small pilot in a production environment with real-time data to assess the performance of the new ML model. This is usually done in the following scenarios:
  - When there is no historical data available to conduct the evaluation or where testing with historical data is not expected to be an accurate; for example, during the onset of COVID, customer behavior patterns abruptly changed to the extent that testing with any historical data became nearly useless
  - When there is an existing model in production being used for critical real-time predictions, the sanity check for the new model needs to be performed not just in terms of its accuracy but also its subtle impact on downstream KPIs such as revenue per user session
In such cases, teams might deploy the model in production, divert a small number of prediction requests to the newer model, and periodically compare the overall impact on the KPIs. For example, in the case of a recommendation model deployed on an e-commerce website, a recommendation model might start recommending products that are comparatively cheaper than the predictions from the older model already live in production. In this scenario, the likelihood of a customer completing a purchase would go up, but at the same time, the revenue generated per user session would decrease, impacting overall revenue for the organization. So, although it might seem like the ML model is working as designed, it might not be considered a success by the business/sales stakeholders, and more discussions would be required to optimize it.
Operationalize model: Once the model has been approved for deployment in production, the ML team will work with their organization’s IT and data engineering teams to deploy the model so that other applications can start utilizing it to generate insights. Depending on the size of the organization, there can be significant overlap in the roles these teams play.
The actual deployment architecture would depend on the following:
- Prediction SLAs – Ranging from periodic batch jobs to solutions that require sub-second prediction performance.
- Compliance requirements – Can the user data be sent to third-party cloud providers, or does it need to always reside within an organization’s data centers?
- Infrastructure requirements – This depends on the size of the model and its compute requirements. Small models can be served from a shared compute node. Some large models might need a large GPU-connected node.
We will discuss this topic in detail in later chapters, but the following figure shows some key components you might consider as part of your deployment architecture.

Figure 1.3 – Key components of ML model training and deployment

Monitor and retrain: It might seem as if the ML team’s job is done once the model has been operationalized, but in real-world deployments, most models require periodic or sometimes constant monitoring to ensure the model is operating within the required performance thresholds. Model performance could become sub-optimal for several reasons:
- Data drift: Changes in data being used to generate predictions could change significantly and impact the model’s performance. As we discussed before, during COVID, customer behavior changed significantly. Models that were trained on pre-COVID customer behavior data were not equipped to handle this sudden change in usage patterns. Change due to the pandemic was relatively rare but high-impact, but there are plenty of other smaller changes in prediction input data that might impact your model’s performance adversely. The impact could range from a subtle drop in accuracy to a model generating erroneous responses. So, it is important to keep an eye on the key performance metrics of your ML solution.
- Change prediction request volume: If your solution was designed to handle 100 requests per second but now is seeing periodic bursts in traffic of around 1,000 requests per second, your solution might not be able to keep up with the demand, or latency might go above acceptable levels. So, your solution also needs to have monitoring and certain levels of auto-scaling built in to handle such scenarios. For larger changes in traffic volumes, you might even need to completely rethink the serving architecture.
There would be scenarios where through monitoring, you will discover that your ML model no longer meets the prediction accuracy and requires retraining. If the change in data patterns is expected, the ML team should design the solution to support automatic periodic retraining. For example, in the retail industry, product catalogs, pricing, and promotions constantly evolve, requiring regular retraining of the models. In other scenarios, the change might be gradual or unexpected, and when the monitoring system alerts the ML team of the model performance degradation, they need to take a call on retraining the model with more recent data, or maybe even completely rebuilding the model with new features.

Now that we have a good idea of the life cycle of an ML project, let’s learn about some of the common challenges faced by ML developers when creating and deploying ML solutions.

The Definitive Guide to Google Vertex AI

By : Jasmeet Bhatia, Kartik Chaudhary

The Definitive Guide to Google Vertex AI

By: Jasmeet Bhatia, Kartik Chaudhary

Overview of this book

ML project life cycle

The Definitive Guide to Google Vertex AI

By : Jasmeet Bhatia, Kartik Chaudhary

The Definitive Guide to Google Vertex AI

By: Jasmeet Bhatia, Kartik Chaudhary

Overview of this book

ML project life cycle

Create a Note

Delete Bookmark

Delete Note

Edit Note

Confirmation

Buy this book with your credits?