The dataset, along with the Databricks notebooks, is available in the GitHub repository. The dataset is unwieldy. It has bad columns with a high degree of correlation, which is another way of saying some sensors are duplicates, and there are unused columns and extraneous data. For the sake of readability, there will be two notebooks in the GitHub repository. The first does all of the data manipulation and puts the data into a data table. The second notebook does the machine learning. We will focus this recipe on the data manipulation notebook. At the end of the recipe, we will talk about two other notebooks to show an example of MLflow.
One other thing you will need in this recipe is an MLflow workspace. To set up an MLflow workspace, you will need to go into Databricks and create the workspace for this experiment. We will write the results of our experiment there.