As always, we import the libraries we need for this project. Next, we import data from our Spark data table into a Pandas DataFrame. One-hot encoding can change categorical values, such as our example of Wine and No Wine, into encoded values that machine learning algorithms can use better. In step 4, we take our feature columns and our one-hot encoded column and perform a split, splitting them into a testing and training set. In step 5, we create a decision tree classifier, use the X_train and y_train data to train the model, and then use the X_test data to create a y_prediction dataset. In other words, in the end, we will have a set of predictions called y_pred based on the predictions the dataset had on the X_test set. In step 6, we evaluate the accuracy of the model and the area under the curve (AUC).
Decision tree classifiers are used when the data is complex. In the same way, you can use a decision tree to follow a set of logical rules using yes/no questions...