Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Hands-On Ensemble Learning with R
  • Toc
  • feedback
Hands-On Ensemble Learning with R

Hands-On Ensemble Learning with R

By : Tattar
3 (1)
close
Hands-On Ensemble Learning with R

Hands-On Ensemble Learning with R

3 (1)
By: Tattar

Overview of this book

Ensemble techniques are used for combining two or more similar or dissimilar machine learning algorithms to create a stronger model. Such a model delivers superior prediction power and can give your datasets a boost in accuracy. Hands-On Ensemble Learning with R begins with the important statistical resampling methods. You will then walk through the central trilogy of ensemble techniques – bagging, random forest, and boosting – then you'll learn how they can be used to provide greater accuracy on large datasets using popular R packages. You will learn how to combine model predictions using different machine learning algorithms to build ensemble models. In addition to this, you will explore how to improve the performance of your ensemble models. By the end of this book, you will have learned how machine learning algorithms can be combined to reduce common problems and build simple efficient ensemble models with the help of real-world examples.
Table of Contents (15 chapters)
close
12
12. What's Next?
13
A. Bibliography
14
Index

An ensemble purview

The caret R package is core to ensemble machine learning methods. It provides a large framework and we can also put different statistical and machine learning models together to create an ensemble. For the recent version of the package on the author's laptop, the caret package provides access to the following models:

> library(caret)
> names(getModelInfo())
  [1] "ada"                 "AdaBag"              "AdaBoost.M1" 
  [4] "adaboost"            "amdai"               "ANFIS" 
  [7] "avNNet"              "awnb"                "awtan"        
     
[229] "vbmpRadial"          "vglmAdjCat"          "vglmContRatio 
[232] "vglmCumulative"      "widekernelpls"       "WM" 
[235] "wsrf"                "xgbLinear"           "xgbTree" 
[238] "xyf"               

Depending on your requirements, you can choose any combination of these 238 models. The authors of the package keep on updating this list. It is to be noted that not all models will be available in the caret package, and that it is a platform that facilitates the ensembling of these methods. Consequently, if you choose a model such as ANFIS, and the R package frbs contains this function, which is not available on your machine, then caret will display a message on the terminal as indicated in the following snippet:

An ensemble purview

Figure 7: Caret providing a message to install the required R package

You need to key in the number 1 and continue. The package will be installed and loaded, and the program will continue. It is good to know the host of options for ensemble methods. A brief method for stack ensembling analytical models is provided here, and the details will unfold later in the book.

For the Hypothyroid dataset, we had a high accuracy of an average of 98% between the five models. The Waveform dataset saw an average accuracy of approximately 88%, while the average for German Credit data is 75%. We will try to increase the accuracy for this dataset. The accuracy improvement will be attempted using three models: naïve Bayes, logistic regression, and classification tree. First, we need to partition the data into three parts: train, test, and stack:

> load("../Data/GC2.RData")
> set.seed(12345)
> Train_Test_Stack <- sample(c("Train","Test","Stack"),nrow(GC2),replace = TRUE,prob = c(0.5,0.25,0.25))
> GC2_Train <- GC2[Train_Test_Stack=="Train",]
> GC2_Test <- GC2[Train_Test_Stack=="Test",]
> GC2_Stack <- GC2[Train_Test_Stack=="Stack",]The dependent and independent variables will be marked next in character vectors for programming convenient. 

> # set label name and Exhogenous
> Endogenous <- 'good_bad'
> Exhogenous <- names(GC2_Train)[names(GC2_Train) != Endogenous]

The model will be built on the training data first and accuracy will be assessed using the metric of Area Under Curve, the curve being the ROC. The control parameters will be set up first and the three models, naïve Bayes, classification tree, and logistic regression, will be created using the training dataset:

> # Creating a caret control object for the number of 
> # cross-validations to be performed
> myControl <- trainControl(method='cv', number=3, returnResamp='none')
> # train all the ensemble models with GC2_Train
> model_NB <- train(GC2_Train[,Exhogenous], GC2_Train[,Endogenous], 
+                    method='naive_bayes', trControl=myControl)
> model_rpart <- train(GC2_Train[,Exhogenous], GC2_Train[,Endogenous], 
+                      method='rpart', trControl=myControl)
> model_glm <- train(GC2_Train[,Exhogenous], GC2_Train[,Endogenous], 
+                        method='glm', trControl=myControl)

Predictions for the test and stack blocks are carried out next. We store the predicted probabilities along the test and stack data frames:

> # get predictions for each ensemble model for two last datasets
> # and add them back to themselves
> GC2_Test$NB_PROB <- predict(object=model_NB, GC2_Test[,Exhogenous],
+                              type="prob")[,1]
> GC2_Test$rf_PROB <- predict(object=model_rpart, GC2_Test[,Exhogenous],
+                             type="prob")[,1]
> GC2_Test$glm_PROB <- predict(object=model_glm, GC2_Test[,Exhogenous],
+                                  type="prob")[,1]
> GC2_Stack$NB_PROB <- predict(object=model_NB, GC2_Stack[,Exhogenous],
+                               type="prob")[,1]
> GC2_Stack$rf_PROB <- predict(object=model_rpart, GC2_Stack[,Exhogenous],
+                              type="prob")[,1]
> GC2_Stack$glm_PROB <- predict(object=model_glm, GC2_Stack[,Exhogenous],
+                                   type="prob")[,1]

The ROC is an important measure for model assessments. The higher the area under the ROC, the better the model would be. Note that these measures, or any other measure, will not be the same as the models fitted earlier since the data has changed:

> # see how each individual model performed on its own
> AUC_NB <- roc(GC2_Test[,Endogenous], GC2_Test$NB_PROB )
> AUC_NB$auc
Area under the curve: 0.7543
> AUC_rf <- roc(GC2_Test[,Endogenous], GC2_Test$rf_PROB )
> AUC_rf$auc
Area under the curve: 0.6777
> AUC_glm <- roc(GC2_Test[,Endogenous], GC2_Test$glm_PROB )
> AUC_glm$auc
Area under the curve: 0.7446

For the test dataset, we can see that the area under curve for the naïve Bayes, classification tree, and logistic regression are respectively 0.7543, 0.6777, and 0.7446. If we put the predicted values together in some format, and that leads to an increase in the accuracy, the purpose of the ensemble technique has been accomplished. As such, we consider the new predicted probabilities under the three models and append them to the stacked data frame. These three columns will now be treated as new input vectors. We then build a naïve Bayes model, an arbitrary choice, and you can try any other model (not necessarily restricted to one of these) for the stacked data frame. The AUC can then be predicted and calculated:

> # Stacking it together
> Exhogenous2 <- names(GC2_Stack)[names(GC2_Stack) != Endogenous]
> Stack_Model <- train(GC2_Stack[,Exhogenous2], GC2_Stack[,Endogenous], 
+                      method='naive_bayes', trControl=myControl)
> Stack_Prediction <- predict(object=Stack_Model,GC2_Test[,Exhogenous2],type="prob")[,1]
> Stack_AUC <- roc(GC2_Test[,Endogenous],Stack_Prediction)
> Stack_AUC$auc
Area under the curve: 0.7631

The AUC for the stacked data observations is higher than any of the earlier models, which is an improvement.

A host of questions should arise for the critical thinker. Why should this technique work? Will it lead to improvisations under all possible cases? If yes, will simply adding new model predictions lead to further improvements? If no, how does one pick the base models so that we can be reasonably assured of improvisations? What are the restrictions on the choice of models? We will provide solutions to most of these questions throughout this book. In the next section, we will quickly look at some useful statistical tests that will aid the assessment of model performance.

bookmark search playlist download font-size

Change the font size

margin-width

Change margin width

day-mode

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Delete Bookmark

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete