Preparing and testing the train script in R

In this recipe, we will write a custom train script in R that allows us to inspect the input and configuration parameters set by Amazon SageMaker during the training process. The following diagram shows the train script inside the custom container, which makes use of the hyperparameters, input data, and configuration specified in the Estimator instance using the SageMaker Python SDK and the reticulate package:

Figure 2.79 – The R train script inside the custom container makes use of the input parameters, configuration, and data to train and output a model

There are several options when running a training job – use a built-in algorithm, use a custom train script and custom Docker images, or use a custom train script and prebuilt Docker images. In this recipe, we will focus on the second option, where we will prepare and test a bare minimum training script in R that builds a linear model for a specific regression problem.

Once we have finished working on this recipe, we will have a better understanding of how SageMaker works behind the scenes. We will see where and how to load and use the configuration and arguments we have specified in the SageMaker Python SDK Estimator instance.

Important note

Later on, you will notice a few similarities between the Python and R recipes in this chapter. What is critical here is noticing and identifying both major and subtle differences in certain parts of the Python and R recipes. For example, when working with the serve script in this chapter, we will be dealing with two files in R (api.r and serve) instead of one in Python (serve). As we will see in the other recipes of this book, working on the R recipes will help us have a better understanding of the internals of SageMaker's capabilities, as there is a big chance that we will have to prepare custom solutions to solve certain requirements. As we get exposed to more machine learning requirements, we will find that there are packages in R for machine learning without direct counterparts in Python. That said, we must be familiar with how to get custom R algorithm code working in SageMaker. Stay tuned for more!

Getting ready

Make sure you have completed the Setting up the Python and R experimentation environments recipe.

How to do it...

The first set of steps in this recipe focus on preparing the train script. Let's get started:

Inside the ml-r directory, double-click the train file to open it inside the Editor pane:
Figure 2.80 – Empty ml-r/train file
Here, we have an empty train file. In the lower-right-hand corner of the Editor pane, you can change the syntax highlight settings to R.
Add the following lines of code to start the train script in order to import the required packages and libraries:
```
#!/usr/bin/Rscript
library("rjson")
```

Define the prepare_paths() function, which we will use to initialize the PATHS variable. This will help us manage the paths of the primary files and directories used in the script:

prepare_paths <- function() {
    keys <- c('hyperparameters', 
              'input', 
              'data',
              'model')
    
    values <- c('input/config/hyperparameters.json', 
                'input/config/inputdataconfig.json', 
                'input/data/',
                'model/')
    
    paths <- as.list(values)
    names(paths) <- keys
    
    return(paths);
} 
    
PATHS <- prepare_paths()

This function allows us to initialize the PATHS variable with a dictionary-like data structure where we can get the absolute paths of the required file.

Next, define the get_path() function, which makes use of the PATHS variable from the previous step:
```
get_path <- function(key) {
    output <- paste('/opt/ml/', PATHS[[key]],
                    sep="")
    
    return(output);
}
```
When referring to the location of a specific file, such as hyperparameters.json, we will use get_path('hyperparameters') instead of the absolute path.
Next, add the following lines of code just after the get_path() function definition from the previous step. These functions will be used to load and print the contents of the JSON files we will work with later:
```
load_json <- function(target_file) {
    result <- fromJSON(file = target_file)
}
print_json <- function(target_json) {
    print(target_json)
}
```

After that, define the inspect_hyperparameters() and list_dir_contents() functions after the print_json() function definition:

inspect_hyperparameters <- function() {
    hyperparameters_json_path <- get_path(
        'hyperparameters'
    )
    print(hyperparameters_json_path)
    hyperparameters <- load_json(
        hyperparameters_json_path
    )
    print(hyperparameters)
}
list_dir_contents <- function(target_path) {
    print(list.files(target_path))
}

The inspect_hyperparameters() function inspects the contents of the hyperparameters.json file inside the /opt/ml/input/config directory. The list_dir_contents() function, on the other hand, displays the contents of a target directory.

Define the inspect_input() function. It will help us inspect the contents of inputdataconfig.json inside the /opt/ml/input/config directory:

inspect_input <- function() {
    input_config_json_path <- get_path('input')
    print(input_config_json_path)
    input_config <- load_json(
        input_config_json_path
    )
    print_json(input_config)
    
    for (key in names(input_config)) {
        print(key)
        
        input_data_dir <- paste(get_path('data'), 
                                key, '/', sep="")
        print(input_data_dir)
        list_dir_contents(input_data_dir)
    }
}

This will be used to list the contents of the training input directory inside the main() function later.

Define the load_training_data() function:

load_training_data <- function(input_data_dir) {
    print('[load_training_data]')
    files <- list_dir_contents(input_data_dir)
    training_data_path <- paste0(
        input_data_dir, files[[1]])
    print(training_data_path)
    
    df <- read.csv(training_data_path, header=FALSE)
    colnames(df) <- c("y","X")
    print(df)
    return(df)
}

This function can be divided into two parts – preparing the specific path pointing to the CSV file containing the training data and reading the contents of the CSV file using the read.csv() function. The return value of this function is an R DataFrame (a two-dimensional table-like structure).

Next, define the get_input_data_dir() function:

get_input_data_dir <- function() {
    print('[get_input_data_dir]')
    key <- 'train'
    input_data_dir <- paste0(
        get_path('data'), key, '/')
    
    return(input_data_dir)
}

After that, define the train_model() function:
```
train_model <- function(data) {
    model <- lm(y ~ X, data=data)    
    print(summary(model))
    return(model)
}
```
This function makes use of the lm() function to fit and prepare linear models, which can then be used for regression tasks. It accepts a formula such as y ~ X as the first parameter value and the training dataset as the second parameter value.
Note
Formulas in R involve a tilde symbol (~) and one or more independent variables at the right of the tilde (~), such as X1 + X2 + X3. In the example in this recipe, we only have one variable on the right-hand side of the tilde (~), meaning that we will only have a single predictor variable for this model. On the left-hand side of the tilde (~) is the dependent variable that we are trying to predict using the predictor variable(s). That said, the y ~ X formula simply expresses a relationship between the predictor variable, X, and the y variable we are trying to predict. Since we are dealing with the same dataset as we did for the recipes in Chapter 1, Getting Started with Machine Learning Using Amazon SageMaker, the y variable here maps to monthly_salary, while X maps to management_experience_months.
Define the save_model() function:
```
save_model <- function(model) {
    print('[save_model]')
    filename <- paste0(get_path('model'), 'model')
    print(filename)
    saveRDS(model, file=filename)
    print('Model Saved!')
}
```
Here, we make use of the saveRDS() function, which accepts an R object and writes it to a file. In this case, we will accept a trained model object and save it inside the /opt/ml/model directory.
Define the main() function, as shown here. This function triggers the functions defined in the previous steps:
```
main <- function() {
    inspect_hyperparameters()
    inspect_input()
    input_data_dir = get_input_data_dir()
    print(input_data_dir)
    data <- load_training_data(input_data_dir)
    model <- train_model(data)
    save_model(model)
}
```
This main() function can be divided into four parts – inspecting the hyperparameters and the input, loading the training data, training the model using the train_model() function, and saving the model using the save_model() function.
Finally, call the main() function at the end of the script:
```
main()
```
Tip
You can access a working copy of the train file in the Machine Learning with Amazon SageMaker Cookbook GitHub repository: https://github.com/PacktPublishing/Machine-Learning-with-Amazon-SageMaker-Cookbook/blob/master/Chapter02/ml-r/train.
Now that we are done with the train script, we will use the Terminal to perform the last set of steps in this recipe. The last set of steps focus on installing a few script prerequisites.
Open a new Terminal:
Figure 2.81 – New Terminal
Here, we can see how to create a new Terminal tab. We simply click the plus (+) button and then choose New Terminal.
Check the version of R in the Terminal:
```
R --version
```
Running this line of code should return a similar set of results to what is shown here:
Figure 2.82 – Result of the R --version command in the Terminal
Here, we can see that our environment is using R version 3.4.4.
Install the rjson package:
```
sudo R -e "install.packages('rjson',repos='https://cloud.r-project.org')"
```
The rjson package provides the utilities for handling JSON data in R.
Use the following commands to make the train script executable and then run the train script:
```
cd /home/ubuntu/environment/opt/ml-r
chmod +x train
./train
```
Running the previous lines of code will yield results similar to what is shown here:

Figure 2.83 – R train script output

Here, we can see the logs that were produced by the train script. Once the train script has been successfully executed, we expect the model files to be stored inside the /opt/ml/model directory.

At this point, we have finished preparing and testing the train script. Now, let's see how this works!

How it works…

The train script in this recipe demonstrates how the input and output values are passed around between the SageMaker API and the custom container. It also performs a fairly straightforward set of steps to train a linear model using the training data provided.

When you are required to work on a more realistic example, the train script will do the following:

Load and use a few environment variables using the Sys.getenv() function in R. We can load environment variables set by SageMaker automatically, such as TRAINING_JOB_NAME and TRAINING_JOB_ARN.
Load the contents of the hyperparameters.json file using the fromJSON() function.
Load the contents of the inputdataconfig.json file using the fromJSON() function. This file contains the properties of each of the input data channels, such as the file type and usage of the file or pipe mode.
Load the data file(s) inside the /opt/ml/input/data directory. Take note that there's a parent directory named after the input data channel in the path before the actual files themselves. An example of this would be /opt/ml/input/data/<channel name>/<filename>.
Perform model training using the hyperparameters and training data that was loaded in the previous steps.
Save the model inside the /opt/ml/model directory:
```
saveRDS(model, file="/opt/ml/model/model.RDS")
```
We can optionally evaluate the model using the validation data and log the results.

Now that have finished preparing the train script in R, let's quickly discuss some possible solutions we can prepare using what we learned in this recipe.

There's more…

It is important to note that we are free to use any algorithm in the train script to train our model. This level of flexibility gives us an edge once we need to work on more complex examples. Here is a quick example of what the train function may look like if the neuralnet R package is used in the train script:

train <- function(df.training_data, hidden_layers=4) {
    model <- neuralnet(
        label ~ ., 
        df.training_data, 
        hidden=c(hidden_layers,1),
        linear.output = FALSE, 
        threshold=0.02,
        stepmax=1000000,
        act.fct = "logistic")
    return(model)
}

In this example, we allow the number of hidden layers to be set while we are configuring the Estimator object using the set_hyperparameters() function. The following example shows how to implement a train function to prepare a time series forecasting model in R:

train <- function(data) {
    model <- snaive(data) 
    print(summary(model))
    return(model)
}

Here, we simply used the snaive() function from the forecast package to prepare the model. Of course, we are free to use other functions as well, such as ets() and auto.arima() from the forecast package.

Machine Learning with Amazon SageMaker Cookbook

By : Joshua Arvin Lat

Machine Learning with Amazon SageMaker Cookbook

By: Joshua Arvin Lat

Overview of this book

Preparing and testing the train script in R

Getting ready

How to do it...

How it works…

There's more…

Machine Learning with Amazon SageMaker Cookbook

By : Joshua Arvin Lat

Machine Learning with Amazon SageMaker Cookbook

By: Joshua Arvin Lat

Overview of this book

Preparing and testing the train script in R

Getting ready

How to do it...

How it works…

There's more…

Create a Note

Delete Bookmark

Delete Note

Edit Note

Confirmation

Buy this book with your credits?