Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Mastering Machine Learning with R
  • Toc
  • feedback
Mastering Machine Learning with R

Mastering Machine Learning with R

By : Lesmeister
1.3 (3)
close
Mastering Machine Learning with R

Mastering Machine Learning with R

1.3 (3)
By: Lesmeister

Overview of this book

Given the growing popularity of the R-zerocost statistical programming environment, there has never been a better time to start applying ML to your data. This book will teach you advanced techniques in ML ,using? the latest code in R 3.5. You will delve into various complex features of supervised learning, unsupervised learning, and reinforcement learning algorithms to design efficient and powerful ML models. This newly updated edition is packed with fresh examples covering a range of tasks from different domains. Mastering Machine Learning with R starts by showing you how to quickly manipulate data and prepare it for analysis. You will explore simple and complex models and understand how to compare them. You’ll also learn to use the latest library support, such as TensorFlow and Keras-R, for performing advanced computations. Additionally, you’ll explore complex topics, such as natural language processing (NLP), time series analysis, and clustering, which will further refine your skills in developing applications. Each chapter will help you implement advanced ML algorithms using real-world examples. You’ll even be introduced to reinforcement learning, along with its various use cases and models. In the concluding chapters, you’ll get a glimpse into how some of these blackbox models can be diagnosed and understood. By the end of this book, you’ll be equipped with the skills to deploy ML techniques in your own projects or at work.
Table of Contents (16 chapters)
close

Handling duplicate observations

The easiest way to get started is to use the base R duplicated() function to create a vector of logical values that match the data observations. These values will consist of either TRUE or FALSE where TRUE indicates a duplicate. Then, we'll create a table of those values and their counts and identify which of the rows are dupes:

dupes <- duplicated(gettysburg)

table(dupes)
dupes
FALSE TRUE
587 3

which(dupes == "TRUE")
[1] 588 589
If you want to see the actual rows and even put them into a tibble dataframe, the janitor package has the get_dupes() function. The code for that would be simply: df_dupes <- janitor::get_dupes(gettysburg).

To rid ourselves of these duplicate rows, we put the distinct() function for the dplyr package to good use, specifying .keep_all = TRUE to make sure we return all of the features into the new tibble. Note that .keep_all defaults to FALSE:

gettysburg <- dplyr::distinct(gettysburg, .keep_all = TRUE)

Notice that, in the Global Environment, the tibble is now a dimension of 587 observations of 26 variables/features.

With the duplicate observations out of the way, it's time to start drilling down into the data and understand its structure a little better by exploring the descriptive statistics of the quantitative features.

Descriptive statistics

Traditionally, we could use the base R summary() function to identify some basic statistics. Now, and recently I might add, I like to use the package sjmisc and its descr() function. It produces a more readable output, and you can assign that output to a dataframe. What works well is to create that dataframe, save it as a .csv, and explore it at your leisure. It automatically selects numeric features only. It also fits well with tidyverse so that you can incorporate dplyr functions such as group_by() and filter(). Here's an example in our case where we examine the descriptive stats for the infantry of the Confederate Army. The output will consist of the following:

  • var: feature name
  • type: integer
  • n: number of observations
  • NA.prc: percent of missing values
  • mean
  • sd: standard deviation
  • se: standard error
  • md: median
  • trimmed: trimmed mean
  • range
  • skew
gettysburg %>%
dplyr::filter(army == "Confederate" & type == "Infantry") %>%
sjmisc::descr() -> descr_stats

readr::write_csv(descr_stats, 'descr_stats.csv')

The following is abbreviated output from the preceding code saved to a spreadsheet:

In this one table, we can discern some rather interesting tidbits. In particular is the percent of missing values per feature. If you modify the precious code to examine the Union Army, you'll find that there're no missing values. The reason the usurpers from the South had missing values is based on a couple of factors; either shoddy staff work in compiling the numbers on July 3rd or the records were lost over the years. Note that, for the number of men captured, if you remove the missing value, all other values are zero, so we could just replace the missing value with it. The Rebels did not report troops as captured, but rather as missing, in contrast with the Union.

Once you feel comfortable with the descriptive statistics, move on to exploring the categorical features in the next section.

Exploring categorical variables

When it comes to an understanding of your categorical variables, there're many different ways to go about it. We can easily use the base R table() function on a feature. If you just want to see how many distinct levels are in a feature, then dplyr works well. In this example, we examine type, which has three unique levels:

dplyr::count(gettysburg, dplyr::n_distinct(type))

The output of the preceding code is as follows:

# A tibble: 1 x 2
`dplyr::n_distinct(type)` n
<int> <int>
3 587

Let's now look at a way to explore all of the categorical features utilizing tidyverse principles. Doing it this way always allows you to save the tibble and examine the results in depth as needed. Here is a way of putting all categorical features into a separate tibble:

gettysburg_cat <-
gettysburg[, sapply(gettysburg, class) == 'character']

Using dplyr, you can now summarize all of the features and the number of distinct levels in each:

gettysburg_cat %>%
dplyr::summarise_all(dplyr::funs(dplyr::n_distinct(.)))

The output of the preceding code is as follows:

# A tibble: 1 x 9
type state regiment_or_battery brigade division corps army july1_Commander Cdr_casualty
<int> <int> <int> <int> <int> <int> <int> <int> <int>
3 30 275 124 38 14 2 586 6

Notice that there're 586 distinct values to july1_Commander. This means that two of the unit Commanders have the same rank and last name. We can also surmise that this feature will be of no value to any further analysis, but we'll deal with that issue in a couple of sections ahead.

Suppose we're interested in the number of observations for each of the levels for the Cdr_casualty feature. Yes, we could use table(), but how about producing the output as a tibble as discussed before? Give this code a try:

gettysburg_cat %>% 
dplyr::group_by(Cdr_casualty) %>%
dplyr::summarize(num_rows = n())

The output of the preceding code is as follows:

# A tibble: 6 x 2
Cdr_casualty num_rows
<chr> <int>
1 captured 6
2 killed 29
3 mortally wounded 24
4 no 405
5 wounded 104
6 wounded-captured 19

Speaking of tables, let's look at a tibble-friendly way of producing one using two features. This code takes the idea of comparing commander casualties by army:

gettysburg_cat %>%
janitor::tabyl(army, Cdr_casualty)

The output of the preceding code is as follows:

army   captured killed mortally wounded   no  wounded  wounded-captured
Confederate 2 15 13 165 44 17
Union 4 14 11 240 60 2

Explore the data on your own and, once you're comfortable with the categorical variables, let's tackle the issue of missing values.

bookmark search playlist font-size

Change the font size

margin-width

Change margin width

day-mode

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Delete Bookmark

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete