Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Clojure for Data Science
  • Toc
  • feedback
Clojure for Data Science

Clojure for Data Science

By : Garner
5 (4)
close
Clojure for Data Science

Clojure for Data Science

5 (4)
By: Garner

Overview of this book

The term “data science” has been widely used to define this new profession that is expected to interpret vast datasets and translate them to improved decision-making and performance. Clojure is a powerful language that combines the interactivity of a scripting language with the speed of a compiled language. Together with its rich ecosystem of native libraries and an extremely simple and consistent functional approach to data manipulation, which maps closely to mathematical formula, it is an ideal, practical, and flexible language to meet a data scientist’s diverse needs. Taking you on a journey from simple summary statistics to sophisticated machine learning algorithms, this book shows how the Clojure programming language can be used to derive insights from data. Data scientists often forge a novel path, and you’ll see how to make use of Clojure’s Java interoperability capabilities to access libraries such as Mahout and Mllib for which Clojure wrappers don’t yet exist. Even seasoned Clojure developers will develop a deeper appreciation for their language’s flexibility! You’ll learn how to apply statistical thinking to your own data and use Clojure to explore, analyze, and visualize it in a technically and statistically robust way. You can also use Incanter for local data processing and ClojureScript to present interactive visualisations and understand how distributed platforms such as Hadoop sand Spark’s MapReduce and GraphX’s BSP solve the challenges of data analysis at scale, and how to explain algorithms using those programming models. Above all, by following the explanations in this book, you’ll learn not just how to be effective using the current state-of-the-art methods in data science, but why such methods work so that you can continue to be productive as the field evolves into the future.
Table of Contents (12 chapters)
close
11
Index

Descriptive statistics

Descriptive statistics are numbers that are used to summarize and describe data. In the next chapter, we'll turn our attention to a more sophisticated analysis, the so-called inferential statistics, but for now we'll limit ourselves to simply describing what we can observe about the data contained in the file.

To demonstrate what we mean, let's look at the Electorate column of the data. This column lists the total number of registered voters in each constituency:

(defn ex-1-6 []
  (->> (load-data :uk-scrubbed)
       (i/$ "Electorate")
       (count)))

;; 650

We've filtered the nil field from the dataset; the preceding code should return a list of 650 numbers corresponding to the electorate in each of the UK constituencies.

Descriptive statistics, also called summary statistics, are ways of measuring attributes of sequences of numbers. They help characterize the sequence and can act as a guide for further analysis. Let's start by calculating the two most basic statistics that we can from a sequence of numbers—its mean and its variance.

The mean

The most common way of measuring the average of a data set is with the mean. It's actually one of several ways of measuring the central tendency of the data. The mean, or more precisely, the arithmetic mean, is a straightforward calculation—simply add up the values and divide by the count—but in spite of this it has a somewhat intimidating mathematical notation:

The mean

where The mean is pronounced x-bar, the mathematical symbol often used to denote the mean.

To programmers coming to data science from fields outside mathematics or the sciences, this notation can be quite confusing and alienating. Others may be entirely comfortable with this notation, and they can safely skip the next section.

Interpreting mathematical notation

Although mathematical notation may appear obscure and upsetting, there are really only a handful of symbols that will occur frequently in the formulae in this book.

Σ is pronounced sigma and means sum. When you see it in mathematical notation it means that a sequence is being added up. The symbols above and below the sigma indicate the range over which we'll be summing. They're rather like a C-style for loop and in the earlier formula indicate we'll be summing from i=1 up to i=n. By convention n is the length of the sequence, and sequences in mathematical notation are one-indexed, not zero-indexed, so summing from 1 to n means that we're summing over the entire length of the sequence.

The expression immediately following the sigma is the sequence to be summed. In our preceding formula for the mean, xi immediately follows the sigma. Since i will represent each index from 1 up to n, xi represents each element in the sequence of xs.

Finally, Interpreting mathematical notation appears just before the sigma, indicating that the entire expression should be multiplied by 1 divided by n (also called the reciprocal of n). This can be simplified to just dividing by n.

Name

Mathematical symbol

Clojure equivalent

 

n

(count xs)

Sigma notation

Interpreting mathematical notation

(reduce + xs)

Pi notation

Interpreting mathematical notation

(reduce * xs)

Putting this all together, we get "add up the elements in the sequence from the first to the last and divide by the count". In Clojure, this can be written as:

(defn mean [xs]
  (/ (reduce + xs)
     (count xs)))

Where xs stands for "the sequence of xs". We can use our new mean function to calculate the mean of the UK electorate:

(defn ex-1-7 []
  (->> (load-data :uk-scrubbed)
       (i/$ "Electorate")
       (mean)))

;; 70149.94

In fact, Incanter already includes a function, mean, to calculate the mean of a sequence very efficiently in the incanter.stats namespace. In this chapter (and throughout the book), the incanter.stats namespace will be required as s wherever it's used.

The median

The median is another common descriptive statistic for measuring the central tendency of a sequence. If you ordered all the data from lowest to highest, the median is the middle value. If there is an even number of data points in the sequence, the median is usually defined as the mean of the middle two values.

The median is often represented in formulae by The median, pronounced x-tilde. It's one of the deficiencies of mathematical notation that there's no particularly standard way of expressing the formula for the median value, but nonetheless it's fairly straightforward in Clojure:

(defn median [xs]
  (let [n   (count xs)
        mid (int (/ n 2))]
    (if (odd? n)
      (nth (sort xs) mid)
      (->> (sort xs)
           (drop (dec mid))
           (take 2)
           (mean)))))

The median of the UK electorate is:

(defn ex-1-8 []
  (->> (load-data :uk-scrubbed)
       (i/$ "Electorate")
       (median)))

;; 70813.5

Incanter also has a function for calculating the median value as s/median.

bookmark search playlist font-size

Change the font size

margin-width

Change margin width

day-mode

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Delete Bookmark

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete