Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Clojure for Data Science
  • Toc
  • feedback
Clojure for Data Science

Clojure for Data Science

By : Garner
5 (4)
close
Clojure for Data Science

Clojure for Data Science

5 (4)
By: Garner

Overview of this book

The term “data science” has been widely used to define this new profession that is expected to interpret vast datasets and translate them to improved decision-making and performance. Clojure is a powerful language that combines the interactivity of a scripting language with the speed of a compiled language. Together with its rich ecosystem of native libraries and an extremely simple and consistent functional approach to data manipulation, which maps closely to mathematical formula, it is an ideal, practical, and flexible language to meet a data scientist’s diverse needs. Taking you on a journey from simple summary statistics to sophisticated machine learning algorithms, this book shows how the Clojure programming language can be used to derive insights from data. Data scientists often forge a novel path, and you’ll see how to make use of Clojure’s Java interoperability capabilities to access libraries such as Mahout and Mllib for which Clojure wrappers don’t yet exist. Even seasoned Clojure developers will develop a deeper appreciation for their language’s flexibility! You’ll learn how to apply statistical thinking to your own data and use Clojure to explore, analyze, and visualize it in a technically and statistically robust way. You can also use Incanter for local data processing and ClojureScript to present interactive visualisations and understand how distributed platforms such as Hadoop sand Spark’s MapReduce and GraphX’s BSP solve the challenges of data analysis at scale, and how to explain algorithms using those programming models. Above all, by following the explanations in this book, you’ll learn not just how to be effective using the current state-of-the-art methods in data science, but why such methods work so that you can continue to be productive as the field evolves into the future.
Table of Contents (12 chapters)
close
11
Index

Poincaré's baker

There's a story that, while almost certainly apocryphal, allows us to look in more detail at the way in which the central limit theorem allows us to reason about how distributions are formed. It concerns the celebrated nineteenth century French polymath Henri Poincaré who, so the story goes, weighed his bread every day for a year.

Baking was a regulated profession, and Poincaré discovered that, while the weights of the bread followed a normal distribution, the peak was at 950g rather than the advertised 1kg. He reported his baker to the authorities and so the baker was fined.

The next year, Poincaré continued to weigh his bread from the same baker. He found the mean value was now 1kg, but that the distribution was no longer symmetrical around the mean. The distribution was skewed to the right, consistent with the baker giving Poincaré only the heaviest of his loaves. Poincaré reported his baker to the authorities once more and his baker was fined a second time.

Whether the story is true or not needn't concern us here; it's provided simply to illustrate a key point—the distribution of a sequence of numbers can tell us something important about the process that generated it.

Generating distributions

To develop our intuition about the normal distribution and variance, let's model an honest and dishonest baker using Incanter's distribution functions. We can model the honest baker as a normal distribution with a mean of 1,000, corresponding to a fair loaf of 1kg. We'll assume a variance in the baking process that results in a standard deviation of 30g.

(defn honest-baker [mean sd]
  (let [distribution (d/normal-distribution mean sd)]
    (repeatedly #(d/draw distribution))))

(defn ex-1-18 []
  (-> (take 10000 (honest-baker 1000 30))
      (c/histogram :x-label "Honest baker"
                   :nbins 25)
      (i/view)))

The preceding code will provide an output similar to the following histogram:

Generating distributions

Now, let's model a baker who sells only the heaviest of his loaves. We partition the sequence into groups of thirteen (a "baker's dozen") and pick the maximum value:

(defn dishonest-baker [mean sd]
  (let [distribution (d/normal-distribution mean sd)]
    (->> (repeatedly #(d/draw distribution))
         (partition 13)
         (map (partial apply max)))))

(defn ex-1-19 []
  (-> (take 10000 (dishonest-baker 950 30))
      (c/histogram :x-label "Dishonest baker"
                   :nbins 25)
      (i/view)))

The preceding code will produce a histogram similar to the following:

Generating distributions

It should be apparent that this histogram does not look quite like the others we have seen. The mean value is still 1kg, but the spread of values around the mean is no longer symmetrical. We say that this histogram indicates a skewed normal distribution.

bookmark search playlist font-size

Change the font size

margin-width

Change margin width

day-mode

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Delete Bookmark

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete