Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Learning Apache Spark 2
  • Toc
  • feedback
Learning Apache Spark 2

Learning Apache Spark 2

By : Abbasi
3.8 (6)
close
Learning Apache Spark 2

Learning Apache Spark 2

3.8 (6)
By: Abbasi

Overview of this book

Apache Spark has seen an unprecedented growth in terms of its adoption over the last few years, mainly because of its speed, diversity and real-time data processing capabilities. It has quickly become the preferred choice of tool for many Big Data professionals looking to find quick insights from large chunks of data. This book introduces you to the Apache Spark framework, and familiarizes you with all the latest features and capabilities introduced in Spark 2. Starting with a detailed introduction to Spark’s architecture and the installation procedure, this book covers everything you need to know about the Spark framework in the most practical manner. You will learn how to perform the basic ETL activities using Spark, and work with different components of Spark such as Spark SQL, as well as the Dataset and DataFrame APIs for manipulating your data. Then, you will perform machine learning using Spark MLlib, as well as perform streaming analytics and graph processing using the Spark Streaming and GraphX modules respectively. The book also gives special emphasis on deploying your Spark models, and how they can be operated in a clustered mode. During the course of the book, you will come across implementations of different real-world use-cases and examples, giving you the hands-on knowledge you need to use Apache Spark in the best possible manner.
Table of Contents (12 chapters)
close

The Past

Before going into the present-day Spark, it might be worthwhile understanding what problems Spark intend to solve, and especially the data movement. Without knowing the background we will not be able to predict the future.

"You have to learn the past to predict the future."

Late 1990s: The world was a much simpler place to live, with proprietary databases being the sole choice of consumers. Data was growing at quite an amazing pace, and some of the biggest databases boasted of maintaining datasets in excess of a Terabyte.

Early 2000s: The dotcom bubble happened, meant companies started going online, and likes of Amazon and eBay leading the revolution. Some of the dotcom start-ups failed, while others succeeded. The commonality among the business models started was a razor-sharp focus on page views, and everything started getting focused on the number of users. A lot of marketing budget was spent on getting people online. This meant more customer behavior data in the form of weblogs. Since the defacto storage was an MPP database, and the value of such weblogs was unknown, more often than not these weblogs were stuffed into archive storage or deleted.

2002: In search for a better search engine, Doug Cutting and Mike Cafarella started work on an open source project called Nutch, the objective of which was to be a web scale crawler. Web-Scale was defined as billions of web pages and Doug and Mike were able to index hundreds of millions of web-pages, running on a handful of nodes and had a knack of falling down.

2004-2006: Google published a paper on the Google File System (GFS) (2003) and MapReduce (2004) demonstrating the backbone of their search engine being resilient to failures, and almost linearly scalable. Doug Cutting took particular interest in this development as he could see that GFS and MapReduce papers directly addressed Nutch’s shortcomings. Doug Cutting added Map Reduce implementation to Nutch which ran on 20 nodes, and was much easier to program. Of course we are talking in comparative terms here.

2006-2008: Cutting went to work with Yahoo in 2006 who had lost the search crown to Google and were equally impressed by the GFS and MapReduce papers. The storage and processing parts of Nutch were spun out to form a separate project named Hadoop under AFS where as Nutch web crawler remained a separate project. Hadoop became a top-level Apache project in 2008. On February 19, 2008 Yahoo announced that its search index is run on a 10000 node Hadoop cluster (truly an amazing feat).

We haven't forget about the proprietary database vendors. the majority of them didn’t expect Hadoop to change anything for them, as database vendors typically focused on relational data, which was smaller in volumes but higher in value. I was talking to a CTO of a major database vendor (will remain unnamed), and discussing this new and upcoming popular elephant (Hadoop of course! Thanks to Doug Cutting’s son for choosing a sane name. I mean he could have chosen anything else, and you know how kids name things these days..). The CTO was quite adamant that the real value is in the relational data, which was the bread and butter of his company, and despite that fact that the relational data had huge volumes, it had less of a business value. This was more of a 80-20 rule for data, where from a size perspective unstructured data was 4 times the size of structured data (80-20), whereas the same structured data had 4 times the value of unstructured data. I would say that the relational database vendors massively underestimated the value of unstructured data back then.

Anyways, back to Hadoop: So, after the announcement by Yahoo, a lot of companies wanted to get a piece of the action. They realised something big was about to happen in the dataspace. Lots of interesting use cases started to appear in the Hadoop space, and the defacto compute engine on Hadoop, MapReduce wasn’t able to meet all those expectations.

The MapReduce Conundrum: The original Hadoop comprised primarily HDFS and Map-Reduce as a compute engine. The original use case of web scale search meant that the architecture was primarily aimed at long-running batch jobs (typically single-pass jobs without iterations), like the original use case of indexing web pages. The core requirement of such a framework was scalability and fault-tolerance, as you don’t want to restart a job that had been running for 3 days, having completed 95% of its work. Furthermore, the objective of MapReduce was to target acyclic data flows.

A typical MapReduce program is composed of a Map() operation and optionally a Reduce() operation, and any workload had to be converted to the MapReduce paradigm before you could get the benefit of Hadoop. Not only that majority of other open source projects on Hadoop also used MapReduce as a way to perform computation. For example: Hive and Pig Latin both generated MapReduce to operate on Big Data sets. The problem with the architecture of MapReduce was that the job output data from each step had to be store in a distributed system before the next step could begin. This meant that each iteration had to reload the data from the disk thus incurring a significant performance penalty. Furthermore, while typically design, for batch jobs, Hadoop has often been used to do exploratory analysis through SQL-like interfaces such as Pig and Hive. Each query incurs significant latency due to initial MapReduce job setup, and initial data read which often means increased wait times for the users.

Beginning of Spark: In June of 2011, Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker and Ion Stoica published a paper in which they proposed a framework that could outperform Hadoop 10 times in iterative machine learning jobs. The framework is now known as Spark. The paper aimed to solve two of the major inadequacies of the Hadoop/MR framework:

  • Iterative jobs
  • Interactive analysis

The idea that you can plug the gaps of map-reduce from an iterative and interactive analysis point of view, while maintaining its scalability and resilience meant that the platform could be used across a wide variety of use cases.

This created huge interest in Spark, particularly from communities of users who had become frustrated with the relatively slow response from MapReduce, particularly for interactive queries requests. Spark in 2015 became the most active open source project in Big Data, and had tons of new features of improvements during the course of the project. The community grew almost 300%, with attendances at Spark-Summit increasing from just 1,100 in 2014 to almost 4,000 in 2015. The number of meetup groups grew by a factor of 4, and the contributors to the project increased from just over a 100 in 2013 to 600 in 2015.

Spark is today the hottest technology for big data analytics. Numerous benchmarks have confirmed that it is the fastest engine out there. If you go to any Big data conference be it Strata + Hadoop World or Hadoop Summit, Spark is considered to be the technology for future.

Stack Overflow released the results of a 2016 developer survey (http://bit.ly/1MpdIlU) with responses from 56,033 engineers across 173 countries. Some of the facts related to Spark were pretty interesting. Spark was the leader in Trending Tech and the Top-Paying Tech.

 Why are people so excited about Spark?

In addition to plugging MapReduce deficiencies, Spark provides three major things that make it really powerful:

  • General engine with libraries for many data analysis tasks - includes built-in libraries for Streaming, SQL, machine learning and graph processing
  • Access to diverse data sources, means it can connect to Hadoop, Cassandra, traditional SQL databases, and Cloud Storage including Amazon and OpenStack
  • Last but not the least, Spark provides a simple unified API that means users have to learn just one API to get the benefit of the entire framework stack

We hope that this book gives you the foundation of understanding Spark as a framework, and helps you take the next step towards using it for your implementations.

bookmark search playlist download font-size

Change the font size

margin-width

Change margin width

day-mode

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Delete Bookmark

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete