Sign In Start Free Trial

Book Overview & Buying
Table Of Contents
Feedback & Rating

Learning PySpark

By : Drabas, Lee

3.9 (194)

Learning PySpark

3.9 (194)

By: Drabas, Lee

Overview of this book

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark. You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command. By the end of this book, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications.

Preface

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Free Chapter

1. Understanding Spark

1. Understanding Spark

What is Apache Spark?

Spark Jobs and APIs

Spark 2.0 architecture

Summary

2. Resilient Distributed Datasets

2. Resilient Distributed Datasets

Internal workings of an RDD

Creating RDDs

Global versus local scope

Transformations

Actions

Summary

3. DataFrames

3. DataFrames

Python to RDD communications

Catalyst Optimizer refresh

Speeding up PySpark with DataFrames

Creating DataFrames

Simple DataFrame queries

Interoperating with RDDs

Querying with the DataFrame API

Querying with SQL

DataFrame scenario – on-time flight performance

Spark Dataset API

Summary

4. Prepare Data for Modeling

4. Prepare Data for Modeling

Checking for duplicates, missing observations, and outliers

Getting familiar with your data

Visualization

Summary

5. Introducing MLlib

5. Introducing MLlib

Overview of the package

Loading and transforming the data

Getting to know your data

Creating the final dataset

Predicting infant survival

Summary

6. Introducing the ML Package

6. Introducing the ML Package

Overview of the package

Predicting the chances of infant survival with ML

Parameter hyper-tuning

Other features of PySpark ML in action

Summary

7. GraphFrames

7. GraphFrames

Introducing GraphFrames

Installing GraphFrames

Preparing your flights dataset

Building the graph

Executing simple queries

Understanding vertex degrees

Determining the top transfer airports

Understanding motifs

Determining airport ranking using PageRank

Determining the most popular non-stop flights

Using Breadth-First Search

Visualizing flights using D3

Summary

8. TensorFrames

8. TensorFrames

What is Deep Learning?

What is TensorFlow?

Introducing TensorFrames

TensorFrames – quick start

Summary

9. Polyglot Persistence with Blaze

9. Polyglot Persistence with Blaze

Installing Blaze

Polyglot persistence

Abstracting data

Data operations

Summary

10. Structured Streaming

10. Structured Streaming

What is Spark Streaming?

Why do we need Spark Streaming?

What is the Spark Streaming application data flow?

Simple streaming application using DStreams

A quick primer on global aggregations

Introducing Structured Streaming

Summary

11. Packaging Spark Applications

11. Packaging Spark Applications

The spark-submit command

Deploying the app programmatically

Databricks Jobs

Summary

Index

Index

Customer Reviews

3.9 (194)

5 star

39.2%

4 star

32%

3 star

13.9%

2 star

7.2%

1 star

7.7%

Preface

It is estimated that in 2013 the whole world produced around 4.4 zettabytes of data; that is, 4.4 billion terabytes! By 2020, we (as the human race) are expected to produce ten times that. With data getting larger literally by the second, and given the growing appetite for making sense out of it, in 2004 Google employees Jeffrey Dean and Sanjay Ghemawat published the seminal paper MapReduce: Simplified Data Processing on Large Clusters. Since then, technologies leveraging the concept started growing very quickly with Apache Hadoop initially being the most popular. It ultimately created a Hadoop ecosystem that included abstraction layers such as Pig, Hive, and Mahout – all leveraging this simple concept of map and reduce.

However, even though capable of chewing through petabytes of data daily, MapReduce is a fairly restricted programming framework. Also, most of the tasks require reading and writing to disk. Seeing these drawbacks, in 2009 Matei Zaharia started working on Spark as part of his PhD. Spark was first released in 2012. Even though Spark is based on the same MapReduce concept, its advanced ways of dealing with data and organizing tasks make it 100x faster than Hadoop (for in-memory computations).

In this book, we will guide you through the latest incarnation of Apache Spark using Python. We will show you how to read structured and unstructured data, how to use some fundamental data types available in PySpark, build machine learning models, operate on graphs, read streaming data, and deploy your models in the cloud. Each chapter will tackle different problem, and by the end of the book we hope you will be knowledgeable enough to solve other problems we did not have space to cover here.

Search

Your notes and bookmarks