Chapter 6: Performance Tuning with Apache Spark | Data Engineering with Databricks Cookbook

Book Overview & Buying
Table Of Contents
Feedback & Rating

Data Engineering with Databricks Cookbook

By : Pulkit Chadha

4.4 (7)

Buy this Book

Data Engineering with Databricks Cookbook

4.4 (7)

By: Pulkit Chadha

Buy this Book

Overview of this book

Written by a Senior Solutions Architect at Databricks, Data Engineering with Databricks Cookbook will show you how to effectively use Apache Spark, Delta Lake, and Databricks for data engineering, starting with comprehensive introduction to data ingestion and loading with Apache Spark. What makes this book unique is its recipe-based approach, which will help you put your knowledge to use straight away and tackle common problems. You’ll be introduced to various data manipulation and data transformation solutions that can be applied to data, find out how to manage and optimize Delta tables, and get to grips with ingesting and processing streaming data. The book will also show you how to improve the performance problems of Apache Spark apps and Delta Lake. Advanced recipes later in the book will teach you how to use Databricks to implement DataOps and DevOps practices, as well as how to orchestrate and schedule data pipelines using Databricks Workflows. You’ll also go through the full process of setup and configuration of the Unity Catalog for data governance. By the end of this book, you’ll be well-versed in building reliable and scalable data pipelines using modern data engineering technologies.

Preface

The evolving landscape of data engineering

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Sections

Get in touch

Share Your Thoughts

Download a free PDF copy of this book

Free Chapter

Part 1 – Working with Apache Spark and Delta Lake

Chapter 1: Data Ingestion and Data Extraction with Apache Spark

Technical requirements

Reading CSV data with Apache Spark

Reading JSON data with Apache Spark

Reading Parquet data with Apache Spark

Parsing XML data with Apache Spark

Working with nested data structures in Apache Spark

Processing text data in Apache Spark

Writing data with Apache Spark

Chapter 2: Data Transformation and Data Manipulation with Apache Spark

Technical requirements

Applying basic transformations to data with Apache Spark

Filtering data with Apache Spark

Performing joins with Apache Spark

Performing aggregations with Apache Spark

Using window functions with Apache Spark

Writing custom UDFs in Apache Spark

Handling null values with Apache Spark

Chapter 3: Data Management with Delta Lake

Technical requirements

Creating a Delta Lake table

Reading a Delta Lake table

Updating data in a Delta Lake table

Merging data into Delta tables

Change data capture in Delta Lake

Optimizing Delta Lake tables

Versioning and time travel for Delta Lake tables

Managing Delta Lake tables

Chapter 4: Ingesting Streaming Data

Technical requirements

Configuring Spark Structured Streaming for real-time data processing

Reading data from real-time sources, such as Apache Kafka, with Apache Spark Structured Streaming

Defining transformations and filters on a Streaming DataFrame

Configuring checkpoints for Structured Streaming in Apache Spark

Configuring triggers for Structured Streaming in Apache Spark

Applying window aggregations to streaming data with Apache Spark Structured Streaming

Handling out-of-order and late-arriving events with watermarking in Apache Spark Structured Streaming

Chapter 5: Processing Streaming Data

Technical requirements

Writing the output of Apache Spark Structured Streaming to a sink such as Delta Lake

Idempotent stream writing with Delta Lake and Apache Spark Structured Streaming

Merging or applying Change Data Capture on Apache Spark Structured Streaming and Delta Lake

Joining streaming data with static data in Apache Spark Structured Streaming and Delta Lake

Joining streaming data with streaming data in Apache Spark Structured Streaming and Delta Lake

Monitoring real-time data processing with Apache Spark Structured Streaming

Chapter 6: Performance Tuning with Apache Spark

Technical requirements

Monitoring Spark jobs in the Spark UI

Using broadcast variables

Optimizing Spark jobs by minimizing data shuffling

Avoiding data skew

Caching and persistence

Partitioning and repartitioning

Optimizing join strategies

Chapter 7: Performance Tuning in Delta Lake

Technical requirements

Optimizing Delta Lake table partitioning for query performance

Organizing data with Z-ordering for efficient query execution

Skipping data for faster query execution

Reducing Delta Lake table size and I/O cost with compression

Part 2 – Data Engineering Capabilities within Databricks

Chapter 8: Orchestration and Scheduling Data Pipeline with Databricks Workflows

Technical requirements

Building Databricks workflows

Running and managing Databricks Workflows

Passing task and job parameters within a Databricks Workflow

Conditional branching in Databricks Workflows

Triggering jobs based on file arrival

Setting up workflow alerts and notifications

Troubleshooting and repairing failures in Databricks Workflows

Chapter 9: Building Data Pipelines with Delta Live Tables

Technical requirements

Creating a multi-hop medallion architecture data pipeline with Delta Live Tables in Databricks

Building a data pipeline with Delta Live Tables on Databricks

Implementing data quality and validation rules with Delta Live Tables in Databricks

Quarantining bad data with Delta Live Tables in Databricks

Monitoring Delta Live Tables pipelines

Deploying Delta Live Tables pipelines with Databricks Asset Bundles

Applying changes (CDC) to Delta tables with Delta Live Tables

Chapter 10: Data Governance with Unity Catalog

Technical requirements

Connecting to cloud object storage using Unity Catalog

Creating and managing catalogs, schemas, volumes, and tables using Unity Catalog

Defining and applying fine-grained access control policies using Unity Catalog

Tagging, commenting, and capturing metadata about data and AI assets using Databricks Unity Catalog

Filtering sensitive data with Unity Catalog

Using Unity Catalogs lineage data for debugging, root cause analysis, and impact assessment

Accessing and querying system tables using Unity Catalog

Chapter 11: Implementing DataOps and DevOps on Databricks

Technical requirements

Using Databricks Repos to store code in Git

Automating tasks by using the Databricks CLI

Using the Databricks VSCode extension for local development and testing

Using Databricks Asset Bundles (DABs)

Leveraging GitHub Actions with Databricks Asset Bundles (DABs)

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Customer Reviews

4.4 (7)

5 star

85.7%

4 star

3 star

2 star

1 star

14.3%

Data Engineering with Databricks Cookbook

By : Pulkit Chadha

Data Engineering with Databricks Cookbook

By: Pulkit Chadha

Overview of this book

Optimizing Spark jobs by minimizing data shuffling

Unlock full access

Continue reading for free

Data Engineering with Databricks Cookbook

By : Pulkit Chadha

Data Engineering with Databricks Cookbook

By: Pulkit Chadha

Overview of this book

Optimizing Spark jobs by minimizing data shuffling

Unlock full access

Continue reading for free

Create a Note

Delete Bookmark

Delete Note

Edit Note

Confirmation

Buy this book with your credits?