Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • In-Memory Analytics with Apache Arrow
  • Toc
  • feedback
In-Memory Analytics with Apache Arrow

In-Memory Analytics with Apache Arrow

By : Matthew Topol
4.9 (15)
close
In-Memory Analytics with Apache Arrow

In-Memory Analytics with Apache Arrow

4.9 (15)
By: Matthew Topol

Overview of this book

Apache Arrow is designed to accelerate analytics and allow the exchange of data across big data systems easily. In-Memory Analytics with Apache Arrow begins with a quick overview of the Apache Arrow format, before moving on to helping you to understand Arrow’s versatility and benefits as you walk through a variety of real-world use cases. You'll cover key tasks such as enhancing data science workflows with Arrow, using Arrow and Apache Parquet with Apache Spark and Jupyter for better performance and hassle-free data translation, as well as working with Perspective, an open source interactive graphical and tabular analysis tool for browsers. As you advance, you'll explore the different data interchange and storage formats and become well-versed with the relationships between Arrow, Parquet, Feather, Protobuf, Flatbuffers, JSON, and CSV. In addition to understanding the basic structure of the Arrow Flight and Flight SQL protocols, you'll learn about Dremio’s usage of Apache Arrow to enhance SQL analytics and discover how Arrow can be used in web-based browser apps. Finally, you'll get to grips with the upcoming features of Arrow to help you stay ahead of the curve. By the end of this book, you will have all the building blocks to create useful, efficient, and powerful analytical services and utilities with Apache Arrow.
Table of Contents (16 chapters)
close
1
Section 1: Overview of What Arrow Is, its Capabilities, Benefits, and Goals
5
Section 2: Interoperability with Arrow: pandas, Parquet, Flight, and Datasets
11
Section 3: Real-World Examples, Use Cases, and Future Development

Would you download a library? Of course!

As mentioned before, the Arrow project contains a variety of libraries for multiple programming languages. These official libraries enable anyone to work with Arrow data without having to implement the Arrow format themselves, regardless of the platform and programming language they are utilizing. There are two primary types of libraries that exist so far: ones that are distinct implementations of the Arrow specification, and ones that are built on other implementations. As of the time of writing this book, there are currently implementations for Arrow in C++ [3], C# [4], Go [5], Java [6], JavaScript [7], Julia [8], and Rust [9], which are all distinct implementations.

On top of those, there are libraries for C (Glib) [10], MATLAB [11], Python [12], R [13], and Ruby[14], which are all built on top of the C++ library, which happens to have the most active development. As you might expect, the various implementations all have different stages as far as what features and aspects of the specification are implemented, and the documentation helpfully provides an implementation matrix showing what features are implemented in which libraries. The implementation matrix [15] is then updated as these aspects of the specification and features are implemented in a given library.

With so many different implementations, you might be concerned about interoperability between them. As a result, the various library versions are integration tested via automated continuous integration (CI) jobs in order to ensure this interoperability among them. Depending on the language and development, these libraries are tested on a very large variety of platforms, including but not limited to the following:

  • x86/x86-64
  • arm64
  • s390x (IBM Mainframes)
  • macOS
  • Windows 32 and 64 bit
  • Debian/Ubuntu/Red Hat/CentOS

These libraries are deployed with their various respective package managing methods to attempt to make it as easy as possible to acquire and download the libraries. As a result, there's been significant adoption of Arrow, whether you're a data scientist using pandas, numpy, or Dask, or you're performing calculations and analytics using Apache Spark or AirFlow. And, if you're looking to get the libraries so you can try them out for yourself, the Apache Software Foundation hosts various ways to download and acquire the libraries.

Some of the channels where the libraries are made available are as follows:

When developing something that will utilize the Arrow libraries, keep the terms that were mentioned a few pages ago in mind, as most of the libraries utilize similar terminology and naming for describing their Application Programming Interfaces (APIs).

bookmark search playlist download font-size

Change the font size

margin-width

Change margin width

day-mode

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Delete Bookmark

Modal Close icon
Are you sure you want to delete it?
Cancel
Yes, Delete