
In-Memory Analytics with Apache Arrow
By :

In the previous example, we created a scanner and then read the entire dataset. This time, we're going to muck around with the builder first to give it a filter to use before it starts reading the data. We'll also use the Project
function to control what columns get read. Since we're using Parquet files, we can reduce the IO and memory usage by only reading the columns we want rather than reading all of them; we just need to tell the scanner that that's what we want.
In the previous section, we learned about the Arrow Compute API as a library for performing various operations and computations on Arrow-formatted data. It also includes objects and functionality for defining complex expressions referencing fields and calling functions. These expression objects can then be used in conjunction with the scanners to define simple or complex filters for our data. Before we dig into the scanner, let's take a quick detour to cover the...