
Machine Learning with BigQuery ML
By :

Google BigQuery is a highly scalable, serverless, distributed data warehouse technology built internally by Google in 2006 and then released for public use on GCP in 2010. Thanks to its architecture, it can store petabytes of data and query them with high performance and on-demand scaling. Due to its serverless nature, users who store and query data on BigQuery don't have to manage the underlying infrastructure and can focus on implementing the logic that brings the business value, saving time and resources.
BigQuery is currently used by many large enterprises that leverage it to make data-driven decisions, including Twitter, The Home Depot, and Dow Jones.
BigQuery has a distributed architecture running on thousands of nodes across Google's data centers. Your datasets are not stored in a unique server but are chunked and replicated across different regions to guarantee maximum performance and availability.
The storage and compute layers are fully decoupled in BigQuery. This means that the query engine runs on different servers from the servers where the data is stored. This feature enables BigQuery to provide great scalability both in terms of data volume and query execution. This decoupled paradigm is only possible thanks to Google's Petabit network, which moves data very quickly from one server to another, leveraging Google's proprietary fiber cables across the globe.
Now let's look deeper into how BigQuery manages storage and the compute engine.
Unlike traditional data warehouses, BigQuery stores data in columnar format rather than in row format. This approach enables you to do the following:
All the data is stored in Google's proprietary distributed filesystem named Google File System (codename Colossus). The distribution of the data allows it to guarantee faster I/O performance and better availability of data in the case of failures. Google File System is based on two different server types:
In the following diagram, you can see how Google File System manages data:
Figure 1.8 – Google File System (Colossus) storage strategy
Now that we've learned how BigQuery handles large volumes of data, let's see how this data can be accessed by the compute layer.
Fully decoupled from storage, the compute layer is responsible for receiving query statements from BigQuery users and executing them in the fastest way. The query engine is based on Dremel, a technology developed by Google and then published in a paper in 2010. This engine leverages a multi-level tree architecture:
The execution process of a query on BigQuery based on the multi-level tree is represented in the following diagram:
Figure 1.9 – The BigQuery engine is a multi-level tree
Each node provides a number of processing units called BigQuery slots to execute the business logic of the query. A BigQuery slot can be considered a virtual CPU on a Dremel node. The calculation of the slots needed to perform a specific query is automatically managed by BigQuery depending on the complexity of the query and the impacted data volumes.
Now that we've learned about the technical architecture underneath BigQuery, let's take a look at how this architecture translates into benefits for the enterprises that use it to become data-driven companies compared to other traditional on-premises data warehouses.
As we have mentioned, BigQuery has a serverless architecture that saves users from having to manage not only the provisioning and maintenance of the servers but also all the maintenance operations related to the upgrading and patching of the operative system and of the database software that supports the functioning of the data warehouse. Thanks to the serverless approach, the user can easily start ingesting data into BigQuery and using it without having to perform capacity planning or any hardware and software provisioning upfront. This is particularly important for prototyping and to enable a fail-fast approach that favors a culture of innovation and experimentation.
It doesn't matter if you need to store megabytes of data or petabytes; BigQuery can provide you with maximum flexibility and scalability in terms of both data storage and processing. Thanks to its multi-tenant architecture, a small or medium-sized business can leverage the same innovative capabilities as the largest enterprises, or they can start with a small use case to scale later, according to business needs. Traditional data warehouse technologies leverage the same servers to store and compute. For this reason, they are not particularly suitable for unbalanced use cases, such as when large volumes of data storage are needed but high computing performance is not required, or vice versa. Thanks to its decoupled architecture, as we've seen in the previous section, BigQuery is designed to independently scale storage and compute power according to the user's actual requirements, reducing the total cost of ownership of the solution.
Thanks to its resilient, distributed architecture, BigQuery is able to offer a Service Level Agreement (SLA) of monthly uptime percentage greater than 99.99%. This very high availability standard is granted by Google without any extra effort from the BigQuery users, who don't need to take care of high availability or disaster recovery strategies.
The BigQuery engine offers the ability to query terabytes of data in seconds and petabytes in minutes. This kind of performance is very hard to achieve with a traditional on-premises data warehouse. Higher performance means getting insights faster, as well as processing large volumes of data that very often would be impossible to manage on-premises without huge hardware and software investments. To further improve performance, BigQuery offers the possibility of enabling BigQuery BI Engine. BigQuery BI Engine is an in-memory analytics layer that can be activated on top of BigQuery to execute queries faster, with sub-second latencies.
Traditional data warehouses are designed for long batch operations and are often unable to manage real-time workloads. BigQuery, however, provides a specific interface to ingest data in real time, making it immediately available for analysis. This feature opens up new possibilities to companies that want to accelerate their analytics and overcome the typical approach of data warehouses, which usually involves dealing with the business situation of the day before.
BigQuery stores files in a compressed and optimized format in Google File System but provides the option to load data in various formats that are typically used in data lake technologies. Users can load data into BigQuery using Avro, ORC, CSV, and JSON formats.
BigQuery offers two SQL extensions that are not available in traditional data warehouses:
By default, BigQuery automatically encrypts and decrypts customer's data before storing it in Google File System. BigQuery is also responsible for managing and rotating the encryption and decryption keys. To further improve security, BigQuery provides the option to use Customer-Managed Encryption Keys (CMEKs). In this case, the keys are managed directly by the customer in Google Cloud Key Management System.
Unauthorized access and use of data can be prevented by setting the right roles in Google Identity Access Management System (IAM).
One of the great benefits of using BigQuery is the native integration with a lot of other GCP services:
Figure 1.10 – BigQuery integration with other GCP services
As we can see in the preceding screenshot, Google BigQuery can read data from Google Bigtable, Google Cloud Storage, Google Cloud SQL, and Google Drive using external tables. This feature can simplify data ingestion into BigQuery from other databases, which can also be performed using ETL/ELT tools such as Google Dataflow and Data Fusion. When the datasets are stored in BigQuery, they can be accessed from other GCP components, such as Google DataProc, DataPrep for data processing and preparation, Data Studio, Looker, and Google Sheets for data visualization. BigQuery is integrated with AI Platform Notebooks to allow data scientists and engineers to easily access data from their Jupyter environments.
To summarize, bringing data into BigQuery opens a vast plethora of options that can be used according to the user's needs.
Beyond the integration with other GCP services, Google's partners provide connectors and integrations with BigQuery, creating a rich data management ecosystem. Some examples include Informatica, Tableau, Fivetran, Talend, and Confluent.
If you want to start from scratch using BigQuery, you can leverage existing public tables that are available from the BigQuery Public Datasets Program. The program contains interesting datasets coming from different industries and different countries about different topics. Some examples that we'll use in the next chapter to train our ML models include transactions from bike sharing services, open data on New York City, and records of taxi trips.
There are different ways to interact with BigQuery. They are:
bq
command line, available with the installation of the Google Cloud SDK. It can be used to automate jobs and commands by including them in scripts.For our purposes, we'll take a look at the BigQuery web UI available in Google Cloud Console that will be used in the coming chapters to develop with BigQuery ML.
In the following screenshot, you can see how the BigQuery UI appears in the GCP console:
Figure 1.11 – BigQuery web and graphical user interface available in the GCP console
In the left column, the main BigQuery features are available and the datasets are listed and visible to the user. In this case, there is only one dataset.
The remaining part of the screen is occupied by the development canvas with the results and outcomes at the bottom. We'll learn how to use the BigQuery web UI in Chapter 2, Setting Up Your GCP and BigQuery Environment, when we'll create a GCP project and start using BigQuery.
BigQuery structures, such as tables, views, and ML models, are organized in datasets. Each dataset is a container for different structures and can be used to control access to underlying data structures. A dataset is directly linked to the following:
In the following diagram, you can see an example of a hierarchy composed of projects, datasets, tables, and BigQuery ML models:
Figure 1.12 – BigQuery hierarchy: datasets, tables, views, and ML models
A dataset hosted in GCP, Project A, can also be queried by users linked to another GCP project, Project B, if the right permissions are set. In this case, the storage costs are charged to Project A, which hosts the dataset structures, while the compute costs are billed on the billing account related to Project B. This is exactly what will happen when we use BigQuery public datasets in future chapters for hands-on exercises.
Tip
Remember, your queries can only include tables that reside in the same region. In BigQuery, you cannot query tables that are stored in different geographic locations. If you want to execute queries on tables located in different regions, you need to export and import the data into a dataset in the same region, passing through Google Cloud Storage.
Now that we've learned the main characteristics of BigQuery, let's focus more specifically on the core of this book: BigQuery ML.
Change the font size
Change margin width
Change background colour