-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating

Data Engineering with Google Cloud Platform
By :

Another option to manage ephemeral clusters is using Cloud Composer. We learned about Airflow in the previous chapter to orchestrate BigQuery data loading. But as we've already learned, Airflow has many operators and one of them is of course Dataproc.
You should use this approach compared to a workflow template if your jobs are complex, in terms of developing a pipeline that contains many branches, backfilling logic, and dependencies to other services, since workflow templates can't handle these complexities.
In this section, we will use Airflow to create a Dataproc cluster, submit a pyspark
job, and delete the cluster when finished.
Check the full code in the GitHub repository:
Link to be updated
To use the Dataproc operators in Airflow, we need to import the operators, like this:
from airflow.providers.google.cloud.operators.dataproc import ( DataprocCreateClusterOperator...