In Data Engineer's Lunch #76, Arpan Patel will cover how to connect Airflow and Dataproc with a demo using an Airflow DAG to create a Dataproc cluster, submit an Apache Spark job to Dataproc, and destroy the Dataproc cluster upon completion.
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
Data Engineer's Lunch #76: Airflow and Google Dataproc
1. Version 1.0
Airflow and Google Dataproc
In Data Engineer's Lunch #76, Arpan Patel will cover how to
connect Airflow and Google Dataproc with a demo using an Airflow
DAG to create a Dataproc cluster, submit an Apache Spark job to
Dataproc, and destroy the Dataproc cluster upon completion.
Arpan Patel
Engineer @ Anant
2. Google Dataproc
● Fully managed and highly scalable service for running
Apache Spark, Apache Flink, Presto, and 30+ open source
tools and frameworks
○ Lets you take advantage of open source data tools
for batch processing, querying, streaming, and
machine learning
● Dataproc clusters are quick to start, scale, and shutdown,
with each of these operations taking 90 seconds or less,
on average
● Built-in integration with other Google Cloud Platform
services, such as BigQuery, Cloud Storage, Cloud
Bigtable, Cloud Logging, and Cloud Monitoring
● Can easily interact with clusters and Spark or Hadoop
jobs through the Google Cloud console, the Cloud SDK, or
the Dataproc REST API
4. Google Dataproc + DataStax Astra
● Cluster Properties
○ dataproc:dataproc.conscrypt.provider.enable=false
● Job Properties
○ spark.jars.packages → com.datastax.spark:spark-cassandra-connector_2.12:3.1.0
● DAG param mappings to GCP REST API mappings
○ need to convert camel casing to "_". For example masterConfig -> master_config
○ if we want to use GKE for Dataproc cluster creation, then need to swap cluster_config for
virtual_cluster_config
5. Demo
● Open repo on Gitpod
● Set GCP Connection and Variables
● Run Dag that will:
○ Spin up Dataproc Cluster on GCE
○ Submit Dataproc Spark Job to read from DataStax Astra
○ Destroy Cluster
6. Strategy: Scalable Fast Data
Architecture: Cassandra, Spark, Kafka
Engineering: Node, Python, JVM,CLR
Operations: Cloud, Container
Rescue: Downtime!! I need help.
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037