Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask

Building Data Pipelines with Cask Hydrator
Gokul Gunasekaran
Software Engineer, Cask Data
June 15, 2016
Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.

cask.co
INGEST
any data from any source
in real-time and batch
BUILD
drag-and-drop ETL/ELT
pipelines that run on Hadoop
EGRESS
any data to any destination
in real-time and batch
Data Pipeline
provides the ability to automate complex workﬂows that involves fetching data,
performing non-trivial transformations, deriving and serving insights from the data
2

cask.co
Web Analytics and Reporting Use Case
✦ Hadoop ETL pipeline(s) stitched together using hard-to-maintain, brittle scripts 
✦ Not many developers with expertise in Hadoop components (HDFS, MapReduce, Spark, YARN,
HBase, Kafka) 
✦ Hard to debug and validate, resulting in frequent failures in production environment 
 
Fetch web access logs from S3 every hour, load it into Hadoop cluster for backup and perform
analytics and enable realtime reporting of no. of successful/failure responses and client browser info
Challenge —
3

cask.co
Demo
Load Log Files from S3 into
HDFS and perform
aggregations/analysis
• Start with web access logs stored in Amazon S3
• Store the raw logs into HDFS Avro Files
• Parse the access log lines into individual ﬁelds
• Find out distribution of status codes
• Find out the most commonly used client browser
4

cask.co
S3 Input
69.181.160.120 - - [08/Feb/2015:04:36:40 +0000] "GET /ajax/planStatusHistory HTTP/1.1" 200 508
"http://builds.cask.co/log" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit Chrome/
38.0.2125.122 Safari/537.36"
Fields: IP Address, Timestamp, Http Method, URI, Http Status, Response Size, URI, Client Browser
5

cask.co
Hydrator Studio
✦ Drag-and-drop GUI for visual Data
Pipeline creation 
✦ Rich library of pre-built sources,
transforms, sinks for data ingestion and
ETL use cases 
✦ Separation of pipeline creation from
execution framework - MapReduce,
Spark, Spark Streaming etc. 
✦ Hadoop-native and Hadoop Distro
agnostic
6

cask.co
Hydrator Data Pipeline
✦ Captures Metadata, Audit, Lineage
info and visualized using Cask
Tracker 
✦ Post-run notiﬁcation, centralized
metrics and log collection for ease of
operability 
✦ Simple Java API to build your own
source, transforms, sinks with class
loader isolation 
✦ SparkML based plugins, Python
transforms for data scientists
7

cask.co
✦ ElasticSearch, Cassandra, Kafka, SFTP, JMS and many more sources and sinks 
✦ De-duplicate, Group By Aggregation, Row Denormalizer and other transforms
Out of the box Integrations
8

cask.co
✦ Implement your own batch (or realtime) source, transform, sink plugins using simple Java API
Custom Plugins
9

cask.co
Data Lake
Fraud
Detection
Recommendation
Engine
Sensor Data
Analytics
Customer
360
Hydrator Tracker
CASK DATA APP PLATFORM
Hadoop ecosystem, 50 different projects
Top 6 Hadoop distributions
10

cask.co
Pipeline Implementation
Logical Pipeline
Physical Workﬂow
MR/Spark Executions
Planner
CDAP
✦ Planner converts logical pipeline to a physical
execution plan 
✦ Optimizes and bundles functions into one or more
MR/Spark jobs 
✦ CDAP is the runtime environment where all the
components of the data pipeline are executed 
✦ CDAP provides centralized log and metrics collection,
transaction, lineage and audit information 
11

cask.co
✦ Join across multiple data sources (CDAP-5588) 
✦ Pipeline preview 
✦ Macro substitutions 
✦ Pre-Actions in pipelines similar to post run notiﬁcations 
✦ Spark streaming support for Realtime pipelines
Upcoming capabilities
12

Thank You!
cdap-user@googlegroups.com 
@CaskData 
 
github.com/caskdata/cdap
github.com/caskdata/hydrator-plugins 
Questions?
13

cask.co
Self-Service Data Ingestion
and ETL for Data Lakes
Built for Production
on CDAP
Rich Drag-and-Drop
User Interface
Open Source &
Highly Extensible
14

Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

More from Cask Data

More from Cask Data (9)

Recently uploaded

Recently uploaded (20)

Building Data pipelines with Cask Hydrator, by Gokul Gunasekaran from Cask