Gobblin: Data Lifecycle Management Framework

•Download as PPTX, PDF•

1 like•602 views

Gobblin provides a data ingestion and lifecycle management platform for LinkedIn's Hadoop clusters. It supports ingesting data from various sources into HDFS, and provides additional capabilities like replication, retention, optimization and compliance. Gobblin treats each dataset independently and orchestrates operators like ingestion, replication and retention through shared metadata. This allows for flexible and extensible management of LinkedIn's large and growing volume of datasets and data flows through their entire lifecycle.

Software

Gobblin:
What’s new?
Vasanth Rajamani
Chavdar Botev

About
Vasanth Rajamani
Manager
ETL Infrastructure, LinkedIn
Chavdar Botev
Tech Lead
ETL Infrastructure, LinkedIn
2

3
Gobblin for Data Ingest
Streaming
events
OLTP
Snapshots
OLTP
Changelog
Cloud
Services
Kafka
JDBC
REST
SOAP
HDFS
SFTP

A Peek in Our Support List:
Beyond the Data Ingest
Can you also copy this data onto these other Hadoop
clusters?2 Replication
Can you retain data for a period of time and then purge it
on an ongoing basis?3 Retention
Can you provide certain datasets in a more optimal format
like ORC?4 Optimization
Can you guarantee that the data doesn’t have duplicates?5 Compaction
Can you purge some rows for compliance reasons? Can
this be done continuously?6 Compliance
4
When and how often is the data made available?1 Monitoring

Beyond Data Ingest
5
Oracle Espresso
Kafka MySQL
Site-facing clusters
External
Sources
• Monitoring
• Retention
• Optimization
 Format
 Layout
 Compaction
• Auditing
• Compliance
ETL Clusters
• Monitoring
• Retention
• Optimization
• Auditing
• Compliance
Prod Clusters
• Monitoring
• Retention
• Optimization
• Auditing
• Compliance
Dev Clusters
HDFS
Ingest
Replication
Data Load

Data Lifecycle Management:
The Next Frontier

Managing the flow of systems’ data and
metadata throughout its life cycle:
from creation and receipt
through distribution and maintenance
to deletion.
7
Data Lifecycle Management

Hadoop Data Lifecycle Management
at LinkedIn
8
 Data and metadata
 10+K datasets
 Dataset auto-discovery
 Ownership across many teams
 Systems
 Multiple loosely coupled systems
 Ownership across multiple teams
 Systems and data evolve independently over time

9
Hadoop Data Lifecycle Management
with Gobblin

Datasets
10
 Ubiquitous
 Heterogenous
 Common
 Dataset URI
 E.g. /data/tracking/<TOPIC>,
/data/databases/<DATABASE>/<TABLE>
 Metadata

Dataset Operators
11
 Ingest
 Replication
 Retention management
 Data deduping
 …
 Different implementations possible

Metadata
12
 Ubiquitous
 Heterogenous
 Common
 Associated with a Dataset URI
 Can be represented as a collection of K/V pairs
 Metadata in Gobblin:
 Input: Dataset configuration
 Output: Metrics and tracking events

Orchestration
13
 Dataset operators: independent actors
 Ingest unaware of replication and vice versa
 Interaction through shared state
 Ingest lands dataset in a data directory
 Replication copies all datasets in the directory
 Retention runs all datasets in the directory
 Datasets and metadata: the common language

How About Falcon?
14
 Top-down approach
 Tight coupling: centralized repository for feeds
(datasets) and processes
 Not designed for multi-tenancy
 Lack of dataset auto-discovery
 Lack of policies
 Inflexible flows

Conclusion
15
 Data lifecycle management
 It’s more than just ingest
 Loosely coupled systems
 Flexible processing is a must for growth
 Dataset-centric processing
 Think about datasets, not jobs

What's hot

Presto meetup 2015-03-19 @FacebookTreasure Data, Inc.

Presto @ Facebook: Past, Present and FutureDataWorks Summit

Presto at Hadoop Summit 2016kbajda

Change Data Capture with Data Collector @OVHParis Data Engineers !

Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)Matt Fuller

Airflow at lyft for Airflow summit 2020 conferenceTao Feng

Introducing MongoDB 2.6MongoDB

Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson

Presto @ Treasure Data - Presto Meetup Boston 2015Taro L. Saito

Presto: Distributed sql query engine kiran palaka

Presto Strata Hadoop SJ 2016 short talkkbajda

Presto at TwitterBill Graham

PrestoChen Chun

Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkFlink Forward

Open source data ingestionTreasure Data, Inc.

Building Data Pipelines in PythonC4Media

Bullet: A Real Time Data Query EngineDataWorks Summit

Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaFlink Forward

Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...viirya

Presto for the Enterprise @ Hadoop MeetupWojciech Biela

What's hot (20)

Presto meetup 2015-03-19 @Facebook

Presto @ Facebook: Past, Present and Future

Presto at Hadoop Summit 2016

Change Data Capture with Data Collector @OVH

Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)

Airflow at lyft for Airflow summit 2020 conference

Introducing MongoDB 2.6

Open Source Big Data Ingestion - Without the Heartburn!

Presto @ Treasure Data - Presto Meetup Boston 2015

Presto: Distributed sql query engine

Presto Strata Hadoop SJ 2016 short talk

Presto at Twitter

Presto

Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink

Open source data ingestion

Building Data Pipelines in Python

Bullet: A Real Time Data Query Engine

Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...

Presto for the Enterprise @ Hadoop Meetup

Similar to Gobblin: Data Lifecycle Management Framework

Vargas polyglot-persistence-cloud-edbtGenoveva Vargas-Solar

Teradata Loom Introductory Presentationmlang222

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal

Hadoop and Voldemort @ LinkedInHadoop User Group

Datalake ArchitectureTechYugadi IT Solutions & Consulting

Populate your Search index, NEST 2016-01David Smiley

Predictive Analytics and Machine Learning…with SAS and Apache HadoopHortonworks

Big data architectures and the data lakeJames Serra

Oracle GoldenGate for Oracle DBAsGuatemala User Group

Building High Performance MySQL Query Systems and Analytic ApplicationsCalpont

Building High Performance MySql Query Systems And Analytic Applicationsguest40cda0b

co-Hadoop: Data co-location on Hadoop.Yousef Fadila

Cassandra Essentials Day CambridgeMarc Fielding

Backup and Disaster Recovery in Hadooplarsgeorge

HTAP QueriesAtif Shaikh

Alluxio Data Orchestration Platform for the CloudShubham Tagra

DataGraft Platform: RDF Database-as-a-ServiceMarin Dimitrov

Google Data Engineering.pdfavenkatram

Data Engineering on GCPBlibBlobb

Exposing Hidden Relationships: Practical Work in Linked Data using Digital Co...Cory Lampert

Similar to Gobblin: Data Lifecycle Management Framework (20)

Vargas polyglot-persistence-cloud-edbt

Teradata Loom Introductory Presentation

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010

Hadoop and Voldemort @ LinkedIn

Datalake Architecture

Populate your Search index, NEST 2016-01

Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Big data architectures and the data lake

Oracle GoldenGate for Oracle DBAs

Building High Performance MySQL Query Systems and Analytic Applications

Building High Performance MySql Query Systems And Analytic Applications

co-Hadoop: Data co-location on Hadoop.

Cassandra Essentials Day Cambridge

Backup and Disaster Recovery in Hadoop

HTAP Queries

Alluxio Data Orchestration Platform for the Cloud

DataGraft Platform: RDF Database-as-a-Service

Google Data Engineering.pdf

Data Engineering on GCP

Exposing Hidden Relationships: Practical Work in Linked Data using Digital Co...

Recently uploaded

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Professional Resume Template for Software DevelopersVinodh Ram

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

Clustering techniques data mining book ....ShaimaaMohamedGalal

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

TECUNIQUE: Success Stories: IT Service providermohitmore19

Recently uploaded (20)

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

Unlocking the Future of AI Agents with Large Language Models

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Professional Resume Template for Software Developers

Advancing Engineering with AI through the Next Generation of Strategic Projec...

How To Use Server-Side Rendering with Nuxt.js

Diamond Application Development Crafting Solutions with Precision

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

Clustering techniques data mining book ....

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

Salesforce Certified Field Service Consultant

Optimizing AI for immediate response in Smart CCTV

TECUNIQUE: Success Stories: IT Service provider

Gobblin: Data Lifecycle Management Framework

1. Gobblin: What’s new? Vasanth Rajamani Chavdar Botev

2. About Vasanth Rajamani Manager ETL Infrastructure, LinkedIn Chavdar Botev Tech Lead ETL Infrastructure, LinkedIn 2

3. 3 Gobblin for Data Ingest Streaming events OLTP Snapshots OLTP Changelog Cloud Services Kafka JDBC REST SOAP HDFS SFTP

4. A Peek in Our Support List: Beyond the Data Ingest Can you also copy this data onto these other Hadoop clusters?2 Replication Can you retain data for a period of time and then purge it on an ongoing basis?3 Retention Can you provide certain datasets in a more optimal format like ORC?4 Optimization Can you guarantee that the data doesn’t have duplicates?5 Compaction Can you purge some rows for compliance reasons? Can this be done continuously?6 Compliance 4 When and how often is the data made available?1 Monitoring

5. Beyond Data Ingest 5 Oracle Espresso Kafka MySQL Site-facing clusters External Sources • Monitoring • Retention • Optimization  Format  Layout  Compaction • Auditing • Compliance ETL Clusters • Monitoring • Retention • Optimization • Auditing • Compliance Prod Clusters • Monitoring • Retention • Optimization • Auditing • Compliance Dev Clusters HDFS Ingest Replication Data Load

6. Data Lifecycle Management: The Next Frontier

7. Managing the flow of systems’ data and metadata throughout its life cycle: from creation and receipt through distribution and maintenance to deletion. 7 Data Lifecycle Management

8. Hadoop Data Lifecycle Management at LinkedIn 8  Data and metadata  10+K datasets  Dataset auto-discovery  Ownership across many teams  Systems  Multiple loosely coupled systems  Ownership across multiple teams  Systems and data evolve independently over time

9. 9 Hadoop Data Lifecycle Management with Gobblin

10. Datasets 10  Ubiquitous  Heterogenous  Common  Dataset URI  E.g. /data/tracking/<TOPIC>, /data/databases/<DATABASE>/<TABLE>  Metadata

11. Dataset Operators 11  Ingest  Replication  Retention management  Data deduping  …  Different implementations possible

12. Metadata 12  Ubiquitous  Heterogenous  Common  Associated with a Dataset URI  Can be represented as a collection of K/V pairs  Metadata in Gobblin:  Input: Dataset configuration  Output: Metrics and tracking events

13. Orchestration 13  Dataset operators: independent actors  Ingest unaware of replication and vice versa  Interaction through shared state  Ingest lands dataset in a data directory  Replication copies all datasets in the directory  Retention runs all datasets in the directory  Datasets and metadata: the common language

14. How About Falcon? 14  Top-down approach  Tight coupling: centralized repository for feeds (datasets) and processes  Not designed for multi-tenancy  Lack of dataset auto-discovery  Lack of policies  Inflexible flows

15. Conclusion 15  Data lifecycle management  It’s more than just ingest  Loosely coupled systems  Flexible processing is a must for growth  Dataset-centric processing  Think about datasets, not jobs

Editor's Notes

Managing the flow of systems data and metadata throughout its life cycle: from creation to the time when it becomes obsolete and is deleted.

Gobblin: Data Lifecycle Management Framework

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Gobblin: Data Lifecycle Management Framework

Similar to Gobblin: Data Lifecycle Management Framework (20)

Recently uploaded

Recently uploaded (20)

Gobblin: Data Lifecycle Management Framework

Editor's Notes