Gobblin provides a data ingestion and lifecycle management platform for LinkedIn's Hadoop clusters. It supports ingesting data from various sources into HDFS, and provides additional capabilities like replication, retention, optimization and compliance. Gobblin treats each dataset independently and orchestrates operators like ingestion, replication and retention through shared metadata. This allows for flexible and extensible management of LinkedIn's large and growing volume of datasets and data flows through their entire lifecycle.
3. 3
Gobblin for Data Ingest
Streaming
events
OLTP
Snapshots
OLTP
Changelog
Cloud
Services
Kafka
JDBC
REST
SOAP
HDFS
SFTP
4. A Peek in Our Support List:
Beyond the Data Ingest
Can you also copy this data onto these other Hadoop
clusters?2 Replication
Can you retain data for a period of time and then purge it
on an ongoing basis?3 Retention
Can you provide certain datasets in a more optimal format
like ORC?4 Optimization
Can you guarantee that the data doesn’t have duplicates?5 Compaction
Can you purge some rows for compliance reasons? Can
this be done continuously?6 Compliance
4
When and how often is the data made available?1 Monitoring
7. Managing the flow of systems’ data and
metadata throughout its life cycle:
from creation and receipt
through distribution and maintenance
to deletion.
7
Data Lifecycle Management
8. Hadoop Data Lifecycle Management
at LinkedIn
8
Data and metadata
10+K datasets
Dataset auto-discovery
Ownership across many teams
Systems
Multiple loosely coupled systems
Ownership across multiple teams
Systems and data evolve independently over time
12. Metadata
12
Ubiquitous
Heterogenous
Common
Associated with a Dataset URI
Can be represented as a collection of K/V pairs
Metadata in Gobblin:
Input: Dataset configuration
Output: Metrics and tracking events
13. Orchestration
13
Dataset operators: independent actors
Ingest unaware of replication and vice versa
Interaction through shared state
Ingest lands dataset in a data directory
Replication copies all datasets in the directory
Retention runs all datasets in the directory
Datasets and metadata: the common language
14. How About Falcon?
14
Top-down approach
Tight coupling: centralized repository for feeds
(datasets) and processes
Not designed for multi-tenancy
Lack of dataset auto-discovery
Lack of policies
Inflexible flows
15. Conclusion
15
Data lifecycle management
It’s more than just ingest
Loosely coupled systems
Flexible processing is a must for growth
Dataset-centric processing
Think about datasets, not jobs
Editor's Notes
Managing the flow of systems data and metadata throughout its life cycle: from creation to the time when it becomes obsolete and is deleted.