Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
The Data Driven Network
Kapil Surlaker
Director of Engineering
Bridging Batch and Streaming Data
Integration with Gobblin
...
Data Integration: key requirements
Source, Sink
Diversity
Batch
+
Streaming
Data
Quality
So, we built
SFTP
JDBC
REST
Simplifying Data Integration
@LinkedIn
Hundreds of TB per day
Thousands of datasets
~30 different source sy...
4
Other Open Source Systems in this Space
Sqoop, Flume, Falcon, Nifi, Kafka Connect
Flink, Spark, Samza, Apex
Similar in p...
: Under the Hood
5
6
Gobblin: The Logical Pipeline
7
WorkUnit
A logical unit of work, typically bounded but not necessary.
Kafka Topic: LoginEvent, Partition: 10, Offsets: 1...
8
Source: A provider of WorkUnits
(typically a system like Kafka, HDFS etc.)
9
Task: A unit of execution that operates on a WorkUnit
Extracts records from the source, writes to the destination
Ends w...
10
Extractor: A provider of records given a WorkUnit
Connects to Data Source
Deserializer of records
11
Converter: A 1:N mapper of input records to output records
Multiple converters can be chained
(e.g. Avro <-> JSON, Sche...
12
Quality Checker: Can check if the quality of the output is
satisfactory
Row-level (e.g. time value check)
Task-level (e...
13
Writer: Writes to the destination
Connection to the destination, Serializer of records
Sync / Async
e.g. FsWriter, Kafk...
14
Publisher: Finalizes / Commits the data
Used for destinations that support atomicity
(e.g. move tmp staging directory t...
15
Gobblin: The Logical Pipeline
16
State Store (HDFS, S3, MySQL, ZK, …)
Load config
previous watermarks
save watermarks
Gobblin: The Logical Pipeline
Stat...
: Pipeline Specification
17
Gobblin: Pipeline Specification
job.name=PullFromWikipedia	
job.group=Wikipedia	
job.description=A	getting	started	example...
source.revisions.cnt=5	
wikipedia.api.rooturl=https://en.wikipedia.org/w/api.php	
wikipedia.avro.schema={"namespace":	“exa...
converter.classes=gobblin.example.wikipedia.WikipediaConverter	
extract.namespace=gobblin.example.wikipedia	
writer.destin...
Gobblin: Pipeline Deployment
Bare Metal / AWS / Azure / VM
Standalone:
Single Instance
Small Medium Large
AWS (EC2)
Hadoop...
Execution Model: Batch versus Streaming
Batch
Determine work, Acquire slots, Run, Checkpoint, Repeat
+ Cost-efficient, det...
Execution Model: Batch versus Streaming
Streaming
Determine work streams, Run continuously, Checkpoint periodically
+ Low ...
Batch
Execution Model Scorecard
Batch
Streaming
Streaming
Streaming
Streaming
Batch
Batch
JDBC <->HDFS Kafka ->HDFS
HDFS -...
Can we run in both models
using the same system?
26
Gobblin: The Logical Pipeline
27
Batch
Determine work
Streaming
Determine work
- unbounded WorkUnit
Pipeline Stages: Start
28
Batch
Acquire slots, Run
Streaming
Run continuously
Checkpoint periodically
Shutdown gracefully
Pipeline Stages: Run
Wa...
29
Batch
Checkpoint, Commit
Streaming
Do nothing
- NoOpPublisher
Pipeline Stages: End
Enabling Streaming mode
task.executionMode = streaming
Standalone:
Single Instance
AWS
Hadoop (YARN / MR)
Standalone Clust...
A Streaming Pipeline Spec: Kafka 2 Kafka
# A sample pull file that copies an input Kafka topic and
# produces to an output...
job.description=This is a job that runs forever, copies an input Kafka
topic to an output Kafka topic
job.lock.enabled=fal...
mmon.serialization.ByteArrayDeserializer
gobblin.streaming.kafka.topic.singleton=test
kafka.brokers=localhost:9092
# Sampl...
# Sample 10% of the records
converter.classes=gobblin.converter.SamplingConverter
converter.sample.ratio=0.10
writer.build...
data.publisher.type=gobblin.publisher.NoopPublisher
task.executionMode=STREAMING
# Configure watermark storage for streami...
Gobblin Streaming: Cluster view
Cluster of processes
Apache Helix:
work-unit assignment,
fault-tolerance,
reassignment Clu...
Active Workstreams in Gobblin
Gobblin as a Service
Global orchestrator with REST API for submitting logical flow specifica...
Roadmap
Final LinkedIn Gobblin 0.10.0 release
Apache Incubator code donation and release
More Streaming runtimes
Integrati...
39
Gobblin Team @ LinkedIn
Prochain SlideShare
Chargement dans…5
×

Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

12 110 vues

Publié le

This talk describes the motivations behind Apache Gobblin (incubating), architecture, latest innovations in supporting both batch and streaming data pipelines as well as future roadmap.

Publié dans : Données & analyses
  • Would you please tell me the pictures in this document, in which a drawing software to draw?
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • sell your house fast in kansas city http://sellurhomekc.com/
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • Hi there! Essay Help For Students | Discount 10% for your first order! - Check our website! https://vk.cc/80SakO
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @ https://www.thesisscientist.com/top-30-sites-for-download-free-books-2018
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • Effective powerful love spell to get your Ex Lover back now: Check on Dr Trust Website: https://utimatespellcaster.com I have just experience the wonders of Dr. Trust love spell,that have been spread on the internet and worldwide, How he marvelously helped people all over the world to restored back their marriage and get back lost lovers, and also help to win lottery. I contacted him after going through so many testimonies from different people how he help to bring back ex lover back, i told him about my husband that abandoned me about 8months ago, and left home with all i had..Dr Trust only told me to smile 3 times and have a rest of mind he will handle all in just 48hrs, After the second day Anthony called me, i was just so shocked, i pick the call and couldnt believe my ears, he was really begging me to forgive him and making promises on phone..He come back home and also got me a new car just for him to proof his love for me. i was so happy and called Dr Trust and thanked him, he only told me to share the good news all over the world ..Well if your need an effective and real spell caster for any problem in your life you can contact Dr Trust his website: https://utimatespellcaster.com (Ultimatespellcast@yahoo.com or Ultimatespellcast@gmail.com) Website: https://utimatespellcaster.com His whatApp or call number +2348156885231
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

  1. 1. The Data Driven Network Kapil Surlaker Director of Engineering Bridging Batch and Streaming Data Integration with Gobblin Shirshanka Das Gobblin team 26th Apr, 2017 Big Data Meetup github.com/linkedin/gobblin @ApacheGobblin gitter.im/gobblin
  2. 2. Data Integration: key requirements Source, Sink Diversity Batch + Streaming Data Quality So, we built
  3. 3. SFTP JDBC REST Simplifying Data Integration @LinkedIn Hundreds of TB per day Thousands of datasets ~30 different source systems 80%+ of data ingest Open source @ github.com/linkedin/gobblin Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal, CERN, NerdWallet and many more… Apache incubation under way SFTP Azure StorageAzure Storage
  4. 4. 4 Other Open Source Systems in this Space Sqoop, Flume, Falcon, Nifi, Kafka Connect Flink, Spark, Samza, Apex Similar in pieces, dissimilar in aggregate Most are tied to a specific execution model (batch / stream) Most are tied to a specific implementation, ecosystem (Kafka, Hadoop etc)
  5. 5. : Under the Hood 5
  6. 6. 6 Gobblin: The Logical Pipeline
  7. 7. 7 WorkUnit A logical unit of work, typically bounded but not necessary. Kafka Topic: LoginEvent, Partition: 10, Offsets: 10-200 HDFS Folder: /data/Login, File: part-0.avro Hive Dataset: Tracking.Login, date-partition=mm-dd-yy-hh
  8. 8. 8 Source: A provider of WorkUnits (typically a system like Kafka, HDFS etc.)
  9. 9. 9 Task: A unit of execution that operates on a WorkUnit Extracts records from the source, writes to the destination Ends when WorkUnit is exhausted of records (assigned to Thread in ThreadPool, Mapper in Map-Reduce etc.)
  10. 10. 10 Extractor: A provider of records given a WorkUnit Connects to Data Source Deserializer of records
  11. 11. 11 Converter: A 1:N mapper of input records to output records Multiple converters can be chained (e.g. Avro <-> JSON, Schema project, Encrypt)
  12. 12. 12 Quality Checker: Can check if the quality of the output is satisfactory Row-level (e.g. time value check) Task-level (e.g. audit check, schema compatibility)
  13. 13. 13 Writer: Writes to the destination Connection to the destination, Serializer of records Sync / Async e.g. FsWriter, KafkaWriter, CouchbaseWriter
  14. 14. 14 Publisher: Finalizes / Commits the data Used for destinations that support atomicity (e.g. move tmp staging directory to final output directory on HDFS)
  15. 15. 15 Gobblin: The Logical Pipeline
  16. 16. 16 State Store (HDFS, S3, MySQL, ZK, …) Load config previous watermarks save watermarks Gobblin: The Logical Pipeline Stateful ^
  17. 17. : Pipeline Specification 17
  18. 18. Gobblin: Pipeline Specification job.name=PullFromWikipedia job.group=Wikipedia job.description=A getting started example for Gobblin source.class=gobblin.example.wikipedia.WikipediaSource source.page.titles=LinkedIn,Wikipedia:Sandbox source.revisions.cnt=5 wikipedia.api.rooturl=https://en.wikipedia.org/w/api.php wikipedia.avro.schema={"namespace": “example.wikipedia.avro” ,…"null"]}]} gobblin.wikipediaSource.maxRevisionsPerPage=10 converter.classes=gobblin.example.wikipedia.WikipediaConverter Pipeline Name, Description Source + configuration
  19. 19. source.revisions.cnt=5 wikipedia.api.rooturl=https://en.wikipedia.org/w/api.php wikipedia.avro.schema={"namespace": “example.wikipedia.avro” ,…"null"]}]} gobblin.wikipediaSource.maxRevisionsPerPage=10 converter.classes=gobblin.example.wikipedia.WikipediaConverter extract.namespace=gobblin.example.wikipedia writer.destination.type=HDFS writer.output.format=AVRO writer.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner data.publisher.type=gobblin.publisher.BaseDataPublisher Gobblin: Pipeline Specification Converter Writer + configuration
  20. 20. converter.classes=gobblin.example.wikipedia.WikipediaConverter extract.namespace=gobblin.example.wikipedia writer.destination.type=HDFS writer.output.format=AVRO writer.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner data.publisher.type=gobblin.publisher.BaseDataPublisher Gobblin: Pipeline Specification Publisher
  21. 21. Gobblin: Pipeline Deployment Bare Metal / AWS / Azure / VM Standalone: Single Instance Small Medium Large AWS (EC2) Hadoop (YARN / MR) Standalone Cluster Pipeline Specification Static Cluster Elastic ClusterOne Box One Spec Multiple Environments
  22. 22. Execution Model: Batch versus Streaming Batch Determine work, Acquire slots, Run, Checkpoint, Repeat + Cost-efficient, deterministic, repeatable - Higher latency - Setup, Checkpoint costs dominate if “micro-batching”
  23. 23. Execution Model: Batch versus Streaming Streaming Determine work streams, Run continuously, Checkpoint periodically + Low latency - Higher-cost because it is harder to provision accurately - More sophistication needed to deal with change
  24. 24. Batch Execution Model Scorecard Batch Streaming Streaming Streaming Streaming Batch Batch JDBC <->HDFS Kafka ->HDFS HDFS ->Kafka Kafka <->Kinesis
  25. 25. Can we run in both models using the same system?
  26. 26. 26 Gobblin: The Logical Pipeline
  27. 27. 27 Batch Determine work Streaming Determine work - unbounded WorkUnit Pipeline Stages: Start
  28. 28. 28 Batch Acquire slots, Run Streaming Run continuously Checkpoint periodically Shutdown gracefully Pipeline Stages: Run Watermark Manager State Storage notify ack shutdown
  29. 29. 29 Batch Checkpoint, Commit Streaming Do nothing - NoOpPublisher Pipeline Stages: End
  30. 30. Enabling Streaming mode task.executionMode = streaming Standalone: Single Instance AWS Hadoop (YARN / MR) Standalone Cluster
  31. 31. A Streaming Pipeline Spec: Kafka 2 Kafka # A sample pull file that copies an input Kafka topic and # produces to an output Kafka topic with sampling job.name=Kafka2KafkaStreaming job.group=Kafka job.description=This is a job that runs forever, copies an input Kafka topic to an output Kafka topic job.lock.enabled=false source.class=gobblin.source….KafkaSimpleStreamingSource Pipeline Name, Description
  32. 32. job.description=This is a job that runs forever, copies an input Kafka topic to an output Kafka topic job.lock.enabled=false source.class=gobblin.source….KafkaSimpleStreamingSource gobblin.streaming.kafka.topic.key.deserializer=org.apache.kafka.com mon.serialization.StringDeserializer gobblin.streaming.kafka.topic.value.deserializer=org.apache.kafka.co mmon.serialization.ByteArrayDeserializer gobblin.streaming.kafka.topic.singleton=test kafka.brokers=localhost:9092 # Sample 10% of the records Source, configuration A Streaming Pipeline Spec: Kafka 2 Kafka
  33. 33. mmon.serialization.ByteArrayDeserializer gobblin.streaming.kafka.topic.singleton=test kafka.brokers=localhost:9092 # Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter converter.sample.ratio=0.10 writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder writer.kafka.topic=test_copied writer.kafka.producerConfig.bootstrap.servers=localhost:9092 writer.kafka.producerConfig.value.serializer=org.apache.kafka.comm on.serialization.ByteArraySerializer A Streaming Pipeline Spec: Kafka 2 Kafka Converter, configuration
  34. 34. # Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter converter.sample.ratio=0.10 writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder writer.kafka.topic=test_copied writer.kafka.producerConfig.bootstrap.servers=localhost:9092 writer.kafka.producerConfig.value.serializer=org.apache.kafka.comm on.serialization.ByteArraySerializer data.publisher.type=gobblin.publisher.NoopPublisher task.executionMode=STREAMING A Streaming Pipeline Spec: Kafka 2 Kafka Writer, configuration Publisher
  35. 35. data.publisher.type=gobblin.publisher.NoopPublisher task.executionMode=STREAMING # Configure watermark storage for streaming #streaming.watermarkStateStore.type=zk #streaming.watermarkStateStore.config.state.store.zk.connectString= localhost:2181 # Configure watermark commit settings for streaming #streaming.watermark.commitIntervalMillis=2000 A Streaming Pipeline Spec: Kafka 2 Kafka Execution Mode, watermark storage configuration
  36. 36. Gobblin Streaming: Cluster view Cluster of processes Apache Helix: work-unit assignment, fault-tolerance, reassignment Cluster Master Helix Worker 1 Worker 2 Worker 3 Sink (Kafka, HDFS, …) Stream Source
  37. 37. Active Workstreams in Gobblin Gobblin as a Service Global orchestrator with REST API for submitting logical flow specifications Logical flow specifications compile down to physical pipeline specs Global Throttling Throttling capability to ensure Gobblin respects quotas globally (e.g. api calls, network b/w, Hadoop namenode etc.) Generic: can be used outside Gobblin Metadata driven Integration with Metadata Service (c.f. WhereHows) Policy driven replication, permissions, encryption etc.
  38. 38. Roadmap Final LinkedIn Gobblin 0.10.0 release Apache Incubator code donation and release More Streaming runtimes Integration with Apache Samza, LinkedIn Brooklin GDPR Compliance: Data purge for Hadoop and other systems Security improvements Credential storage, Secure specs
  39. 39. 39 Gobblin Team @ LinkedIn

×