Contenu connexe
Similaire à 2013 march 26_thug_etl_cdc_talking_points (20)
2013 march 26_thug_etl_cdc_talking_points
- 1. Data Integration in 2013:
A working session
Adam Muise
March 26 2013
Note: This deck is purposely sparse. Want value?
Join the conversation in the Toronto Hadoop User
Group:
http://www.meetup.com/TorontoHUG/
© Hortonworks Inc. 2012
- 2. Proposed Agenda
• Introductions
• Discuss common Data Integration Patterns
• Round-table of User Group Member CDC/ETL Use Cases
• New Data Integration Solutions: A change from the Old Guard:
– Hadoop and the Data Lake
– Streaming (+ Hadoop)
– Data Lake Governance / Management (InfoTrellis)
– Databus (LinkedIn)
Page 2
© Hortonworks Inc. 2012
- 4. General Data Integration Patterns
• Enterprise Application Integration*
– Metadata lookup
– Validation
– Extra-app communication
• Enterprise Service Bus (SOA, Message Bus/Hub)*
• Federation*
– Bridging multiple databases with a query layer
– Eg: Composite
• Extract Transform Load (ETL)*
– Collection
– Aggregation
– Format/Schema transformation
• Data Lake
– Landing Zone for multiple datasets in one store
– Mixed schema, often raw structured/unstructured data
– Eg: Hadoop
* Source: Data Integration Blueprint and Modeling: Techniques for a Scalable and Sustainable Architecture, Anthony David Giordano, 2010, IBM Press.
Page 4
© Hortonworks Inc. 2012
- 7. New Data Integration Solutions
Fresh Ideas to new and old problems…
Page 7
© Hortonworks Inc. 2012
- 8. Hadoop: The Data Lake
Publish Event
Signal Data
Transformation
Model/ Transform &
Apply Metadata Aggregate
Publish
Exchange
Explore
Visualize
Extract & Report
Load
Analyze
Page 8
© Hortonworks Inc. 2012
- 11. DataBus (LinkedIn)
Databus is a low latency change capture system which has become an
integral part of LinkedIn’s data processing pipeline. Databus addresses a
fundamental requirement to reliably capture, flow and processes primary
data changes. Databus provides the following features:
1. Isolation between sources and consumers
2. Guaranteed in order and at least once delivery with high availability
3. Consumption from an arbitrary time point in the change stream including full bootstrap
capability of the entire data.
4. Partitioned consumption
5. Source consistency preservation
https://github.com/linkedin/databus/wiki
Page 11
© Hortonworks Inc. 2012