SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
© Hortonworks Inc. 2013
ETL 2.0
Reference Architecture
Page 1
George Vetticaden - Hortonworks: Solutions Engineer
George Trujillo - Hortonworks: Master Principal Big Data Specialist
© Hortonworks Inc. 2013
George Vetticaden
•  Solutions Engineer – Big Data at Hortonworks
•  Chief Architect and Co-Founder of eScreeningz
§  Enterprise Architect vFabric Cloud App Platform – VMware
§  Specialties:
§  Big Data and Cloud Computing
§  Hadoop
§  Cloud Application Platforms (PAAS) – Cloud Foundry, Heroku
§  Infrastructure as Service Platforms – vCloud Director, AWS
§  Virtualization – vSphere, vCenter
§  J2EE
§  Hibernate, Spring
§  ESB and Middleware Integration
§  SOA Architecture
© Hortonworks Inc. 2013
George Trujillo
•  Master Principal Big Data Specialist - Hortonworks
•  Tier One BigData, Oracle and BCA Specialist - VMware
•  20+ years Oracle DBA: DW, BI, RAC, Streams, Data
Guard, Perf, B/R
§  Oracle Double ACE
§  Sun Microsystem's Ambassador for Application Middleware
§  Oracle Fusion Council & Oracle Beta Leadership Council
§  Two terms Independent Oracle Users Group Board of
Directors
§  Recognized as one of the “Oracles of Oracle” by IOUG
§  MySQL Certified DBA
§  VMware Certified Instructor (VCI)
Sun Ambassador
© Hortonworks Inc. 2013
Challenges with a Traditional ETL Platform
Page 4
Incapable/high
complexity when
dealing with loosely
structured data
Data discarded
due to cost and/or
performance
-Lot of time spent understanding
source and defining destination data
structures
-High latency between data generation
and availability
No visibility into
transactional data
-Doesn’t scale linearly.
-License Costs High
EDW used as an ETL
tool with 100s of
transient staging tables
© Hortonworks Inc. 2013
Hadoop Based ETL Platform
Page 5
-Support for any type
of data: structured/
unstructured
-Linearly scalable on
commodity hardware
-Massively parallel
storage and compute
-Store raw transactional data
-Store 7+ years of data with no archiving
-Data Lineage: Store intermediate stages of data
-Becomes a powerful analytics platform
-Provides data for use with
minimum delay and latency
-Enables real time capture
of source data
-Data warehouse can
focus less on storage
& transformation and
more on analytics
© Hortonworks Inc. 2013
Key Capability in Hadoop: Late binding
Page 6
DATA	
  	
  
SERVICES	
  
OPERATIONAL	
  
SERVICES	
  
HORTONWORKS	
  	
  
DATA	
  PLATFORM	
  
HADOOP	
  CORE	
  
WEB	
  LOGS,	
  	
  
CLICK	
  STREAMS	
  
MACHINE	
  
GENERATED	
  
OLTP	
  
Data	
  Mart	
  /	
  
EDW	
  
Client	
  Apps	
  
Dynamically	
  Apply	
  
Transforma8ons	
  
Hortonworks	
  HDP	
  
With	
  tradi=onal	
  ETL,	
  structure	
  must	
  be	
  agreed	
  upon	
  far	
  in	
  advance	
  and	
  is	
  difficult	
  to	
  change.	
  
With	
  Hadoop,	
  capture	
  all	
  data,	
  structure	
  data	
  as	
  business	
  need	
  evolve.	
  
WEB	
  LOGS,	
  	
  
CLICK	
  STREAMS	
  
MACHINE	
  
GENERATED	
  
OLTP	
  
ETL	
  Server	
   Data	
  Mart	
  /	
  
EDW	
  
Client	
  Apps	
  
Store	
  Transformed	
  
Data	
  
© Hortonworks Inc. 2013
Organize Tiers and Process with Metadata
Page 7
Work
Tier
Standardize, Cleanse, Transform
MapReduce
Pig
Raw
Tier
Extract & Load
WebHDFS
Flume
Sqoop
Gold/
Storage
Tier
Transform, Integrate, Storage
MapReduce
Pig
Conform, Summarize, Access
HiveQL
Pig
Access
Tier
HCat
Provides unified
metadata access
to Pig, Hive &
MapReduce
•  Organize data
based on
source/derived
relationships
•  Allows for fault
and rebuild
process
© Hortonworks Inc. 2013
ETL Reference Architecture
Page 8
Model/
Apply Metadata
Extract &
Load
Publish
Exchange
Explore
Visualize
Report
Analyze
Publish Event
Signal Data
Transformation
Transform &
Aggregate
© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
ETL Reference Architecture
Page 9
Organize/Model
Create Metadata
Extract &
Load
Publish
Exchange
Explore
Visualize
Report
Analyze
Publish Event
Signal Data
Transformation
Transform &
Aggregate
HCatalog
© Hortonworks Inc. 2013
HCatalog
Table access
Aligned metadata
REST API
•  Raw Hadoop data
•  Inconsistent, unknown
•  Tool specific access
Apache HCatalog provides flexible metadata
services across tools and external access
Metadata Services with HCatalog
•  Consistency of metadata and data models across tools
(MapReduce, Pig, Hbase, and Hive)
•  Accessibility: share data as tables in and out of HDFS
•  Availability: enables flexible, thin-client access via REST API
Shared table
and schema
management
opens the
platform
Page 11
© Hortonworks Inc. 2013
• Best Practice: Use HCatalog to manage metadata
– Schema/structure when needed via tables and partitions
– Late binding at work: Multiple/changing bindings supported
– Abstract Location of data, scale and maintain over time easily
– Abstract format of data file (e.g.: compression type, HL7 v2, HL7 v3)
• Cope with change of source data seamlessly
– Heterogeneous schemas across partitions within HCatalog as source
system evolves, consumers of data unaffected
– E.g.: Partition ‘2012-01-01’ of Table X has schema with 30 fields and
HL7 v2 format. Partition ‘2013-01-01’ has 35 fields with HL7 v3 format
• RESTful API via WebHCat
Page 12
Step 2 – HCatalog, Metadata
© Hortonworks Inc. 2013
Sample Tweet data as JSON
{
"user":{
"name":"George Vetticaden - Name",
"id":10000000,
"userlocation":"Chicago",
"screenname":"gvetticadenScreenName",
"geoenabled":false
},
"tweetmessage":"hello world",
"createddate":"2013-06-18T11:47:10",
"geolocation":{
"latitude":1000.0,
"longitude":10000.0
}
}
© Hortonworks Inc. 2013
Hive/HCat Schema for the Twitter Data
create external table tweet (
user struct <
userlocation:string,
id:bigint,
name:string,
screenname:string,
geoenabled:string
>,
geoLocation struct <
latitude:float,
longitude:float
>,
tweetmessage string,
createddate string
)
ROW FORMAT SERDE 'org.apache.hcatalog.data.JsonSerDe'
location "/user/kibana/twitter/landing"
© Hortonworks Inc. 2013
Pig Example
Page 15
Count how many time users tweeted an url:
raw = load '/user/kibana/twitter/landing' as (user,
tweetmessage);
botless = filter raw by myudfs.NotABot(user);
grpd = group botless by (url, user);
cntd = foreach grpd generate flatten(url, user), COUNT(botless);
store cntd into '/data/counted/20120530';
Using HCatalog:
raw = load ’tweet' using HCatLoader();
botless = filter raw by myudfs.NotABot(user) and ds == '20120530’;
grpd = group botless by (url, user);
cntd = foreach grpd generate flatten(url, user), COUNT(botless);
store cntd into 'counted' using HCatStorer();
No need to know
file location
No need to
declare schema
Partition filter
© Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
ETL Reference Architecture
Page 16
Organize/Model
Create Metadata
Extract &
Load
Publish
Exchange
Explore
Visualize
Report
Analyze
Publish Event
Signal Data
Transformation
Transform &
Aggregate
© Hortonworks Inc. 2013
Step 3&4 – Transform, Aggregate, Explore
• MapReduce
– For Programmers
– When control matters
• Hive
– HiveQL (SQL-like) to ad-hoc query and explore data
• Pig
– Pig for declarative data crunching and preprocessing (the T in ELT)
– User Defined Functions (UDF) for extensibility and portability. Ex:
Custom UDF for calling industry-specific data format parsers
(SWIFT, X12, NACHA, HL7, HIPPA, etc.)
• HCatalog
– Consistent metadata, consistent data Sharing across all tools
Page 17
Common Processing Patterns
© Hortonworks Inc. 2013
Common ETL Processing Patterns
• Long-term data retention
• Staging for Data Exploration
• Data Cleansing
• Data Enrichment
Page 19
© Hortonworks Inc. 2013
Important Dimensions to Consider..
• Compression
• Buffering
• Data Format Containers
• Logical Processing Tiers (Raw, Work, Gold, Access)
Page 20
© Hortonworks Inc. 2013
Compression in Hadoop is Important
•  Biggest performance bottleneck in Hadoop: Read/Write IO
•  Compression formats supported in HDP include gzip, bzip2, LZO, LZ4
and Snappy
•  Type of compression to use is based on a number of factors like:
– Size of the data
– Is faster compression/decompression or compression effectiveness more
important (space/time trade-off)? Faster compression/decompression
speeds usually come at the expense of smaller space savings.
– Do compressed files need to be split-able for parallel MapReduce
processing of a large file
Page 21
© Hortonworks Inc. 2013
Suitcase Pattern: Buffering and Compression
• Suitcase Pattern
– Before we travel, we take our clothes off the rack and pack them
(easier to store)
– We then unpack them when we arrive and put them back on the
rack (easier to process)
– Consider event data “traveling” over the network to Hadoop
– we want to compress it before it makes the trip, but in a way that
facilitates how we intend to process it once it arrives
• Suitcase Pattern Implementation
– In Hadoop, generally speaking, several thousand bytes to several
hundred thousand bytes is deemed important
– Buffering records during collection also allows us to compress the
whole block of records as a single record to be sent over the
network to Hadoop – resulting in lower network and file I/O
– Buffering records during collection also helps us handle bursts
© Hortonworks Inc. 2013
Time Series: The Key to MapReduce
• Event data has a natural temporal ordering
– Observations close together in time will be more closely related
than observations further apart
– Time series analysis of events often makes use of the one-way
ordering of time
• Batching by time is a composite pattern
– Batches of records from a single event source (compressed and
written as a single physical record in HDFS) are organized by time
– Physical records in HDFS are organized into files by time
– Metadata can be associated with both to support queries with time-
range predicates
– A sequence of files can be indexed based on the highest timestamp
inside of HCatalog to avoid MapReduce from having to open the file
– A sequence of physical records in a file can be partitioned based on
the highest timestamp (record-level metadata inside a SequenceFile)
to avoid Mappers from having to de-compress the batch
© Hortonworks Inc. 2013
Different Data Format Containers
Page 24
Data Format Description Key Advantages
Sequence File Persistent data structure for binary key-value
pairs.
Row-oriented. This means that fields in each
row are stored together as the contents of as
single sequence-file record
•  Split-able
•  Compress-able at Block and Row
Level
•  Work well as contains for small
files. HDFS and MapReduce are
optimized for large files, so packing
files into a Sequence file makes
storing and processing the smaller
files more efficient
Avro File Similar to sequence files (split-able,
compressible, row-oriented) except they
have support schema evolution and binding
in multiple language
Schema stored in the file itself
•  Split-able
•  Compress-able at Block and Row
Level
•  Ideally suited for unstructured data
sets with constantly changing
attributes/schema
RC File Similar to sequence and avro file but are
column-oriented
•  Provides faster access to subset of
columns without doing a full table
scan across all columns
Optimized RC
File
Optimized RC Fileformat supporting sql like
types and has more efficient serialization/
deserialization
•  Provides faster access in Next
Generation MR
•  HIVE-3874
© Hortonworks Inc. 2013
Best Practices for Processing Patterns
Page 25
Processing
Pattern
Tier
Path
Data
Format
Compres
sion
Description
Long-term data
retention
Raw à
Gold
Avro
Sequence
Gzip/bzip2 Conversion of all raw data into sequence/avro
files with block compression, a useable but
compressed data format. This can also
involve the aggregation of smaller files from
ingestion into large sequence or avro formats.
Staging for Data
Exploration
Raw à
Access
RC, ORC LZO A conversion of subset of raw input
normalized tables into an access-optimized
data structure like RC file.
Data Cleansing Raw à
Work
Txt(Raw
format)
None Common ETL cleansing operations (e.g:
discarding bad data, scrubbing, sanitizing)
Data Enrichment Raw à
Work
Sequence LZO, None Aggregations or calculation of fields based on
analysis of data within Hadoop or other
information pulled from other sources
ingested into Hadoop.
© Hortonworks Inc. 2013
The Question that You are dying to Ask..
What Tooling do I have to do orchestrate
these ETL flows?
Page 26
© Hortonworks Inc. 2013
Falcon: One-stop Shop for Data Lifecycle
Apache Falcon
Provides Orchestrates
Data Management Needs Tools
Multi Cluster Management Oozie
Replication Sqoop
Scheduling Distcp
Data Reprocessing Flume
Dependency Management Map / Reduce
Hive and Pig Jobs
Falcon provides a single interface to orchestrate data lifecycle.
Sophisticated DLM easily added to Hadoop applications.
© Hortonworks Inc. 2013
Falcon Usage At A Glance.
>  Falcon provides the key services data processing applications need.
>  Complex data processing logic handled by Falcon instead of hard-coded in apps.
>  Faster development and higher quality for ETL, reporting and other data
processing apps on Hadoop.
Hortonworks Data Management Product (Herd, Continiuum)
(or Data Processing Applications, Customer Management Software)
Spec Files or
REST APIs
Data Import
and
Replication
Scheduling
and
Coordination
Data Lifecycle
Policies
Multi-Cluster
Management
SLA
Management
Falcon Data Lifecycle Management Service
© Hortonworks Inc. 2013
Falcon Example: Multi-Cluster Failover
>  Falcon manages workflow, replication or both.
>  Enables business continuity without requiring full data reprocessing.
>  Failover clusters require less storage and CPU.
Staged
Data
Cleansed
Data
Conformed
Data
Presented
Data
Staged
Data
Presented
Data
BI and Analytics
Primary Hadoop Cluster
Failover Hadoop Cluster
Replication
© Hortonworks Inc. 2013
Example – Data Lifecycle Management
• User creates entities using DSL
– Cluster for Primary, Cluster for Secondary (BCP)
– Data Set
– Submits to Falcon (RESTful API)
• Falcon orchestrates these into scheduled workflows
– Maintains the dependencies and relationships between entities
– Instruments workflows for dependencies, retry logic, Table/
Partition registration, notifications, etc.
– Creates a scheduled recurring workflow for
– Copying data from source to target(s)
– Purging expired data on source and target(s)
<cluster colo=”colo-1" description="test cluster" name=”cluster-primary"
xmlns="uri:ivory:cluster:0.1”>
<interfaces>
<interface type="readonly" endpoint="hftp://localhost:50070"
version="1.1.1"/>
<interface type="write" endpoint="hdfs://localhost:54310” version="1.1.1"/>
<interface type="execute" endpoint="localhost:54311" version="1.1.1"/>
<interface type="workflow" endpoint="http://localhost:11000/oozie/"
version="3.3.0"/>
<interface type="messaging" endpoint="tcp://localhost:61616?daemon=true"
version="5.1.6"/>
</interfaces>
</cluster>
<feed description="TestHourlySummary" name="TestHourlySummary” xmlns="uri:ivory:feed:0.1">
<partitions/>
<groups>bi</groups>
<frequency>hours(1)</frequency>
<late-arrival cut-off="hours(4)"/>
<clusters>
<cluster name=”cluster-primary" type="source">
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
<retention limit="days(2)" action="delete"/>
</cluster>
<cluster name=”cluster-BCP" type="target">
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
<retention limit="days(2)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data” path="/projects/test/TestHourlySummary/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
<location type="stats" path="/none"/>
<location type="meta" path="/none"/>
</locations>
<ACL owner=”venkatesh" group="users" permission="0755"/>
<schema location="/none" provider="none"/>
</feed>
© Hortonworks Inc. 2013
Thanks/Questions…
Page 31

Contenu connexe

Tendances

Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeKent Graziano
 
Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep diveDataWorks Summit
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Data Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and GovernanceData Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and GovernanceDenodo
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreCloudera, Inc.
 
Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfMichael Kogan
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Tristan Baker
 
warner-DP-203-slides.pptx
warner-DP-203-slides.pptxwarner-DP-203-slides.pptx
warner-DP-203-slides.pptxHibaB2
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 

Tendances (20)

Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Intro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on SnowflakeIntro to Data Vault 2.0 on Snowflake
Intro to Data Vault 2.0 on Snowflake
 
Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep dive
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Data Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and GovernanceData Catalog for Better Data Discovery and Governance
Data Catalog for Better Data Discovery and Governance
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 
Some Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdfSome Iceberg Basics for Beginners (CDP).pdf
Some Iceberg Basics for Beginners (CDP).pdf
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
 
warner-DP-203-slides.pptx
warner-DP-203-slides.pptxwarner-DP-203-slides.pptx
warner-DP-203-slides.pptx
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 

Similaire à A Reference Architecture for ETL 2.0

Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudDataWorks Summit/Hadoop Summit
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Seetharam Venkatesh
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDataWorks Summit
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopHortonworks
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep DiveHortonworks
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Hortonworks
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalHortonworks
 
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopPartners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopEric Sun
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016alanfgates
 

Similaire à A Reference Architecture for ETL 2.0 (20)

Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopPartners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 

Dernier (20)

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 

A Reference Architecture for ETL 2.0

  • 1. © Hortonworks Inc. 2013 ETL 2.0 Reference Architecture Page 1 George Vetticaden - Hortonworks: Solutions Engineer George Trujillo - Hortonworks: Master Principal Big Data Specialist
  • 2. © Hortonworks Inc. 2013 George Vetticaden •  Solutions Engineer – Big Data at Hortonworks •  Chief Architect and Co-Founder of eScreeningz §  Enterprise Architect vFabric Cloud App Platform – VMware §  Specialties: §  Big Data and Cloud Computing §  Hadoop §  Cloud Application Platforms (PAAS) – Cloud Foundry, Heroku §  Infrastructure as Service Platforms – vCloud Director, AWS §  Virtualization – vSphere, vCenter §  J2EE §  Hibernate, Spring §  ESB and Middleware Integration §  SOA Architecture
  • 3. © Hortonworks Inc. 2013 George Trujillo •  Master Principal Big Data Specialist - Hortonworks •  Tier One BigData, Oracle and BCA Specialist - VMware •  20+ years Oracle DBA: DW, BI, RAC, Streams, Data Guard, Perf, B/R §  Oracle Double ACE §  Sun Microsystem's Ambassador for Application Middleware §  Oracle Fusion Council & Oracle Beta Leadership Council §  Two terms Independent Oracle Users Group Board of Directors §  Recognized as one of the “Oracles of Oracle” by IOUG §  MySQL Certified DBA §  VMware Certified Instructor (VCI) Sun Ambassador
  • 4. © Hortonworks Inc. 2013 Challenges with a Traditional ETL Platform Page 4 Incapable/high complexity when dealing with loosely structured data Data discarded due to cost and/or performance -Lot of time spent understanding source and defining destination data structures -High latency between data generation and availability No visibility into transactional data -Doesn’t scale linearly. -License Costs High EDW used as an ETL tool with 100s of transient staging tables
  • 5. © Hortonworks Inc. 2013 Hadoop Based ETL Platform Page 5 -Support for any type of data: structured/ unstructured -Linearly scalable on commodity hardware -Massively parallel storage and compute -Store raw transactional data -Store 7+ years of data with no archiving -Data Lineage: Store intermediate stages of data -Becomes a powerful analytics platform -Provides data for use with minimum delay and latency -Enables real time capture of source data -Data warehouse can focus less on storage & transformation and more on analytics
  • 6. © Hortonworks Inc. 2013 Key Capability in Hadoop: Late binding Page 6 DATA     SERVICES   OPERATIONAL   SERVICES   HORTONWORKS     DATA  PLATFORM   HADOOP  CORE   WEB  LOGS,     CLICK  STREAMS   MACHINE   GENERATED   OLTP   Data  Mart  /   EDW   Client  Apps   Dynamically  Apply   Transforma8ons   Hortonworks  HDP   With  tradi=onal  ETL,  structure  must  be  agreed  upon  far  in  advance  and  is  difficult  to  change.   With  Hadoop,  capture  all  data,  structure  data  as  business  need  evolve.   WEB  LOGS,     CLICK  STREAMS   MACHINE   GENERATED   OLTP   ETL  Server   Data  Mart  /   EDW   Client  Apps   Store  Transformed   Data  
  • 7. © Hortonworks Inc. 2013 Organize Tiers and Process with Metadata Page 7 Work Tier Standardize, Cleanse, Transform MapReduce Pig Raw Tier Extract & Load WebHDFS Flume Sqoop Gold/ Storage Tier Transform, Integrate, Storage MapReduce Pig Conform, Summarize, Access HiveQL Pig Access Tier HCat Provides unified metadata access to Pig, Hive & MapReduce •  Organize data based on source/derived relationships •  Allows for fault and rebuild process
  • 8. © Hortonworks Inc. 2013 ETL Reference Architecture Page 8 Model/ Apply Metadata Extract & Load Publish Exchange Explore Visualize Report Analyze Publish Event Signal Data Transformation Transform & Aggregate
  • 9. © Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION ETL Reference Architecture Page 9 Organize/Model Create Metadata Extract & Load Publish Exchange Explore Visualize Report Analyze Publish Event Signal Data Transformation Transform & Aggregate
  • 11. © Hortonworks Inc. 2013 HCatalog Table access Aligned metadata REST API •  Raw Hadoop data •  Inconsistent, unknown •  Tool specific access Apache HCatalog provides flexible metadata services across tools and external access Metadata Services with HCatalog •  Consistency of metadata and data models across tools (MapReduce, Pig, Hbase, and Hive) •  Accessibility: share data as tables in and out of HDFS •  Availability: enables flexible, thin-client access via REST API Shared table and schema management opens the platform Page 11
  • 12. © Hortonworks Inc. 2013 • Best Practice: Use HCatalog to manage metadata – Schema/structure when needed via tables and partitions – Late binding at work: Multiple/changing bindings supported – Abstract Location of data, scale and maintain over time easily – Abstract format of data file (e.g.: compression type, HL7 v2, HL7 v3) • Cope with change of source data seamlessly – Heterogeneous schemas across partitions within HCatalog as source system evolves, consumers of data unaffected – E.g.: Partition ‘2012-01-01’ of Table X has schema with 30 fields and HL7 v2 format. Partition ‘2013-01-01’ has 35 fields with HL7 v3 format • RESTful API via WebHCat Page 12 Step 2 – HCatalog, Metadata
  • 13. © Hortonworks Inc. 2013 Sample Tweet data as JSON { "user":{ "name":"George Vetticaden - Name", "id":10000000, "userlocation":"Chicago", "screenname":"gvetticadenScreenName", "geoenabled":false }, "tweetmessage":"hello world", "createddate":"2013-06-18T11:47:10", "geolocation":{ "latitude":1000.0, "longitude":10000.0 } }
  • 14. © Hortonworks Inc. 2013 Hive/HCat Schema for the Twitter Data create external table tweet ( user struct < userlocation:string, id:bigint, name:string, screenname:string, geoenabled:string >, geoLocation struct < latitude:float, longitude:float >, tweetmessage string, createddate string ) ROW FORMAT SERDE 'org.apache.hcatalog.data.JsonSerDe' location "/user/kibana/twitter/landing"
  • 15. © Hortonworks Inc. 2013 Pig Example Page 15 Count how many time users tweeted an url: raw = load '/user/kibana/twitter/landing' as (user, tweetmessage); botless = filter raw by myudfs.NotABot(user); grpd = group botless by (url, user); cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into '/data/counted/20120530'; Using HCatalog: raw = load ’tweet' using HCatLoader(); botless = filter raw by myudfs.NotABot(user) and ds == '20120530’; grpd = group botless by (url, user); cntd = foreach grpd generate flatten(url, user), COUNT(botless); store cntd into 'counted' using HCatStorer(); No need to know file location No need to declare schema Partition filter
  • 16. © Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION ETL Reference Architecture Page 16 Organize/Model Create Metadata Extract & Load Publish Exchange Explore Visualize Report Analyze Publish Event Signal Data Transformation Transform & Aggregate
  • 17. © Hortonworks Inc. 2013 Step 3&4 – Transform, Aggregate, Explore • MapReduce – For Programmers – When control matters • Hive – HiveQL (SQL-like) to ad-hoc query and explore data • Pig – Pig for declarative data crunching and preprocessing (the T in ELT) – User Defined Functions (UDF) for extensibility and portability. Ex: Custom UDF for calling industry-specific data format parsers (SWIFT, X12, NACHA, HL7, HIPPA, etc.) • HCatalog – Consistent metadata, consistent data Sharing across all tools Page 17
  • 19. © Hortonworks Inc. 2013 Common ETL Processing Patterns • Long-term data retention • Staging for Data Exploration • Data Cleansing • Data Enrichment Page 19
  • 20. © Hortonworks Inc. 2013 Important Dimensions to Consider.. • Compression • Buffering • Data Format Containers • Logical Processing Tiers (Raw, Work, Gold, Access) Page 20
  • 21. © Hortonworks Inc. 2013 Compression in Hadoop is Important •  Biggest performance bottleneck in Hadoop: Read/Write IO •  Compression formats supported in HDP include gzip, bzip2, LZO, LZ4 and Snappy •  Type of compression to use is based on a number of factors like: – Size of the data – Is faster compression/decompression or compression effectiveness more important (space/time trade-off)? Faster compression/decompression speeds usually come at the expense of smaller space savings. – Do compressed files need to be split-able for parallel MapReduce processing of a large file Page 21
  • 22. © Hortonworks Inc. 2013 Suitcase Pattern: Buffering and Compression • Suitcase Pattern – Before we travel, we take our clothes off the rack and pack them (easier to store) – We then unpack them when we arrive and put them back on the rack (easier to process) – Consider event data “traveling” over the network to Hadoop – we want to compress it before it makes the trip, but in a way that facilitates how we intend to process it once it arrives • Suitcase Pattern Implementation – In Hadoop, generally speaking, several thousand bytes to several hundred thousand bytes is deemed important – Buffering records during collection also allows us to compress the whole block of records as a single record to be sent over the network to Hadoop – resulting in lower network and file I/O – Buffering records during collection also helps us handle bursts
  • 23. © Hortonworks Inc. 2013 Time Series: The Key to MapReduce • Event data has a natural temporal ordering – Observations close together in time will be more closely related than observations further apart – Time series analysis of events often makes use of the one-way ordering of time • Batching by time is a composite pattern – Batches of records from a single event source (compressed and written as a single physical record in HDFS) are organized by time – Physical records in HDFS are organized into files by time – Metadata can be associated with both to support queries with time- range predicates – A sequence of files can be indexed based on the highest timestamp inside of HCatalog to avoid MapReduce from having to open the file – A sequence of physical records in a file can be partitioned based on the highest timestamp (record-level metadata inside a SequenceFile) to avoid Mappers from having to de-compress the batch
  • 24. © Hortonworks Inc. 2013 Different Data Format Containers Page 24 Data Format Description Key Advantages Sequence File Persistent data structure for binary key-value pairs. Row-oriented. This means that fields in each row are stored together as the contents of as single sequence-file record •  Split-able •  Compress-able at Block and Row Level •  Work well as contains for small files. HDFS and MapReduce are optimized for large files, so packing files into a Sequence file makes storing and processing the smaller files more efficient Avro File Similar to sequence files (split-able, compressible, row-oriented) except they have support schema evolution and binding in multiple language Schema stored in the file itself •  Split-able •  Compress-able at Block and Row Level •  Ideally suited for unstructured data sets with constantly changing attributes/schema RC File Similar to sequence and avro file but are column-oriented •  Provides faster access to subset of columns without doing a full table scan across all columns Optimized RC File Optimized RC Fileformat supporting sql like types and has more efficient serialization/ deserialization •  Provides faster access in Next Generation MR •  HIVE-3874
  • 25. © Hortonworks Inc. 2013 Best Practices for Processing Patterns Page 25 Processing Pattern Tier Path Data Format Compres sion Description Long-term data retention Raw à Gold Avro Sequence Gzip/bzip2 Conversion of all raw data into sequence/avro files with block compression, a useable but compressed data format. This can also involve the aggregation of smaller files from ingestion into large sequence or avro formats. Staging for Data Exploration Raw à Access RC, ORC LZO A conversion of subset of raw input normalized tables into an access-optimized data structure like RC file. Data Cleansing Raw à Work Txt(Raw format) None Common ETL cleansing operations (e.g: discarding bad data, scrubbing, sanitizing) Data Enrichment Raw à Work Sequence LZO, None Aggregations or calculation of fields based on analysis of data within Hadoop or other information pulled from other sources ingested into Hadoop.
  • 26. © Hortonworks Inc. 2013 The Question that You are dying to Ask.. What Tooling do I have to do orchestrate these ETL flows? Page 26
  • 27. © Hortonworks Inc. 2013 Falcon: One-stop Shop for Data Lifecycle Apache Falcon Provides Orchestrates Data Management Needs Tools Multi Cluster Management Oozie Replication Sqoop Scheduling Distcp Data Reprocessing Flume Dependency Management Map / Reduce Hive and Pig Jobs Falcon provides a single interface to orchestrate data lifecycle. Sophisticated DLM easily added to Hadoop applications.
  • 28. © Hortonworks Inc. 2013 Falcon Usage At A Glance. >  Falcon provides the key services data processing applications need. >  Complex data processing logic handled by Falcon instead of hard-coded in apps. >  Faster development and higher quality for ETL, reporting and other data processing apps on Hadoop. Hortonworks Data Management Product (Herd, Continiuum) (or Data Processing Applications, Customer Management Software) Spec Files or REST APIs Data Import and Replication Scheduling and Coordination Data Lifecycle Policies Multi-Cluster Management SLA Management Falcon Data Lifecycle Management Service
  • 29. © Hortonworks Inc. 2013 Falcon Example: Multi-Cluster Failover >  Falcon manages workflow, replication or both. >  Enables business continuity without requiring full data reprocessing. >  Failover clusters require less storage and CPU. Staged Data Cleansed Data Conformed Data Presented Data Staged Data Presented Data BI and Analytics Primary Hadoop Cluster Failover Hadoop Cluster Replication
  • 30. © Hortonworks Inc. 2013 Example – Data Lifecycle Management • User creates entities using DSL – Cluster for Primary, Cluster for Secondary (BCP) – Data Set – Submits to Falcon (RESTful API) • Falcon orchestrates these into scheduled workflows – Maintains the dependencies and relationships between entities – Instruments workflows for dependencies, retry logic, Table/ Partition registration, notifications, etc. – Creates a scheduled recurring workflow for – Copying data from source to target(s) – Purging expired data on source and target(s) <cluster colo=”colo-1" description="test cluster" name=”cluster-primary" xmlns="uri:ivory:cluster:0.1”> <interfaces> <interface type="readonly" endpoint="hftp://localhost:50070" version="1.1.1"/> <interface type="write" endpoint="hdfs://localhost:54310” version="1.1.1"/> <interface type="execute" endpoint="localhost:54311" version="1.1.1"/> <interface type="workflow" endpoint="http://localhost:11000/oozie/" version="3.3.0"/> <interface type="messaging" endpoint="tcp://localhost:61616?daemon=true" version="5.1.6"/> </interfaces> </cluster> <feed description="TestHourlySummary" name="TestHourlySummary” xmlns="uri:ivory:feed:0.1"> <partitions/> <groups>bi</groups> <frequency>hours(1)</frequency> <late-arrival cut-off="hours(4)"/> <clusters> <cluster name=”cluster-primary" type="source"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name=”cluster-BCP" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> </clusters> <locations> <location type="data” path="/projects/test/TestHourlySummary/${YEAR}-${MONTH}-${DAY}-${HOUR}"/> <location type="stats" path="/none"/> <location type="meta" path="/none"/> </locations> <ACL owner=”venkatesh" group="users" permission="0755"/> <schema location="/none" provider="none"/> </feed>
  • 31. © Hortonworks Inc. 2013 Thanks/Questions… Page 31