SlideShare a Scribd company logo
1 of 11
Download to read offline
ETL Practices
for Better or Worse
opportunities and challenges
along with the rise of machines
Agenda
1. Bring the data together and then transform
2. Data modeling for pragmatism
3. Let the datasets talk to each other
4. Beef up sandbox and backfill
5. Performance tuning
6. Examples, questions and comments
ELT and RIT
● ELT =
○ Extract from Source
○ Load to HDFS/MPP
○ Transform and Integrate inside Hadoop/MPP
● RIT =
○ Replicate message/log to Queue (JMS/Kafka/CDC)
○ Stream from Queue and Ingest to HDFS/MPP
○ Transform and Integrate inside Hadoop/MPP
Why ELT and RIT (instead of ETL)
● Store related raw data together for better
leverage
● Big queryable staging area is quite useful
● Reduce workload impact against source
systems
● Write data cleansing and business logic in
similar languages/scripting used by BI
Is Data Modeling still Important
● Do we still need to model the data in the era
of NoSQL and Big Data?
● Shall we de-normalize/pre-join everything?
● Shall we use hierarchical JSON/XML and/or
Key-Value pair for everything?
● Balance and trade-off: analytics, reusability,
metadata-driven, size vs. easy-to-query
Is Data Modeling still Important
● Cluster all attributes and children objects to
the tree structure (thinking in NoSQL way)
● Can we live without JOIN operator?
● Is mutable/updatable dataset still useful?
● Why not snapshot everything? Why SCD2?
● Is relational-model outdated?
● Model it in source or fix it in report?
● All or none: Index, Hash, and Full Scan?
Integration Brings the True Value
● Like the idea of SOA, be careful with DQ
● Data producers are loosely-coupled for the
sake of scalability
● Integration and cross-reference is deferred
to DW/BI layer, yet someone has to do it
● How can unique identifier help here?
● Replicate dim/ref/lkp and Federate tx/event
Profiling Prototype Deploy Backfill
● Profiling data to understand data
● Prototype with real data
● Enrich and harden the derived data in
sandbox before deploying to production
● Be ready to backfill the data because it will
happen (easier to produce or to consume?)
Self-service is Cool, but
● Strong automation tools must be built first
● Software can monitor and throttle
● If a user’s job gets killed, there needs to be
enough info/clue to explain why and how to
(try to) fix it
● Education and knowledge sharing is
essential, is wiki page/runbook good enough
Performance Matters
● Good instrumentation/logging will pay off big
time in performance tuning
● Can we run complex OLAP reports on top of
operational metadata?
Where needs the tuning the most?
● Swapping (spill to disk) can be a big issue
● Detect and kill the bad jobs early
Examples
1. data exploration in MPP/Hadoop instead of
in source systems (don’t let brain wait)
2. web click stream backend transaction
3. replicate/synchronize rollup hierarchy
(mapping lookup) to multiple data systems;
then produce near-real-time aggregation in
each system; finally federate the aggregates

More Related Content

What's hot

SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
DataWorks Summit
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Michael Stack
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 

What's hot (20)

SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
 
In Search of Database Nirvana: Challenges of Delivering HTAP
In Search of Database Nirvana: Challenges of Delivering HTAPIn Search of Database Nirvana: Challenges of Delivering HTAP
In Search of Database Nirvana: Challenges of Delivering HTAP
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine Overview
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 
Hadoop data ingestion
Hadoop data ingestionHadoop data ingestion
Hadoop data ingestion
 
Powering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta LakePowering Interactive BI Analytics with Presto and Delta Lake
Powering Interactive BI Analytics with Presto and Delta Lake
 
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 

Viewers also liked

How To Buy Data Warehouse
How To Buy Data WarehouseHow To Buy Data Warehouse
How To Buy Data Warehouse
Eric Sun
 

Viewers also liked (9)

How To Buy Data Warehouse
How To Buy Data WarehouseHow To Buy Data Warehouse
How To Buy Data Warehouse
 
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
 
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
 
Insights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured StreamingInsights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured Streaming
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Hbase hive pig
Hbase hive pigHbase hive pig
Hbase hive pig
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 

Similar to ETL Practices for Better or Worse

ETL with WSO2 Enterprise Middleware Platform
ETL with WSO2 Enterprise Middleware Platform ETL with WSO2 Enterprise Middleware Platform
ETL with WSO2 Enterprise Middleware Platform
WSO2
 
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith Kumar Pampatti
 
Chandan's_Resume
Chandan's_ResumeChandan's_Resume
Chandan's_Resume
Chandan Das
 
Resume_Informatica&IDQ_4+years_of_exp
Resume_Informatica&IDQ_4+years_of_expResume_Informatica&IDQ_4+years_of_exp
Resume_Informatica&IDQ_4+years_of_exp
rajarao marisa
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
Data Con LA
 
Resume_RaghavMahajan_ETL_Developer
Resume_RaghavMahajan_ETL_DeveloperResume_RaghavMahajan_ETL_Developer
Resume_RaghavMahajan_ETL_Developer
Raghav Mahajan
 
Resume_kallesh_latest
Resume_kallesh_latestResume_kallesh_latest
Resume_kallesh_latest
Kallesha CB
 

Similar to ETL Practices for Better or Worse (20)

ETL with WSO2 Enterprise Middleware Platform
ETL with WSO2 Enterprise Middleware Platform ETL with WSO2 Enterprise Middleware Platform
ETL with WSO2 Enterprise Middleware Platform
 
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013Transforming Data Architecture Complexity at Sears - StampedeCon 2013
Transforming Data Architecture Complexity at Sears - StampedeCon 2013
 
Resume_Sita_Ramadas_akkineni
Resume_Sita_Ramadas_akkineniResume_Sita_Ramadas_akkineni
Resume_Sita_Ramadas_akkineni
 
Ajith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETLAjith_kumar_4.3 Years_Informatica_ETL
Ajith_kumar_4.3 Years_Informatica_ETL
 
Chandan's_Resume
Chandan's_ResumeChandan's_Resume
Chandan's_Resume
 
Big data analytics beyond beer and diapers
Big data analytics   beyond beer and diapersBig data analytics   beyond beer and diapers
Big data analytics beyond beer and diapers
 
Resume_Informatica&IDQ_4+years_of_exp
Resume_Informatica&IDQ_4+years_of_expResume_Informatica&IDQ_4+years_of_exp
Resume_Informatica&IDQ_4+years_of_exp
 
Dwh faqs
Dwh faqsDwh faqs
Dwh faqs
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
 
Hadoop vs Java Batch Processing JSR 352
Hadoop vs Java Batch Processing JSR 352Hadoop vs Java Batch Processing JSR 352
Hadoop vs Java Batch Processing JSR 352
 
Resume_gmail
Resume_gmailResume_gmail
Resume_gmail
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Resume_RaghavMahajan_ETL_Developer
Resume_RaghavMahajan_ETL_DeveloperResume_RaghavMahajan_ETL_Developer
Resume_RaghavMahajan_ETL_Developer
 
#1 Calicut MuleSoft Meetup - Introduction to Enterprise Integration and MuleSoft
#1 Calicut MuleSoft Meetup - Introduction to Enterprise Integration and MuleSoft#1 Calicut MuleSoft Meetup - Introduction to Enterprise Integration and MuleSoft
#1 Calicut MuleSoft Meetup - Introduction to Enterprise Integration and MuleSoft
 
Arun Kondra
Arun KondraArun Kondra
Arun Kondra
 
Resume quaish abuzer
Resume quaish abuzerResume quaish abuzer
Resume quaish abuzer
 
Resume_kallesh_latest
Resume_kallesh_latestResume_kallesh_latest
Resume_kallesh_latest
 
Richa_Profile
Richa_ProfileRicha_Profile
Richa_Profile
 

Recently uploaded

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 

Recently uploaded (20)

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

ETL Practices for Better or Worse

  • 1. ETL Practices for Better or Worse opportunities and challenges along with the rise of machines
  • 2. Agenda 1. Bring the data together and then transform 2. Data modeling for pragmatism 3. Let the datasets talk to each other 4. Beef up sandbox and backfill 5. Performance tuning 6. Examples, questions and comments
  • 3. ELT and RIT ● ELT = ○ Extract from Source ○ Load to HDFS/MPP ○ Transform and Integrate inside Hadoop/MPP ● RIT = ○ Replicate message/log to Queue (JMS/Kafka/CDC) ○ Stream from Queue and Ingest to HDFS/MPP ○ Transform and Integrate inside Hadoop/MPP
  • 4. Why ELT and RIT (instead of ETL) ● Store related raw data together for better leverage ● Big queryable staging area is quite useful ● Reduce workload impact against source systems ● Write data cleansing and business logic in similar languages/scripting used by BI
  • 5. Is Data Modeling still Important ● Do we still need to model the data in the era of NoSQL and Big Data? ● Shall we de-normalize/pre-join everything? ● Shall we use hierarchical JSON/XML and/or Key-Value pair for everything? ● Balance and trade-off: analytics, reusability, metadata-driven, size vs. easy-to-query
  • 6. Is Data Modeling still Important ● Cluster all attributes and children objects to the tree structure (thinking in NoSQL way) ● Can we live without JOIN operator? ● Is mutable/updatable dataset still useful? ● Why not snapshot everything? Why SCD2? ● Is relational-model outdated? ● Model it in source or fix it in report? ● All or none: Index, Hash, and Full Scan?
  • 7. Integration Brings the True Value ● Like the idea of SOA, be careful with DQ ● Data producers are loosely-coupled for the sake of scalability ● Integration and cross-reference is deferred to DW/BI layer, yet someone has to do it ● How can unique identifier help here? ● Replicate dim/ref/lkp and Federate tx/event
  • 8. Profiling Prototype Deploy Backfill ● Profiling data to understand data ● Prototype with real data ● Enrich and harden the derived data in sandbox before deploying to production ● Be ready to backfill the data because it will happen (easier to produce or to consume?)
  • 9. Self-service is Cool, but ● Strong automation tools must be built first ● Software can monitor and throttle ● If a user’s job gets killed, there needs to be enough info/clue to explain why and how to (try to) fix it ● Education and knowledge sharing is essential, is wiki page/runbook good enough
  • 10. Performance Matters ● Good instrumentation/logging will pay off big time in performance tuning ● Can we run complex OLAP reports on top of operational metadata? Where needs the tuning the most? ● Swapping (spill to disk) can be a big issue ● Detect and kill the bad jobs early
  • 11. Examples 1. data exploration in MPP/Hadoop instead of in source systems (don’t let brain wait) 2. web click stream backend transaction 3. replicate/synchronize rollup hierarchy (mapping lookup) to multiple data systems; then produce near-real-time aggregation in each system; finally federate the aggregates