ETL – Everything you need to know

•Télécharger en tant que PPTX, PDF•

1 j'aime•111 vues

Introduction to ETL, ETL vs data pipelines and how it looks like when we process big data. The challenges, complications and things we should consider when architecting big data system. Stream processing vs batch processing and how we can combine both using Lamba architecture. Learn more: aka.ms/data-guide aka.ms/stream-processing aka.ms/building-blocks aka.ms/start-with-the-cloud

Logiciels

Adi Polak
Microsoft
ETL – Everything
you need to know

About me – Adi Polak
@adipolak
https://medium.com/@adipolak
https://dev.to/adipolak

Use cases
AD tech Automotive
Cyber Security
… EVERYTHING WITH
DATA!
@adipolak

Agenda
• ETL
• Data Pipelines
• Data Challenges
• Big Data
• Stream vs Batch processing
• Architecture
• Tools
• Learn More!
@adipolak

Data Challenges –
Collection of data
@adipolak

Data Challenges - Collection of data
Audio, video, images. Meaningless
without adding some structure
Unstructured
JSON, XML, sensor data, social media,
device data, web logs. Flexible data
model structure
Semi-Structured
Structured CSV, Columnar Storage (Parquet, ORC).
Strict data model structure
@adipolak

Data Challenges –
Duplication of data
@adipolak

Data Challenges –
Inconsistency of data
@adipolak
Schema
Data representation
Data value

Data Challenges –
Variety of data
@adipolak

Big Data Processing Pipelines
Visualize
Azure
Machine
Learning

Why is processing Big Data challenging ?
Variety
Velocity
Volume
@adipolak

Data
Engineers
Analytics tools
Knowledge of SQL
Data Warehousing
Data Architecture
Coding Skills
Machine Learning *
@adipolak

Batch Layer
Apache Spark
Azure Batch
Speed Layer
Spark Streaming
Kafka Stream
Azure stream analytics
Service Layer
Prometheus
Graphana
PowerBI
Tools
Many more…
@adipolak

What did we learn today:
• ETL vs Data Pipelines
• Big Data and Data challenges
• Data Engineers
• Streaming vs Batch processing
• Architectures
• Tools
@adipolak

Learn More !
aka.ms/data-guide
aka.ms/stream-processing
aka.ms/building-blocks
aka.ms/start-with-the-cloud
@adipolak

Recommandé

ETL & Machine LearningLuthfi Hariz

HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...Sri Ambati

What you need to know to start an AI company?Mo Patel

Big data groningenWillem Hendriks

Data science tools - A.Marchev and K.HaralampievData Science Society

NTEN Webinar - Data Cleaning and Visualization Tools for NonprofitsAzavea

Follow the stars 25/11/2011Miel Vander Sande

Recommandé

ETL & Machine LearningLuthfi Hariz

HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...Sri Ambati

What you need to know to start an AI company?Mo Patel

Big data groningenWillem Hendriks

Data science tools - A.Marchev and K.HaralampievData Science Society

NTEN Webinar - Data Cleaning and Visualization Tools for NonprofitsAzavea

Follow the stars 25/11/2011Miel Vander Sande

Spatial ETL For Web Services-Based Data SharingSafe Software

Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceDeepak Chandramouli

Trivadis Azure Data LakeTrivadis

Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra

Building a Testable Data Access LayerTodd Anglin

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks

50 Shades of Data - JEEConf 2018 - Kyiv, UkraineLucas Jellema

Lighthouse - an open-source library to build data lakes - Kris PeetersData Science Leuven

Pitfalls of Data Warehousing_2019-04-24Martin Bém

Architecting Agile Data Applications for ScaleDatabricks

Fundamentals Big Data and AI ArchitectureGuido Schmutz

The Power of Unified Analytics with Ali Ghodsi Databricks

Big Data Trend and Open DataJongwook Woo

Data Lake OverviewJames Serra

SQL Analytics Powering Telemetry Analysis at ComcastDatabricks

Spark Summit EU talk by Bas GeerdinkSpark Summit

Tableau & MongoDB: Visual Analytics at the Speed of ThoughtMongoDB

Dell Digital Transformation Through AI and Data Analytics WebinarBill Wong

Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra

Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo

Demystifying Apache SparkAdi Polak

Burst workloads Cutting costs with Kubernetes and Virtual KubeletAdi Polak

Contenu connexe

Similaire à ETL – Everything you need to know

Spatial ETL For Web Services-Based Data SharingSafe Software

Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceDeepak Chandramouli

Trivadis Azure Data LakeTrivadis

Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra

Building a Testable Data Access LayerTodd Anglin

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks

50 Shades of Data - JEEConf 2018 - Kyiv, UkraineLucas Jellema

Lighthouse - an open-source library to build data lakes - Kris PeetersData Science Leuven

Pitfalls of Data Warehousing_2019-04-24Martin Bém

Architecting Agile Data Applications for ScaleDatabricks

Fundamentals Big Data and AI ArchitectureGuido Schmutz

The Power of Unified Analytics with Ali Ghodsi Databricks

Big Data Trend and Open DataJongwook Woo

Data Lake OverviewJames Serra

SQL Analytics Powering Telemetry Analysis at ComcastDatabricks

Spark Summit EU talk by Bas GeerdinkSpark Summit

Tableau & MongoDB: Visual Analytics at the Speed of ThoughtMongoDB

Dell Digital Transformation Through AI and Data Analytics WebinarBill Wong

Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra

Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo

Similaire à ETL – Everything you need to know (20)

Spatial ETL For Web Services-Based Data Sharing

Nodes2020 | Graph of enterprise_metadata | NEO4J Conference

Trivadis Azure Data Lake

Data Lakehouse, Data Mesh, and Data Fabric (r1)

Building a Testable Data Access Layer

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...

50 Shades of Data - JEEConf 2018 - Kyiv, Ukraine

Lighthouse - an open-source library to build data lakes - Kris Peeters

Pitfalls of Data Warehousing_2019-04-24

Architecting Agile Data Applications for Scale

Fundamentals Big Data and AI Architecture

The Power of Unified Analytics with Ali Ghodsi

Big Data Trend and Open Data

Data Lake Overview

SQL Analytics Powering Telemetry Analysis at Comcast

Spark Summit EU talk by Bas Geerdink

Tableau & MongoDB: Visual Analytics at the Speed of Thought

Dell Digital Transformation Through AI and Data Analytics Webinar

Data Lakehouse, Data Mesh, and Data Fabric (r2)

Scaling up with Cisco Big Data: Data + Science = Data Science

Plus de Adi Polak

Demystifying Apache SparkAdi Polak

Burst workloads Cutting costs with Kubernetes and Virtual KubeletAdi Polak

AI at ScaleAdi Polak

From desktop to the cloud, cutting costs with Virtual kubelet and ACIAdi Polak

Evolution of VS code Java ecosystemAdi Polak

Make it clean - scala clean codeAdi Polak

Spark UDFs are EviL, Catalyst to the rEsCue!Adi Polak

Plus de Adi Polak (7)

Demystifying Apache Spark

Burst workloads Cutting costs with Kubernetes and Virtual Kubelet

AI at Scale

From desktop to the cloud, cutting costs with Virtual kubelet and ACI

Evolution of VS code Java ecosystem

Make it clean - scala clean code

Spark UDFs are EviL, Catalyst to the rEsCue!

Dernier

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2

tonesoftglanshi9

WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg

VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale

%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba

WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit

%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba

WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba

WSO2CON 2024 Slides - Open Source to SaaSWSO2

WSO2Con204 - Hard Rock Presentation - KeynoteWSO2

%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba

Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver

Dernier (20)

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...

tonesoftg

WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...

VTU technical seminar 8Th Sem on Scikit-learn

%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

What Goes Wrong with Language Definitions and How to Improve the Situation

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain

WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity

MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...

%in Soweto+277-882-255-28 abortion pills for sale in soweto

WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...

WSO2CON 2024 Slides - Open Source to SaaS

WSO2Con204 - Hard Rock Presentation - Keynote

%in ivory park+277-882-255-28 abortion pills for sale in ivory park

Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...

ETL – Everything you need to know

2. Adi Polak Microsoft ETL – Everything you need to know

3. About me – Adi Polak @adipolak https://medium.com/@adipolak https://dev.to/adipolak

4. Use cases AD tech Automotive Cyber Security … EVERYTHING WITH DATA! @adipolak

5. Agenda • ETL • Data Pipelines • Data Challenges • Big Data • Stream vs Batch processing • Architecture • Tools • Learn More! @adipolak

6. E T L Extract Transform Load @adipolak

7. Data Pipeline != ETL

8. @adipolak

9. ETL

10. ETL

11. Data Pipeline VS ETL

12. Most complex data pipeline demo

13. Data Challenges – Collection of data @adipolak

14. Data Challenges - Collection of data Audio, video, images. Meaningless without adding some structure Unstructured JSON, XML, sensor data, social media, device data, web logs. Flexible data model structure Semi-Structured Structured CSV, Columnar Storage (Parquet, ORC). Strict data model structure @adipolak

15. Data Challenges – Duplication of data @adipolak

16. Data Challenges – Inconsistency of data @adipolak Schema Data representation Data value

17. Data Challenges – Variety of data @adipolak

18. What about Big Data ?

19. @adipolak

20. Big Data Processing Pipelines Visualize Azure Machine Learning

21. Why is processing Big Data challenging ? Variety Velocity Volume @adipolak

22. Data Engineers Analytics tools Knowledge of SQL Data Warehousing Data Architecture Coding Skills Machine Learning * @adipolak

23. Streaming vs. Batch Processing

24. Batch processing

25. NRT / Stream processing

26. Why not both?

27. Lambda Architecture

28. BATCH LAYER SPEED LAYER SERVICE LAYER

29. Batch Layer Apache Spark Azure Batch Speed Layer Spark Streaming Kafka Stream Azure stream analytics Service Layer Prometheus Graphana PowerBI Tools Many more… @adipolak

30.

31. What did we learn today: • ETL vs Data Pipelines • Big Data and Data challenges • Data Engineers • Streaming vs Batch processing • Architectures • Tools @adipolak

32. Learn More ! aka.ms/data-guide aka.ms/stream-processing aka.ms/building-blocks aka.ms/start-with-the-cloud @adipolak

Notes de l'éditeur

Simple data pipeline
Show command line example ps –A | grep java
Relational databases (RDBMS) work with structured data. Non-relational databases (NoSQL) work with semi-structured data Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
Relational databases (RDBMS) work with structured data. Non-relational databases (NoSQL) work with semi-structured data Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
Relational databases (RDBMS) work with structured data. Non-relational databases (NoSQL) work with semi-structured data Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
Relational databases (RDBMS) work with structured data. Non-relational databases (NoSQL) work with semi-structured data Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
Relational databases (RDBMS) work with structured data. Non-relational databases (NoSQL) work with semi-structured data Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
Variety: It can be structured, semi-structured, or unstructured Velocity: It can be streaming, near real-time or batch Volume: It can be 1GB or 1ZB
Apache Hadoop-based analytics - it is an open source platform for distributed storage and distributed processing of data sets. These services provide data storage, processing, data access, security, governance, and operations. You need to have a good grasp on tools like MapReduce, Hadoop and HBase etc. Knowledge of SQL - you need to have strong knowledge in relational database management system such as SQL to manage data. Data Warehousing - learning how to construct and use a data warehouse is a must. Data warehousing helps you aggregate unstructured data from one or more sources to compare and analyse for better business. Data Architecture - having knowledge of building complex database systems for companies. This term also refers to processes that address the data at rest, data at motion, data sets, and how they relate to data dependent applications and processes. Coding skills - you need to have good coding skills in Python, Java, Perl etc. Machine Learning - Although, it is said that machine learning is an integral part of data science but having some level of understanding of how to put the data into use using statistical analysis and data modelling is a huge advantage. Therefore, knowing is machine learning can be like cherry to the cake. Operating system - extensive knowledge in operating systems such as Linux, UNIX and Solaris etc. can be very helpful since most of the tools will be based on these systems.