SlideShare une entreprise Scribd logo
1  sur  51
Data Engineering
Engineering Data into Information
training@itversity.com
Agenda
• Introduction to Data Engineering
• Role of Big Data in Data Engineering
• Key Skills related to Data Engineering
• Role of Big Data in Data Engineering
• Overview of Data Engineering Certifications
• Free Content and ITVersity Paid Resources
training@itversity.com
Staying in touch
• Join our Meetup group - https://www.meetup.com/itversityin/
• Enroll for our labs - https://labs.itversity.com/plans
• Subscribe to our YouTube Channel for Videos -
http://youtube.com/itversityin/?sub_confirmation=1
• Access Content via our GitHub - https://github.com/dgadiraju/itversity-
books
• Lab and Content Support using Slack
Reach out to dgadiraju@itversity.com for enquiries related to corporate
training and data engineering services
training@itversity.com
Introduction
• About me - https://www.linkedin.com/in/durga0gadiraju/
• 13+ years of rich industry experience in building large scale data
driven applications
• IT Versity, LLC is Dallas based startup specializing in low cost quality
training in emerging technologies such as Big Data, Cloud etc
• We provide training using following platforms
• https://labs.itversity.com - low cost big data lab to learn technologies.
• http://discuss.itversity.com - support while learning
• http://www.itversity.com - website for content
• https://youtube.com/itversityin
• https://github.com/dgadiraju
training@itversity.com
Web/App Server
Web/App Server
Web/App Server
Database
Client
Client
Client
Client
Client
Client
Switch
Firewall
Switch
Firewall
Web/App Server
Web/App Server
Web/App Server
Database
Files
Databases
Big Data
Clusters
External
Apps
Data Integration
Batch or Real Time
• For batch get data from databases
by querying data from Database
• Batch Tools: Informatica, Ab Initio
etc
• For real time get data from web
server logs or database logs
• Real time tools: Goldengate to get
data from database logs, Kafka to
get data from web server logs
Job Roles – Skills and Technologies
BI
Developer
Application
Developer
DevOps
Engineer
Data
Engineer
Data Engineer Bi Developer Application Developer DevOps Engineer
Responsibilities Data ingestion and
processing
Reporting and Visualization Developing applications Maintaining infrastructure
such as Big Data clusters
Solutions Architect
training@itversity.com
Data Engineering
• Get data from different sources
• Design Data Marts for reporting
• Process data by applying transformation rules
• Row level transformations
• Aggregations
• Sorting
• Ranking
• And more
• Port data back to Data Marts for reporting
training@itversity.com
Data Engineering was performed by tools (eg:
Informatica)
training@itversity.com
Now it is being transitioned to programming
languages and Cloud (eg: Python)
training@itversity.com
What are the limitations of conventional tools
or programming languages?
training@itversity.com
Limitations of conventional approach
• Scalability is major challenge
• Hardware Cost
• Licensing
training@itversity.com
Big Data eco system tools and technologies
solve the problem of scalability
training@itversity.com
Data Engineering
HDFS
Hive
Pig
Sqoop
Impala
Tez
EMR
Spark Ganglia
HBase
Impala
Zookeeper
Map Reduce YARN
Kafka Flume
Storm
Flink
Datameer
AWS s3
Azure Blob
Technologies (Big Data eco system highlighted)
training@itversity.com
Airflow
NiFi
Let us understand these vast array of tools
training@itversity.com
Big Data eco system – High level categories
All the technologies in the previous slide can be categorized into these
• File system
• Data ingestion
• Data processing
• Batch
• Real time
• Streaming
• Visualization
• Support
training@itversity.com
File System
Big Data eco system – High level categories
Data Ingestion Data Processing Visualization
Insights
Support
training@itversity.com
Big Data eco system – File System
File systems supporting Big Data should be typically distributed file
systems. However cloud based storages are also becoming quite
popular as they can cut down the operational costs significantly with
pay-as-you-go model.
• HDFS – Hadoop Distributed File System
• AWS S3 – Amazon’s cloud based storage
• Azure Blob – Microsoft Azure’s cloud based storage
• NoSQL file systems
training@itversity.com
Big Data eco system – Data Ingestion
Data ingestion can be done either in real time or in batches. Data can
be pulled either from relational databases or streamed from web logs
• NiFi – a UI based Data Ingestion tool
• Kafka – a queue based technology from which data can be consumed
to any technology. One category is Big Data.
• There are many other tools and at times we might have to customize
as per our requirements
training@itversity.com
Big Data eco system – Data Processing
Data processing is categorized into
• Batch
• Map Reduce – I/O driven
• Spark – Memory driven
• Real time (real time operations)
• NoSQL – HBase/MongoDB/Cassandra
• Ad hoc querying – Impala/Presto/Spark SQL
• Streaming (near real time data processing)
• Spark Streaming
• Flink
• Storm
training@itversity.com
Big Data eco system – Data Processing
• Amazon Recommendation engine
• LinkedIn endorsements
training@itversity.com
Big Data eco system – Visualization
Once the data is processed we need to visualize the data using
standard reporting tools or custom applications.
• Datameer
• d3js
• Tableau
• Qlikview
• and many more
training@itversity.com
Big Data eco system – Support
There are bunch of tools which are used to support the clusters
• Ambari/Cloudera Manager/Ganglia – Used to setup and maintain the
tools
• Zookeeper – Load balancing and fail over
• Kerberos – Security
• Knox/Ranger
training@itversity.com
File System
Big Data eco system – High level categories and skills mapping
Data Ingestion Data Processing Visualization
Insights
HDFS s3 Azure Blob Other
NiFi
Kafka Custom
Batch
Real Time Streaming
Datameer
BI Tools Custom
Support
DevOps
Hadoop
training@itversity.com
Job Roles – Skills and Technologies
BI
Developer
Application
Developer
DevOps
Engineer
Data
Engineer
Data Engineer Bi Developer Application Developer DevOps Engineer
Responsibilities Data ingestion and
processing
Reporting and Visualization Developing applications Maintaining infrastructure
such as Big Data clusters
Skills Basic programming, Data
Warehousing, ETL, Data
integration
Reporting, Domain
knowledge, Data
Warehousing, BI
Advanced programming,
Application frameworks,
Databases
System Administration,
DevOps, Cloud based
technologies
Technologies (Big Data) Scala/Python, Spark, NiFi,
Kafka, Spark
Streaming/Storm/Flink etc
BI Tools such as Tableau,
Data Modeling,
Visualization frameworks of
R, Python etc.
Java/Python, MVC, Micro
Services, NoSQL etc
Puppet/Chef/Ansible,
Cloudera/Hortonworks/Ma
pR etc, AWS
Solutions Architect
training@itversity.com
File System
Big Data eco system – High level categories and skills mapping
Data Ingestion Data Processing Visualization
Insights
HDFS s3 Azure Blob Other
Sqoop Flume
Kafka Custom
Batch
Real Time Streaming
Datameer
BI Tools Custom
Support
DevOps
Hadoop
training@itversity.com
Data Engineering – On Prem vs. Cloud
• Lately most of the clients are moving away from On-Premise to Cloud
• On-Premise are typically built using MapR, Hortonworks or Cloudera.
• MapR and Hortonworks are close to extinct and Cloudera is thriving
to survive by coming up with Cloud based service.
training@itversity.com
Data Engineering – Distributions – Challenges
Here are some of the challenges distributions are facing.
• Storage, Catalog (Metadata) and Processing are coupled.
• Clusters are typically under utilized.
• Even though we can setup Clusters in Cloud using distributions, they
are typically under utilized.
• Adding new nodes is a tedious process.
• Demo using ITVersity labs
training@itversity.com
Data Engineering – Cloud based services
• Following are the most popular cloud based Data Engineering
Services.
• Databricks
• AWS Analytics
• Google Dataproc
• Storage, Catalog and Processing are decoupled.
training@itversity.com
Data Engineering – Core Skills
HDFS
Hive
Pig
Sqoop
Tez
EMR
Spark Ganglia
NoSQL
Impala
Zookeeper
Map Reduce YARN
Kafka Flume
Storm
Flink
DatameerAWS s3
Azure Blob
Technologies (Big Data eco system highlighted)
training@itversity.com
Airflow
NiFi
Databricks Dataproc
Data Engineering vs. Data Science
• Data Science and Data Engineering are 2 different fields
• Data Science can be implemented even using excel on smaller volumes of data
• When it comes to larger volumes of data, Data Scientist team work closely with
Data Engineers to
• Ingest data from different sources
• Process data – Data Cleansing, Standardization, Aggregations etc
• Data can ported to data science algorithms after processing the data
• Data science algorithms can be applied by using Big Data modules such as Mahout, Spark
MLLib etc
• Data Scientists should be cognizant about Data Engineering, but need not be
hands on. Data Engineers are the ones who work on Big Data eco system. But in
the smaller organization Data Scientist/Data Engineer has to be master of both.
training@itversity.com
Data Engineering – roles and responsibilities
• Environment – Linux
• Ad hoc querying and reporting – SQL
• Data ingestion – NiFi and Kafka
• Performing ETL
• Conventional tools such as Informatica
• Programming languages such as Python or Scala
• Spark – heavy volumes of data
• Validations – SQL or Shell Scripting
• Big Data on Cloud – AWS EMR, Databricks, GCP Data Proc
• Visualization – Tableau
training@itversity.com
Data Engineering – Required Skills
• Linux Fundamentals
• Database Essentials
• Basics of Programming (Python and Scala)
• Big Data eco system tools and technologies
• Building applications at scale
• Data Ingestion
• Streaming Data Pipelines
• Visualization
• Big Data on Cloud
training@itversity.com
Linux Fundamentals
• Overview of Operating System
• Logging into linux (including passwordless)
• Basic linux commands
• Editors such as vi/vim
• Regular expressions
• Processing information using awk/sed
• Basics of shell scripting
• Troubleshooting the issues
training@itversity.com
Database Essentials
• Overview of Relational Databases
• Normalization
• Creating tables and manipulating data
• Basics SQL
• Analytical Functions
• Relating RDBMS with NoSQL
• Writing queries in MongoDB
training@itversity.com
Basics of Programming (Python and Scala)
• Data Types
• Basic programming constructs
• Pre-defined functions (string manipulation)
• User defined functions (including lambda functions)
• Collections
• Basic I/O operations
• Database operations
• Externalizing properties
training@itversity.com
Big Data eco system tools and technologies
• File Systems Overview
• Processing Engines Overview
• HDFS commands
• YARN
• Hive
• Sqoop
• Flume
• Distributions
training@itversity.com
Databases in Big Data
• Hive Overview
• Creating databases, tables and loading data
• Queries in Hive
• Hive based engines
• File formats
• Integration of Spark SQL with Scala/Python – Overview
training@itversity.com
Building applications at scale
• Overview of Spark
• Reading data from file systems
• Processing data using Core Spark API
• Processing data using Data Sets and/or Data Frames
• Processing data using Spark SQL
• Saving data to file systems
• Development life cycle
• Execution life cycle
• Troubleshooting and performance tuning
training@itversity.com
Apache Spark
• Apache Spark is in memory distributed processing framework on top
of file systems such as HDFS, s3, Azure Blob etc
• There are bunch of APIs to process the data. They are called as
Transformations and Actions. They are also known as Core Spark.
• Tightly integrated with programming languages such as Scala, Python,
Java, R
• To be proficient in Spark, you need to learn one of the programming
languages – preferably Scala or Python
• You often hear about YARN, Mesos – they are just frameworks to run
the jobs. Developers do not need to worry about it.
training@itversity.com
Data Ingestion
• Copying data between RDBMS and HDFS using Sqoop
• Copying data between RDBMS and Hive using Sqoop
• Real time data ingestion using Flume
• Data Ingestion using Kafka
• Copying data between RDBMS and HDFS using Spark JDBC
training@itversity.com
Streaming Data Pipelines
• Integrating data from Flume to Kafka
• Getting golden copy of data using Flume to HDFS
• Integration of Kafka and Spark Streaming
• Apply analytics rules on inflight data using Spark Steaming APIs
training@itversity.com
Visualization
• Overview of BI and Visualization tools
• Setting up Tableau Desktop
• Connecting to different data sources
• Creating reports
• Creating dashboards
training@itversity.com
Big Data on Cloud
• Overview of Cloud
• Understanding AWS (Amazon Web Services)
• Setting up EC2 instances
• AWS CLI (Command Line Interface)
• Creating AWS EMR cluster using both web console as well as CLI
• Step execution
• Running Spark Jobs
• Deploying applications using Azkaban
training@itversity.com
Pre-requisites
• CS or IT graduate or experienced IT professional
• Basic programming and database skills
• Laptop with 4 GB RAM and 64 bit operating system
• High speed internet
training@itversity.com
Targeted Audience
• CS or IT freshers who want to become Data Engineers with Big Data
skills or aspiring for Data Science
• IT professionals who want to transition to Data Engineer roles
• Mainframes Professionals
• Test Engineers
• ETL or Data Warehouse professionals
• BI professionals
• Application Developers to get proficiency of Big Data
training@itversity.com
Certifications
We cover curriculum for following certifications:
• CCA 175 Spark and Hadoop Developer
• Databricks/O’reilly Certified Spark Developer
Please RSVP to the next live session on Meetup:
https://www.meetup.com/itversityin/events/271739702/
training@itversity.com
Free Content and ITVersity Paid Resources
• Free Content
• Videos on YouTube
• Content on GitHub
• Vagrant Box on Vagrant Apps
• Paid Resources
• Labs at nominal cost
• Live Support via Slack or forums
training@itversity.com
Our Success Stories
• Thousands trained on vast array of skills
• Hundreds got certified – please visit
http://discuss.itversity.com/c/certifications/success-stories
• Many successfully transitioned to Data Engineer roles
• Many Data Scientists add necessary Data Engineering skills
training@itversity.com
We believe in trainings related to open source also should be open source
• https://labs.itversity.com - low cost big data lab to learn technologies.
• http://discuss.itversity.com - support while learning
• http://www.itversity.com - website for content
• https://youtube.com/itversityin
• https://github.com/dgadiraju
training@itversity.com
Q&A
training@itversity.com

Contenu connexe

Tendances

Tendances (20)

Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Data engineering
Data engineeringData engineering
Data engineering
 
Modernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data PipelinesModernize & Automate Analytics Data Pipelines
Modernize & Automate Analytics Data Pipelines
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 

Similaire à Introduction to Data Engineering

Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Paige_Roberts
 
advance computing and big adata analytic.pptx
advance computing and big adata analytic.pptxadvance computing and big adata analytic.pptx
advance computing and big adata analytic.pptx
TeddyIswahyudi1
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 

Similaire à Introduction to Data Engineering (20)

Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
DA_01_Intro.pptx
DA_01_Intro.pptxDA_01_Intro.pptx
DA_01_Intro.pptx
 
Data Engineer Course In Bangalore-October
Data Engineer Course In Bangalore-OctoberData Engineer Course In Bangalore-October
Data Engineer Course In Bangalore-October
 
advance computing and big adata analytic.pptx
advance computing and big adata analytic.pptxadvance computing and big adata analytic.pptx
advance computing and big adata analytic.pptx
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Lecture1
Lecture1Lecture1
Lecture1
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
J1 - Keynote Data Platform - Rohan Kumar
J1 - Keynote Data Platform - Rohan KumarJ1 - Keynote Data Platform - Rohan Kumar
J1 - Keynote Data Platform - Rohan Kumar
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 

Plus de Durga Gadiraju

Plus de Durga Gadiraju (8)

Data ingestion using NiFi - Quick Overview
Data ingestion using NiFi - Quick OverviewData ingestion using NiFi - Quick Overview
Data ingestion using NiFi - Quick Overview
 
Itversity
ItversityItversity
Itversity
 
Big Data Certifications Workshop - 201711 - Introduction and Database Essentials
Big Data Certifications Workshop - 201711 - Introduction and Database EssentialsBig Data Certifications Workshop - 201711 - Introduction and Database Essentials
Big Data Certifications Workshop - 201711 - Introduction and Database Essentials
 
Big Data Certifications Workshop - 201711 - Introduction and Linux Essentials
Big Data Certifications Workshop - 201711 - Introduction and Linux EssentialsBig Data Certifications Workshop - 201711 - Introduction and Linux Essentials
Big Data Certifications Workshop - 201711 - Introduction and Linux Essentials
 
HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)
 
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
Pycon India 2017 - Big Data Engineering using Spark with Python (pyspark) - W...
 
Oracle migrations and upgrades
Oracle migrations and upgradesOracle migrations and upgrades
Oracle migrations and upgrades
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 

Dernier

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Dernier (20)

%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 

Introduction to Data Engineering

  • 1. Data Engineering Engineering Data into Information training@itversity.com
  • 2. Agenda • Introduction to Data Engineering • Role of Big Data in Data Engineering • Key Skills related to Data Engineering • Role of Big Data in Data Engineering • Overview of Data Engineering Certifications • Free Content and ITVersity Paid Resources training@itversity.com
  • 3. Staying in touch • Join our Meetup group - https://www.meetup.com/itversityin/ • Enroll for our labs - https://labs.itversity.com/plans • Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1 • Access Content via our GitHub - https://github.com/dgadiraju/itversity- books • Lab and Content Support using Slack Reach out to dgadiraju@itversity.com for enquiries related to corporate training and data engineering services training@itversity.com
  • 4. Introduction • About me - https://www.linkedin.com/in/durga0gadiraju/ • 13+ years of rich industry experience in building large scale data driven applications • IT Versity, LLC is Dallas based startup specializing in low cost quality training in emerging technologies such as Big Data, Cloud etc • We provide training using following platforms • https://labs.itversity.com - low cost big data lab to learn technologies. • http://discuss.itversity.com - support while learning • http://www.itversity.com - website for content • https://youtube.com/itversityin • https://github.com/dgadiraju training@itversity.com
  • 5. Web/App Server Web/App Server Web/App Server Database Client Client Client Client Client Client Switch Firewall Switch Firewall
  • 6. Web/App Server Web/App Server Web/App Server Database Files Databases Big Data Clusters External Apps Data Integration Batch or Real Time • For batch get data from databases by querying data from Database • Batch Tools: Informatica, Ab Initio etc • For real time get data from web server logs or database logs • Real time tools: Goldengate to get data from database logs, Kafka to get data from web server logs
  • 7. Job Roles – Skills and Technologies BI Developer Application Developer DevOps Engineer Data Engineer Data Engineer Bi Developer Application Developer DevOps Engineer Responsibilities Data ingestion and processing Reporting and Visualization Developing applications Maintaining infrastructure such as Big Data clusters Solutions Architect training@itversity.com
  • 8. Data Engineering • Get data from different sources • Design Data Marts for reporting • Process data by applying transformation rules • Row level transformations • Aggregations • Sorting • Ranking • And more • Port data back to Data Marts for reporting training@itversity.com
  • 9. Data Engineering was performed by tools (eg: Informatica) training@itversity.com
  • 10. Now it is being transitioned to programming languages and Cloud (eg: Python) training@itversity.com
  • 11. What are the limitations of conventional tools or programming languages? training@itversity.com
  • 12. Limitations of conventional approach • Scalability is major challenge • Hardware Cost • Licensing training@itversity.com
  • 13. Big Data eco system tools and technologies solve the problem of scalability training@itversity.com
  • 14. Data Engineering HDFS Hive Pig Sqoop Impala Tez EMR Spark Ganglia HBase Impala Zookeeper Map Reduce YARN Kafka Flume Storm Flink Datameer AWS s3 Azure Blob Technologies (Big Data eco system highlighted) training@itversity.com Airflow NiFi
  • 15. Let us understand these vast array of tools training@itversity.com
  • 16. Big Data eco system – High level categories All the technologies in the previous slide can be categorized into these • File system • Data ingestion • Data processing • Batch • Real time • Streaming • Visualization • Support training@itversity.com
  • 17. File System Big Data eco system – High level categories Data Ingestion Data Processing Visualization Insights Support training@itversity.com
  • 18. Big Data eco system – File System File systems supporting Big Data should be typically distributed file systems. However cloud based storages are also becoming quite popular as they can cut down the operational costs significantly with pay-as-you-go model. • HDFS – Hadoop Distributed File System • AWS S3 – Amazon’s cloud based storage • Azure Blob – Microsoft Azure’s cloud based storage • NoSQL file systems training@itversity.com
  • 19. Big Data eco system – Data Ingestion Data ingestion can be done either in real time or in batches. Data can be pulled either from relational databases or streamed from web logs • NiFi – a UI based Data Ingestion tool • Kafka – a queue based technology from which data can be consumed to any technology. One category is Big Data. • There are many other tools and at times we might have to customize as per our requirements training@itversity.com
  • 20. Big Data eco system – Data Processing Data processing is categorized into • Batch • Map Reduce – I/O driven • Spark – Memory driven • Real time (real time operations) • NoSQL – HBase/MongoDB/Cassandra • Ad hoc querying – Impala/Presto/Spark SQL • Streaming (near real time data processing) • Spark Streaming • Flink • Storm training@itversity.com
  • 21. Big Data eco system – Data Processing • Amazon Recommendation engine • LinkedIn endorsements training@itversity.com
  • 22. Big Data eco system – Visualization Once the data is processed we need to visualize the data using standard reporting tools or custom applications. • Datameer • d3js • Tableau • Qlikview • and many more training@itversity.com
  • 23. Big Data eco system – Support There are bunch of tools which are used to support the clusters • Ambari/Cloudera Manager/Ganglia – Used to setup and maintain the tools • Zookeeper – Load balancing and fail over • Kerberos – Security • Knox/Ranger training@itversity.com
  • 24. File System Big Data eco system – High level categories and skills mapping Data Ingestion Data Processing Visualization Insights HDFS s3 Azure Blob Other NiFi Kafka Custom Batch Real Time Streaming Datameer BI Tools Custom Support DevOps Hadoop training@itversity.com
  • 25. Job Roles – Skills and Technologies BI Developer Application Developer DevOps Engineer Data Engineer Data Engineer Bi Developer Application Developer DevOps Engineer Responsibilities Data ingestion and processing Reporting and Visualization Developing applications Maintaining infrastructure such as Big Data clusters Skills Basic programming, Data Warehousing, ETL, Data integration Reporting, Domain knowledge, Data Warehousing, BI Advanced programming, Application frameworks, Databases System Administration, DevOps, Cloud based technologies Technologies (Big Data) Scala/Python, Spark, NiFi, Kafka, Spark Streaming/Storm/Flink etc BI Tools such as Tableau, Data Modeling, Visualization frameworks of R, Python etc. Java/Python, MVC, Micro Services, NoSQL etc Puppet/Chef/Ansible, Cloudera/Hortonworks/Ma pR etc, AWS Solutions Architect training@itversity.com
  • 26. File System Big Data eco system – High level categories and skills mapping Data Ingestion Data Processing Visualization Insights HDFS s3 Azure Blob Other Sqoop Flume Kafka Custom Batch Real Time Streaming Datameer BI Tools Custom Support DevOps Hadoop training@itversity.com
  • 27. Data Engineering – On Prem vs. Cloud • Lately most of the clients are moving away from On-Premise to Cloud • On-Premise are typically built using MapR, Hortonworks or Cloudera. • MapR and Hortonworks are close to extinct and Cloudera is thriving to survive by coming up with Cloud based service. training@itversity.com
  • 28. Data Engineering – Distributions – Challenges Here are some of the challenges distributions are facing. • Storage, Catalog (Metadata) and Processing are coupled. • Clusters are typically under utilized. • Even though we can setup Clusters in Cloud using distributions, they are typically under utilized. • Adding new nodes is a tedious process. • Demo using ITVersity labs training@itversity.com
  • 29. Data Engineering – Cloud based services • Following are the most popular cloud based Data Engineering Services. • Databricks • AWS Analytics • Google Dataproc • Storage, Catalog and Processing are decoupled. training@itversity.com
  • 30. Data Engineering – Core Skills HDFS Hive Pig Sqoop Tez EMR Spark Ganglia NoSQL Impala Zookeeper Map Reduce YARN Kafka Flume Storm Flink DatameerAWS s3 Azure Blob Technologies (Big Data eco system highlighted) training@itversity.com Airflow NiFi Databricks Dataproc
  • 31. Data Engineering vs. Data Science • Data Science and Data Engineering are 2 different fields • Data Science can be implemented even using excel on smaller volumes of data • When it comes to larger volumes of data, Data Scientist team work closely with Data Engineers to • Ingest data from different sources • Process data – Data Cleansing, Standardization, Aggregations etc • Data can ported to data science algorithms after processing the data • Data science algorithms can be applied by using Big Data modules such as Mahout, Spark MLLib etc • Data Scientists should be cognizant about Data Engineering, but need not be hands on. Data Engineers are the ones who work on Big Data eco system. But in the smaller organization Data Scientist/Data Engineer has to be master of both. training@itversity.com
  • 32. Data Engineering – roles and responsibilities • Environment – Linux • Ad hoc querying and reporting – SQL • Data ingestion – NiFi and Kafka • Performing ETL • Conventional tools such as Informatica • Programming languages such as Python or Scala • Spark – heavy volumes of data • Validations – SQL or Shell Scripting • Big Data on Cloud – AWS EMR, Databricks, GCP Data Proc • Visualization – Tableau training@itversity.com
  • 33. Data Engineering – Required Skills • Linux Fundamentals • Database Essentials • Basics of Programming (Python and Scala) • Big Data eco system tools and technologies • Building applications at scale • Data Ingestion • Streaming Data Pipelines • Visualization • Big Data on Cloud training@itversity.com
  • 34. Linux Fundamentals • Overview of Operating System • Logging into linux (including passwordless) • Basic linux commands • Editors such as vi/vim • Regular expressions • Processing information using awk/sed • Basics of shell scripting • Troubleshooting the issues training@itversity.com
  • 35. Database Essentials • Overview of Relational Databases • Normalization • Creating tables and manipulating data • Basics SQL • Analytical Functions • Relating RDBMS with NoSQL • Writing queries in MongoDB training@itversity.com
  • 36. Basics of Programming (Python and Scala) • Data Types • Basic programming constructs • Pre-defined functions (string manipulation) • User defined functions (including lambda functions) • Collections • Basic I/O operations • Database operations • Externalizing properties training@itversity.com
  • 37. Big Data eco system tools and technologies • File Systems Overview • Processing Engines Overview • HDFS commands • YARN • Hive • Sqoop • Flume • Distributions training@itversity.com
  • 38. Databases in Big Data • Hive Overview • Creating databases, tables and loading data • Queries in Hive • Hive based engines • File formats • Integration of Spark SQL with Scala/Python – Overview training@itversity.com
  • 39. Building applications at scale • Overview of Spark • Reading data from file systems • Processing data using Core Spark API • Processing data using Data Sets and/or Data Frames • Processing data using Spark SQL • Saving data to file systems • Development life cycle • Execution life cycle • Troubleshooting and performance tuning training@itversity.com
  • 40. Apache Spark • Apache Spark is in memory distributed processing framework on top of file systems such as HDFS, s3, Azure Blob etc • There are bunch of APIs to process the data. They are called as Transformations and Actions. They are also known as Core Spark. • Tightly integrated with programming languages such as Scala, Python, Java, R • To be proficient in Spark, you need to learn one of the programming languages – preferably Scala or Python • You often hear about YARN, Mesos – they are just frameworks to run the jobs. Developers do not need to worry about it. training@itversity.com
  • 41. Data Ingestion • Copying data between RDBMS and HDFS using Sqoop • Copying data between RDBMS and Hive using Sqoop • Real time data ingestion using Flume • Data Ingestion using Kafka • Copying data between RDBMS and HDFS using Spark JDBC training@itversity.com
  • 42. Streaming Data Pipelines • Integrating data from Flume to Kafka • Getting golden copy of data using Flume to HDFS • Integration of Kafka and Spark Streaming • Apply analytics rules on inflight data using Spark Steaming APIs training@itversity.com
  • 43. Visualization • Overview of BI and Visualization tools • Setting up Tableau Desktop • Connecting to different data sources • Creating reports • Creating dashboards training@itversity.com
  • 44. Big Data on Cloud • Overview of Cloud • Understanding AWS (Amazon Web Services) • Setting up EC2 instances • AWS CLI (Command Line Interface) • Creating AWS EMR cluster using both web console as well as CLI • Step execution • Running Spark Jobs • Deploying applications using Azkaban training@itversity.com
  • 45. Pre-requisites • CS or IT graduate or experienced IT professional • Basic programming and database skills • Laptop with 4 GB RAM and 64 bit operating system • High speed internet training@itversity.com
  • 46. Targeted Audience • CS or IT freshers who want to become Data Engineers with Big Data skills or aspiring for Data Science • IT professionals who want to transition to Data Engineer roles • Mainframes Professionals • Test Engineers • ETL or Data Warehouse professionals • BI professionals • Application Developers to get proficiency of Big Data training@itversity.com
  • 47. Certifications We cover curriculum for following certifications: • CCA 175 Spark and Hadoop Developer • Databricks/O’reilly Certified Spark Developer Please RSVP to the next live session on Meetup: https://www.meetup.com/itversityin/events/271739702/ training@itversity.com
  • 48. Free Content and ITVersity Paid Resources • Free Content • Videos on YouTube • Content on GitHub • Vagrant Box on Vagrant Apps • Paid Resources • Labs at nominal cost • Live Support via Slack or forums training@itversity.com
  • 49. Our Success Stories • Thousands trained on vast array of skills • Hundreds got certified – please visit http://discuss.itversity.com/c/certifications/success-stories • Many successfully transitioned to Data Engineer roles • Many Data Scientists add necessary Data Engineering skills training@itversity.com
  • 50. We believe in trainings related to open source also should be open source • https://labs.itversity.com - low cost big data lab to learn technologies. • http://discuss.itversity.com - support while learning • http://www.itversity.com - website for content • https://youtube.com/itversityin • https://github.com/dgadiraju training@itversity.com