Introduction to Data Engineering

Data Engineering
Engineering Data into Information
training@itversity.com

Agenda
• Introduction to Data Engineering
• Role of Big Data in Data Engineering
• Key Skills related to Data Engineering
• Role of Big Data in Data Engineering
• Overview of Data Engineering Certifications
• Free Content and ITVersity Paid Resources

Staying in touch
• Join our Meetup group - https://www.meetup.com/itversityin/
• Enroll for our labs - https://labs.itversity.com/plans
• Subscribe to our YouTube Channel for Videos -
http://youtube.com/itversityin/?sub_confirmation=1
• Access Content via our GitHub - https://github.com/dgadiraju/itversity-
books
• Lab and Content Support using Slack
Reach out to dgadiraju@itversity.com for enquiries related to corporate
training and data engineering services

Introduction
• About me - https://www.linkedin.com/in/durga0gadiraju/
• 13+ years of rich industry experience in building large scale data
driven applications
• IT Versity, LLC is Dallas based startup specializing in low cost quality
training in emerging technologies such as Big Data, Cloud etc
• We provide training using following platforms
• https://labs.itversity.com - low cost big data lab to learn technologies.
• http://discuss.itversity.com - support while learning
• http://www.itversity.com - website for content
• https://youtube.com/itversityin
• https://github.com/dgadiraju

Web/App Server
Web/App Server
Web/App Server
Database
Client
Client
Client
Client
Client
Client
Switch
Firewall
Switch
Firewall

Web/App Server
Web/App Server
Web/App Server
Database
Files
Databases
Big Data
Clusters
External
Apps
Data Integration
Batch or Real Time
• For batch get data from databases
by querying data from Database
• Batch Tools: Informatica, Ab Initio
etc
• For real time get data from web
server logs or database logs
• Real time tools: Goldengate to get
data from database logs, Kafka to
get data from web server logs

Job Roles – Skills and Technologies
BI
Developer
Application
Developer
DevOps
Engineer
Data
Engineer
Data Engineer Bi Developer Application Developer DevOps Engineer
Responsibilities Data ingestion and
processing
Reporting and Visualization Developing applications Maintaining infrastructure
such as Big Data clusters
Solutions Architect

Data Engineering
• Get data from different sources
• Design Data Marts for reporting
• Process data by applying transformation rules
• Row level transformations
• Aggregations
• Sorting
• Ranking
• And more
• Port data back to Data Marts for reporting

Data Engineering was performed by tools (eg:
Informatica)

Now it is being transitioned to programming
languages and Cloud (eg: Python)

What are the limitations of conventional tools
or programming languages?

Limitations of conventional approach
• Scalability is major challenge
• Hardware Cost
• Licensing

Big Data eco system tools and technologies
solve the problem of scalability

Data Engineering
HDFS
Hive
Pig
Sqoop
Impala
Tez
EMR
Spark Ganglia
HBase
Impala
Zookeeper
Map Reduce YARN
Kafka Flume
Storm
Flink
Datameer
AWS s3
Azure Blob
Technologies (Big Data eco system highlighted)
Airflow
NiFi

Let us understand these vast array of tools

Big Data eco system – High level categories
All the technologies in the previous slide can be categorized into these
• File system
• Data ingestion
• Data processing
• Batch
• Real time
• Streaming
• Visualization
• Support

File System
Big Data eco system – High level categories
Data Ingestion Data Processing Visualization
Insights
Support

Big Data eco system – File System
File systems supporting Big Data should be typically distributed file
systems. However cloud based storages are also becoming quite
popular as they can cut down the operational costs significantly with
pay-as-you-go model.
• HDFS – Hadoop Distributed File System
• AWS S3 – Amazon’s cloud based storage
• Azure Blob – Microsoft Azure’s cloud based storage
• NoSQL file systems

Big Data eco system – Data Ingestion
Data ingestion can be done either in real time or in batches. Data can
be pulled either from relational databases or streamed from web logs
• NiFi – a UI based Data Ingestion tool
• Kafka – a queue based technology from which data can be consumed
to any technology. One category is Big Data.
• There are many other tools and at times we might have to customize
as per our requirements

Big Data eco system – Data Processing
Data processing is categorized into
• Batch
• Map Reduce – I/O driven
• Spark – Memory driven
• Real time (real time operations)
• NoSQL – HBase/MongoDB/Cassandra
• Ad hoc querying – Impala/Presto/Spark SQL
• Streaming (near real time data processing)
• Spark Streaming
• Flink
• Storm

Big Data eco system – Data Processing
• Amazon Recommendation engine
• LinkedIn endorsements

Big Data eco system – Visualization
Once the data is processed we need to visualize the data using
standard reporting tools or custom applications.
• Datameer
• d3js
• Tableau
• Qlikview
• and many more

Big Data eco system – Support
There are bunch of tools which are used to support the clusters
• Ambari/Cloudera Manager/Ganglia – Used to setup and maintain the
tools
• Zookeeper – Load balancing and fail over
• Kerberos – Security
• Knox/Ranger

File System
Big Data eco system – High level categories and skills mapping
Insights
HDFS s3 Azure Blob Other
NiFi
Kafka Custom
Batch
Real Time Streaming
Datameer
BI Tools Custom
Support
DevOps
Hadoop

Job Roles – Skills and Technologies
BI
Developer
Application
Developer
DevOps
Engineer
Data
Engineer
Data Engineer Bi Developer Application Developer DevOps Engineer
Responsibilities Data ingestion and
processing
Reporting and Visualization Developing applications Maintaining infrastructure
such as Big Data clusters
Skills Basic programming, Data
Warehousing, ETL, Data
integration
Reporting, Domain
knowledge, Data
Warehousing, BI
Advanced programming,
Application frameworks,
Databases
System Administration,
DevOps, Cloud based
technologies
Technologies (Big Data) Scala/Python, Spark, NiFi,
Kafka, Spark
Streaming/Storm/Flink etc
BI Tools such as Tableau,
Data Modeling,
Visualization frameworks of
R, Python etc.
Java/Python, MVC, Micro
Services, NoSQL etc
Puppet/Chef/Ansible,
Cloudera/Hortonworks/Ma
pR etc, AWS
Solutions Architect

File System
Big Data eco system – High level categories and skills mapping
Insights
HDFS s3 Azure Blob Other
Sqoop Flume
Kafka Custom
Batch
Real Time Streaming
Datameer
BI Tools Custom
Support
DevOps
Hadoop

Data Engineering – On Prem vs. Cloud
• Lately most of the clients are moving away from On-Premise to Cloud
• On-Premise are typically built using MapR, Hortonworks or Cloudera.
• MapR and Hortonworks are close to extinct and Cloudera is thriving
to survive by coming up with Cloud based service.

Data Engineering – Distributions – Challenges
Here are some of the challenges distributions are facing.
• Storage, Catalog (Metadata) and Processing are coupled.
• Clusters are typically under utilized.
• Even though we can setup Clusters in Cloud using distributions, they
are typically under utilized.
• Adding new nodes is a tedious process.
• Demo using ITVersity labs

Data Engineering – Cloud based services
• Following are the most popular cloud based Data Engineering
Services.
• Databricks
• AWS Analytics
• Google Dataproc
• Storage, Catalog and Processing are decoupled.

Data Engineering – Core Skills
HDFS
Hive
Pig
Sqoop
Tez
EMR
Spark Ganglia
NoSQL
Impala
Zookeeper
Map Reduce YARN
Kafka Flume
Storm
Flink
DatameerAWS s3
Azure Blob
Technologies (Big Data eco system highlighted)
Airflow
NiFi
Databricks Dataproc

Data Engineering vs. Data Science
• Data Science and Data Engineering are 2 different fields
• Data Science can be implemented even using excel on smaller volumes of data
• When it comes to larger volumes of data, Data Scientist team work closely with
Data Engineers to
• Ingest data from different sources
• Process data – Data Cleansing, Standardization, Aggregations etc
• Data can ported to data science algorithms after processing the data
• Data science algorithms can be applied by using Big Data modules such as Mahout, Spark
MLLib etc
• Data Scientists should be cognizant about Data Engineering, but need not be
hands on. Data Engineers are the ones who work on Big Data eco system. But in
the smaller organization Data Scientist/Data Engineer has to be master of both.

Data Engineering – roles and responsibilities
• Environment – Linux
• Ad hoc querying and reporting – SQL
• Data ingestion – NiFi and Kafka
• Performing ETL
• Conventional tools such as Informatica
• Programming languages such as Python or Scala
• Spark – heavy volumes of data
• Validations – SQL or Shell Scripting
• Big Data on Cloud – AWS EMR, Databricks, GCP Data Proc
• Visualization – Tableau

Data Engineering – Required Skills
• Linux Fundamentals
• Database Essentials
• Basics of Programming (Python and Scala)
• Big Data eco system tools and technologies
• Building applications at scale
• Data Ingestion
• Streaming Data Pipelines
• Visualization
• Big Data on Cloud

Linux Fundamentals
• Overview of Operating System
• Logging into linux (including passwordless)
• Basic linux commands
• Editors such as vi/vim
• Regular expressions
• Processing information using awk/sed
• Basics of shell scripting
• Troubleshooting the issues

Database Essentials
• Overview of Relational Databases
• Normalization
• Creating tables and manipulating data
• Basics SQL
• Analytical Functions
• Relating RDBMS with NoSQL
• Writing queries in MongoDB

Basics of Programming (Python and Scala)
• Data Types
• Basic programming constructs
• Pre-defined functions (string manipulation)
• User defined functions (including lambda functions)
• Collections
• Basic I/O operations
• Database operations
• Externalizing properties

Big Data eco system tools and technologies
• File Systems Overview
• Processing Engines Overview
• HDFS commands
• YARN
• Hive
• Sqoop
• Flume
• Distributions

Databases in Big Data
• Hive Overview
• Creating databases, tables and loading data
• Queries in Hive
• Hive based engines
• File formats
• Integration of Spark SQL with Scala/Python – Overview

Building applications at scale
• Overview of Spark
• Reading data from file systems
• Processing data using Core Spark API
• Processing data using Data Sets and/or Data Frames
• Processing data using Spark SQL
• Saving data to file systems
• Development life cycle
• Execution life cycle
• Troubleshooting and performance tuning

Apache Spark
• Apache Spark is in memory distributed processing framework on top
of file systems such as HDFS, s3, Azure Blob etc
• There are bunch of APIs to process the data. They are called as
Transformations and Actions. They are also known as Core Spark.
• Tightly integrated with programming languages such as Scala, Python,
Java, R
• To be proficient in Spark, you need to learn one of the programming
languages – preferably Scala or Python
• You often hear about YARN, Mesos – they are just frameworks to run
the jobs. Developers do not need to worry about it.

Data Ingestion
• Copying data between RDBMS and HDFS using Sqoop
• Copying data between RDBMS and Hive using Sqoop
• Real time data ingestion using Flume
• Data Ingestion using Kafka
• Copying data between RDBMS and HDFS using Spark JDBC

Streaming Data Pipelines
• Integrating data from Flume to Kafka
• Getting golden copy of data using Flume to HDFS
• Integration of Kafka and Spark Streaming
• Apply analytics rules on inflight data using Spark Steaming APIs

Visualization
• Overview of BI and Visualization tools
• Setting up Tableau Desktop
• Connecting to different data sources
• Creating reports
• Creating dashboards

Big Data on Cloud
• Overview of Cloud
• Understanding AWS (Amazon Web Services)
• Setting up EC2 instances
• AWS CLI (Command Line Interface)
• Creating AWS EMR cluster using both web console as well as CLI
• Step execution
• Running Spark Jobs
• Deploying applications using Azkaban

Pre-requisites
• CS or IT graduate or experienced IT professional
• Basic programming and database skills
• Laptop with 4 GB RAM and 64 bit operating system
• High speed internet

Targeted Audience
• CS or IT freshers who want to become Data Engineers with Big Data
skills or aspiring for Data Science
• IT professionals who want to transition to Data Engineer roles
• Mainframes Professionals
• Test Engineers
• ETL or Data Warehouse professionals
• BI professionals
• Application Developers to get proficiency of Big Data

Certifications
We cover curriculum for following certifications:
• CCA 175 Spark and Hadoop Developer
• Databricks/O’reilly Certified Spark Developer
Please RSVP to the next live session on Meetup:
https://www.meetup.com/itversityin/events/271739702/

Free Content and ITVersity Paid Resources
• Free Content
• Videos on YouTube
• Content on GitHub
• Vagrant Box on Vagrant Apps
• Paid Resources
• Labs at nominal cost
• Live Support via Slack or forums

Our Success Stories
• Thousands trained on vast array of skills
• Hundreds got certified – please visit
http://discuss.itversity.com/c/certifications/success-stories
• Many successfully transitioned to Data Engineer roles
• Many Data Scientists add necessary Data Engineering skills

We believe in trainings related to open source also should be open source
• https://labs.itversity.com - low cost big data lab to learn technologies.
• http://discuss.itversity.com - support while learning
• http://www.itversity.com - website for content
• https://youtube.com/itversityin
• https://github.com/dgadiraju

Introduction to Data Engineering

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Introduction to Data Engineering

Similaire à Introduction to Data Engineering (20)

Plus de Durga Gadiraju

Plus de Durga Gadiraju (8)

Dernier

Dernier (20)

Introduction to Data Engineering