As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends.
* Introduction to Data Engineering
* Role of Big Data in Data Engineering
* Key Skills related to Data Engineering
* Role of Big Data in Data Engineering
* Overview of Data Engineering Certifications
* Free Content and ITVersity Paid Resources
Don't worry if you miss the video - you can click on the below link to go through the video after the schedule.
https://youtu.be/dj565kgP1Ss
* Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/
Relevant Playlists:
* Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi
* Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl
* Join our Meetup group - https://www.meetup.com/itversityin/
* Enroll for our labs - https://labs.itversity.com/plans
* Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1
* Access Content via our GitHub - https://github.com/dgadiraju/itversity-books
* Lab and Content Support using Slack
2. Agenda
• Introduction to Data Engineering
• Role of Big Data in Data Engineering
• Key Skills related to Data Engineering
• Role of Big Data in Data Engineering
• Overview of Data Engineering Certifications
• Free Content and ITVersity Paid Resources
training@itversity.com
3. Staying in touch
• Join our Meetup group - https://www.meetup.com/itversityin/
• Enroll for our labs - https://labs.itversity.com/plans
• Subscribe to our YouTube Channel for Videos -
http://youtube.com/itversityin/?sub_confirmation=1
• Access Content via our GitHub - https://github.com/dgadiraju/itversity-
books
• Lab and Content Support using Slack
Reach out to dgadiraju@itversity.com for enquiries related to corporate
training and data engineering services
training@itversity.com
4. Introduction
• About me - https://www.linkedin.com/in/durga0gadiraju/
• 13+ years of rich industry experience in building large scale data
driven applications
• IT Versity, LLC is Dallas based startup specializing in low cost quality
training in emerging technologies such as Big Data, Cloud etc
• We provide training using following platforms
• https://labs.itversity.com - low cost big data lab to learn technologies.
• http://discuss.itversity.com - support while learning
• http://www.itversity.com - website for content
• https://youtube.com/itversityin
• https://github.com/dgadiraju
training@itversity.com
6. Web/App Server
Web/App Server
Web/App Server
Database
Files
Databases
Big Data
Clusters
External
Apps
Data Integration
Batch or Real Time
• For batch get data from databases
by querying data from Database
• Batch Tools: Informatica, Ab Initio
etc
• For real time get data from web
server logs or database logs
• Real time tools: Goldengate to get
data from database logs, Kafka to
get data from web server logs
7. Job Roles – Skills and Technologies
BI
Developer
Application
Developer
DevOps
Engineer
Data
Engineer
Data Engineer Bi Developer Application Developer DevOps Engineer
Responsibilities Data ingestion and
processing
Reporting and Visualization Developing applications Maintaining infrastructure
such as Big Data clusters
Solutions Architect
training@itversity.com
8. Data Engineering
• Get data from different sources
• Design Data Marts for reporting
• Process data by applying transformation rules
• Row level transformations
• Aggregations
• Sorting
• Ranking
• And more
• Port data back to Data Marts for reporting
training@itversity.com
16. Big Data eco system – High level categories
All the technologies in the previous slide can be categorized into these
• File system
• Data ingestion
• Data processing
• Batch
• Real time
• Streaming
• Visualization
• Support
training@itversity.com
17. File System
Big Data eco system – High level categories
Data Ingestion Data Processing Visualization
Insights
Support
training@itversity.com
18. Big Data eco system – File System
File systems supporting Big Data should be typically distributed file
systems. However cloud based storages are also becoming quite
popular as they can cut down the operational costs significantly with
pay-as-you-go model.
• HDFS – Hadoop Distributed File System
• AWS S3 – Amazon’s cloud based storage
• Azure Blob – Microsoft Azure’s cloud based storage
• NoSQL file systems
training@itversity.com
19. Big Data eco system – Data Ingestion
Data ingestion can be done either in real time or in batches. Data can
be pulled either from relational databases or streamed from web logs
• NiFi – a UI based Data Ingestion tool
• Kafka – a queue based technology from which data can be consumed
to any technology. One category is Big Data.
• There are many other tools and at times we might have to customize
as per our requirements
training@itversity.com
20. Big Data eco system – Data Processing
Data processing is categorized into
• Batch
• Map Reduce – I/O driven
• Spark – Memory driven
• Real time (real time operations)
• NoSQL – HBase/MongoDB/Cassandra
• Ad hoc querying – Impala/Presto/Spark SQL
• Streaming (near real time data processing)
• Spark Streaming
• Flink
• Storm
training@itversity.com
21. Big Data eco system – Data Processing
• Amazon Recommendation engine
• LinkedIn endorsements
training@itversity.com
22. Big Data eco system – Visualization
Once the data is processed we need to visualize the data using
standard reporting tools or custom applications.
• Datameer
• d3js
• Tableau
• Qlikview
• and many more
training@itversity.com
23. Big Data eco system – Support
There are bunch of tools which are used to support the clusters
• Ambari/Cloudera Manager/Ganglia – Used to setup and maintain the
tools
• Zookeeper – Load balancing and fail over
• Kerberos – Security
• Knox/Ranger
training@itversity.com
24. File System
Big Data eco system – High level categories and skills mapping
Data Ingestion Data Processing Visualization
Insights
HDFS s3 Azure Blob Other
NiFi
Kafka Custom
Batch
Real Time Streaming
Datameer
BI Tools Custom
Support
DevOps
Hadoop
training@itversity.com
25. Job Roles – Skills and Technologies
BI
Developer
Application
Developer
DevOps
Engineer
Data
Engineer
Data Engineer Bi Developer Application Developer DevOps Engineer
Responsibilities Data ingestion and
processing
Reporting and Visualization Developing applications Maintaining infrastructure
such as Big Data clusters
Skills Basic programming, Data
Warehousing, ETL, Data
integration
Reporting, Domain
knowledge, Data
Warehousing, BI
Advanced programming,
Application frameworks,
Databases
System Administration,
DevOps, Cloud based
technologies
Technologies (Big Data) Scala/Python, Spark, NiFi,
Kafka, Spark
Streaming/Storm/Flink etc
BI Tools such as Tableau,
Data Modeling,
Visualization frameworks of
R, Python etc.
Java/Python, MVC, Micro
Services, NoSQL etc
Puppet/Chef/Ansible,
Cloudera/Hortonworks/Ma
pR etc, AWS
Solutions Architect
training@itversity.com
26. File System
Big Data eco system – High level categories and skills mapping
Data Ingestion Data Processing Visualization
Insights
HDFS s3 Azure Blob Other
Sqoop Flume
Kafka Custom
Batch
Real Time Streaming
Datameer
BI Tools Custom
Support
DevOps
Hadoop
training@itversity.com
27. Data Engineering – On Prem vs. Cloud
• Lately most of the clients are moving away from On-Premise to Cloud
• On-Premise are typically built using MapR, Hortonworks or Cloudera.
• MapR and Hortonworks are close to extinct and Cloudera is thriving
to survive by coming up with Cloud based service.
training@itversity.com
28. Data Engineering – Distributions – Challenges
Here are some of the challenges distributions are facing.
• Storage, Catalog (Metadata) and Processing are coupled.
• Clusters are typically under utilized.
• Even though we can setup Clusters in Cloud using distributions, they
are typically under utilized.
• Adding new nodes is a tedious process.
• Demo using ITVersity labs
training@itversity.com
29. Data Engineering – Cloud based services
• Following are the most popular cloud based Data Engineering
Services.
• Databricks
• AWS Analytics
• Google Dataproc
• Storage, Catalog and Processing are decoupled.
training@itversity.com
30. Data Engineering – Core Skills
HDFS
Hive
Pig
Sqoop
Tez
EMR
Spark Ganglia
NoSQL
Impala
Zookeeper
Map Reduce YARN
Kafka Flume
Storm
Flink
DatameerAWS s3
Azure Blob
Technologies (Big Data eco system highlighted)
training@itversity.com
Airflow
NiFi
Databricks Dataproc
31. Data Engineering vs. Data Science
• Data Science and Data Engineering are 2 different fields
• Data Science can be implemented even using excel on smaller volumes of data
• When it comes to larger volumes of data, Data Scientist team work closely with
Data Engineers to
• Ingest data from different sources
• Process data – Data Cleansing, Standardization, Aggregations etc
• Data can ported to data science algorithms after processing the data
• Data science algorithms can be applied by using Big Data modules such as Mahout, Spark
MLLib etc
• Data Scientists should be cognizant about Data Engineering, but need not be
hands on. Data Engineers are the ones who work on Big Data eco system. But in
the smaller organization Data Scientist/Data Engineer has to be master of both.
training@itversity.com
32. Data Engineering – roles and responsibilities
• Environment – Linux
• Ad hoc querying and reporting – SQL
• Data ingestion – NiFi and Kafka
• Performing ETL
• Conventional tools such as Informatica
• Programming languages such as Python or Scala
• Spark – heavy volumes of data
• Validations – SQL or Shell Scripting
• Big Data on Cloud – AWS EMR, Databricks, GCP Data Proc
• Visualization – Tableau
training@itversity.com
33. Data Engineering – Required Skills
• Linux Fundamentals
• Database Essentials
• Basics of Programming (Python and Scala)
• Big Data eco system tools and technologies
• Building applications at scale
• Data Ingestion
• Streaming Data Pipelines
• Visualization
• Big Data on Cloud
training@itversity.com
34. Linux Fundamentals
• Overview of Operating System
• Logging into linux (including passwordless)
• Basic linux commands
• Editors such as vi/vim
• Regular expressions
• Processing information using awk/sed
• Basics of shell scripting
• Troubleshooting the issues
training@itversity.com
35. Database Essentials
• Overview of Relational Databases
• Normalization
• Creating tables and manipulating data
• Basics SQL
• Analytical Functions
• Relating RDBMS with NoSQL
• Writing queries in MongoDB
training@itversity.com
36. Basics of Programming (Python and Scala)
• Data Types
• Basic programming constructs
• Pre-defined functions (string manipulation)
• User defined functions (including lambda functions)
• Collections
• Basic I/O operations
• Database operations
• Externalizing properties
training@itversity.com
37. Big Data eco system tools and technologies
• File Systems Overview
• Processing Engines Overview
• HDFS commands
• YARN
• Hive
• Sqoop
• Flume
• Distributions
training@itversity.com
38. Databases in Big Data
• Hive Overview
• Creating databases, tables and loading data
• Queries in Hive
• Hive based engines
• File formats
• Integration of Spark SQL with Scala/Python – Overview
training@itversity.com
39. Building applications at scale
• Overview of Spark
• Reading data from file systems
• Processing data using Core Spark API
• Processing data using Data Sets and/or Data Frames
• Processing data using Spark SQL
• Saving data to file systems
• Development life cycle
• Execution life cycle
• Troubleshooting and performance tuning
training@itversity.com
40. Apache Spark
• Apache Spark is in memory distributed processing framework on top
of file systems such as HDFS, s3, Azure Blob etc
• There are bunch of APIs to process the data. They are called as
Transformations and Actions. They are also known as Core Spark.
• Tightly integrated with programming languages such as Scala, Python,
Java, R
• To be proficient in Spark, you need to learn one of the programming
languages – preferably Scala or Python
• You often hear about YARN, Mesos – they are just frameworks to run
the jobs. Developers do not need to worry about it.
training@itversity.com
41. Data Ingestion
• Copying data between RDBMS and HDFS using Sqoop
• Copying data between RDBMS and Hive using Sqoop
• Real time data ingestion using Flume
• Data Ingestion using Kafka
• Copying data between RDBMS and HDFS using Spark JDBC
training@itversity.com
42. Streaming Data Pipelines
• Integrating data from Flume to Kafka
• Getting golden copy of data using Flume to HDFS
• Integration of Kafka and Spark Streaming
• Apply analytics rules on inflight data using Spark Steaming APIs
training@itversity.com
43. Visualization
• Overview of BI and Visualization tools
• Setting up Tableau Desktop
• Connecting to different data sources
• Creating reports
• Creating dashboards
training@itversity.com
44. Big Data on Cloud
• Overview of Cloud
• Understanding AWS (Amazon Web Services)
• Setting up EC2 instances
• AWS CLI (Command Line Interface)
• Creating AWS EMR cluster using both web console as well as CLI
• Step execution
• Running Spark Jobs
• Deploying applications using Azkaban
training@itversity.com
45. Pre-requisites
• CS or IT graduate or experienced IT professional
• Basic programming and database skills
• Laptop with 4 GB RAM and 64 bit operating system
• High speed internet
training@itversity.com
46. Targeted Audience
• CS or IT freshers who want to become Data Engineers with Big Data
skills or aspiring for Data Science
• IT professionals who want to transition to Data Engineer roles
• Mainframes Professionals
• Test Engineers
• ETL or Data Warehouse professionals
• BI professionals
• Application Developers to get proficiency of Big Data
training@itversity.com
47. Certifications
We cover curriculum for following certifications:
• CCA 175 Spark and Hadoop Developer
• Databricks/O’reilly Certified Spark Developer
Please RSVP to the next live session on Meetup:
https://www.meetup.com/itversityin/events/271739702/
training@itversity.com
48. Free Content and ITVersity Paid Resources
• Free Content
• Videos on YouTube
• Content on GitHub
• Vagrant Box on Vagrant Apps
• Paid Resources
• Labs at nominal cost
• Live Support via Slack or forums
training@itversity.com
49. Our Success Stories
• Thousands trained on vast array of skills
• Hundreds got certified – please visit
http://discuss.itversity.com/c/certifications/success-stories
• Many successfully transitioned to Data Engineer roles
• Many Data Scientists add necessary Data Engineering skills
training@itversity.com
50. We believe in trainings related to open source also should be open source
• https://labs.itversity.com - low cost big data lab to learn technologies.
• http://discuss.itversity.com - support while learning
• http://www.itversity.com - website for content
• https://youtube.com/itversityin
• https://github.com/dgadiraju
training@itversity.com