This document provides an overview of a course on real-time data processing using Apache Storm and Apache Spark. The course objectives are to provide knowledge and skills for real-time analytics on streaming data using these open-source frameworks. Students will learn how to install and use Storm to process unbounded streams of data in real-time, as well as Spark, which provides faster performance than Hadoop MapReduce for certain applications. The hands-on course covers topics like Storm and Spark installation, working with RDDs in Spark, and real-world projects using Storm for streaming data analytics.
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Hadoop_RealTime_Processing_eVenkat
1. Page 1 of 3
Hadoop Real Time Processing Systems
Objective
Apache Storm is a free and open source distributed real-time computation system.
Storm makes it easy to reliably process unbounded streams of data, doing for
real-time processing what Hadoop did for batch processing. The main purpose of
this course is to provide knowledge and skills for real time analytics of wide variety
of streamed data.
Apache Spark is an open-source data analytics cluster computing framework.
Spark is not tied to the two-stage MapReduce paradigm, and promises
performance up to 100 times faster than Hadoop MapReduce for certain
applications. Spark provides primitives for in-memory cluster computing that
allows user programs to load data into a cluster’s memory and query it repeatedly,
making it well suited to machine learning algorithms
The participants will start by learning the What and Why of Storm and how storm
is used in real time analytics. After that they will be installing Storm on their
systems and work with Spouts and Bolts. They will then be introduced to Spark
which is successor to Map Reduce using Scala. The participants will learn
1. Hadoop Gen 2 Installation.
2. Introduction to Yarn and its working
3. Understand where to use Storm for real time analytics
4. Setup Apache Storm cluster on your system
5. Learn Storm Technology Stack and Groupings
6. Implement Spouts and Bolts
7. Work on multiple Real World Projects using Storm
8. Concepts and features of RDD
9. Transformation and Actions
10.Working of Spark in a Cluster
Note: The course will be have 40% of theoretical discussion and 60% of actual
hands on
Duration: 30 hours
Audience
This course is designed for anyone who is
1. Wanting to architect a project using Spark.
2. An ETL or Data Warehousing Developer looking at alternative approach to
data analysis and storage.
3. Data Engineer
2. Page 2 of 3
Pre-Requisites
1. Basic knowledge of Java.
2. Basic understanding of Hadoop and its working.
Course Outline
1 Hadoop & YARN Overview
• Anatomy of Hadoop Cluster, Installing and Configuring Plain Hadoop
• What is Big Data Analytics
• Batch v/s Real time
• Limitations of Hadoop
• Storm for Real Time Analytics
2 Storm Basics
• Installation of Storm
• Components of Storm
• Properties of Storm
3 Storm Technology Stack and Groupings
• Storm Running Modes
• Creating First Storm Topology
• Topologies in Storm
4 Spouts and Bolts
• Reliable vs Unreliable Messages
• Getting Data
• Bolt Lifecycle
• Bolt Structure
• Reliable vs Unreliable Bolts
5. Spark Basics
• Batch Analytics
• Real Time Analytics Options
• Streaming Data – Storm
• In Memory Data – Spark
• Modes of Spark
6. Spark Installation
• Spark Installation
• Overview of Spark on a cluster
• Spark Standalone Cluster.
3. Page 3 of 3
7. Working with RDD
• RDDs
• Transformations in RDD
• Actions in RDD
• Loading Data in RDD
• Saving Data through RDD
• Key-Value Pair RDD
• MapReduce and Pair RDD Operations
• Scala and Hadoop Integration Hands on.
8. Spark integration with Hive
9. Spark Streaming