Presentation on the struggles with traditional architectures and an overview of the Lambda Architecture utilizing Spark to drive massive amounts of both batch and streaming data for processing and analytics
2. 2
Agenda
• Struggles in Traditional Architectures
• What is the Lambda Architecture?
• Spark: Unified Development Framework
• Demonstration: Spark Batch & Streaming jobs in Talend
3. 3
Historical Data New Data
Traditional Architecture
Web Logs
Internet of
Things
DBMS
/ EDW
HADOOP
Social
Media
CLOUD
DATASET
4. 4
Situation
I need fast access to
historical data on the fly
with real time data from the
stream for analysis
6. 6
Lambda Architecture
A data processing architecture designed
to handle massive quantities of data by
taking advantage of both batch and
stream-processing methods
https://en.wikipedia.org/wiki/Lambda_architecture
14. 14
APPLICATION
INTEGRATION
CLOUD
INTEGRATION
DATA
INTEGRATION
BIG DATA
INTEGRATION
MASTER DATA
MANAGEMENT
Data Fabric
STUDIO REPOSITORY DEPLOYMENT EXECUTION MONITORING
Comprehensive
Eclipse-based
user interface
Web-based
deployment &
scheduling
Single web-based
monitoring console
Consolidated
metadata & project
information
Same container
for batch processing,
message routing & services
6
Discovery &
cleansing for
business users
SELF-SERVICE
51 3
42
15. 15
Visually develop jobs that run 100% on Spark
• 5X times faster using independent benchmarks
• 10X developer productivity gained over hand-coding
Spark
• 100X faster with in-memory processing
900 components including 100+ new Spark components
• HDFS, RDBMS, NoSQL, Cloud Storage, Transformation,
Messaging, In-memory analytics & machine learning
recommendations, and much more
• In-memory data caching & “windowed” computations
• Click to enable Spark Streaming for real-time data
processing
Real Time Big Data Integration and Unlimited Scale
1st Data Integration Platform
on Spark
+ +
5X FASTER
UNLIMITED SCALE
Benefits: Make decisions faster. Tremendous developer productivity.
16. 16
Talend Demonstration
1. Talend Studio User Interface
2. Building a Spark Job
3. Building a Real-time Recommendation pipeline
4. Introduction to the Talend Real-time Big Data
Sandbox
17. 17
For More Information
- Download the Talend Sandbox!
http://www.talend.com/products/real-time-big-data
- Check the Apache Spark Project
http://spark.apache.org/
- Find out more about the Lambda Architecture
http://lambda-architecture.net/
Editor's Notes
Title Slide
Historical Side:
Data from RDBMS
Data from Hadoop
Data from Cloud Environment (Ie. Salesforce)
New Side:
Apache Web Logs
Sensor Data: Internet of Things
Social Media Data: Facebook, Twitter, etc…
Batch Layer, managing all available big dataset which is an immutable, append-only set of raw data using distributed processing system.
Speed layer, processing data in streaming fashion with low latency, and the real-time views are provided by the most recent data.
Serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low latency and ad-hoc way.
Robust and fault tolerant—The batch layer handles failover when machines go down using replication and restarting computation tasks on other machines. The serving layer uses replication under the hood to ensure availability when servers go down.
Scalable—Both the batch layer and serving layers are easily scalable. They can both be implemented as fully distributed systems, whereupon scaling them is as easy as just adding new machines.
Extensible—Adding a new view is as easy as adding a new function of the master dataset. Since the master dataset can contain arbitrary data, new types of data can be easily added. If you want to tweak a view, you don’t have to worry about supporting multiple versions of the view in the application. Rather you can simply recompute the entire view from scratch.
• Developed in 2009 at UC Berkeley AMPLab, open sourced in 2010, and became a top-level Apache project in February, 2014
• Fast, distributed, scalable and fault tolerant cluster compute system
• Enables Low-latency with complex analytics
• Empower users to iterate through the data by utilizing the in-memory cache.
• Logistic regression runs up to 100x faster than Hadoop M/R in memory.
• We’re able to train exact models without doing any approximation.
• Can be set up as:
- Standalone
- Over Yarn
- in MapReduce
• An extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data stream.
• Spark Streaming receives streaming input, and divides the data into batches which are then processed by Spark engine.
• As a result, developers can maintain the same Java/Scala code in Batch and Speed layer.
DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, it is represented by a continuous sequence of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval, as shown in the following figure.
Delivering value through a engineered suite of products that work together all in the same environment. But also extend your services into the Cloud.
Delivering value through a engineered suite of products that work together all in the same environment. But also extend your services into the Cloud.
• An extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data stream.
• Spark Streaming receives streaming input, and divides the data into batches which are then processed by Spark engine.
• As a result, developers can maintain the same Java/Scala code in Batch and Speed layer.
DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, it is represented by a continuous sequence of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval, as shown in the following figure.
• An extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data stream.
• Spark Streaming receives streaming input, and divides the data into batches which are then processed by Spark engine.
• As a result, developers can maintain the same Java/Scala code in Batch and Speed layer.
DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, it is represented by a continuous sequence of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval, as shown in the following figure.