Lambda Architecture is a useful framework to think about designing big data applications. This framework has been built initially at Twitter. In this presentation you will learn, based on concrete examples how to build deploy scalable and fault tolerant applications, with a focus on Big Data and Hadoop.
This presentation was delivered at the OOP conference, Munich, Feb 2016
First of all since we will be talking about Big Data Applications… let’s see some use case that are very common…
High level
New apps, big data or not, must be “fault tolerant”…
and the lambda arch has been build for that.. and at which level…
Hardware, Commodity hardware : we know that it willl fail
So we compensate for that using software : HDFS/MapR-FS, Hbase/MaprDB, Zookeeper, .. you have infrastructure to support failure
What about the developer? human being becoming the weakest link
So infrastrcuture using Hadoop/MapR/Distributed software is Fault Tolerant
but we still need to deal with HUMAN ERROR…. since some of us are making mistake
the goal is to “recover from it”
WE ALL DO MISTAKE… look at these big names
Facebook apologises after crash: Social network site went down for the third time in a month due to a 'configuration issue'
Storm is realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.
Cascalog Fully-featured data processing and querying library for Clojure or Java.
ElephantDB Distributed database specialized in exporting key/value data from Hadoop
As you can guess, in application development
when we talk about architecture it is all about LAYERS
So we can see that in this case the application generate “EVENTS”
everything we do generate events: Credit Card Payiment, Commit toGit, WebPage Click, twet, ….
The event are used to manipulate the “data”, but we can use the events as the main data
The event you generate are immutable they have “happened”, and they are time based
Data Locality
Resilient distributed datasets, or RDD, are the primary abstraction in Spark. They are a collection of objects that is distributed across nodes in a cluster, and data operations are performed on RDD.
Once created, RDD are immutable.
You can also persist, or cache, RDDs in memory or on disk.
Spark RDDs are fault-tolerant. If a given node or task fails, the RDD can be reconstructed automatically on the remaining nodes and the job will complete.
There are two types of data operations you can perform on an RDD, transformations and actions.
A transformation will return an RDD. Since RDD are immutable, the transformation will return a new RDD.
An action will return a value.
● Socket
● Kafka
● Flume
● HDFS
● MQ (ZeroMQ...)
● Twitter
● ...
● Or a custom implementation of Receiver
Store all events as raw data
Create Intermediate Views
Errors are fixed using re-computation
Based on Scalable and Reliable Storage
Distributed File System
Optimized formats (Parquet, Avro, Protobuff, …)
NoSQL Engines
HBase, MapR-DB, Elasticsearch, Cassandra, MongoDB, …
Distributed Processing
Spark
Drill (SQL)