This document discusses using Apache Spark and Cassandra for IoT applications. Cassandra is a distributed database that is highly available, horizontally scalable, and supports multiple datacenters with no single point of failure. It is well-suited for storing time series sensor data. Spark can be used for both batch and stream processing of data in Cassandra. The Spark Cassandra Connector allows Cassandra tables to be accessed as Spark RDDs. Real-time sensor data can be ingested using Spark Streaming and stored in Cassandra. Common use cases with this architecture include real-time analytics on streaming data and batch analytics on historical sensor data.
3. •Distributed database
•Highly Available
•Horizontal & Linear Scalable
•Multi Datacenter Support
•No Single Point Of Failure
•Chooses Availability Over Strong Consistency
Cassandra for IoT_
3
Node 1
Node 2
Node 3
Node 4
1-25
26-5051-75
76-0
4. Great for Time Series Data_
4
CREATE TABLE sensors(
sensorId uuid,
time timeuuid,
metricName text,
metricValue double,
PRIMARY KEY(sensorId, time)
)
id t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11
Stored sequentially on disk
6. •Open Source & Apache project since 2010
•Data processing Framework
• Batch processing
• Stream processing
What Is Apache Spark_
6
7. •Fast
• up to 100 times faster than Hadoop
• a lot of in-memory processing
• linear scalable using more nodes
•Easy
• Scala, Java and Python API
• Clean Code (e.g. with lambdas in Java 8)
• expanded API: map, reduce, filter, groupBy, sort, union, join,
reduceByKey, groupByKey, sample, take, first, count
•Fault-Tolerant
• easily reproducible
Why Use Spark_
7
8. •RDD‘s – Resilient Distributed Dataset
• Read–Only description of a collection of objects
• Partitioned for distribution
• Determined through transformations
• Allows automatically rebuild on failure
•Operations
• Transformations (map,filter,reduce...) —> new RDD
• Actions (count, collect, save)
•Only Actions start processing!
Easily Reproducable?_
8
14. •Real Time Processing using micro batches
•Supported sources: Files, TCP, MQTT, Kafka, Twitter,..
•Data as Discretized Stream (DStream)
•Same programming model as for batches
•All Operations of the Spark Core, SQL and MLLib
•Stateful Operations & Sliding Windows
Stream Processing With Spark Streaming_
14
17. •Spark Streaming
• Continuous data streams
• MQTT, Kafka, ZeroMQ...
• Easily reliable
•Spark Core
• Existing data
• SQL Databases, CSV, Json...
•Use the same programming model or even the same code!
Use Cases for Spark and Cassandra in IoT_
17
Ingestion
18. •Real-Time Analysis
• React on events
• Join with existing data
• Apply events on ML models
•Batch Analysis
• Scheduled jobs
• Analytics on the data
• Train ML models
Use Cases for Spark and Cassandra in IoT_
18
Analyses