20160331 sa introduction to big data pipelining berlin meetup 0.3
1. Simon Ambridge
Data Pipelines With Spark & DSE
An Introduction To Building Agile, Flexible and Scalable Big Data and Data Science Pipelines
2. Certified Apache Cassandra and DataStax enthusiast who enjoys
explaining that the traditional approaches to data management just
don’t cut it anymore in the new always on, no single point of failure,
high volume, high velocity, real time distributed data management
world.
Previously 25 years implementing Oracle relational data management
solutions. Certified in Exadata, Oracle Cloud, Oracle Essbase, Oracle
Linux and OBIEE
simon.ambridge@datastax.com
@stratman1958
Simon Ambridge
Pre-Sales Solution Engineer, Datastax UK
3. Big Data Pipelining: Why Analytics?
• To be able to react to customers faster and with more accuracy
• To reduce the business risk through more accurate understanding of the
market
• Optimise return on marketing investment via better targeted campaigns
• Faster time to market with the right products at the right time
• Improve efficiency – commerce, plant and people
Recent survey found that more than half of respondents wanted:
85% wanted analytics to handle ‘real-time’ data changing at <1s intervals
4. Big, Static Data
Fast, Streaming Data
Big Data Pipelining: Classification
Big Data Pipelines can mean different things to different people
Repeated analysis on a static but massive dataset
• Typically an element of research – e.g. genomics, clinical
trial, demographic data
• Something that is typically repetitive, iterative, shared
amongst data scientists for analysis
Real-time analytics on streaming data
• Typically an industrialised processes – e.g. sensors, tick
data, bioinformatics, transactional data, real-time
personalisation
• Something that is happening in real-time that usually
cannot be dropped or lost
6. Static Analytics: Traditional Approach
Repeated iterations, at each stage
Run/debug cycle can be slow
Sampling Modeling InterpretTuning Reporting
Re-sample
Typical traditional ‘static’ data analysis model
Data
Results
8. Static Analytics: Scale Up Challenges
• Sampling and analysis often run on a single machine
• CPU and memory limitations – finite resources on a single machine
• Offers limited sampling of a large dataset because of data size limitations
• Multiple iterations over large datasets is frequently not an ideal approach
9. Static Analytics: Big Data Problems
• Data really is getting Big!
• Data is getting bigger!
• The number of data sources is exploding!
• More data is arriving faster!
Scaling up is becoming impractical – physical limits
• The validity of the analysis becomes obsolete, faster
• Analysis too slow to get any real ROI from the data
10. Big Data Analytics: Big Data Needs
We need scalable infrastructure + distributed technologies
• Data volumes can be scaled – we can distribute the data
across multiple low-cost machines or cloud instances
• Faster processing – distributed smaller datasets
• More complex processing – distributed across multiple
machines
• No single point of failure
11. Big Data Analytics: DSE Delivers
Building a distributed data processing framework can be a complex task!
It needs to be:
• Scalable
• Have fast in-memory processing
• Able to handle real-time or streaming data feeds
• Able to handle high throughput and low-latency
• Ideally be able to handle ad-hoc queries
• Ideally be replicated across multiple data centers for resiliency
12. DataStax Enterprise: Standard Edition
DataStax
Enterprise
• Certified Cassandra – delivers trusted, tested and
certified versions of Cassandra ready for
production environments.
• Expert Support – answers and assistance from
the Cassandra experts for all production needs.
• Enterprise Security – supplies full protection for
sensitive data.
• Automatic Management Services – automates key
maintenance functions to keep the database
running smoothly.
• OpsCenter – provides advanced management and
monitoring functionality for production
applications.
13. DataStax
Enterprise
• Advanced Analytics – provides ability to run real-
time and batch analytic operations on Cassandra
data, as well as integrate DSE with external
Hadoop deployments.
• Enterprise Search – supplies built-in enterprise
and distributed search capabilities on Cassandra
data.
• In-Memory Option – delivers all the benefits of
Cassandra to in-memory computing.
• Workload Isolation – allows for analytics and
search functions to run separately from
transactional workloads, with no need to ETL
data to different systems.
DataStax Enterprise: Max Edition
14. Intro To Cassandra: THE Cloud Database
What is Apache Cassandra?
• Originally started at Facebook in 2008
• Top level Apache project since 2010
• Open source distributed database
• Clusters can handle large amounts of data (PB’s)
• Performant at high velocity
• Extremely resilient:
• Across multiple data centres
• No single point of failure
• Continuous Availability, disaster avoidance
• Enterprise Cassandra platform from Datastax
15. Intro To Spark: THE Analytics Engine
What is Apache Spark?
• Started at UC Berkeley in 2009
• Apache Project since 2010
• Distributed in-memory processing
• Rich Scala, Java and Python APIs
• Fast - 10x-100x faster than Hadoop MapReduce
• 2x-5x less code than R
• Batch and streaming analytics
• Interactive shell (REPL)
• Tightly integrated with DSE
16. Spark: Dayton Gray Sort Contest
October 2014
Daytona Gray benchmark tests how fast a system can sort 100 TB of data
(1 trillion records)
• Previous world record held by Hadoop MapReduce cluster of 2100 nodes, in 72
minutes
• Spark completed the benchmark in 23 minutes on just 206 EC2 nodes. All the
sorting took place on disk (HDFS), without using Spark’s in-memory cache (3X
faster using 10X fewer machines)
• Spark also sorted 1 PB (10 trillion records) on 190 machines in under 4 hours.
This beats previously results based on Hadoop MapReduce in 16 hours on 3800
machines (4X faster using 20X fewer machines)
17. DataStax Enterprise: Analytics Integration
Cassandra Cluster
Spark Cluster
ETL
Spark Cluster
• Tight integration
• Data locality
• Microsecond response times
X
• Apache Cassandra for Distributed Persistent Storage
• Integrated Apache Spark for Distributed Real-Time Analytics
• Analytics nodes close to data - no ETL required
X
• Loose integration
• Data separate from processing
• Millisecond response times
“Latency
when
transferring
data
is
unavoidable.
The
trick
is
to
reduce
the
latency
to
as
close
to
zero
as
possible…”
18. Intro To Parquet: Quick History
What is Parquet?
• Started at Twitter and Cloudera in 2013
• Databases traditionally store information in rows and are optimized for
working with one record at a time
• Columnar storage systems optimised to store data by column
• A compressed, efficient columnar data representation
• Compression schemes can be specified on a per-column level
• Allows complex data to be encoded efficiently
• Netflix - 7 PB of warehoused data in Parquet format
• Not as compressed as ORC (Hortonworks) but faster read/analysis
19. Intro To Akka: Distributed Apps
What is Akka?
• Open source toolkit first released in 2009
• Simplifies the construction of highly concurrent and distributed Java apps
• Makes it easier to build concurrent, fault-tolerant and scalable applications
• Based on the ‘actor’ model
• Highly performant event-driven programming
• Hierarchical - each actor is created and supervised by its parent
• Process failures treated as events handled by an actor's supervisor
• Language bindings exist for both Java and Scala
20. Big Data Pipelining: Static Datasets
Valid data pipeline analysis methods must be:
Auditable
• Reproducible – essential for any science, so too for Data Science
• Documented – important to understand the how and why
• Controlled
• Suitable for version control
• Collaborative
• Easily accessible
21. Intro To Notebooks: Features
What are Notebooks?
• Drive your data analysis from the browser
• Increasingly popular
• Highly interactive
• Tight integration with Apache Spark
• Handy tools for analysts:
• Reproducible visual analysis
• Code in Scala, CQL, SparkSQL
• Charting – pie, bar, line etc
• Extensible with custom libraries
23. Big Data Pipelining: Static Datasets
Example architecture & requirements
1. Optimised source data format
2. Distributed in-memory analytics
3. Interactive and flexible data analysis tool
4. Persistent data store
5. Visualisation tools
24. Big Data Pipelining: Pipeline Flow
ADAM Notebook Notebook Datastore Visualisation
Example: Genome research platform (SHAR3)
25. Big Data Pipelining: Pipeline Scalability
• Add more (physical or virtual) nodes as
required to add capacity
• Container tools can ease configuration
management
• Scale out quickly
26. Big Data Pipelining: Pipeline Process Flow
3. Persistent data storage
2. Interactive, flexible and reproducible analysis
1. Source data
4. Visualise and analyse
27. Analytics: Static Data Pipeline Process
• No longer an iterative process constrained by hardware limitations
• Now a more scalable, resilient, dynamic, interactive process, easily shareable
Analyse
The new model for large-scale static data analysis
Share
X
Load
29. Big Data Pipelining: Real-Time Analytics
• Capture, prepare, and process fast streaming data
• Needs different approach from traditional batch processing
• Has to be at the speed of now – cannot wait even for seconds
• Immediate insight & instant decisions - offers huge commercial
and engineering advantages
What problem are we trying to solve?
30. Big Data Analytics: Streams
Data tidal waves!Netflix
• Ingests Petabytes of data per day
• Over 1 TRILLION transactions per day (>10 m per second) into DSE
Data streams?
Data deluge?
31. Big Data Pipelining: Real-Time Use Cases
Social media
• Commercial value - trending products, sentiment analysis
• Reaction time is critical as the value of data quickly diminishes over time
e.g. Twitter, Facebook
Sensor data (IoT)
• Critical safety and monitoring
• Missed data could have significant safety implications
• Utility billing, engineering management e.g. power plant, vehicles
Examples of use cases for streaming analytics…
32. Big Data Pipelining: Real-Time Use Cases
Transactional data
• Missed data could have huge financial implications e.g. market data
• Credit card transactions, fraud detection– if it’s not now, its too late
User Experience
• Personalising the user experience
• Commercial benefit to customise the user experience
• Netflix, Spotify, eBay, mobile apps etc.
Examples of use cases for streaming analytics…
33. Big Data Pipelining: Real-Time architecture
Analytics in real-time at scale demand fast processing, with low latencies
Common solution is to use in-memory distributed architecture
Increasingly using a technology stack comprising Kafka, Spark and Cassandra
• Scalable
• Distributed
• Resilient
Streaming analytics architecture - what do we need?
34. Intro To Kafka: Quick History
What is Apache Kafka?
• Originally developed by LinkedIn
• Open sourced since 2011
• Top level project since 2012
• Enterprise support from Confluent
• Fast?- single Kafka broker handles hundreds of MB/s of reads/writes from thousands
of clients
• Scalable? - can be elastically and transparently expanded without downtime. Data
streams are distributed over a cluster of machines
• Durable? - messages persisted on disk, replicated within the cluster, prevent data loss
• Powerful? - each broker can handle TBs of messages without performance impact.
• Distributed? - modern cluster-centric design, strong durability and fault-tolerance
35. Intro To Kafka: Architecture
How Does Kafka Work?
Producers send messages to the Kafka cluster, which in
turn serves them up to consumers
• Kafka maintains feeds of messages in categories called topics
• Processes that publish messages to Kafka are called producers
• Processes that subscribe to topics and process the feed are called consumers
• A Kafka cluster is comprised of one or more servers called a broker
• Java API, other languages supported
36. Intro To Kafka: Streaming Flow
How Does Kafka Work With Spark?
• Publish-subscribe messaging system implemented as a replicated commit
log
• Messages are simply byte arrays so can store any object in any format
• Each topic partition stored as a log (an ordered set of messages)
• Each message in a partition is assigned a unique offset
• Consumers are responsible to track their location in each topic log
Spark consumes messages as a stream, in micro batches, saved as RDD’s
37. DataStax Enterprise: Streaming Schematic
Sensor Network
Signal
Aggregation
Messaging Queue
Sensor Data Queue
Management
Broker
Broker
Collection
Data Processing
& Storage
38. DataStax Enterprise: Streaming Analytics
Real-time
Analytics
Data Processing
& Storage
Near real-time,
batch Analytics
Analytics / BI
!$£€!
Personalisation
Actionable insight Monitoring
40. Real-Time Analytics: DSE Multi-DC
Workload Management and Separation With DSE
Analytics / BI
Analytics
Datacentre
OLTP
Datacentre
100% Uptime, Global Scale
OLTP
Real-Time Analytics
Mixed Load OLTP and Analytics Platform
Replication
Replication
JDBC
ODBC
Separation of OLTP
from Analytics
Social Media
IoT
Personalisation & Persistence
Personalisation
!$£€!
Actionable insight
Monitoring
App, Web
41. OLTP
Feed
OLTP Layer
100% Uptime, Global Scale
High Velocity
Ingestion Layer
Lambda &Big Data: DSE &Hadoop
Data Stores -
Active& Legacy
Batch Analytics Analytics / BI
• Scalable
• Fault-tolerant
• Fast
JDBC
ODBC
Real-Time
Analytic/Integration
Layer
Social Media IoT
Web, App
Oracle
IBMSAP
OLTP
Feed
42. OLTP
Feed
OLTP
Feed
Big Data Use Case: DSE &SAP
Data Stores -
Active& Legacy
Hot Data
Storage / Query
Analytics / BI
SAP/Hana Smart Data Access
OLTP Layer
100% Uptime, Global Scale
High Velocity
Ingestion Layer
Social Media IoT
• Scalable
• Fault-tolerant
• Fast
Oracle
IBMSAP
Real-Time
Analytic/Integration
Layer
Web, App
JDBC
ODBC