SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
Simon Ambridge
Data Pipelines With Spark & DSE
An Introduction To Building Agile, Flexible and Scalable Big Data and Data Science Pipelines
Certified Apache Cassandra and DataStax enthusiast who enjoys
explaining that the traditional approaches to data management just
don’t cut it anymore in the new always on, no single point of failure,
high volume, high velocity, real time distributed data management
world.
Previously 25 years implementing Oracle relational data management
solutions. Certified in Exadata, Oracle Cloud, Oracle Essbase, Oracle
Linux and OBIEE
simon.ambridge@datastax.com
@stratman1958
Simon Ambridge
Pre-Sales Solution Engineer, Datastax UK
Big Data Pipelining: Why Analytics?
• To be able to react to customers faster and with more accuracy
• To reduce the business risk through more accurate understanding of the
market
• Optimise return on marketing investment via better targeted campaigns
• Faster time to market with the right products at the right time
• Improve efficiency – commerce, plant and people
Recent survey found that more than half of respondents wanted:
85% wanted analytics to handle ‘real-time’ data changing at <1s intervals
Big, Static Data
Fast, Streaming Data
Big Data Pipelining: Classification
Big Data Pipelines can mean different things to different people
Repeated analysis on a static but massive dataset
• Typically an element of research – e.g. genomics, clinical
trial, demographic data
• Something that is typically repetitive, iterative, shared
amongst data scientists for analysis
Real-time analytics on streaming data
• Typically an industrialised processes – e.g. sensors, tick
data, bioinformatics, transactional data, real-time
personalisation
• Something that is happening in real-time that usually
cannot be dropped or lost
Static Datasets
All You Can Eat?
Really.
Static Analytics: Traditional Approach
Repeated iterations, at each stage
Run/debug cycle can be slow
Sampling Modeling InterpretTuning Reporting
Re-sample
Typical traditional ‘static’ data analysis model
Data
Results
Analytics: Traditional Scaling
DATA
DATA
DATA
Small datasets, small servers
Large datasets, large servers
Big datasets, big servers
Static Analytics: Scale Up Challenges
• Sampling and analysis often run on a single machine
• CPU and memory limitations – finite resources on a single machine
• Offers limited sampling of a large dataset because of data size limitations
• Multiple iterations over large datasets is frequently not an ideal approach
Static Analytics: Big Data Problems
• Data really is getting Big!
• Data is getting bigger!
• The number of data sources is exploding!
• More data is arriving faster!
Scaling up is becoming impractical – physical limits
• The validity of the analysis becomes obsolete, faster
• Analysis too slow to get any real ROI from the data
Big Data Analytics: Big Data Needs
We need scalable infrastructure + distributed technologies
• Data volumes can be scaled – we can distribute the data
across multiple low-cost machines or cloud instances
• Faster processing – distributed smaller datasets
• More complex processing – distributed across multiple
machines
• No single point of failure
Big Data Analytics: DSE Delivers
Building a distributed data processing framework can be a complex task!
It needs to be:
• Scalable
• Have fast in-memory processing
• Able to handle real-time or streaming data feeds
• Able to handle high throughput and low-latency
• Ideally be able to handle ad-hoc queries
• Ideally be replicated across multiple data centers for resiliency
DataStax Enterprise: Standard Edition
DataStax
Enterprise
• Certified Cassandra – delivers trusted, tested and
certified versions of Cassandra ready for
production environments.
• Expert Support – answers and assistance from
the Cassandra experts for all production needs.
• Enterprise Security – supplies full protection for
sensitive data.
• Automatic Management Services – automates key
maintenance functions to keep the database
running smoothly.
• OpsCenter – provides advanced management and
monitoring functionality for production
applications.
DataStax
Enterprise
• Advanced Analytics – provides ability to run real-
time and batch analytic operations on Cassandra
data, as well as integrate DSE with external
Hadoop deployments.
• Enterprise Search – supplies built-in enterprise
and distributed search capabilities on Cassandra
data.
• In-Memory Option – delivers all the benefits of
Cassandra to in-memory computing.
• Workload Isolation – allows for analytics and
search functions to run separately from
transactional workloads, with no need to ETL
data to different systems.
DataStax Enterprise: Max Edition
Intro To Cassandra: THE Cloud Database
What is Apache Cassandra?
• Originally started at Facebook in 2008
• Top level Apache project since 2010
• Open source distributed database
• Clusters can handle large amounts of data (PB’s)
• Performant at high velocity
• Extremely resilient:
• Across multiple data centres
• No single point of failure
• Continuous Availability, disaster avoidance
• Enterprise Cassandra platform from Datastax
Intro To Spark: THE Analytics Engine
What is Apache Spark?
• Started at UC Berkeley in 2009
• Apache Project since 2010
• Distributed in-memory processing
• Rich Scala, Java and Python APIs
• Fast - 10x-100x faster than Hadoop MapReduce
• 2x-5x less code than R
• Batch and streaming analytics
• Interactive shell (REPL)
• Tightly integrated with DSE
Spark: Dayton Gray Sort Contest
October 2014
Daytona Gray benchmark tests how fast a system can sort 100 TB of data
(1 trillion records)
• Previous world record held by Hadoop MapReduce cluster of 2100 nodes, in 72
minutes
• Spark completed the benchmark in 23 minutes on just 206 EC2 nodes. All the
sorting took place on disk (HDFS), without using Spark’s in-memory cache (3X
faster using 10X fewer machines)
• Spark also sorted 1 PB (10 trillion records) on 190 machines in under 4 hours.
This beats previously results based on Hadoop MapReduce in 16 hours on 3800
machines (4X faster using 20X fewer machines)
DataStax Enterprise: Analytics Integration
Cassandra Cluster
Spark Cluster
ETL
Spark Cluster
• Tight integration
• Data locality
• Microsecond response times
X
• Apache Cassandra for Distributed Persistent Storage
• Integrated Apache Spark for Distributed Real-Time Analytics
• Analytics nodes close to data - no ETL required
X
• Loose integration
• Data separate from processing
• Millisecond response times
“Latency	
  when	
  transferring	
  data	
  is	
  unavoidable.	
  The	
  trick	
  is	
  to	
  
reduce	
  the	
  latency	
  to	
  as	
  close	
  to	
  zero	
  as	
  possible…”
Intro To Parquet: Quick History
What is Parquet?
• Started at Twitter and Cloudera in 2013
• Databases traditionally store information in rows and are optimized for
working with one record at a time
• Columnar storage systems optimised to store data by column
• A compressed, efficient columnar data representation
• Compression schemes can be specified on a per-column level
• Allows complex data to be encoded efficiently
• Netflix - 7 PB of warehoused data in Parquet format
• Not as compressed as ORC (Hortonworks) but faster read/analysis
Intro To Akka: Distributed Apps
What is Akka?
• Open source toolkit first released in 2009
• Simplifies the construction of highly concurrent and distributed Java apps
• Makes it easier to build concurrent, fault-tolerant and scalable applications
• Based on the ‘actor’ model
• Highly performant event-driven programming
• Hierarchical - each actor is created and supervised by its parent
• Process failures treated as events handled by an actor's supervisor
• Language bindings exist for both Java and Scala
Big Data Pipelining: Static Datasets
Valid data pipeline analysis methods must be:
Auditable
• Reproducible – essential for any science, so too for Data Science
• Documented – important to understand the how and why
• Controlled
• Suitable for version control
• Collaborative
• Easily accessible
Intro To Notebooks: Features
What are Notebooks?
• Drive your data analysis from the browser
• Increasingly popular
• Highly interactive
• Tight integration with Apache Spark
• Handy tools for analysts:
• Reproducible visual analysis
• Code in Scala, CQL, SparkSQL
• Charting – pie, bar, line etc
• Extensible with custom libraries
Intro To Notebooks: Features
Big Data Pipelining: Static Datasets
Example architecture & requirements
1. Optimised source data format
2. Distributed in-memory analytics
3. Interactive and flexible data analysis tool
4. Persistent data store
5. Visualisation tools
Big Data Pipelining: Pipeline Flow
ADAM Notebook Notebook Datastore Visualisation
Example: Genome research platform (SHAR3)
Big Data Pipelining: Pipeline Scalability
• Add more (physical or virtual) nodes as
required to add capacity
• Container tools can ease configuration
management
• Scale out quickly
Big Data Pipelining: Pipeline Process Flow
3. Persistent data storage
2. Interactive, flexible and reproducible analysis
1. Source data
4. Visualise and analyse
Analytics: Static Data Pipeline Process
• No longer an iterative process constrained by hardware limitations
• Now a more scalable, resilient, dynamic, interactive process, easily shareable
Analyse
The new model for large-scale static data analysis
Share
X
Load
Real-Time Datasets
If it’s Not “Now”, Then It’s Probably Already Too Late
Big Data Pipelining: Real-Time Analytics
• Capture, prepare, and process fast streaming data
• Needs different approach from traditional batch processing
• Has to be at the speed of now – cannot wait even for seconds
• Immediate insight & instant decisions - offers huge commercial
and engineering advantages
What problem are we trying to solve?
Big Data Analytics: Streams
Data tidal waves!Netflix
• Ingests Petabytes of data per day
• Over 1 TRILLION transactions per day (>10 m per second) into DSE
Data streams?
Data deluge?
Big Data Pipelining: Real-Time Use Cases
Social media
• Commercial value - trending products, sentiment analysis
• Reaction time is critical as the value of data quickly diminishes over time
e.g. Twitter, Facebook
Sensor data (IoT)
• Critical safety and monitoring
• Missed data could have significant safety implications
• Utility billing, engineering management e.g. power plant, vehicles
Examples of use cases for streaming analytics…
Big Data Pipelining: Real-Time Use Cases
Transactional data
• Missed data could have huge financial implications e.g. market data
• Credit card transactions, fraud detection– if it’s not now, its too late
User Experience
• Personalising the user experience
• Commercial benefit to customise the user experience
• Netflix, Spotify, eBay, mobile apps etc.
Examples of use cases for streaming analytics…
Big Data Pipelining: Real-Time architecture
Analytics in real-time at scale demand fast processing, with low latencies
Common solution is to use in-memory distributed architecture
Increasingly using a technology stack comprising Kafka, Spark and Cassandra
• Scalable
• Distributed
• Resilient
Streaming analytics architecture - what do we need?
Intro To Kafka: Quick History
What is Apache Kafka?
• Originally developed by LinkedIn
• Open sourced since 2011
• Top level project since 2012
• Enterprise support from Confluent
• Fast?- single Kafka broker handles hundreds of MB/s of reads/writes from thousands
of clients
• Scalable? - can be elastically and transparently expanded without downtime. Data
streams are distributed over a cluster of machines
• Durable? - messages persisted on disk, replicated within the cluster, prevent data loss
• Powerful? - each broker can handle TBs of messages without performance impact.
• Distributed? - modern cluster-centric design, strong durability and fault-tolerance
Intro To Kafka: Architecture
How Does Kafka Work?
Producers send messages to the Kafka cluster, which in
turn serves them up to consumers
• Kafka maintains feeds of messages in categories called topics
• Processes that publish messages to Kafka are called producers
• Processes that subscribe to topics and process the feed are called consumers
• A Kafka cluster is comprised of one or more servers called a broker
• Java API, other languages supported
Intro To Kafka: Streaming Flow
How Does Kafka Work With Spark?
• Publish-subscribe messaging system implemented as a replicated commit
log
• Messages are simply byte arrays so can store any object in any format
• Each topic partition stored as a log (an ordered set of messages)
• Each message in a partition is assigned a unique offset
• Consumers are responsible to track their location in each topic log
Spark consumes messages as a stream, in micro batches, saved as RDD’s
DataStax Enterprise: Streaming Schematic
Sensor Network
Signal
Aggregation
Messaging Queue
Sensor Data Queue
Management
Broker
Broker
Collection
Data Processing
& Storage
DataStax Enterprise: Streaming Analytics
Real-time
Analytics
Data Processing
& Storage
Near real-time,
batch Analytics
Analytics / BI
!$£€!
Personalisation
Actionable insight Monitoring
DataStax Enterprise: Multi-DC Uses
DC: EUROPEDC: USA
Real-time active-activegeo-replication
across physical datacentres
4 3
25
1
4 3
25
1
8
1
2
3
4
5
6
7
1
2
3
OLTP:
Cassandra
5
4
Analytics:
Cassandra + Spark
Replication
Replication
Workload separation via virtual datacentres
Real-Time Analytics: DSE Multi-DC
Workload Management and Separation With DSE
Analytics / BI
Analytics
Datacentre
OLTP
Datacentre
100% Uptime, Global Scale
OLTP
Real-Time Analytics
Mixed Load OLTP and Analytics Platform
Replication
Replication
JDBC
ODBC
Separation of OLTP
from Analytics
Social Media
IoT
Personalisation & Persistence
Personalisation
!$£€!
Actionable insight
Monitoring
App, Web
OLTP
Feed
OLTP Layer
100% Uptime, Global Scale
High Velocity
Ingestion Layer
Lambda &Big Data: DSE &Hadoop
Data Stores -
Active& Legacy
Batch Analytics Analytics / BI
• Scalable
• Fault-tolerant
• Fast
JDBC
ODBC
Real-Time
Analytic/Integration
Layer
Social Media IoT
Web, App
Oracle
IBMSAP
OLTP
Feed
OLTP
Feed
OLTP
Feed
Big Data Use Case: DSE &SAP
Data Stores -
Active& Legacy
Hot Data
Storage / Query
Analytics / BI
SAP/Hana Smart Data Access
OLTP Layer
100% Uptime, Global Scale
High Velocity
Ingestion Layer
Social Media IoT
• Scalable
• Fault-tolerant
• Fast
Oracle
IBMSAP
Real-Time
Analytic/Integration
Layer
Web, App
JDBC
ODBC
Thank you!

Contenu connexe

Tendances

Apache Kafka in the Insurance Industry
Apache Kafka in the Insurance IndustryApache Kafka in the Insurance Industry
Apache Kafka in the Insurance IndustryKai Wähner
 
Streaming IBM i to Kafka for Next-Gen Use Cases
Streaming IBM i to Kafka for Next-Gen Use CasesStreaming IBM i to Kafka for Next-Gen Use Cases
Streaming IBM i to Kafka for Next-Gen Use CasesPrecisely
 
Connecting Apache Kafka to Cash
Connecting Apache Kafka to CashConnecting Apache Kafka to Cash
Connecting Apache Kafka to Cashconfluent
 
JUG Tirana - Introduction to data streaming
JUG Tirana - Introduction to data streamingJUG Tirana - Introduction to data streaming
JUG Tirana - Introduction to data streamingNicolas Fränkel
 
IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...
IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...
IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...Kai Wähner
 
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...Kai Wähner
 
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...Kai Wähner
 
Apache Kafka in the Transportation and Logistics
Apache Kafka in the Transportation and LogisticsApache Kafka in the Transportation and Logistics
Apache Kafka in the Transportation and LogisticsKai Wähner
 
Apache Kafka and Blockchain - Comparison and a Kafka-native Implementation
Apache Kafka and Blockchain - Comparison and a Kafka-native ImplementationApache Kafka and Blockchain - Comparison and a Kafka-native Implementation
Apache Kafka and Blockchain - Comparison and a Kafka-native ImplementationKai Wähner
 
Unleashing Apache Kafka and TensorFlow in Hybrid Cloud Architectures
Unleashing Apache Kafka and TensorFlow in Hybrid Cloud ArchitecturesUnleashing Apache Kafka and TensorFlow in Hybrid Cloud Architectures
Unleashing Apache Kafka and TensorFlow in Hybrid Cloud ArchitecturesKai Wähner
 
IBM Cloud Pak for Integration with Confluent Platform powered by Apache Kafka
IBM Cloud Pak for Integration with Confluent Platform powered by Apache KafkaIBM Cloud Pak for Integration with Confluent Platform powered by Apache Kafka
IBM Cloud Pak for Integration with Confluent Platform powered by Apache KafkaKai Wähner
 
Mainframe Integration, Offloading and Replacement with Apache Kafka
Mainframe Integration, Offloading and Replacement with Apache KafkaMainframe Integration, Offloading and Replacement with Apache Kafka
Mainframe Integration, Offloading and Replacement with Apache KafkaKai Wähner
 
Apache Kafka in the Airline, Aviation and Travel Industry
Apache Kafka in the Airline, Aviation and Travel IndustryApache Kafka in the Airline, Aviation and Travel Industry
Apache Kafka in the Airline, Aviation and Travel IndustryKai Wähner
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner
 
Apache Kafka in Financial Services - Use Cases and Architectures
Apache Kafka in Financial Services - Use Cases and ArchitecturesApache Kafka in Financial Services - Use Cases and Architectures
Apache Kafka in Financial Services - Use Cases and ArchitecturesKai Wähner
 
Event Streaming CTO Roundtable for Cloud-native Kafka Architectures
Event Streaming CTO Roundtable for Cloud-native Kafka ArchitecturesEvent Streaming CTO Roundtable for Cloud-native Kafka Architectures
Event Streaming CTO Roundtable for Cloud-native Kafka ArchitecturesKai Wähner
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaKai Wähner
 
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...Kai Wähner
 
Understanding the TCO and ROI of Apache Kafka & Confluent
Understanding the TCO and ROI of Apache Kafka & ConfluentUnderstanding the TCO and ROI of Apache Kafka & Confluent
Understanding the TCO and ROI of Apache Kafka & Confluentconfluent
 
Hybrid & Global Kafka Architecture
Hybrid & Global Kafka ArchitectureHybrid & Global Kafka Architecture
Hybrid & Global Kafka Architectureconfluent
 

Tendances (20)

Apache Kafka in the Insurance Industry
Apache Kafka in the Insurance IndustryApache Kafka in the Insurance Industry
Apache Kafka in the Insurance Industry
 
Streaming IBM i to Kafka for Next-Gen Use Cases
Streaming IBM i to Kafka for Next-Gen Use CasesStreaming IBM i to Kafka for Next-Gen Use Cases
Streaming IBM i to Kafka for Next-Gen Use Cases
 
Connecting Apache Kafka to Cash
Connecting Apache Kafka to CashConnecting Apache Kafka to Cash
Connecting Apache Kafka to Cash
 
JUG Tirana - Introduction to data streaming
JUG Tirana - Introduction to data streamingJUG Tirana - Introduction to data streaming
JUG Tirana - Introduction to data streaming
 
IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...
IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...
IoT Architectures for Apache Kafka and Event Streaming - Industry 4.0, Digita...
 
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
 
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
Apache Kafka, Tiered Storage and TensorFlow for Streaming Machine Learning wi...
 
Apache Kafka in the Transportation and Logistics
Apache Kafka in the Transportation and LogisticsApache Kafka in the Transportation and Logistics
Apache Kafka in the Transportation and Logistics
 
Apache Kafka and Blockchain - Comparison and a Kafka-native Implementation
Apache Kafka and Blockchain - Comparison and a Kafka-native ImplementationApache Kafka and Blockchain - Comparison and a Kafka-native Implementation
Apache Kafka and Blockchain - Comparison and a Kafka-native Implementation
 
Unleashing Apache Kafka and TensorFlow in Hybrid Cloud Architectures
Unleashing Apache Kafka and TensorFlow in Hybrid Cloud ArchitecturesUnleashing Apache Kafka and TensorFlow in Hybrid Cloud Architectures
Unleashing Apache Kafka and TensorFlow in Hybrid Cloud Architectures
 
IBM Cloud Pak for Integration with Confluent Platform powered by Apache Kafka
IBM Cloud Pak for Integration with Confluent Platform powered by Apache KafkaIBM Cloud Pak for Integration with Confluent Platform powered by Apache Kafka
IBM Cloud Pak for Integration with Confluent Platform powered by Apache Kafka
 
Mainframe Integration, Offloading and Replacement with Apache Kafka
Mainframe Integration, Offloading and Replacement with Apache KafkaMainframe Integration, Offloading and Replacement with Apache Kafka
Mainframe Integration, Offloading and Replacement with Apache Kafka
 
Apache Kafka in the Airline, Aviation and Travel Industry
Apache Kafka in the Airline, Aviation and Travel IndustryApache Kafka in the Airline, Aviation and Travel Industry
Apache Kafka in the Airline, Aviation and Travel Industry
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
Apache Kafka in Financial Services - Use Cases and Architectures
Apache Kafka in Financial Services - Use Cases and ArchitecturesApache Kafka in Financial Services - Use Cases and Architectures
Apache Kafka in Financial Services - Use Cases and Architectures
 
Event Streaming CTO Roundtable for Cloud-native Kafka Architectures
Event Streaming CTO Roundtable for Cloud-native Kafka ArchitecturesEvent Streaming CTO Roundtable for Cloud-native Kafka Architectures
Event Streaming CTO Roundtable for Cloud-native Kafka Architectures
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
 
Understanding the TCO and ROI of Apache Kafka & Confluent
Understanding the TCO and ROI of Apache Kafka & ConfluentUnderstanding the TCO and ROI of Apache Kafka & Confluent
Understanding the TCO and ROI of Apache Kafka & Confluent
 
Hybrid & Global Kafka Architecture
Hybrid & Global Kafka ArchitectureHybrid & Global Kafka Architecture
Hybrid & Global Kafka Architecture
 

En vedette

Building Scalable Big Data Pipelines
Building Scalable Big Data PipelinesBuilding Scalable Big Data Pipelines
Building Scalable Big Data PipelinesChristian Gügi
 
Big Data Architectures
Big Data ArchitecturesBig Data Architectures
Big Data ArchitecturesGuido Schmutz
 
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlBuilding a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlSpark Summit
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 

En vedette (7)

Building Scalable Big Data Pipelines
Building Scalable Big Data PipelinesBuilding Scalable Big Data Pipelines
Building Scalable Big Data Pipelines
 
Big Data Architectures
Big Data ArchitecturesBig Data Architectures
Big Data Architectures
 
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlBuilding a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Similaire à 20160331 sa introduction to big data pipelining berlin meetup 0.3

Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...Simon Ambridge
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupVictor Coustenoble
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...DataStax
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWSCaserta
 
Cray Urika-XA Advanced Analytics Platform
Cray Urika-XA Advanced Analytics PlatformCray Urika-XA Advanced Analytics Platform
Cray Urika-XA Advanced Analytics Platforminside-BigData.com
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architectureSudheer Kondla
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA
 

Similaire à 20160331 sa introduction to big data pipelining berlin meetup 0.3 (20)

Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetupDataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 
Cray Urika-XA Advanced Analytics Platform
Cray Urika-XA Advanced Analytics PlatformCray Urika-XA Advanced Analytics Platform
Cray Urika-XA Advanced Analytics Platform
 
Data platform architecture
Data platform architectureData platform architecture
Data platform architecture
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 

Dernier

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Dernier (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

20160331 sa introduction to big data pipelining berlin meetup 0.3

  • 1. Simon Ambridge Data Pipelines With Spark & DSE An Introduction To Building Agile, Flexible and Scalable Big Data and Data Science Pipelines
  • 2. Certified Apache Cassandra and DataStax enthusiast who enjoys explaining that the traditional approaches to data management just don’t cut it anymore in the new always on, no single point of failure, high volume, high velocity, real time distributed data management world. Previously 25 years implementing Oracle relational data management solutions. Certified in Exadata, Oracle Cloud, Oracle Essbase, Oracle Linux and OBIEE simon.ambridge@datastax.com @stratman1958 Simon Ambridge Pre-Sales Solution Engineer, Datastax UK
  • 3. Big Data Pipelining: Why Analytics? • To be able to react to customers faster and with more accuracy • To reduce the business risk through more accurate understanding of the market • Optimise return on marketing investment via better targeted campaigns • Faster time to market with the right products at the right time • Improve efficiency – commerce, plant and people Recent survey found that more than half of respondents wanted: 85% wanted analytics to handle ‘real-time’ data changing at <1s intervals
  • 4. Big, Static Data Fast, Streaming Data Big Data Pipelining: Classification Big Data Pipelines can mean different things to different people Repeated analysis on a static but massive dataset • Typically an element of research – e.g. genomics, clinical trial, demographic data • Something that is typically repetitive, iterative, shared amongst data scientists for analysis Real-time analytics on streaming data • Typically an industrialised processes – e.g. sensors, tick data, bioinformatics, transactional data, real-time personalisation • Something that is happening in real-time that usually cannot be dropped or lost
  • 5. Static Datasets All You Can Eat? Really.
  • 6. Static Analytics: Traditional Approach Repeated iterations, at each stage Run/debug cycle can be slow Sampling Modeling InterpretTuning Reporting Re-sample Typical traditional ‘static’ data analysis model Data Results
  • 7. Analytics: Traditional Scaling DATA DATA DATA Small datasets, small servers Large datasets, large servers Big datasets, big servers
  • 8. Static Analytics: Scale Up Challenges • Sampling and analysis often run on a single machine • CPU and memory limitations – finite resources on a single machine • Offers limited sampling of a large dataset because of data size limitations • Multiple iterations over large datasets is frequently not an ideal approach
  • 9. Static Analytics: Big Data Problems • Data really is getting Big! • Data is getting bigger! • The number of data sources is exploding! • More data is arriving faster! Scaling up is becoming impractical – physical limits • The validity of the analysis becomes obsolete, faster • Analysis too slow to get any real ROI from the data
  • 10. Big Data Analytics: Big Data Needs We need scalable infrastructure + distributed technologies • Data volumes can be scaled – we can distribute the data across multiple low-cost machines or cloud instances • Faster processing – distributed smaller datasets • More complex processing – distributed across multiple machines • No single point of failure
  • 11. Big Data Analytics: DSE Delivers Building a distributed data processing framework can be a complex task! It needs to be: • Scalable • Have fast in-memory processing • Able to handle real-time or streaming data feeds • Able to handle high throughput and low-latency • Ideally be able to handle ad-hoc queries • Ideally be replicated across multiple data centers for resiliency
  • 12. DataStax Enterprise: Standard Edition DataStax Enterprise • Certified Cassandra – delivers trusted, tested and certified versions of Cassandra ready for production environments. • Expert Support – answers and assistance from the Cassandra experts for all production needs. • Enterprise Security – supplies full protection for sensitive data. • Automatic Management Services – automates key maintenance functions to keep the database running smoothly. • OpsCenter – provides advanced management and monitoring functionality for production applications.
  • 13. DataStax Enterprise • Advanced Analytics – provides ability to run real- time and batch analytic operations on Cassandra data, as well as integrate DSE with external Hadoop deployments. • Enterprise Search – supplies built-in enterprise and distributed search capabilities on Cassandra data. • In-Memory Option – delivers all the benefits of Cassandra to in-memory computing. • Workload Isolation – allows for analytics and search functions to run separately from transactional workloads, with no need to ETL data to different systems. DataStax Enterprise: Max Edition
  • 14. Intro To Cassandra: THE Cloud Database What is Apache Cassandra? • Originally started at Facebook in 2008 • Top level Apache project since 2010 • Open source distributed database • Clusters can handle large amounts of data (PB’s) • Performant at high velocity • Extremely resilient: • Across multiple data centres • No single point of failure • Continuous Availability, disaster avoidance • Enterprise Cassandra platform from Datastax
  • 15. Intro To Spark: THE Analytics Engine What is Apache Spark? • Started at UC Berkeley in 2009 • Apache Project since 2010 • Distributed in-memory processing • Rich Scala, Java and Python APIs • Fast - 10x-100x faster than Hadoop MapReduce • 2x-5x less code than R • Batch and streaming analytics • Interactive shell (REPL) • Tightly integrated with DSE
  • 16. Spark: Dayton Gray Sort Contest October 2014 Daytona Gray benchmark tests how fast a system can sort 100 TB of data (1 trillion records) • Previous world record held by Hadoop MapReduce cluster of 2100 nodes, in 72 minutes • Spark completed the benchmark in 23 minutes on just 206 EC2 nodes. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache (3X faster using 10X fewer machines) • Spark also sorted 1 PB (10 trillion records) on 190 machines in under 4 hours. This beats previously results based on Hadoop MapReduce in 16 hours on 3800 machines (4X faster using 20X fewer machines)
  • 17. DataStax Enterprise: Analytics Integration Cassandra Cluster Spark Cluster ETL Spark Cluster • Tight integration • Data locality • Microsecond response times X • Apache Cassandra for Distributed Persistent Storage • Integrated Apache Spark for Distributed Real-Time Analytics • Analytics nodes close to data - no ETL required X • Loose integration • Data separate from processing • Millisecond response times “Latency  when  transferring  data  is  unavoidable.  The  trick  is  to   reduce  the  latency  to  as  close  to  zero  as  possible…”
  • 18. Intro To Parquet: Quick History What is Parquet? • Started at Twitter and Cloudera in 2013 • Databases traditionally store information in rows and are optimized for working with one record at a time • Columnar storage systems optimised to store data by column • A compressed, efficient columnar data representation • Compression schemes can be specified on a per-column level • Allows complex data to be encoded efficiently • Netflix - 7 PB of warehoused data in Parquet format • Not as compressed as ORC (Hortonworks) but faster read/analysis
  • 19. Intro To Akka: Distributed Apps What is Akka? • Open source toolkit first released in 2009 • Simplifies the construction of highly concurrent and distributed Java apps • Makes it easier to build concurrent, fault-tolerant and scalable applications • Based on the ‘actor’ model • Highly performant event-driven programming • Hierarchical - each actor is created and supervised by its parent • Process failures treated as events handled by an actor's supervisor • Language bindings exist for both Java and Scala
  • 20. Big Data Pipelining: Static Datasets Valid data pipeline analysis methods must be: Auditable • Reproducible – essential for any science, so too for Data Science • Documented – important to understand the how and why • Controlled • Suitable for version control • Collaborative • Easily accessible
  • 21. Intro To Notebooks: Features What are Notebooks? • Drive your data analysis from the browser • Increasingly popular • Highly interactive • Tight integration with Apache Spark • Handy tools for analysts: • Reproducible visual analysis • Code in Scala, CQL, SparkSQL • Charting – pie, bar, line etc • Extensible with custom libraries
  • 23. Big Data Pipelining: Static Datasets Example architecture & requirements 1. Optimised source data format 2. Distributed in-memory analytics 3. Interactive and flexible data analysis tool 4. Persistent data store 5. Visualisation tools
  • 24. Big Data Pipelining: Pipeline Flow ADAM Notebook Notebook Datastore Visualisation Example: Genome research platform (SHAR3)
  • 25. Big Data Pipelining: Pipeline Scalability • Add more (physical or virtual) nodes as required to add capacity • Container tools can ease configuration management • Scale out quickly
  • 26. Big Data Pipelining: Pipeline Process Flow 3. Persistent data storage 2. Interactive, flexible and reproducible analysis 1. Source data 4. Visualise and analyse
  • 27. Analytics: Static Data Pipeline Process • No longer an iterative process constrained by hardware limitations • Now a more scalable, resilient, dynamic, interactive process, easily shareable Analyse The new model for large-scale static data analysis Share X Load
  • 28. Real-Time Datasets If it’s Not “Now”, Then It’s Probably Already Too Late
  • 29. Big Data Pipelining: Real-Time Analytics • Capture, prepare, and process fast streaming data • Needs different approach from traditional batch processing • Has to be at the speed of now – cannot wait even for seconds • Immediate insight & instant decisions - offers huge commercial and engineering advantages What problem are we trying to solve?
  • 30. Big Data Analytics: Streams Data tidal waves!Netflix • Ingests Petabytes of data per day • Over 1 TRILLION transactions per day (>10 m per second) into DSE Data streams? Data deluge?
  • 31. Big Data Pipelining: Real-Time Use Cases Social media • Commercial value - trending products, sentiment analysis • Reaction time is critical as the value of data quickly diminishes over time e.g. Twitter, Facebook Sensor data (IoT) • Critical safety and monitoring • Missed data could have significant safety implications • Utility billing, engineering management e.g. power plant, vehicles Examples of use cases for streaming analytics…
  • 32. Big Data Pipelining: Real-Time Use Cases Transactional data • Missed data could have huge financial implications e.g. market data • Credit card transactions, fraud detection– if it’s not now, its too late User Experience • Personalising the user experience • Commercial benefit to customise the user experience • Netflix, Spotify, eBay, mobile apps etc. Examples of use cases for streaming analytics…
  • 33. Big Data Pipelining: Real-Time architecture Analytics in real-time at scale demand fast processing, with low latencies Common solution is to use in-memory distributed architecture Increasingly using a technology stack comprising Kafka, Spark and Cassandra • Scalable • Distributed • Resilient Streaming analytics architecture - what do we need?
  • 34. Intro To Kafka: Quick History What is Apache Kafka? • Originally developed by LinkedIn • Open sourced since 2011 • Top level project since 2012 • Enterprise support from Confluent • Fast?- single Kafka broker handles hundreds of MB/s of reads/writes from thousands of clients • Scalable? - can be elastically and transparently expanded without downtime. Data streams are distributed over a cluster of machines • Durable? - messages persisted on disk, replicated within the cluster, prevent data loss • Powerful? - each broker can handle TBs of messages without performance impact. • Distributed? - modern cluster-centric design, strong durability and fault-tolerance
  • 35. Intro To Kafka: Architecture How Does Kafka Work? Producers send messages to the Kafka cluster, which in turn serves them up to consumers • Kafka maintains feeds of messages in categories called topics • Processes that publish messages to Kafka are called producers • Processes that subscribe to topics and process the feed are called consumers • A Kafka cluster is comprised of one or more servers called a broker • Java API, other languages supported
  • 36. Intro To Kafka: Streaming Flow How Does Kafka Work With Spark? • Publish-subscribe messaging system implemented as a replicated commit log • Messages are simply byte arrays so can store any object in any format • Each topic partition stored as a log (an ordered set of messages) • Each message in a partition is assigned a unique offset • Consumers are responsible to track their location in each topic log Spark consumes messages as a stream, in micro batches, saved as RDD’s
  • 37. DataStax Enterprise: Streaming Schematic Sensor Network Signal Aggregation Messaging Queue Sensor Data Queue Management Broker Broker Collection Data Processing & Storage
  • 38. DataStax Enterprise: Streaming Analytics Real-time Analytics Data Processing & Storage Near real-time, batch Analytics Analytics / BI !$£€! Personalisation Actionable insight Monitoring
  • 39. DataStax Enterprise: Multi-DC Uses DC: EUROPEDC: USA Real-time active-activegeo-replication across physical datacentres 4 3 25 1 4 3 25 1 8 1 2 3 4 5 6 7 1 2 3 OLTP: Cassandra 5 4 Analytics: Cassandra + Spark Replication Replication Workload separation via virtual datacentres
  • 40. Real-Time Analytics: DSE Multi-DC Workload Management and Separation With DSE Analytics / BI Analytics Datacentre OLTP Datacentre 100% Uptime, Global Scale OLTP Real-Time Analytics Mixed Load OLTP and Analytics Platform Replication Replication JDBC ODBC Separation of OLTP from Analytics Social Media IoT Personalisation & Persistence Personalisation !$£€! Actionable insight Monitoring App, Web
  • 41. OLTP Feed OLTP Layer 100% Uptime, Global Scale High Velocity Ingestion Layer Lambda &Big Data: DSE &Hadoop Data Stores - Active& Legacy Batch Analytics Analytics / BI • Scalable • Fault-tolerant • Fast JDBC ODBC Real-Time Analytic/Integration Layer Social Media IoT Web, App Oracle IBMSAP OLTP Feed
  • 42. OLTP Feed OLTP Feed Big Data Use Case: DSE &SAP Data Stores - Active& Legacy Hot Data Storage / Query Analytics / BI SAP/Hana Smart Data Access OLTP Layer 100% Uptime, Global Scale High Velocity Ingestion Layer Social Media IoT • Scalable • Fault-tolerant • Fast Oracle IBMSAP Real-Time Analytic/Integration Layer Web, App JDBC ODBC
  • 43.