SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
Simon Ambridge
Data Pipelines With Spark & DSE
An Introduction To Building Agile, Flexible and Scalable Big Data and Data Science Pipelines
Version 0.8
Certified Apache Cassandra and DataStax enthusiast who enjoys
explaining that the traditional approaches to data management just
don’t cut it anymore in the new always on, no single point of failure,
high volume, high velocity, real time distributed data management
world.
Previously 25 years implementing Oracle relational data management
solutions. Certified in Exadata, Oracle Cloud, Oracle Essbase, Oracle
Linux and OBIEE
simon.ambridge@datastax.com
@stratman1958
Simon Ambridge
Pre-Sales Solution Engineer, Datastax UK
Introduction To Big Data Pipelines
Big, Static Data
Fast, Streaming Data
Big Data Pipelining: Classification
Big Data Pipelines can mean different things to different people
Repeated analysis on a static but massive dataset
• An element of research – e.g. genomics, clinical trial,
demographic data
• Typically repetitive, iterative, shared amongst data
scientists for analysis
Real-time analytics on streaming data
• Industrialised or commercial processes – sensors, tick
data, bioinformatics, transactional data, real-time
personalisation
• Happening in real-time, data cannot be dropped or lost
Static Datasets
All You Can Eat?
Really.
Static Data Analytics : Traditional Tools
Repeated iterations, at each stage
Run/debug cycle can be slow
Sampling Modeling InterpretTuning Reporting
Re-sample
Typical traditional ‘static’ data analysis model
Data
Results
Static Data Analytics : Scale Up Challenges
Sampling and analysis often run on a single machine
• CPU and memory limitations
Limited sampling of a large dataset because of data size limitations
• Multiple iterations over large datasets is frequently not an ideal
approach
Static Data Analytics : Traditional Scaling
DATA (GB)
DATA (MB)
DATA (TB)
Small datasets, small servers
Large datasets, large servers
Static Data Analytics: Big Data Problems
Data is getting Really Big!
• Data volumes are getting larger!
• The number of data sources is exploding!
• More data is arriving faster!
Scaling up is becoming impractical
• Physical limits
• Datalimits
• The validity of the analysis becomes obsolete, faster
Static Data Analytics : Big Data Needs
We need scalable infrastructure + distributed technologies
• Data volumes can be scaled
• Distribute the data across multiple low-cost machines
• Faster processing
• More complex processing
• No single point of failure
Static Data Analytics : DSE Delivers
Building a distributed data processing framework can be a complex task!
It needs to be:
• Scalable
• Fast in-memory processing
• Replicated for resiliency
• Batch and real-time data feeds
• Ad-hoc queries
DataStax delivers an integrated analytics platform
Cassandra: THE Web, IoT & Cloud Database
What is Apache Cassandra?
• Very fast
• Extremely resilient
• Across multiple data centres
• No single point of failure
• Continuous Availability, Disaster Avoidance
• Linear scale
• Easy to operate
Enterprise Cassandra platform from Datastax
DataStax
Enterprise
DataStax Enterprise: Editions
DataStax Enterprise Standard
• DSE Standard is DataStax’s entry level
commercial database offering
• Represents the minimum recommended to
deploy Cassandra in a production environment
DataStax Enterprise Max
• DSE Max is DataStax’s advanced commercial
database offering
• Designed for production Cassandra
environments that have mixed workload
requirements
Spark: THE Analytics Engine
What is Apache Spark?
• Distributed in-memory analytic processing
• Batch and streaming analytics
• Fast - 10x-100x faster than Hadoop MapReduce
• Rich Scala, Java and Python APIs
Tightly integrated with DSE
Spark: Dayton Gray Sort Contest
Dayton Gray benchmark - tests how fast a system can sort 100 TB
of data (1 trillion records)
• Previous world record held by Hadoop MapReduce cluster of 2100
nodes, in 72 minutes
• 2014: Spark completed the benchmark in 23 minutes on just 206 EC2
nodes = 3X faster using 10X fewer machines
• Spark sorted 1 PB (10 trillion records) on 190 machines in < 4 hours.
Previous Hadoop MapReduce time of 16 hours on 3800 machines = 4X
faster using 20X fewer machines
DataStax Enterprise: Analytics Integration
Cassandra Cluster
Spark Cluster
ETL
Spark Cluster
• Tight integration
• Data locality
• Microsecond response times
X
• Apache Cassandra for Distributed Persistent Storage
• Integrated Apache Spark for Distributed Real-Time Analytics
• Analytics nodes close to data - no ETL required
X
• Loose integration
• Data separate from processing
• Millisecond response times
“Latency	
  when	
  transferring	
  data	
  is	
  unavoidable.	
  The	
  trick	
  is	
  to	
  reduce	
  
the	
  latency	
  to	
  as	
  close	
  to	
  zero	
  as	
  possible…”
Static Data Analytics : Requirements
Valid data pipeline analysis methods must be:
Auditable
• Reproducible
• Documented
Controlled
• Version control
Collaborative
• Accessible
Notebooks: Features
What are Notebooks?
• Drive your data analysis from the browser
• Highly interactive
• Tight integration with Apache Spark
• Handy tools for analysts:
• Reproducible visual analysis
• Code in Scala, CQL, SparkSQL, Python
• Charting – pie, bar, line etc
• Extensible with custom libraries
Example: Spark Notebook
Cells
Markdown
Output
Controls
Static Data Analytics : Approach
Example architecture & requirements
1. Optimised source data format
2. Distributed in-memory analytics
3. Interactive and flexible data analysis tool
4. Persistent data store
5. Visualisation tools
Static Data Analytics : Example
ADAM
Notebook Persistent Storage
OLTP Database Visualisation
Genome research platform - ADST (Agile Data Science Toolkit)
Static Data Analytics : Pipeline Process Flow
3. Persistent data storage
2. Interactive, flexible and reproducible analysis
1. Source data
4. Visualise and analyse
Static Data Analytics : Pipeline Scalability
• Add more (physical or virtual) nodes as
required to add capacity
• Container tools ease configuration
management and deployment
• Scale out quickly
Static Data Analytics : Now
• No longer an iterative process constrained by hardware limitations
• Now a more scalable, resilient, dynamic, interactive process, easily shareable
Analyse
The new model for large-scale static data analytics
Share
X
Load
SCALE & DISTRIBUTE PROCESSING
Real-Time Datasets
If it’s Not “Now”, Then It’s Probably Already Too Late
Big Data Pipelining: Why Real-Time?
• React to customers faster and with more accuracy
• Reduce risk through more accurate understanding of the market
• Optimise return on marketing investment
• Faster time to market
• Improve efficiency
In a highly connected world
In most cases ‘real-time’ data changing at <1s intervals
Big Data Pipelining: Real-Time Analytics
• Capture, prepare, and process fast streaming data
• Different approach from traditional batch processing
• The speed of now – cannot wait
• Immediate insight, instant decisions
What problem are we trying to solve?
Big Data Pipelining: Real-Time Use Cases
Sensor data (IoT)
Transactional data
User Experience
Social media
Use cases for streaming analytics
Big Data Analytics: Streams
Data tidal waves!Netflix
• Ingests Petabytes of data per day
• Over 1 TRILLION transactions per day (>10 m per second) into DSE
Data streams?
Data torrent?
Big Data Pipelining: Real-Time architecture
Analytics in real-time, at scale
Fast processing, distributed, in-memory
Increasingly using a technology stack comprising Kafka, Spark and Cassandra
• Scalable
• Distributed
• Resilient
Streaming analytics architecture - what do we need?
Kafka: Architecture
How Does Kafka Work?
Kafka “De-couples” producers and consumers in data pipelines
’Producers’ send messages to the Kafka cluster, which in turn serves them up to
’Consumers’
• Kafka maintains feeds of messages in categories called topics
• A Kafka cluster is comprised of one or more servers called a broker
Producer
Producer
Producer
Consumer
Consumer
Consumer
Kafka
Cluster
Kafka: Streaming With Spark
Kafka writes, Spark reads
• Topics can have multiple partitions
• Each topic partition stored as a log (an ordered set of messages)
• Messages are simply byte arrays, so can store any object in any format
• Each message in a partition is assigned a unique offset
Spark consumes messages as a stream, in micro batches, saved as RDD’s
1 2 3 4 5 6 7 8
Partition 0
1 2 3 4 5 6 7 8
Partition 1
1 2 3 4 5 6
Partition 0
Temperature Topic
Rainfall Topic
Temperature Consumer
Rainfall Consumer
Temperature Consumer
DataStax Enterprise: Streaming Schematic
Sensor
Network
Signal
Aggregation
Services
Messaging Queue
Sensor Data Queue
Management
Broker
Broker
Collection
Service
Data Storage
OLTP PersistenceLayer
Streaming Data
Ingest
DataStax Enterprise: Streaming Analytics
Real-time
Analytics
Persistent Storage
OLTP Database
!$£€!
Personalisation
Actionable insight Monitoring
Web / Analytics / BI
DataStax Enterprise: Multi-DC Uses
DC: EUROPEDC: USA
Real-time active-active geo-replication
across physical datacentres
4 3
25
1
4 3
25
1
8
1
2
3
4
5
6
7
1
2
3
OLTP:
Cassandra
5
4
Analytics:
Cassandra + Spark
Replication
Replication
Workload separation via virtual datacentres
Real-Time Analytics: DSE Multi-DC
Workload Management and Separation With DSE
Analytics / BI
Analytics
Datacentre
OLTP
Datacentre
100% Uptime, Global Scale
OLTP
Real-Time Analytics
Mixed Load OLTP and Analytics Platform
Replication
Replication
JDBC
ODBC
Separation of OLTP
from Analytics
Social Media
IoT
Personalisation & Persistence
Personalisation
!$£€!
Actionable insight
Monitoring
App, Web
DSE & Analytics : Summary
Static, Massive Data
Scalable Data Pipelines
1. Optimised data storage formats
2. Scalable, distributed technologies
3. Flexible and interactive analysis tools
4. Resilient, persistent Storage
Real-Time Streaming Data
Scalable Data Pipelines
1. Scalable, distributed technologies
2. De-coupled Producers and Consumers
3. Real-Time analytics
4. Resilient, persistent Storage
Spark
Mesos
Akka
Cassandra
Kafka
Thank you!

Contenu connexe

Tendances

Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6DataStax
 
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd KnownCassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd KnownDataStax
 
Transforms Document Management at Scale with Distributed Database Solution wi...
Transforms Document Management at Scale with Distributed Database Solution wi...Transforms Document Management at Scale with Distributed Database Solution wi...
Transforms Document Management at Scale with Distributed Database Solution wi...DataStax Academy
 
From PoCs to Production
From PoCs to ProductionFrom PoCs to Production
From PoCs to ProductionDataStax
 
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...DataStax
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy
 
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackCisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackDataStax Academy
 
Making Every Drop Count: How i20 Addresses the Water Crisis with the IoT and ...
Making Every Drop Count: How i20 Addresses the Water Crisis with the IoT and ...Making Every Drop Count: How i20 Addresses the Water Crisis with the IoT and ...
Making Every Drop Count: How i20 Addresses the Water Crisis with the IoT and ...DataStax
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...DataStax
 
Proofpoint: Fraud Detection and Security on Social Media
Proofpoint: Fraud Detection and Security on Social MediaProofpoint: Fraud Detection and Security on Social Media
Proofpoint: Fraud Detection and Security on Social MediaDataStax Academy
 
Cold Storage That Isn't Glacial (Joshua Hollander, Protectwise) | Cassandra S...
Cold Storage That Isn't Glacial (Joshua Hollander, Protectwise) | Cassandra S...Cold Storage That Isn't Glacial (Joshua Hollander, Protectwise) | Cassandra S...
Cold Storage That Isn't Glacial (Joshua Hollander, Protectwise) | Cassandra S...DataStax
 
Oracle to Cassandra Core Concepts Guid Part 1: A new hope
Oracle to Cassandra Core Concepts Guid Part 1: A new hopeOracle to Cassandra Core Concepts Guid Part 1: A new hope
Oracle to Cassandra Core Concepts Guid Part 1: A new hopeDataStax
 
Building a Digital Bank
Building a Digital BankBuilding a Digital Bank
Building a Digital BankDataStax
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax
 
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
Battery Ventures: Simulating and Visualizing Large Scale Cassandra DeploymentsBattery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
Battery Ventures: Simulating and Visualizing Large Scale Cassandra DeploymentsDataStax Academy
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Data Con LA
 
The Last Pickle: Distributed Tracing from Application to Database
The Last Pickle: Distributed Tracing from Application to DatabaseThe Last Pickle: Distributed Tracing from Application to Database
The Last Pickle: Distributed Tracing from Application to DatabaseDataStax Academy
 
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
Webinar: DataStax Training - Everything you need to become a Cassandra RockstarWebinar: DataStax Training - Everything you need to become a Cassandra Rockstar
Webinar: DataStax Training - Everything you need to become a Cassandra RockstarDataStax
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaDataStax Academy
 
Backup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipesBackup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipesLeandro Totino Pereira
 

Tendances (20)

Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6Webinar | Introducing DataStax Enterprise 4.6
Webinar | Introducing DataStax Enterprise 4.6
 
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd KnownCassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
 
Transforms Document Management at Scale with Distributed Database Solution wi...
Transforms Document Management at Scale with Distributed Database Solution wi...Transforms Document Management at Scale with Distributed Database Solution wi...
Transforms Document Management at Scale with Distributed Database Solution wi...
 
From PoCs to Production
From PoCs to ProductionFrom PoCs to Production
From PoCs to Production
 
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
 
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStackCisco: Cassandra adoption on Cisco UCS & OpenStack
Cisco: Cassandra adoption on Cisco UCS & OpenStack
 
Making Every Drop Count: How i20 Addresses the Water Crisis with the IoT and ...
Making Every Drop Count: How i20 Addresses the Water Crisis with the IoT and ...Making Every Drop Count: How i20 Addresses the Water Crisis with the IoT and ...
Making Every Drop Count: How i20 Addresses the Water Crisis with the IoT and ...
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
 
Proofpoint: Fraud Detection and Security on Social Media
Proofpoint: Fraud Detection and Security on Social MediaProofpoint: Fraud Detection and Security on Social Media
Proofpoint: Fraud Detection and Security on Social Media
 
Cold Storage That Isn't Glacial (Joshua Hollander, Protectwise) | Cassandra S...
Cold Storage That Isn't Glacial (Joshua Hollander, Protectwise) | Cassandra S...Cold Storage That Isn't Glacial (Joshua Hollander, Protectwise) | Cassandra S...
Cold Storage That Isn't Glacial (Joshua Hollander, Protectwise) | Cassandra S...
 
Oracle to Cassandra Core Concepts Guid Part 1: A new hope
Oracle to Cassandra Core Concepts Guid Part 1: A new hopeOracle to Cassandra Core Concepts Guid Part 1: A new hope
Oracle to Cassandra Core Concepts Guid Part 1: A new hope
 
Building a Digital Bank
Building a Digital BankBuilding a Digital Bank
Building a Digital Bank
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
Battery Ventures: Simulating and Visualizing Large Scale Cassandra DeploymentsBattery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
 
The Last Pickle: Distributed Tracing from Application to Database
The Last Pickle: Distributed Tracing from Application to DatabaseThe Last Pickle: Distributed Tracing from Application to Database
The Last Pickle: Distributed Tracing from Application to Database
 
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
Webinar: DataStax Training - Everything you need to become a Cassandra RockstarWebinar: DataStax Training - Everything you need to become a Cassandra Rockstar
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
 
Feeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and KafkaFeeding Cassandra with Spark-Streaming and Kafka
Feeding Cassandra with Spark-Streaming and Kafka
 
Backup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipesBackup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipes
 

En vedette

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Graph Data Modeling in DataStax Enterprise
Graph Data Modeling in DataStax EnterpriseGraph Data Modeling in DataStax Enterprise
Graph Data Modeling in DataStax EnterpriseArtem Chebotko
 
Performance Pipeline Key Recommendations
Performance Pipeline Key RecommendationsPerformance Pipeline Key Recommendations
Performance Pipeline Key RecommendationsIllinois workNet
 
Pipeline Performance Analysis: Cohort Model
Pipeline Performance Analysis: Cohort Model Pipeline Performance Analysis: Cohort Model
Pipeline Performance Analysis: Cohort Model Illinois workNet
 
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Amazon Web Services
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and SparkArtem Chebotko
 
DataStax | Graph Data Modeling in DataStax Enterprise (Artem Chebotko) | Cass...
DataStax | Graph Data Modeling in DataStax Enterprise (Artem Chebotko) | Cass...DataStax | Graph Data Modeling in DataStax Enterprise (Artem Chebotko) | Cass...
DataStax | Graph Data Modeling in DataStax Enterprise (Artem Chebotko) | Cass...DataStax
 
DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...
DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...
DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...DataStax
 
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax
 
DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...
DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...
DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...DataStax
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow ManagementRomi Kuntsman
 
Building Scalable Big Data Pipelines
Building Scalable Big Data PipelinesBuilding Scalable Big Data Pipelines
Building Scalable Big Data PipelinesChristian Gügi
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafkaconfluent
 
Cassandra: One (is the loneliest number)
Cassandra: One (is the loneliest number)Cassandra: One (is the loneliest number)
Cassandra: One (is the loneliest number)DataStax Academy
 
Successful Software Development with Apache Cassandra
Successful Software Development with Apache CassandraSuccessful Software Development with Apache Cassandra
Successful Software Development with Apache CassandraDataStax Academy
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph DatabasesDataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 

En vedette (20)

Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Graph Data Modeling in DataStax Enterprise
Graph Data Modeling in DataStax EnterpriseGraph Data Modeling in DataStax Enterprise
Graph Data Modeling in DataStax Enterprise
 
Performance Pipeline Key Recommendations
Performance Pipeline Key RecommendationsPerformance Pipeline Key Recommendations
Performance Pipeline Key Recommendations
 
Pipeline Performance Analysis: Cohort Model
Pipeline Performance Analysis: Cohort Model Pipeline Performance Analysis: Cohort Model
Pipeline Performance Analysis: Cohort Model
 
Kafka aws
Kafka awsKafka aws
Kafka aws
 
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
Big Data Integration & Analytics Data Flows with AWS Data Pipeline (BDT207) |...
 
Big Data-Driven Applications with Cassandra and Spark
Big Data-Driven Applications  with Cassandra and SparkBig Data-Driven Applications  with Cassandra and Spark
Big Data-Driven Applications with Cassandra and Spark
 
DataStax | Graph Data Modeling in DataStax Enterprise (Artem Chebotko) | Cass...
DataStax | Graph Data Modeling in DataStax Enterprise (Artem Chebotko) | Cass...DataStax | Graph Data Modeling in DataStax Enterprise (Artem Chebotko) | Cass...
DataStax | Graph Data Modeling in DataStax Enterprise (Artem Chebotko) | Cass...
 
DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...
DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...
DataStax | Network Analysis Adventure with DSE Graph, DataStax Studio, and Ti...
 
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
 
DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...
DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...
DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow Management
 
Building Scalable Big Data Pipelines
Building Scalable Big Data PipelinesBuilding Scalable Big Data Pipelines
Building Scalable Big Data Pipelines
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 
Cassandra: One (is the loneliest number)
Cassandra: One (is the loneliest number)Cassandra: One (is the loneliest number)
Cassandra: One (is the loneliest number)
 
Successful Software Development with Apache Cassandra
Successful Software Development with Apache CassandraSuccessful Software Development with Apache Cassandra
Successful Software Development with Apache Cassandra
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph Databases
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 

Similaire à Data Pipelines with Spark & DataStax Enterprise

20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Anubhav Kale
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...Simon Ambridge
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...DataStax
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWSCaserta
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
How to Choose a Host for a Big Data Project
How to Choose a Host for a Big Data ProjectHow to Choose a Host for a Big Data Project
How to Choose a Host for a Big Data ProjectPeak Hosting
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Crate.io
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...Maya Lumbroso
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...Dataconomy Media
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 

Similaire à Data Pipelines with Spark & DataStax Enterprise (20)

20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Sa introduction to big data pipelining with cassandra &amp; spark   west mins...Sa introduction to big data pipelining with cassandra &amp; spark   west mins...
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Operational-Analytics
Operational-AnalyticsOperational-Analytics
Operational-Analytics
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
How to Choose a Host for a Big Data Project
How to Choose a Host for a Big Data ProjectHow to Choose a Host for a Big Data Project
How to Choose a Host for a Big Data Project
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 

Plus de DataStax

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?DataStax
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...DataStax
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsDataStax
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphDataStax
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyDataStax
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...DataStax
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache KafkaDataStax
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseDataStax
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0DataStax
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...DataStax
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesDataStax
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDataStax
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudDataStax
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceDataStax
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...DataStax
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...DataStax
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...DataStax
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)DataStax
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsDataStax
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingDataStax
 

Plus de DataStax (20)

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise Graph
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache Kafka
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for Dummies
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerce
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking Applications
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
 

Dernier

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Dernier (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Data Pipelines with Spark & DataStax Enterprise

  • 1. Simon Ambridge Data Pipelines With Spark & DSE An Introduction To Building Agile, Flexible and Scalable Big Data and Data Science Pipelines Version 0.8
  • 2. Certified Apache Cassandra and DataStax enthusiast who enjoys explaining that the traditional approaches to data management just don’t cut it anymore in the new always on, no single point of failure, high volume, high velocity, real time distributed data management world. Previously 25 years implementing Oracle relational data management solutions. Certified in Exadata, Oracle Cloud, Oracle Essbase, Oracle Linux and OBIEE simon.ambridge@datastax.com @stratman1958 Simon Ambridge Pre-Sales Solution Engineer, Datastax UK
  • 3. Introduction To Big Data Pipelines
  • 4. Big, Static Data Fast, Streaming Data Big Data Pipelining: Classification Big Data Pipelines can mean different things to different people Repeated analysis on a static but massive dataset • An element of research – e.g. genomics, clinical trial, demographic data • Typically repetitive, iterative, shared amongst data scientists for analysis Real-time analytics on streaming data • Industrialised or commercial processes – sensors, tick data, bioinformatics, transactional data, real-time personalisation • Happening in real-time, data cannot be dropped or lost
  • 5. Static Datasets All You Can Eat? Really.
  • 6. Static Data Analytics : Traditional Tools Repeated iterations, at each stage Run/debug cycle can be slow Sampling Modeling InterpretTuning Reporting Re-sample Typical traditional ‘static’ data analysis model Data Results
  • 7. Static Data Analytics : Scale Up Challenges Sampling and analysis often run on a single machine • CPU and memory limitations Limited sampling of a large dataset because of data size limitations • Multiple iterations over large datasets is frequently not an ideal approach
  • 8. Static Data Analytics : Traditional Scaling DATA (GB) DATA (MB) DATA (TB) Small datasets, small servers Large datasets, large servers
  • 9. Static Data Analytics: Big Data Problems Data is getting Really Big! • Data volumes are getting larger! • The number of data sources is exploding! • More data is arriving faster! Scaling up is becoming impractical • Physical limits • Datalimits • The validity of the analysis becomes obsolete, faster
  • 10. Static Data Analytics : Big Data Needs We need scalable infrastructure + distributed technologies • Data volumes can be scaled • Distribute the data across multiple low-cost machines • Faster processing • More complex processing • No single point of failure
  • 11. Static Data Analytics : DSE Delivers Building a distributed data processing framework can be a complex task! It needs to be: • Scalable • Fast in-memory processing • Replicated for resiliency • Batch and real-time data feeds • Ad-hoc queries DataStax delivers an integrated analytics platform
  • 12. Cassandra: THE Web, IoT & Cloud Database What is Apache Cassandra? • Very fast • Extremely resilient • Across multiple data centres • No single point of failure • Continuous Availability, Disaster Avoidance • Linear scale • Easy to operate Enterprise Cassandra platform from Datastax
  • 13. DataStax Enterprise DataStax Enterprise: Editions DataStax Enterprise Standard • DSE Standard is DataStax’s entry level commercial database offering • Represents the minimum recommended to deploy Cassandra in a production environment DataStax Enterprise Max • DSE Max is DataStax’s advanced commercial database offering • Designed for production Cassandra environments that have mixed workload requirements
  • 14. Spark: THE Analytics Engine What is Apache Spark? • Distributed in-memory analytic processing • Batch and streaming analytics • Fast - 10x-100x faster than Hadoop MapReduce • Rich Scala, Java and Python APIs Tightly integrated with DSE
  • 15. Spark: Dayton Gray Sort Contest Dayton Gray benchmark - tests how fast a system can sort 100 TB of data (1 trillion records) • Previous world record held by Hadoop MapReduce cluster of 2100 nodes, in 72 minutes • 2014: Spark completed the benchmark in 23 minutes on just 206 EC2 nodes = 3X faster using 10X fewer machines • Spark sorted 1 PB (10 trillion records) on 190 machines in < 4 hours. Previous Hadoop MapReduce time of 16 hours on 3800 machines = 4X faster using 20X fewer machines
  • 16. DataStax Enterprise: Analytics Integration Cassandra Cluster Spark Cluster ETL Spark Cluster • Tight integration • Data locality • Microsecond response times X • Apache Cassandra for Distributed Persistent Storage • Integrated Apache Spark for Distributed Real-Time Analytics • Analytics nodes close to data - no ETL required X • Loose integration • Data separate from processing • Millisecond response times “Latency  when  transferring  data  is  unavoidable.  The  trick  is  to  reduce   the  latency  to  as  close  to  zero  as  possible…”
  • 17. Static Data Analytics : Requirements Valid data pipeline analysis methods must be: Auditable • Reproducible • Documented Controlled • Version control Collaborative • Accessible
  • 18. Notebooks: Features What are Notebooks? • Drive your data analysis from the browser • Highly interactive • Tight integration with Apache Spark • Handy tools for analysts: • Reproducible visual analysis • Code in Scala, CQL, SparkSQL, Python • Charting – pie, bar, line etc • Extensible with custom libraries
  • 20. Static Data Analytics : Approach Example architecture & requirements 1. Optimised source data format 2. Distributed in-memory analytics 3. Interactive and flexible data analysis tool 4. Persistent data store 5. Visualisation tools
  • 21. Static Data Analytics : Example ADAM Notebook Persistent Storage OLTP Database Visualisation Genome research platform - ADST (Agile Data Science Toolkit)
  • 22. Static Data Analytics : Pipeline Process Flow 3. Persistent data storage 2. Interactive, flexible and reproducible analysis 1. Source data 4. Visualise and analyse
  • 23. Static Data Analytics : Pipeline Scalability • Add more (physical or virtual) nodes as required to add capacity • Container tools ease configuration management and deployment • Scale out quickly
  • 24. Static Data Analytics : Now • No longer an iterative process constrained by hardware limitations • Now a more scalable, resilient, dynamic, interactive process, easily shareable Analyse The new model for large-scale static data analytics Share X Load SCALE & DISTRIBUTE PROCESSING
  • 25. Real-Time Datasets If it’s Not “Now”, Then It’s Probably Already Too Late
  • 26. Big Data Pipelining: Why Real-Time? • React to customers faster and with more accuracy • Reduce risk through more accurate understanding of the market • Optimise return on marketing investment • Faster time to market • Improve efficiency In a highly connected world In most cases ‘real-time’ data changing at <1s intervals
  • 27. Big Data Pipelining: Real-Time Analytics • Capture, prepare, and process fast streaming data • Different approach from traditional batch processing • The speed of now – cannot wait • Immediate insight, instant decisions What problem are we trying to solve?
  • 28. Big Data Pipelining: Real-Time Use Cases Sensor data (IoT) Transactional data User Experience Social media Use cases for streaming analytics
  • 29. Big Data Analytics: Streams Data tidal waves!Netflix • Ingests Petabytes of data per day • Over 1 TRILLION transactions per day (>10 m per second) into DSE Data streams? Data torrent?
  • 30. Big Data Pipelining: Real-Time architecture Analytics in real-time, at scale Fast processing, distributed, in-memory Increasingly using a technology stack comprising Kafka, Spark and Cassandra • Scalable • Distributed • Resilient Streaming analytics architecture - what do we need?
  • 31. Kafka: Architecture How Does Kafka Work? Kafka “De-couples” producers and consumers in data pipelines ’Producers’ send messages to the Kafka cluster, which in turn serves them up to ’Consumers’ • Kafka maintains feeds of messages in categories called topics • A Kafka cluster is comprised of one or more servers called a broker Producer Producer Producer Consumer Consumer Consumer Kafka Cluster
  • 32. Kafka: Streaming With Spark Kafka writes, Spark reads • Topics can have multiple partitions • Each topic partition stored as a log (an ordered set of messages) • Messages are simply byte arrays, so can store any object in any format • Each message in a partition is assigned a unique offset Spark consumes messages as a stream, in micro batches, saved as RDD’s 1 2 3 4 5 6 7 8 Partition 0 1 2 3 4 5 6 7 8 Partition 1 1 2 3 4 5 6 Partition 0 Temperature Topic Rainfall Topic Temperature Consumer Rainfall Consumer Temperature Consumer
  • 33. DataStax Enterprise: Streaming Schematic Sensor Network Signal Aggregation Services Messaging Queue Sensor Data Queue Management Broker Broker Collection Service Data Storage OLTP PersistenceLayer Streaming Data Ingest
  • 34. DataStax Enterprise: Streaming Analytics Real-time Analytics Persistent Storage OLTP Database !$£€! Personalisation Actionable insight Monitoring Web / Analytics / BI
  • 35. DataStax Enterprise: Multi-DC Uses DC: EUROPEDC: USA Real-time active-active geo-replication across physical datacentres 4 3 25 1 4 3 25 1 8 1 2 3 4 5 6 7 1 2 3 OLTP: Cassandra 5 4 Analytics: Cassandra + Spark Replication Replication Workload separation via virtual datacentres
  • 36. Real-Time Analytics: DSE Multi-DC Workload Management and Separation With DSE Analytics / BI Analytics Datacentre OLTP Datacentre 100% Uptime, Global Scale OLTP Real-Time Analytics Mixed Load OLTP and Analytics Platform Replication Replication JDBC ODBC Separation of OLTP from Analytics Social Media IoT Personalisation & Persistence Personalisation !$£€! Actionable insight Monitoring App, Web
  • 37. DSE & Analytics : Summary Static, Massive Data Scalable Data Pipelines 1. Optimised data storage formats 2. Scalable, distributed technologies 3. Flexible and interactive analysis tools 4. Resilient, persistent Storage Real-Time Streaming Data Scalable Data Pipelines 1. Scalable, distributed technologies 2. De-coupled Producers and Consumers 3. Real-Time analytics 4. Resilient, persistent Storage Spark Mesos Akka Cassandra Kafka
  • 38.