SlideShare a Scribd company logo
1 of 68
Re-envisioning the Lambda Architecture:
Web Services & Real-time Analytics w/
Storm and Cassandra
Brian O’Neill, CTO
boneill@healthmarketscience.com
@boneill42
Talk Breakdown
29%
20%
31%
20%
Topics
(1) Motivation
(2) Polyglot Persistence
(3) Analytics
(4) Lambda Architecture
Health Market Science - Then
What we were.
Health Market Science - Now
Intersecting Big Data
w/ Healthcare
We’re fixing healthcare!
Data Pipelines
I/O
The InputFrom government,
state boards, etc.
From the internet,
social data,
networks / graphs
From third-parties,
medical claims
From customers,
expenses,
sales data,
beneficiary information,
quality scores
Data
Pipeline
The Output
Script
Claims
Expense
Sanction
Address
Contact
(phone, fax, etc.)
Drug
RepresentativeDivision
Expense ManagerTM
Provider Verification™
MarketViewTM
Customer
Feed(s)
Customer
Master
Provider MasterFileTM
Credentials
“Agile MDM”
1 billion claims
per year
Organization
Practitioner
Referrals
Sounds easy
Except...
Incomplete Capture
No foreign keys
Differing schemas
Changing schemas
Conflicting information
Ad-hoc Analysis (is hard)
Point-In-Time Retrieval
Why?
?’s
Our MDM Pipeline
- Data Stewardship
- Data Scientists
- Business Analysts
Ingestion
- Semantic Tagging
- Standardization
- Data Mapping
Incorporation
- Consolidation
- Enumeration
- Association
Insight
- Search
- Reports
- Analytics
Feeds
(multiple formats,
changing over time)
API / FTP Web Interface
DimensionsLogicRules
Our first “Pipeline”
+
Sweet!
Dirt Simple
Lightning Fast
Highly Available
Scalable
Multi-Datacenter (DR)
Not Sweet.
How do we query the data?
NoSQL Indexes?
Do such things exist?
Rev. 1 – Wide Rows!
AOP
Triggers!Data model to
support your
queries.
9 7 32 74 99 12 42
$3.50 $7.00 $8.75 $1.00 $4.20 $3.17 $8.88
ONC : PA : 19460
D’Oh! What about ad hoc?
Transformation
Rev 2 – Elastic Search!
AOP
Triggers!
D’Oh!
What if ES fails?
What about schema / type information?
Rev 3 - Apache Storm!
Anatomy of a Storm Cluster
• Nimbus
– Master Node
• Zookeeper
– Cluster Coordination
• Supervisors
– Worker Nodes
Storm Primitives
• Streams
– Unbounded sequence of tuples
• Spouts
– Stream Sources
• Bolts
– Unit of Computation
• Topologies
– Combination of n Spouts and m Bolts
– Defines the overall “Computation”
Storm Spouts
• Represents a source (stream) of data
– Queues (JMS, Kafka, Kestrel, etc.)
– Twitter Firehose
– Sensor Data
• Emits “Tuples” (Events) based on source
– Primary Storm data structure
– Set of Key-Value pairs
Storm Bolts
• Receive Tuples from Spouts or other Bolts
• Operate on, or React to Data
– Functions/Filters/Joins/Aggregations
– Database writes/lookups
• Optionally emit additional Tuples
Storm Topologies
• Data flow between spouts and bolts
• Routing of Tuples between spouts/bolts
– Stream “Groupings”
• Parallelism of Components
• Long-Lived
Storm Topologies
Persistent Word Count
http://github.com/hmsonline/storm-cassandra
NEXT LEVEL : TRIDENT
Trident
• Part of Storm
• Provides a higher-level abstraction for stream
processing
– Constructs for state management and batching
• Adds additional primitives that abstract away
common topological patterns
Trident State
Sequences writes by batch
• Spouts
– Transactional
• Batch contents never change
– Opaque
• Batch contents can change
• State
– Transactional
• Store batch number with counts to maintain sequencing of
writes
– Opaque
• Store previous value in order to overwrite the current value
when contents of a batch change
State Management
Last Batch Value
15 1000
(+59)
Last Batch Value
16 1059
Transactional
Last Batch Previous Current
15 980 1000
(+59)
Opaque
replay == incorporated already?
(because batch composition is the same)
Last Batch Previous Current
16 1000 1059
Last Batch Previous Current
15 980 1000
(+72)
Last Batch Previous Current
16 1000 1072
replay == re-incorporate
Batch composition changes! (not guaranteed)
BACK TO OUR REGULARLY
SCHEDULED TALK
Polyglot Persistence
“The Right Tool for the Job”
Oracle is a registered trademark
of Oracle Corporation and/or its
affiliates. Other names may be
trademarks of their respective
owners.
Back to the Pipeline
KafkaDW
Storm
C* ES Titan SQL
MDM Topology*
*Notional
Design Principles
• What we got:
– At-least-once processing
– Simple data flows
• What we needed to account for:
– Replays
Idempotent Operations!
Immutable Data!
Cassandra State (v0.4.0)
git@github.com:hmsonline/storm-cassandra.git
{tuple}  <mapper>  (ks, cf, row, k:v[])
Storm Cassandra
Trident Elastic Search (v0.3.1)
git@github.com:hmsonline/trident-elasticsearch.git
{tuple}  <mapper>  (idx, docid, k:v[])
Storm Elastic Search
Storm Graph (v0.1.2)
Coming soon to...
git@github.com:hmsonline/storm-graph.git
for (tuple : batch)
<processor> (graph, tuple)
Storm JDBI (v0.1.14)
INTERNAL ONLY (so far)
Worth releasing?
{tuple}  <mapper>  (JDBC Statement)
All good!
But...
What was the average amount paid for a
medical claim associated with procedure
X by zip code over the last five years?
Hadoop (<2)? Batch?
Yuck. ‘Nuff Said.
http://www.slideshare.net/prash1784/introduction-to-hadoop-and-pig-15036186
Alternatives?
Let’s Pre-Compute It!
stream
.groupBy(new Field(“procedure”))
.groupBy(new Field(“zip”))
.aggregate(new Field(“amount”),
new Average())
D’Oh!
GroupBy’s.
They set data in motion!
Lesson Learned
https://github.com/nathanmarz/storm/wiki/Trident-API-Overview
If possible, avoid
re-partitioning
operations!
(e.g. LOG.error!)
Why so hard?
D’Oh!
19 != 9
What we don’t want:
LOCKS!
What’s the alternative?
CONSENSUS!
Cassandra 2.0!
http://www.slideshare.net/planetcassandra/nyc-jonathan-ellis-keynote-cassandra-12-20
http://www.cs.cornell.edu/courses/CS6452/2012sp/papers/paxos-complex.pdf
Conditional Updates
“The alert reader will notice here that
Paxos gives us the ability to agree on
exactly one proposal. After one has been
accepted, it will be returned to future
leaders in the promise, and the new leader
will have to re-propose it again.”
http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
UPDATE value=9 WHERE word=“fox” IF value=6
Love CQL
Conditional Updates
+
Batch Statements
+
Collections
=
BADASS DATA MODELS
Announcing : Storm Cassandra CQL!
git@github.com:hmsonline/storm-cassandra-cql.git
{tuple}  <mapper>  (CQL Statement)
Trident Batching =? CQL Batching
Incremental State!
• Collapse aggregation into the state object.
– This allows the state object to aggregate with current state
in a loop until success.
• Uses Trident Batching to perform in-memory
aggregation for the batch.
for (tuple : batch)
state.aggregate(tuple);
while (failed?) {
persisted_state = read(state)
aggregate(in_memory_state, persisted_state)
failed? = conditionally_update(state)
}
Partition 1
In-Memory Aggregation by Key!
Key Value
fox 6
brown 3
Partition 2
Key Value
fox 3
lazy 72C*
No More GroupBy!
To protect against replays
Use partition + batch identifier(s) in
your conditional update!
“BatchId + partitionIndex consistently represents the
same data as long as:
1.Any repartitioning you do is deterministic (so
partitionBy is, but shuffle is not)
2.You're using a spout that replays the exact same
batch each time (which is true of transactional spouts
but not of opaque transactional spouts)”
- Nathan Marz
Hyper-Cubes!
Our Terminology
• A cube comprises:
– Dimensions (e.g. procedure, zip, time slice)
– A function (e.g. count, sum)
– Function fields (e.g. amount paid)
– Granularity
• A metric comprises:
– Coordinates (e.g. vasectomy, 19460, 879123 – hour since epoch)
– A value (e.g. 500 procedures)
• A perspective comprises:
– A range of coordinates (*, 19460, January)
– An interval (day)
Complex Event Processing
• For each event,
–Find relevant cubes
–Adjust metrics for event
–Group and aggregate metrics
–Use conditional updates to incorporate
metric into persistent state
The Lambda Architecture
http://architects.dzone.com/articles/nathan-marzs-lamda
Let’s Challenge This a Bit
because “additional tools and techniques” cost
money and time.
• Questions:
– Can we solve the problem with a single tool and a
single approach?
– Can we re-use logic across layers?
– Or better yet, can we collapse layers?
A Traditional Interpretation
Speed Layer
(Storm)
Batch Layer
(Hadoop)
Data
Stream
Serving Layer
HBase
Impala
D’Oh! Two pipelines!
Integrating Web Services
• We need a web service that receives an event
and provides,
– an immediate acknowledgement
– a high likelihood that the data is integrated very soon
– a guarantee that the data will be integrated eventually
• We need an architecture that provides for,
– Code / Logic and approach re-use
– Fault-Tolerance
Grand Finale
The Idea : Embedding State!
Kafka
DropWizard
C*
IncrementalCqlState
aggregate(tuple)
“Batch” Layer
(Storm)
Client
The Sequence of Events
The Wins
• Reuse Aggregations and State Code!
• To re-compute (or backfill) a dimension,
simply re-queue!
• Storm is the “safety” net
– If a DW host fails during aggregation, Storm will fill
in the gaps for all ACK’d events.
• Is there an opportunity to reuse more?
– BatchingStrategy & PartitionStrategy?
In the end, all good. =)
Plug
The Book
Shout out:
Taylor Goetz
Thanks!
Brought to you by
Brian O’Neill, CTO
boneill@healthmarketscience.com
@boneill42
12 years together
APPENDIX
CassandraCqlState
public void commit(Long txid) {
BatchStatement batch = new BatchStatement(Type.LOGGED);
batch.addAll(this.statements);
clientFactory.getSession().execute(batch);
}
public void addStatement(Statement statement) {
this.statements.add(statement);
}
public ResultSet execute(Statement statement){
return clientFactory.getSession().execute(statement);
}
CassandraCqlStateUpdater
public void updateState(CassandraCqlState state,
List<TridentTuple> tuples,
TridentCollector collector) {
for (TridentTuple tuple : tuples) {
Statement statement = this.mapper.map(tuple);
state.addStatement(statement);
}
}
ExampleMapper
public Statement map(List<String> keys, Number value) {
Insert statement =
QueryBuilder.insertInto(KEYSPACE_NAME, TABLE_NAME);
statement.value(KEY_NAME, keys.get(0));
statement.value(VALUE_NAME, value);
return statement;
}
public Statement retrieve(List<String> keys) {
Select statement = QueryBuilder.select()
.column(KEY_NAME).column(VALUE_NAME)
.from(KEYSPACE_NAME, TABLE_NAME)
.where(QueryBuilder.eq(KEY_NAME, keys.get(0)));
return statement;
}

More Related Content

What's hot

Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Chen-en Lu
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache KafkaJoe Stein
 
A la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIAA la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIALa Cuisine du Web
 
Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Knoldus Inc.
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
Kafka connect 101
Kafka connect 101Kafka connect 101
Kafka connect 101Whiteklay
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaShiao-An Yuan
 
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...Natan Silnitsky
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
 
Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)StreamNative
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloJoe Stein
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Devoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with KafkaDevoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with KafkaLászló-Róbert Albert
 

What's hot (20)

Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache Kafka
 
A la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIAA la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIA
 
Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Kafka connect 101
Kafka connect 101Kafka connect 101
Kafka connect 101
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
 
Apache kafka introduction
Apache kafka introductionApache kafka introduction
Apache kafka introduction
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)
 
Kafka basics
Kafka basicsKafka basics
Kafka basics
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Devoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with KafkaDevoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with Kafka
 

Similar to Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & Real-time Analytics w/ Storm and Cassandra

Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardBrian O'Neill
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureGabriele Modena
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormDataStax
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Folio3 Software
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsMichael Häusler
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
Deconstructing Lambda
Deconstructing LambdaDeconstructing Lambda
Deconstructing Lambdadarach
 
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...Flink Forward
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Making Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxMaking Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxTrayan Iliev
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Introduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big DataIntroduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big Datawaheed751
 

Similar to Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & Real-time Analytics w/ Storm and Cassandra (20)

Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data Applications
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Deconstructing Lambda
Deconstructing LambdaDeconstructing Lambda
Deconstructing Lambda
 
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Making Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxMaking Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFlux
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Introduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big DataIntroduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big Data
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 

More from DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Recently uploaded

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 

Recently uploaded (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & Real-time Analytics w/ Storm and Cassandra

  • 1. Re-envisioning the Lambda Architecture: Web Services & Real-time Analytics w/ Storm and Cassandra Brian O’Neill, CTO boneill@healthmarketscience.com @boneill42
  • 2. Talk Breakdown 29% 20% 31% 20% Topics (1) Motivation (2) Polyglot Persistence (3) Analytics (4) Lambda Architecture
  • 3. Health Market Science - Then What we were.
  • 4. Health Market Science - Now Intersecting Big Data w/ Healthcare We’re fixing healthcare!
  • 6. The InputFrom government, state boards, etc. From the internet, social data, networks / graphs From third-parties, medical claims From customers, expenses, sales data, beneficiary information, quality scores Data Pipeline
  • 7. The Output Script Claims Expense Sanction Address Contact (phone, fax, etc.) Drug RepresentativeDivision Expense ManagerTM Provider Verification™ MarketViewTM Customer Feed(s) Customer Master Provider MasterFileTM Credentials “Agile MDM” 1 billion claims per year Organization Practitioner Referrals
  • 8. Sounds easy Except... Incomplete Capture No foreign keys Differing schemas Changing schemas Conflicting information Ad-hoc Analysis (is hard) Point-In-Time Retrieval
  • 10. Our MDM Pipeline - Data Stewardship - Data Scientists - Business Analysts Ingestion - Semantic Tagging - Standardization - Data Mapping Incorporation - Consolidation - Enumeration - Association Insight - Search - Reports - Analytics Feeds (multiple formats, changing over time) API / FTP Web Interface DimensionsLogicRules
  • 12. Sweet! Dirt Simple Lightning Fast Highly Available Scalable Multi-Datacenter (DR)
  • 13. Not Sweet. How do we query the data? NoSQL Indexes? Do such things exist?
  • 14. Rev. 1 – Wide Rows! AOP Triggers!Data model to support your queries. 9 7 32 74 99 12 42 $3.50 $7.00 $8.75 $1.00 $4.20 $3.17 $8.88 ONC : PA : 19460 D’Oh! What about ad hoc?
  • 15. Transformation Rev 2 – Elastic Search! AOP Triggers! D’Oh! What if ES fails? What about schema / type information?
  • 16. Rev 3 - Apache Storm!
  • 17. Anatomy of a Storm Cluster • Nimbus – Master Node • Zookeeper – Cluster Coordination • Supervisors – Worker Nodes
  • 18. Storm Primitives • Streams – Unbounded sequence of tuples • Spouts – Stream Sources • Bolts – Unit of Computation • Topologies – Combination of n Spouts and m Bolts – Defines the overall “Computation”
  • 19. Storm Spouts • Represents a source (stream) of data – Queues (JMS, Kafka, Kestrel, etc.) – Twitter Firehose – Sensor Data • Emits “Tuples” (Events) based on source – Primary Storm data structure – Set of Key-Value pairs
  • 20. Storm Bolts • Receive Tuples from Spouts or other Bolts • Operate on, or React to Data – Functions/Filters/Joins/Aggregations – Database writes/lookups • Optionally emit additional Tuples
  • 21. Storm Topologies • Data flow between spouts and bolts • Routing of Tuples between spouts/bolts – Stream “Groupings” • Parallelism of Components • Long-Lived
  • 24. NEXT LEVEL : TRIDENT
  • 25. Trident • Part of Storm • Provides a higher-level abstraction for stream processing – Constructs for state management and batching • Adds additional primitives that abstract away common topological patterns
  • 26. Trident State Sequences writes by batch • Spouts – Transactional • Batch contents never change – Opaque • Batch contents can change • State – Transactional • Store batch number with counts to maintain sequencing of writes – Opaque • Store previous value in order to overwrite the current value when contents of a batch change
  • 27. State Management Last Batch Value 15 1000 (+59) Last Batch Value 16 1059 Transactional Last Batch Previous Current 15 980 1000 (+59) Opaque replay == incorporated already? (because batch composition is the same) Last Batch Previous Current 16 1000 1059 Last Batch Previous Current 15 980 1000 (+72) Last Batch Previous Current 16 1000 1072 replay == re-incorporate Batch composition changes! (not guaranteed)
  • 28. BACK TO OUR REGULARLY SCHEDULED TALK
  • 29. Polyglot Persistence “The Right Tool for the Job” Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.
  • 30. Back to the Pipeline KafkaDW Storm C* ES Titan SQL
  • 32. Design Principles • What we got: – At-least-once processing – Simple data flows • What we needed to account for: – Replays Idempotent Operations! Immutable Data!
  • 33. Cassandra State (v0.4.0) git@github.com:hmsonline/storm-cassandra.git {tuple}  <mapper>  (ks, cf, row, k:v[]) Storm Cassandra
  • 34. Trident Elastic Search (v0.3.1) git@github.com:hmsonline/trident-elasticsearch.git {tuple}  <mapper>  (idx, docid, k:v[]) Storm Elastic Search
  • 35. Storm Graph (v0.1.2) Coming soon to... git@github.com:hmsonline/storm-graph.git for (tuple : batch) <processor> (graph, tuple)
  • 36. Storm JDBI (v0.1.14) INTERNAL ONLY (so far) Worth releasing? {tuple}  <mapper>  (JDBC Statement)
  • 38. But... What was the average amount paid for a medical claim associated with procedure X by zip code over the last five years?
  • 39. Hadoop (<2)? Batch? Yuck. ‘Nuff Said. http://www.slideshare.net/prash1784/introduction-to-hadoop-and-pig-15036186
  • 41. Let’s Pre-Compute It! stream .groupBy(new Field(“procedure”)) .groupBy(new Field(“zip”)) .aggregate(new Field(“amount”), new Average()) D’Oh! GroupBy’s. They set data in motion!
  • 43. Why so hard? D’Oh! 19 != 9 What we don’t want: LOCKS! What’s the alternative? CONSENSUS!
  • 45. Conditional Updates “The alert reader will notice here that Paxos gives us the ability to agree on exactly one proposal. After one has been accepted, it will be returned to future leaders in the promise, and the new leader will have to re-propose it again.” http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0 UPDATE value=9 WHERE word=“fox” IF value=6
  • 46. Love CQL Conditional Updates + Batch Statements + Collections = BADASS DATA MODELS
  • 47. Announcing : Storm Cassandra CQL! git@github.com:hmsonline/storm-cassandra-cql.git {tuple}  <mapper>  (CQL Statement) Trident Batching =? CQL Batching
  • 48. Incremental State! • Collapse aggregation into the state object. – This allows the state object to aggregate with current state in a loop until success. • Uses Trident Batching to perform in-memory aggregation for the batch. for (tuple : batch) state.aggregate(tuple); while (failed?) { persisted_state = read(state) aggregate(in_memory_state, persisted_state) failed? = conditionally_update(state) }
  • 49. Partition 1 In-Memory Aggregation by Key! Key Value fox 6 brown 3 Partition 2 Key Value fox 3 lazy 72C* No More GroupBy!
  • 50. To protect against replays Use partition + batch identifier(s) in your conditional update! “BatchId + partitionIndex consistently represents the same data as long as: 1.Any repartitioning you do is deterministic (so partitionBy is, but shuffle is not) 2.You're using a spout that replays the exact same batch each time (which is true of transactional spouts but not of opaque transactional spouts)” - Nathan Marz
  • 52. Our Terminology • A cube comprises: – Dimensions (e.g. procedure, zip, time slice) – A function (e.g. count, sum) – Function fields (e.g. amount paid) – Granularity • A metric comprises: – Coordinates (e.g. vasectomy, 19460, 879123 – hour since epoch) – A value (e.g. 500 procedures) • A perspective comprises: – A range of coordinates (*, 19460, January) – An interval (day)
  • 53. Complex Event Processing • For each event, –Find relevant cubes –Adjust metrics for event –Group and aggregate metrics –Use conditional updates to incorporate metric into persistent state
  • 55. Let’s Challenge This a Bit because “additional tools and techniques” cost money and time. • Questions: – Can we solve the problem with a single tool and a single approach? – Can we re-use logic across layers? – Or better yet, can we collapse layers?
  • 56. A Traditional Interpretation Speed Layer (Storm) Batch Layer (Hadoop) Data Stream Serving Layer HBase Impala D’Oh! Two pipelines!
  • 57. Integrating Web Services • We need a web service that receives an event and provides, – an immediate acknowledgement – a high likelihood that the data is integrated very soon – a guarantee that the data will be integrated eventually • We need an architecture that provides for, – Code / Logic and approach re-use – Fault-Tolerance
  • 59. The Idea : Embedding State! Kafka DropWizard C* IncrementalCqlState aggregate(tuple) “Batch” Layer (Storm) Client
  • 60. The Sequence of Events
  • 61. The Wins • Reuse Aggregations and State Code! • To re-compute (or backfill) a dimension, simply re-queue! • Storm is the “safety” net – If a DW host fails during aggregation, Storm will fill in the gaps for all ACK’d events. • Is there an opportunity to reuse more? – BatchingStrategy & PartitionStrategy?
  • 62. In the end, all good. =)
  • 64. Thanks! Brought to you by Brian O’Neill, CTO boneill@healthmarketscience.com @boneill42 12 years together
  • 66. CassandraCqlState public void commit(Long txid) { BatchStatement batch = new BatchStatement(Type.LOGGED); batch.addAll(this.statements); clientFactory.getSession().execute(batch); } public void addStatement(Statement statement) { this.statements.add(statement); } public ResultSet execute(Statement statement){ return clientFactory.getSession().execute(statement); }
  • 67. CassandraCqlStateUpdater public void updateState(CassandraCqlState state, List<TridentTuple> tuples, TridentCollector collector) { for (TridentTuple tuple : tuples) { Statement statement = this.mapper.map(tuple); state.addStatement(statement); } }
  • 68. ExampleMapper public Statement map(List<String> keys, Number value) { Insert statement = QueryBuilder.insertInto(KEYSPACE_NAME, TABLE_NAME); statement.value(KEY_NAME, keys.get(0)); statement.value(VALUE_NAME, value); return statement; } public Statement retrieve(List<String> keys) { Select statement = QueryBuilder.select() .column(KEY_NAME).column(VALUE_NAME) .from(KEYSPACE_NAME, TABLE_NAME) .where(QueryBuilder.eq(KEY_NAME, keys.get(0))); return statement; }

Editor's Notes

  1. Tuple: set of key-value pairs (values can be serialized objects)
  2. title Distributed Counting participant A participant B participant Storage note over Storage {"fox" : 6} end note note over A count("fox", batch)=3 end note A->Storage: read("fox") note over B count("fox", batch)=10 end note Storage->A: 6 B->Storage: read("fox") Storage->B: 8 note over A add(6, 3) = 9 end note note over B add(6, 10) = 16 end note B->Storage: write(16) A->Storage: write(9) note over Storage {"fox":16} end note
  3. title Distributed Counting participant Client participant DropWizard participant Kafka participant State(1) participant C* participant Storm participant State(2) Client->DropWizard: POST(event) DropWizard->State(1): aggregate(new Tuple(event)) DropWizard->Kafka: queue(event) DropWizard->Client: 200(ACK) note over State(1) duration (30 sec.) end note State(1)->C*: state, events = read(key) note over State(1) state = aggregate (state, in_memory_state) events = join (events, in_memory_events) end note State(1)->C*: write(state, events) Kafka->Storm: dequeue(event) Storm->State(2): persisted_state, events = read(key) note over State(2) if (!contains?(event)) ... end note State(2)->C*: if !contains(ids) write(state)