SlideShare a Scribd company logo
1 of 49
1• 800.593.4467 • info@healthmarketscience.com




                                                    The Big Data Quadfecta


                                                 Brian O’Neill                           Taylor Goetz
                                                 Lead Architect, Health Market Science   Development Lead, Health Market Scienc
                                                 @boneill42, bone@alumni.brown.edu       @ptgoetz, ptgoetz@gmail.com
1• 800.593.4467 • info@healthmarketscience.com




                                                                    Quadfecta?
                                                 1.       Quadfecta
                                                      •    A legendary beirut/beer pong shot that lands
                                                           on the tops of four cups simultaneously.
                                                           Considered the rarest shot in the game,
                                                           topping even the trifecta, 2-cup knockover-
                                                           and-sink, and simultaneous 6-cup game-
                                                           ending double bounce-in.

                                                           •   Kafka
                                                           •   Storm
                                                           •   Elastic Search
                                                           •   Cassandra
1• 800.593.4467 • info@healthmarketscience.com




Volume
                                       3 V’s




Variety
Velocity
1• 800.593.4467 • info@healthmarketscience.com




      The Use Case
1• 800.593.4467 • info@healthmarketscience.com




                                                           Our Mission




                                                 Prescriber eligibility and remediation
                                                 Eliminate fraud, waste and abuse
                                                 Insights into the healthcare space
1• 800.593.4467 • info@healthmarketscience.com




                                                                           The Business
                                                 Master Data Solutions
                                                                                       Business
                                                                                                            Medical Claims Data
                                                 Health Care Provider & Facilities
                                                                                       Solutions            Medical Procedures & Diagnosis

                                                  Variety/Velocity                                          Volume/Velocity
                                                 • >l2000 of sources                                        • ~1B claims annually
                                                 • 6 Million unique HCPs                                    • +5B records annually
                                                 • 10+ years history                                        • 5+ years history
                                                 Data Challenges                      CompleteView,         Data Challenges
                                                 • Constant change in real           Expense Manager,       • Sources have
                                                   world data                         CompleteSpend           incomplete capture
                                                 • Conflicting & partial info                               • Overlapping source data
                                                 • Frequent changes to                   Prescriber         • Statistical projections &
                                                   source structure                  Eligibility/Remdiati     biases
                                                                                               on
                                                 • Authoritative sources vs.                                • Social media type
                                                   crowdsource                                                relationships
                                                                                          Analtyics
                                                 • Predicting source quality
                                                                                         (Influencer
                                                                                         Networks)
1• 800.593.4467 • info@healthmarketscience.com




                                                                        Our Solutions
                                                   Business
                                                    Needs
                                                                 Sales & Marketing        Compliance          Business Systems        Finance & Legal



                                                                                                01010011

                                                   Solutions
                                                                  Provider Data      Data Assessment, Integration &      Compliance         Market
                                                                                          Enrichment Services                             Intelligence




                                                   Advanced                                                           S orm
                                                                                                                       t
                                                  Technology
                                                                                                                                Master Data Management



                                                     HMS
                                                 Authoritative
                                                   Sources                                 Medical Claims      Federal        State    Web     Derived
                                                                 PDC
1• 800.593.4467 • info@healthmarketscience.com




                                                          Datacenter
                                                 ¾ Petabytes of raw storage
                                                 Virtualized (VMware)
                                                 On a SAN


                                                 Should we go physical???
1• 800.593.4467 • info@healthmarketscience.com




                                                                  Under the Hood
                                                                                  User Interface         Web Services
                                                                                                                  Interfacing
                                                 I’m happy
                                                                        Analytics                  Dashboard / Reports
                                                                                                                Visualization
                                                                          Match
                                                   Customer           Consolidate              Indexing       Relational

                                                     Web                                                  Structured Storage
                                                                       Standardize
                                                 Government              Validate              NoSQL           Graph(s)
                                                   Data Sources    Distributed Processing                    Flexible Storage
1• 800.593.4467 • info@healthmarketscience.com




                                                    Master Data
                                                    Management
                                                 Harvested         faddress Î F@t0




                                                 Government                               flicense Î F@t5




                                                 Private   fsanction Î F@t1          fsanction Î F@t4

                                                                     Schema Change!
1• 800.593.4467 • info@healthmarketscience.com




      The Design
1• 800.593.4467 • info@healthmarketscience.com




                                                 System of Record




                                                 Flexibility (Variety)
                                                 Scalability (Velocity + Volume)
1• 800.593.4467 • info@healthmarketscience.com
1• 800.593.4467 • info@healthmarketscience.com




                                                      Design Principles
                                                 Patterns
                                                  Idempotent Operations
                                                      Elegantly handle replay
                                                  Immutable data
                                                      Assertions of facts over time



                                                 Anti-Patterns
                                                  Transactions / Locking
1• 800.593.4467 • info@healthmarketscience.com




                                                          State / Counting
                                                 Exactly-once semantics for state
                                                  Create small batches
                                                  Order batches          Batch   Total
                                                                         1       4
                                                   Batch 1        4
                                                                         3       4 (wait)
                                                                         2       10 (+6)
                                                              å
                                                   Batch 3        13
                                                                         3       23 (+13)
                                                                         3’      23 (+0)
                                                   Batch 2        6


                                                   Batch 3’       13
1• 800.593.4467 • info@healthmarketscience.com




                                                     What we did wrong…




                                                 Could not react to transactional changes
                                                 Needed extra logic to track what changed
                                                 Took too long
1• 800.593.4467 • info@healthmarketscience.com




                                                  What we did wrong… (II)




                                                 AOP-based triggers
                                                  Worked well initially.
                                                  Business Processes captured as side-
                                                  effects.
1• 800.593.4467 • info@healthmarketscience.com




                                                      What we did right.
                                                 REST APIs for Loose Coupling


                                                 See Virgil:
                                                  https://github.com/hmsonline/virgil


                                                 But really… watch out for Intravert
                                                  https://github.com/zznate/intravert-ug
1• 800.593.4467 • info@healthmarketscience.com




                                                                 Kafka
                                                 •   Millions of Messages
                                                 •   Replay Enabled
                                                 •   No transactions / Lightning Fast
1• 800.593.4467 • info@healthmarketscience.com




                                                           Elastic Search
                                                 •   Edit Distance / Soundex
                                                 •   Native Scalability
                                                 •   Fuzzy Search
                                                 •   Geospatial
                                                 •   Facets
1• 800.593.4467 • info@healthmarketscience.com




                                                                 Storm
                                                 •   Guaranteed once semantics
                                                 •   Well-designed processing abstraction
                                                 •   Beats BYODP
                                                 •   Momentum
1• 800.593.4467 • info@healthmarketscience.com




                                                 The System              NP. Rewind!                 NP. We can
                                                                                                   route around it.
                                                                                              C*             ES
                                                                                                              2

                                                                          Kafka          C*           ES1


                                                 REST API


                                                                                         A
                                                                       NP. Replication
                                                                         Factor > 1.

                                                                                    C*
                                                                                                               Elastic
                                                                                                               Search
                                                             Kafka
                                                            Queue(s)

                                                                          C                    B
                                                 Offset
1• 800.593.4467 • info@healthmarketscience.com




    ?
                             What comes after Quadfecta?
1• 800.593.4467 • info@healthmarketscience.com




                                                   Real-Time Integration
                                                 Real-time CRUD via Web Services
                                                  DRPC
                                                  “Real-time” Queue



                                                            Not quite sure?
1• 800.593.4467 • info@healthmarketscience.com




      The Storm/C* Bridge
1• 800.593.4467 • info@healthmarketscience.com




                                                 Anatomy of a Storm Cluster

                                                  Nimbus
                                                   Master Node
                                                  Zookeeper
                                                   Cluster Coordination
                                                  Supervisors
                                                   Worker Nodes
1• 800.593.4467 • info@healthmarketscience.com




                                                          Storm Primatives
                                                 Streams
                                                   Unbounded sequence of tuples
                                                 Spouts
                                                   Stream Sources
                                                 Bolts
                                                   Unit of Computation
                                                 Topologies
                                                   Combination of n Spouts and n Bolts
                                                   Defines the overall “Computation”
1• 800.593.4467 • info@healthmarketscience.com




                                                         Storm Spouts
                                                 Represents a source (stream) of data
                                                  Queues (JMS, Kafka, Kestrel, etc.)
                                                  Twitter Firehose
                                                  Sensor Data
                                                 Emits “Tuples” (Events) based on
                                                 source
                                                  Primary Storm data structure
                                                  Set of Key-Value pairs
1• 800.593.4467 • info@healthmarketscience.com




                                                          Storm Bolts
                                                 Receive Tuples from Spouts or
                                                 other Bolts
                                                 Operate on, or React to Data
                                                  Functions/Filters/Joins/Aggregations
                                                  Database writes/lookups
                                                 Optionally emit additional Tuples
1• 800.593.4467 • info@healthmarketscience.com




                                                      Storm Topologies
                                                 Data flow between spouts and bolts
                                                 Routing of Tuples between
                                                 spouts/bolts
                                                  Stream “Groupings”
                                                 Parallelism of Components
                                                 Long-Lived
1• 800.593.4467 • info@healthmarketscience.com




                            Storm Topologies
1• 800.593.4467 • info@healthmarketscience.com




                                                   Storm and Cassandra
                                                 Use Cases:
                                                  Write Storm Tuple data to C*
                                                   Computation Results
                                                   Pre-compute indices

                                                  Read data from C* and emit Storm
                                                  Tuples
                                                   Dynamic Lookups
                                                              http://github.com/hmsonline/storm-cassandra
1• 800.593.4467 • info@healthmarketscience.com



                                                      Storm Cassandra Bolt
                                                             Types
                                                                           CassandraBolt



                                                                             Cassandra
                                                                             LookupBolt
                                                                                                          C*
                                                 CassandraBolt
                                                   Writes data to Cassandra
                                                   Available in Batching and Non-Batching

                                                 CassandraLookupBolt
                                                   Reads data from Cassandra
                                                                        http://github.com/hmsonline/storm-cassandra
1• 800.593.4467 • info@healthmarketscience.com




                                                 Storm-Cassandra Project
                                                 Provides generic Bolts for
                                                 writing/reading Storm Tuples
                                                 to/from C*
                                                                         Tuple
                                                           Tuple        Mapper        Rows




                                                            Tuples
                                                                       Columns
                                                                       Mapper         Columns
                                                                                                 C*
                                                               http://github.com/hmsonline/storm-cassandra
1• 800.593.4467 • info@healthmarketscience.com




                                                 Storm-Cassandra Project
                                                 TupleMapper Interface
                                                  Tells the CassandraBolt how to write a
                                                  tuple to an arbitrary data model


                                                 Given a Storm Tuple:
                                                  Map to Column Family
                                                  Map to Row Key
                                                  Map to Columns
                                                               http://github.com/hmsonline/storm-cassandra
1• 800.593.4467 • info@healthmarketscience.com




                                                 Storm-Cassandra Project
                                                 ColumnsMapper Interface
                                                  Tells the CassandraLookupBolt how
                                                  to transform a C* row into a Storm
                                                  Tuple


                                                 Given a C* Row Key and list of
                                                 Columns:
                                                  Return a list of Storm Tuples
                                                                http://github.com/hmsonline/storm-cassandra
1• 800.593.4467 • info@healthmarketscience.com




                                                 Storm-Cassandra Project
                                                 Current State:
                                                   Version 0.4.0
                                                   Uses Astyanax Client
                                                   Several out-of-the-box *Mapper
                                                   Implementations:
                                                     Basic Key-Value Columns
                                                     Value-less Columns
                                                     Counter Columns
                                                     Lookup by row key
                                                     Lookup by range query
                                                   Composite Key/Column Support
                                                   Trident support
                                                                  http://github.com/hmsonline/storm-cassandra
1• 800.593.4467 • info@healthmarketscience.com




                                                 Storm-Cassandra Project
                                                 Future Plans:
                                                  Switch to CQL
                                                  Enhanced Trident Support




                                                             http://github.com/hmsonline/storm-cassandra
1• 800.593.4467 • info@healthmarketscience.com




                                                 Persistent Word Count




                                                        http://github.com/hmsonline/storm-cassandra
1• 800.593.4467 • info@healthmarketscience.com




                            DRPC
1• 800.593.4467 • info@healthmarketscience.com




                            “Reach” Computation
1• 800.593.4467 • info@healthmarketscience.com




*Notional
                                        MDM Topology*
1• 800.593.4467 • info@healthmarketscience.com




                            Load Topology
1• 800.593.4467 • info@healthmarketscience.com




                                                   Shameless Shoutouts
                                                 HMS (https://github.com/hmsonline/)
                                                  storm-cassandra
                                                  storm-elastic-search
                                                  storm-jdbi (coming soon)


                                                 ptgoetz (https://github.com/ptgoetz)
                                                  storm-jms
                                                  storm-signals
1• 800.593.4467 • info@healthmarketscience.com




      Next Level : Trident
1• 800.593.4467 • info@healthmarketscience.com




                                                               Trident
                                                 Provides a higher-level abstraction for
                                                 stream processing
                                                  Constructs for state management and
                                                  Batching
                                                 Adds additional primitives that abstract
                                                 away common topological patterns
                                                 Deprecates transactional topologies
                                                 Distributes with Storm
1• 800.593.4467 • info@healthmarketscience.com




                                                 Sample Trident Operations
                                                 Partition Local
                                                  Functions ( execute(x)  x + y )
                                                  Filters          ( isKeep(x)  0,x )
                                                  PartitionAggregate
                                                    Combiner ( pairwise combining )
                                                    Reducer ( iterative accumulation )
                                                    Aggregator     ( byoa )
1• 800.593.4467 • info@healthmarketscience.com




                                                                             A sample topology
                                                 TridentTopology topology = new TridentTopology();


                                                 TridentState wordCounts =


                                                    topology.newStream("spout1", spout)


                                                     .each(new Fields("sentence"),


                                                                                          new Split(),


                                                                                          new Fields("word"))


                                                     .groupBy(new Fields("word"))


                                                     .persistentAggregate(


                                                                    MemcachedState.opaque(serverLocations),


                                                                                       new Count(),


                                                                                       new Fields("count"))


                                                     .parallelismHint(6);




                                                                                                         https://github.com/nathanmarz/storm/wiki/Trident-state
1• 800.593.4467 • info@healthmarketscience.com




                                                                     Trident State
                                                 Sequenced writes by batch/transaction id.

                                                   Spouts
                                                      Transactional
                                                           Batch contents never change
                                                      Opaque
                                                           Batch contents can change

                                                   State
                                                      Transactional
                                                           Store tx_id with counts to maintain sequencing of writes.
                                                      Opaque
                                                           Store previous value in order to overwrite the current value when
                                                           contents of a batch change.

More Related Content

Similar to The Big Data Quadfecta

Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...EMC
 
Content Management in the Pharmaceutical Industry ©RIL
Content Management in the Pharmaceutical Industry ©RILContent Management in the Pharmaceutical Industry ©RIL
Content Management in the Pharmaceutical Industry ©RILRuhul Laskar
 
Company Profile
Company ProfileCompany Profile
Company Profileregenmed
 
Security, Economics, Technology and the Sustainability Paradox
Security, Economics, Technology and the Sustainability ParadoxSecurity, Economics, Technology and the Sustainability Paradox
Security, Economics, Technology and the Sustainability ParadoxRory King
 
Findhealth
FindhealthFindhealth
FindhealthAnna Jo
 
iHT2 Health IT Summit in Phoenix 2013 – Ken Maddock, VP Facility Support Serv...
iHT2 Health IT Summit in Phoenix 2013 – Ken Maddock, VP Facility Support Serv...iHT2 Health IT Summit in Phoenix 2013 – Ken Maddock, VP Facility Support Serv...
iHT2 Health IT Summit in Phoenix 2013 – Ken Maddock, VP Facility Support Serv...Health IT Conference – iHT2
 
Jonathan Izant AAAS Annual Meeting 2012-02-18
Jonathan Izant AAAS Annual Meeting 2012-02-18Jonathan Izant AAAS Annual Meeting 2012-02-18
Jonathan Izant AAAS Annual Meeting 2012-02-18Sage Base
 
Prediktiv analys och kundlojalitet
Prediktiv analys och kundlojalitetPrediktiv analys och kundlojalitet
Prediktiv analys och kundlojalitetIBM Sverige
 
Healthcare Mgt By The Numbers
Healthcare Mgt By The NumbersHealthcare Mgt By The Numbers
Healthcare Mgt By The Numbersbrucco
 
CPR Capabilities Presentation
CPR Capabilities PresentationCPR Capabilities Presentation
CPR Capabilities PresentationCPRComm
 
Health IT & Voice of Patient
Health IT & Voice of PatientHealth IT & Voice of Patient
Health IT & Voice of PatientDr.Mahmoud Abbas
 
Data Driven Clinical Quality and Decision Support
Data Driven Clinical Quality and Decision SupportData Driven Clinical Quality and Decision Support
Data Driven Clinical Quality and Decision SupportDale Sanders
 
Healthcare Reform & Physician Loyalty: What Can CRM Do To Support ACOs?
Healthcare Reform & Physician Loyalty: What Can CRM Do To Support ACOs?Healthcare Reform & Physician Loyalty: What Can CRM Do To Support ACOs?
Healthcare Reform & Physician Loyalty: What Can CRM Do To Support ACOs?Perficient, Inc.
 
Current State of HIE
Current State of HIECurrent State of HIE
Current State of HIEBrian Ahier
 
Improving Outcomes And Efficiencies Of Managing Patients In The Community - A...
Improving Outcomes And Efficiencies Of Managing Patients In The Community - A...Improving Outcomes And Efficiencies Of Managing Patients In The Community - A...
Improving Outcomes And Efficiencies Of Managing Patients In The Community - A...healthcareisi
 
2011 Patient Access Study Final Toc
2011 Patient Access Study   Final   Toc2011 Patient Access Study   Final   Toc
2011 Patient Access Study Final Tocleela336
 
CapSite 2011 U.S. Patient Access Study
CapSite 2011 U.S. Patient Access StudyCapSite 2011 U.S. Patient Access Study
CapSite 2011 U.S. Patient Access Studygjohnson1193
 
Portfolio (pcorey)
Portfolio (pcorey)Portfolio (pcorey)
Portfolio (pcorey)pcorey
 

Similar to The Big Data Quadfecta (20)

Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
 
Content Management in the Pharmaceutical Industry ©RIL
Content Management in the Pharmaceutical Industry ©RILContent Management in the Pharmaceutical Industry ©RIL
Content Management in the Pharmaceutical Industry ©RIL
 
Company Profile
Company ProfileCompany Profile
Company Profile
 
Security, Economics, Technology and the Sustainability Paradox
Security, Economics, Technology and the Sustainability ParadoxSecurity, Economics, Technology and the Sustainability Paradox
Security, Economics, Technology and the Sustainability Paradox
 
Findhealth
FindhealthFindhealth
Findhealth
 
iHT2 Health IT Summit in Phoenix 2013 – Ken Maddock, VP Facility Support Serv...
iHT2 Health IT Summit in Phoenix 2013 – Ken Maddock, VP Facility Support Serv...iHT2 Health IT Summit in Phoenix 2013 – Ken Maddock, VP Facility Support Serv...
iHT2 Health IT Summit in Phoenix 2013 – Ken Maddock, VP Facility Support Serv...
 
Jonathan Izant AAAS Annual Meeting 2012-02-18
Jonathan Izant AAAS Annual Meeting 2012-02-18Jonathan Izant AAAS Annual Meeting 2012-02-18
Jonathan Izant AAAS Annual Meeting 2012-02-18
 
Prediktiv analys och kundlojalitet
Prediktiv analys och kundlojalitetPrediktiv analys och kundlojalitet
Prediktiv analys och kundlojalitet
 
Healthcare Mgt By The Numbers
Healthcare Mgt By The NumbersHealthcare Mgt By The Numbers
Healthcare Mgt By The Numbers
 
CPR Capabilities Presentation
CPR Capabilities PresentationCPR Capabilities Presentation
CPR Capabilities Presentation
 
Health IT & Voice of Patient
Health IT & Voice of PatientHealth IT & Voice of Patient
Health IT & Voice of Patient
 
Marketing for Medical Device Startups
Marketing for Medical Device StartupsMarketing for Medical Device Startups
Marketing for Medical Device Startups
 
Data Driven Clinical Quality and Decision Support
Data Driven Clinical Quality and Decision SupportData Driven Clinical Quality and Decision Support
Data Driven Clinical Quality and Decision Support
 
Medical Affairs Excellence Services Summary
Medical Affairs Excellence Services SummaryMedical Affairs Excellence Services Summary
Medical Affairs Excellence Services Summary
 
Healthcare Reform & Physician Loyalty: What Can CRM Do To Support ACOs?
Healthcare Reform & Physician Loyalty: What Can CRM Do To Support ACOs?Healthcare Reform & Physician Loyalty: What Can CRM Do To Support ACOs?
Healthcare Reform & Physician Loyalty: What Can CRM Do To Support ACOs?
 
Current State of HIE
Current State of HIECurrent State of HIE
Current State of HIE
 
Improving Outcomes And Efficiencies Of Managing Patients In The Community - A...
Improving Outcomes And Efficiencies Of Managing Patients In The Community - A...Improving Outcomes And Efficiencies Of Managing Patients In The Community - A...
Improving Outcomes And Efficiencies Of Managing Patients In The Community - A...
 
2011 Patient Access Study Final Toc
2011 Patient Access Study   Final   Toc2011 Patient Access Study   Final   Toc
2011 Patient Access Study Final Toc
 
CapSite 2011 U.S. Patient Access Study
CapSite 2011 U.S. Patient Access StudyCapSite 2011 U.S. Patient Access Study
CapSite 2011 U.S. Patient Access Study
 
Portfolio (pcorey)
Portfolio (pcorey)Portfolio (pcorey)
Portfolio (pcorey)
 

More from DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

The Big Data Quadfecta

  • 1. 1• 800.593.4467 • info@healthmarketscience.com The Big Data Quadfecta Brian O’Neill Taylor Goetz Lead Architect, Health Market Science Development Lead, Health Market Scienc @boneill42, bone@alumni.brown.edu @ptgoetz, ptgoetz@gmail.com
  • 2. 1• 800.593.4467 • info@healthmarketscience.com Quadfecta? 1. Quadfecta • A legendary beirut/beer pong shot that lands on the tops of four cups simultaneously. Considered the rarest shot in the game, topping even the trifecta, 2-cup knockover- and-sink, and simultaneous 6-cup game- ending double bounce-in. • Kafka • Storm • Elastic Search • Cassandra
  • 3. 1• 800.593.4467 • info@healthmarketscience.com Volume 3 V’s Variety Velocity
  • 4. 1• 800.593.4467 • info@healthmarketscience.com The Use Case
  • 5. 1• 800.593.4467 • info@healthmarketscience.com Our Mission Prescriber eligibility and remediation Eliminate fraud, waste and abuse Insights into the healthcare space
  • 6. 1• 800.593.4467 • info@healthmarketscience.com The Business Master Data Solutions Business Medical Claims Data Health Care Provider & Facilities Solutions Medical Procedures & Diagnosis Variety/Velocity Volume/Velocity • >l2000 of sources • ~1B claims annually • 6 Million unique HCPs • +5B records annually • 10+ years history • 5+ years history Data Challenges CompleteView, Data Challenges • Constant change in real Expense Manager, • Sources have world data CompleteSpend incomplete capture • Conflicting & partial info • Overlapping source data • Frequent changes to Prescriber • Statistical projections & source structure Eligibility/Remdiati biases on • Authoritative sources vs. • Social media type crowdsource relationships Analtyics • Predicting source quality (Influencer Networks)
  • 7. 1• 800.593.4467 • info@healthmarketscience.com Our Solutions Business Needs Sales & Marketing Compliance Business Systems Finance & Legal 01010011 Solutions Provider Data Data Assessment, Integration & Compliance Market Enrichment Services Intelligence Advanced S orm t Technology Master Data Management HMS Authoritative Sources Medical Claims Federal State Web Derived PDC
  • 8. 1• 800.593.4467 • info@healthmarketscience.com Datacenter ¾ Petabytes of raw storage Virtualized (VMware) On a SAN Should we go physical???
  • 9. 1• 800.593.4467 • info@healthmarketscience.com Under the Hood User Interface Web Services Interfacing I’m happy Analytics Dashboard / Reports Visualization Match Customer Consolidate Indexing Relational Web Structured Storage Standardize Government Validate NoSQL Graph(s) Data Sources Distributed Processing Flexible Storage
  • 10. 1• 800.593.4467 • info@healthmarketscience.com Master Data Management Harvested faddress Î F@t0 Government flicense Î F@t5 Private fsanction Î F@t1 fsanction Î F@t4 Schema Change!
  • 11. 1• 800.593.4467 • info@healthmarketscience.com The Design
  • 12. 1• 800.593.4467 • info@healthmarketscience.com System of Record Flexibility (Variety) Scalability (Velocity + Volume)
  • 13. 1• 800.593.4467 • info@healthmarketscience.com
  • 14. 1• 800.593.4467 • info@healthmarketscience.com Design Principles Patterns Idempotent Operations Elegantly handle replay Immutable data Assertions of facts over time Anti-Patterns Transactions / Locking
  • 15. 1• 800.593.4467 • info@healthmarketscience.com State / Counting Exactly-once semantics for state Create small batches Order batches Batch Total 1 4 Batch 1 4 3 4 (wait) 2 10 (+6) å Batch 3 13 3 23 (+13) 3’ 23 (+0) Batch 2 6 Batch 3’ 13
  • 16. 1• 800.593.4467 • info@healthmarketscience.com What we did wrong… Could not react to transactional changes Needed extra logic to track what changed Took too long
  • 17. 1• 800.593.4467 • info@healthmarketscience.com What we did wrong… (II) AOP-based triggers Worked well initially. Business Processes captured as side- effects.
  • 18. 1• 800.593.4467 • info@healthmarketscience.com What we did right. REST APIs for Loose Coupling See Virgil: https://github.com/hmsonline/virgil But really… watch out for Intravert https://github.com/zznate/intravert-ug
  • 19. 1• 800.593.4467 • info@healthmarketscience.com Kafka • Millions of Messages • Replay Enabled • No transactions / Lightning Fast
  • 20. 1• 800.593.4467 • info@healthmarketscience.com Elastic Search • Edit Distance / Soundex • Native Scalability • Fuzzy Search • Geospatial • Facets
  • 21. 1• 800.593.4467 • info@healthmarketscience.com Storm • Guaranteed once semantics • Well-designed processing abstraction • Beats BYODP • Momentum
  • 22. 1• 800.593.4467 • info@healthmarketscience.com The System NP. Rewind! NP. We can route around it. C* ES 2 Kafka C* ES1 REST API A NP. Replication Factor > 1. C* Elastic Search Kafka Queue(s) C B Offset
  • 23. 1• 800.593.4467 • info@healthmarketscience.com ? What comes after Quadfecta?
  • 24. 1• 800.593.4467 • info@healthmarketscience.com Real-Time Integration Real-time CRUD via Web Services DRPC “Real-time” Queue Not quite sure?
  • 25. 1• 800.593.4467 • info@healthmarketscience.com The Storm/C* Bridge
  • 26. 1• 800.593.4467 • info@healthmarketscience.com Anatomy of a Storm Cluster Nimbus Master Node Zookeeper Cluster Coordination Supervisors Worker Nodes
  • 27. 1• 800.593.4467 • info@healthmarketscience.com Storm Primatives Streams Unbounded sequence of tuples Spouts Stream Sources Bolts Unit of Computation Topologies Combination of n Spouts and n Bolts Defines the overall “Computation”
  • 28. 1• 800.593.4467 • info@healthmarketscience.com Storm Spouts Represents a source (stream) of data Queues (JMS, Kafka, Kestrel, etc.) Twitter Firehose Sensor Data Emits “Tuples” (Events) based on source Primary Storm data structure Set of Key-Value pairs
  • 29. 1• 800.593.4467 • info@healthmarketscience.com Storm Bolts Receive Tuples from Spouts or other Bolts Operate on, or React to Data Functions/Filters/Joins/Aggregations Database writes/lookups Optionally emit additional Tuples
  • 30. 1• 800.593.4467 • info@healthmarketscience.com Storm Topologies Data flow between spouts and bolts Routing of Tuples between spouts/bolts Stream “Groupings” Parallelism of Components Long-Lived
  • 31. 1• 800.593.4467 • info@healthmarketscience.com Storm Topologies
  • 32. 1• 800.593.4467 • info@healthmarketscience.com Storm and Cassandra Use Cases: Write Storm Tuple data to C* Computation Results Pre-compute indices Read data from C* and emit Storm Tuples Dynamic Lookups http://github.com/hmsonline/storm-cassandra
  • 33. 1• 800.593.4467 • info@healthmarketscience.com Storm Cassandra Bolt Types CassandraBolt Cassandra LookupBolt C* CassandraBolt Writes data to Cassandra Available in Batching and Non-Batching CassandraLookupBolt Reads data from Cassandra http://github.com/hmsonline/storm-cassandra
  • 34. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project Provides generic Bolts for writing/reading Storm Tuples to/from C* Tuple Tuple Mapper Rows Tuples Columns Mapper Columns C* http://github.com/hmsonline/storm-cassandra
  • 35. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project TupleMapper Interface Tells the CassandraBolt how to write a tuple to an arbitrary data model Given a Storm Tuple: Map to Column Family Map to Row Key Map to Columns http://github.com/hmsonline/storm-cassandra
  • 36. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project ColumnsMapper Interface Tells the CassandraLookupBolt how to transform a C* row into a Storm Tuple Given a C* Row Key and list of Columns: Return a list of Storm Tuples http://github.com/hmsonline/storm-cassandra
  • 37. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project Current State: Version 0.4.0 Uses Astyanax Client Several out-of-the-box *Mapper Implementations: Basic Key-Value Columns Value-less Columns Counter Columns Lookup by row key Lookup by range query Composite Key/Column Support Trident support http://github.com/hmsonline/storm-cassandra
  • 38. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project Future Plans: Switch to CQL Enhanced Trident Support http://github.com/hmsonline/storm-cassandra
  • 39. 1• 800.593.4467 • info@healthmarketscience.com Persistent Word Count http://github.com/hmsonline/storm-cassandra
  • 40. 1• 800.593.4467 • info@healthmarketscience.com DRPC
  • 41. 1• 800.593.4467 • info@healthmarketscience.com “Reach” Computation
  • 42. 1• 800.593.4467 • info@healthmarketscience.com *Notional MDM Topology*
  • 43. 1• 800.593.4467 • info@healthmarketscience.com Load Topology
  • 44. 1• 800.593.4467 • info@healthmarketscience.com Shameless Shoutouts HMS (https://github.com/hmsonline/) storm-cassandra storm-elastic-search storm-jdbi (coming soon) ptgoetz (https://github.com/ptgoetz) storm-jms storm-signals
  • 45. 1• 800.593.4467 • info@healthmarketscience.com Next Level : Trident
  • 46. 1• 800.593.4467 • info@healthmarketscience.com Trident Provides a higher-level abstraction for stream processing Constructs for state management and Batching Adds additional primitives that abstract away common topological patterns Deprecates transactional topologies Distributes with Storm
  • 47. 1• 800.593.4467 • info@healthmarketscience.com Sample Trident Operations Partition Local Functions ( execute(x)  x + y ) Filters ( isKeep(x)  0,x ) PartitionAggregate Combiner ( pairwise combining ) Reducer ( iterative accumulation ) Aggregator ( byoa )
  • 48. 1• 800.593.4467 • info@healthmarketscience.com A sample topology TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate( MemcachedState.opaque(serverLocations), new Count(), new Fields("count")) .parallelismHint(6); https://github.com/nathanmarz/storm/wiki/Trident-state
  • 49. 1• 800.593.4467 • info@healthmarketscience.com Trident State Sequenced writes by batch/transaction id. Spouts Transactional Batch contents never change Opaque Batch contents can change State Transactional Store tx_id with counts to maintain sequencing of writes. Opaque Store previous value in order to overwrite the current value when contents of a batch change.

Editor's Notes

  1. Storm:realtime, distributed computation systemComparable to complex event processing systemOriginated in the twitter analytics space.
  2. Tuple: set of key-value pairs (values can be serialized objects)
  3. Useful for pre-computing queries in real-time to optimize lookups (avoid expensive C* queries).