SlideShare une entreprise Scribd logo
1  sur  49
1• 800.593.4467 • info@healthmarketscience.com




                                                    The Big Data Quadfecta


                                                 Brian O’Neill                           Taylor Goetz
                                                 Lead Architect, Health Market Science   Development Lead, Health Market Scienc
                                                 @boneill42, bone@alumni.brown.edu       @ptgoetz, ptgoetz@gmail.com
1• 800.593.4467 • info@healthmarketscience.com




                                                                    Quadfecta?
                                                 1.       Quadfecta
                                                      •    A legendary beirut/beer pong shot that lands
                                                           on the tops of four cups simultaneously.
                                                           Considered the rarest shot in the
                                                           game, topping even the trifecta, 2-cup
                                                           knockover-and-sink, and simultaneous 6-cup
                                                           game-ending double bounce-in.

                                                           •   Kafka
                                                           •   Storm
                                                           •   Elastic Search
                                                           •   Cassandra
1• 800.593.4467 • info@healthmarketscience.com




Volume
                                       3 V’s




Variety
Velocity
1• 800.593.4467 • info@healthmarketscience.com




      The Use Case
1• 800.593.4467 • info@healthmarketscience.com




                                                           Our Mission




                                                 Prescriber eligibility and remediation
                                                 Eliminate fraud, waste and abuse
                                                 Insights into the healthcare space
1• 800.593.4467 • info@healthmarketscience.com




                                                                           The Business
                                                 Master Data Solutions
                                                                                       Business
                                                                                                            Medical Claims Data
                                                 Health Care Provider & Facilities
                                                                                       Solutions            Medical Procedures & Diagnosis

                                                  Variety/Velocity                                          Volume/Velocity
                                                 • >l2000 of sources                                        • ~1B claims annually
                                                 • 6 Million unique HCPs                                    • +5B records annually
                                                 • 10+ years history                                        • 5+ years history
                                                                                     CompleteView, Ex
                                                 Data Challenges                                            Data Challenges
                                                                                          pense
                                                 • Constant change in real                                  • Sources have
                                                                                     Manager, Complet
                                                   world data                                                 incomplete capture
                                                                                         eSpend
                                                 • Conflicting & partial info                               • Overlapping source data
                                                 • Frequent changes to                   Prescriber         • Statistical projections &
                                                   source structure                  Eligibility/Remdiati     biases
                                                                                               on
                                                 • Authoritative sources vs.                                • Social media type
                                                   crowdsource                                                relationships
                                                                                          Analtyics
                                                 • Predicting source quality
                                                                                         (Influencer
                                                                                         Networks)
1• 800.593.4467 • info@healthmarketscience.com




                                                                        Our Solutions
                                                   Business
                                                    Needs
                                                                 Sales & Marketing        Compliance          Business Systems        Finance & Legal



                                                                                                01010011

                                                   Solutions
                                                                  Provider Data      Data Assessment, Integration &      Compliance         Market
                                                                                          Enrichment Services                             Intelligence




                                                   Advanced                                                           S orm
                                                                                                                       t
                                                  Technology
                                                                                                                                Master Data Management



                                                     HMS
                                                 Authoritative
                                                   Sources                                 Medical Claims      Federal        State    Web     Derived
                                                                 PDC
1• 800.593.4467 • info@healthmarketscience.com




                                                          Datacenter
                                                 ¾ Petabytes of raw storage
                                                 Virtualized (VMware)
                                                 On a SAN


                                                 Should we go physical???
1• 800.593.4467 • info@healthmarketscience.com




                                                                  Under the Hood
                                                                                  User Interface         Web Services
                                                                                                                  Interfacing
                                                 I’m happy
                                                                        Analytics                  Dashboard / Reports
                                                                                                                Visualization
                                                                          Match
                                                   Customer           Consolidate              Indexing       Relational

                                                     Web                                                  Structured Storage
                                                                       Standardize
                                                 Government              Validate              NoSQL           Graph(s)
                                                   Data Sources    Distributed Processing                    Flexible Storage
1• 800.593.4467 • info@healthmarketscience.com




                                                    Master Data
                                                    Management
                                                 Harvested         faddress Î F@t0




                                                 Government                               flicense Î F@t5




                                                 Private   fsanction Î F@t1          fsanction Î F@t4

                                                                     Schema Change!
1• 800.593.4467 • info@healthmarketscience.com




      The Design
1• 800.593.4467 • info@healthmarketscience.com




                                                 System of Record




                                                 Flexibility (Variety)
                                                 Scalability (Velocity + Volume)
1• 800.593.4467 • info@healthmarketscience.com
1• 800.593.4467 • info@healthmarketscience.com




                                                      Design Principles
                                                 Patterns
                                                  Idempotent Operations
                                                      Elegantly handle replay
                                                  Immutable data
                                                      Assertions of facts over time



                                                 Anti-Patterns
                                                  Transactions / Locking
1• 800.593.4467 • info@healthmarketscience.com




                                                          State / Counting
                                                 Exactly-once semantics for state
                                                  Create small batches
                                                  Order batches          Batch   Total
                                                                         1       4
                                                   Batch 1        4
                                                                         3       4 (wait)
                                                                         2       10 (+6)
                                                              å
                                                   Batch 3        13
                                                                         3       23 (+13)
                                                                         3’      23 (+0)
                                                   Batch 2        6


                                                   Batch 3’       13
1• 800.593.4467 • info@healthmarketscience.com




                                                     What we did wrong…




                                                 Could not react to transactional changes
                                                 Needed extra logic to track what changed
                                                 Took too long
1• 800.593.4467 • info@healthmarketscience.com




                                                  What we did wrong… (II)




                                                 AOP-based triggers
                                                  Worked well initially.
                                                  Business Processes captured as side-
                                                  effects.
1• 800.593.4467 • info@healthmarketscience.com




                                                      What we did right.
                                                 REST APIs for Loose Coupling


                                                 See Virgil:
                                                  https://github.com/hmsonline/virgil


                                                 But really… watch out for Intravert
                                                  https://github.com/zznate/intravert-ug
1• 800.593.4467 • info@healthmarketscience.com




                                                                 Kafka
                                                 •   Millions of Messages
                                                 •   Replay Enabled
                                                 •   No transactions / Lightning Fast
1• 800.593.4467 • info@healthmarketscience.com




                                                           Elastic Search
                                                 •   Edit Distance / Soundex
                                                 •   Native Scalability
                                                 •   Fuzzy Search
                                                 •   Geospatial
                                                 •   Facets
1• 800.593.4467 • info@healthmarketscience.com




                                                                 Storm
                                                 •   Guaranteed once semantics
                                                 •   Well-designed processing abstraction
                                                 •   Beats BYODP
                                                 •   Momentum
1• 800.593.4467 • info@healthmarketscience.com




                                                 The System              NP. Rewind!                 NP. We can
                                                                                                   route around it.
                                                                                              C*             ES
                                                                                                              2

                                                                          Kafka          C*           ES1


                                                 REST API


                                                                                         A
                                                                       NP. Replication
                                                                         Factor > 1.

                                                                                    C*
                                                                                                               Elastic
                                                                                                               Search
                                                             Kafka
                                                            Queue(s)

                                                                          C                    B
                                                 Offset
1• 800.593.4467 • info@healthmarketscience.com




    ?
                             What comes after Quadfecta?
1• 800.593.4467 • info@healthmarketscience.com




                                                   Real-Time Integration
                                                 Real-time CRUD via Web Services
                                                  DRPC
                                                  “Real-time” Queue



                                                            Not quite sure?
1• 800.593.4467 • info@healthmarketscience.com




      The Storm/C* Bridge
1• 800.593.4467 • info@healthmarketscience.com




                                                 Anatomy of a Storm Cluster

                                                  Nimbus
                                                   Master Node
                                                  Zookeeper
                                                   Cluster Coordination
                                                  Supervisors
                                                   Worker Nodes
1• 800.593.4467 • info@healthmarketscience.com




                                                          Storm Primatives
                                                 Streams
                                                   Unbounded sequence of tuples
                                                 Spouts
                                                   Stream Sources
                                                 Bolts
                                                   Unit of Computation
                                                 Topologies
                                                   Combination of n Spouts and n Bolts
                                                   Defines the overall “Computation”
1• 800.593.4467 • info@healthmarketscience.com




                                                         Storm Spouts
                                                 Represents a source (stream) of data
                                                  Queues (JMS, Kafka, Kestrel, etc.)
                                                  Twitter Firehose
                                                  Sensor Data
                                                 Emits “Tuples” (Events) based on
                                                 source
                                                  Primary Storm data structure
                                                  Set of Key-Value pairs
1• 800.593.4467 • info@healthmarketscience.com




                                                          Storm Bolts
                                                 Receive Tuples from Spouts or
                                                 other Bolts
                                                 Operate on, or React to Data
                                                  Functions/Filters/Joins/Aggregations
                                                  Database writes/lookups
                                                 Optionally emit additional Tuples
1• 800.593.4467 • info@healthmarketscience.com




                                                      Storm Topologies
                                                 Data flow between spouts and bolts
                                                 Routing of Tuples between
                                                 spouts/bolts
                                                  Stream “Groupings”
                                                 Parallelism of Components
                                                 Long-Lived
1• 800.593.4467 • info@healthmarketscience.com




                            Storm Topologies
1• 800.593.4467 • info@healthmarketscience.com




                                                   Storm and Cassandra
                                                 Use Cases:
                                                  Write Storm Tuple data to C*
                                                   Computation Results
                                                   Pre-compute indices

                                                  Read data from C* and emit Storm
                                                  Tuples
                                                   Dynamic Lookups
                                                              http://github.com/hmsonline/storm-cassandra
1• 800.593.4467 • info@healthmarketscience.com



                                                      Storm Cassandra Bolt
                                                             Types
                                                                           CassandraBolt



                                                                             Cassandra
                                                                             LookupBolt
                                                                                                          C*
                                                 CassandraBolt
                                                   Writes data to Cassandra
                                                   Available in Batching and Non-Batching

                                                 CassandraLookupBolt
                                                   Reads data from Cassandra
                                                                        http://github.com/hmsonline/storm-cassandra
1• 800.593.4467 • info@healthmarketscience.com




                                                 Storm-Cassandra Project
                                                 Provides generic Bolts for
                                                 writing/reading Storm Tuples
                                                 to/from C*
                                                                         Tuple
                                                           Tuple        Mapper        Rows




                                                            Tuples
                                                                       Columns
                                                                       Mapper         Columns
                                                                                                 C*
                                                               http://github.com/hmsonline/storm-cassandra
1• 800.593.4467 • info@healthmarketscience.com




                                                 Storm-Cassandra Project
                                                 TupleMapper Interface
                                                  Tells the CassandraBolt how to write a
                                                  tuple to an arbitrary data model


                                                 Given a Storm Tuple:
                                                  Map to Column Family
                                                  Map to Row Key
                                                  Map to Columns
                                                               http://github.com/hmsonline/storm-cassandra
1• 800.593.4467 • info@healthmarketscience.com




                                                 Storm-Cassandra Project
                                                 ColumnsMapper Interface
                                                  Tells the CassandraLookupBolt how
                                                  to transform a C* row into a Storm
                                                  Tuple


                                                 Given a C* Row Key and list of
                                                 Columns:
                                                  Return a list of Storm Tuples
                                                                http://github.com/hmsonline/storm-cassandra
1• 800.593.4467 • info@healthmarketscience.com




                                                 Storm-Cassandra Project
                                                 Current State:
                                                   Version 0.4.0
                                                   Uses Astyanax Client
                                                   Several out-of-the-box *Mapper
                                                   Implementations:
                                                     Basic Key-Value Columns
                                                     Value-less Columns
                                                     Counter Columns
                                                     Lookup by row key
                                                     Lookup by range query
                                                   Composite Key/Column Support
                                                   Trident support
                                                                  http://github.com/hmsonline/storm-cassandra
1• 800.593.4467 • info@healthmarketscience.com




                                                 Storm-Cassandra Project
                                                 Future Plans:
                                                  Switch to CQL
                                                  Enhanced Trident Support




                                                             http://github.com/hmsonline/storm-cassandra
1• 800.593.4467 • info@healthmarketscience.com




                                                 Persistent Word Count




                                                        http://github.com/hmsonline/storm-cassandra
1• 800.593.4467 • info@healthmarketscience.com




                            DRPC
1• 800.593.4467 • info@healthmarketscience.com




                            “Reach” Computation
1• 800.593.4467 • info@healthmarketscience.com




*Notional
                                        MDM Topology*
1• 800.593.4467 • info@healthmarketscience.com




                            Load Topology
1• 800.593.4467 • info@healthmarketscience.com




                                                   Shameless Shoutouts
                                                 HMS (https://github.com/hmsonline/)
                                                  storm-cassandra
                                                  storm-elastic-search
                                                  storm-jdbi (coming soon)


                                                 ptgoetz (https://github.com/ptgoetz)
                                                  storm-jms
                                                  storm-signals
1• 800.593.4467 • info@healthmarketscience.com




      Next Level : Trident
1• 800.593.4467 • info@healthmarketscience.com




                                                               Trident
                                                 Provides a higher-level abstraction for
                                                 stream processing
                                                  Constructs for state management and
                                                  Batching
                                                 Adds additional primitives that abstract
                                                 away common topological patterns
                                                 Deprecates transactional topologies
                                                 Distributes with Storm
1• 800.593.4467 • info@healthmarketscience.com




                                                 Sample Trident Operations
                                                 Partition Local
                                                  Functions ( execute(x)  x + y )
                                                  Filters          ( isKeep(x)  0,x )
                                                  PartitionAggregate
                                                    Combiner ( pairwise combining )
                                                    Reducer ( iterative accumulation )
                                                    Aggregator     ( byoa )
1• 800.593.4467 • info@healthmarketscience.com




                                                                             A sample topology
                                                 TridentTopology topology = new TridentTopology();


                                                 TridentState wordCounts =


                                                    topology.newStream("spout1", spout)


                                                     .each(new Fields("sentence"),


                                                                                          new Split(),


                                                                                          new Fields("word"))


                                                     .groupBy(new Fields("word"))


                                                     .persistentAggregate(


                                                                    MemcachedState.opaque(serverLocations),


                                                                                       new Count(),


                                                                                       new Fields("count"))


                                                     .parallelismHint(6);




                                                                                                         https://github.com/nathanmarz/storm/wiki/Trident-state
1• 800.593.4467 • info@healthmarketscience.com




                                                                     Trident State
                                                 Sequenced writes by batch/transaction id.

                                                   Spouts
                                                      Transactional
                                                           Batch contents never change
                                                      Opaque
                                                           Batch contents can change

                                                   State
                                                      Transactional
                                                           Store tx_id with counts to maintain sequencing of writes.
                                                      Opaque
                                                           Store previous value in order to overwrite the current value when
                                                           contents of a batch change.

Contenu connexe

Similaire à Hms nyc* talk

The Trial Site & Beyond! Stuart Redding, The MRN Ltd.
The Trial Site & Beyond!  Stuart Redding, The MRN Ltd.The Trial Site & Beyond!  Stuart Redding, The MRN Ltd.
The Trial Site & Beyond! Stuart Redding, The MRN Ltd.stur1
 
New information products to support the global system
New information products to support the global systemNew information products to support the global system
New information products to support the global systemLuigi Guarino
 
ACO = HIE + Analytics - a Healthcare IT Presentation
ACO = HIE + Analytics - a Healthcare IT PresentationACO = HIE + Analytics - a Healthcare IT Presentation
ACO = HIE + Analytics - a Healthcare IT PresentationPerficient, Inc.
 
Jeremy Chapman - Australia - Tuesday 29 - Who Guiding Principles and quest fo...
Jeremy Chapman - Australia - Tuesday 29 - Who Guiding Principles and quest fo...Jeremy Chapman - Australia - Tuesday 29 - Who Guiding Principles and quest fo...
Jeremy Chapman - Australia - Tuesday 29 - Who Guiding Principles and quest fo...incucai_isodp
 
Security, Economics, Technology and the Sustainability Paradox
Security, Economics, Technology and the Sustainability ParadoxSecurity, Economics, Technology and the Sustainability Paradox
Security, Economics, Technology and the Sustainability ParadoxRory King
 
Improving Outcomes And Efficiencies Of Managing Patients In The Community - A...
Improving Outcomes And Efficiencies Of Managing Patients In The Community - A...Improving Outcomes And Efficiencies Of Managing Patients In The Community - A...
Improving Outcomes And Efficiencies Of Managing Patients In The Community - A...healthcareisi
 
20110719 mcguinness deborah_ontologies_for_the_real_world_microsoft_faculty_s...
20110719 mcguinness deborah_ontologies_for_the_real_world_microsoft_faculty_s...20110719 mcguinness deborah_ontologies_for_the_real_world_microsoft_faculty_s...
20110719 mcguinness deborah_ontologies_for_the_real_world_microsoft_faculty_s...Deborah McGuinness
 
Data validation in the Digital Age
Data validation in the Digital AgeData validation in the Digital Age
Data validation in the Digital AgeJ T "Tom" Johnson
 
Geoscape 2010 Census (Versailles 101008)
Geoscape   2010 Census (Versailles 101008)Geoscape   2010 Census (Versailles 101008)
Geoscape 2010 Census (Versailles 101008)Dan Austin
 
6-005-1430-Keeppanasseril
6-005-1430-Keeppanasseril6-005-1430-Keeppanasseril
6-005-1430-Keeppanasserilmed20su
 
CPR Capabilities Presentation
CPR Capabilities PresentationCPR Capabilities Presentation
CPR Capabilities PresentationCPRComm
 
Research data management for medical data with pyradigm
Research data management for medical data with pyradigmResearch data management for medical data with pyradigm
Research data management for medical data with pyradigmPradeep Redddy Raamana
 
Axx discovery doc
Axx discovery docAxx discovery doc
Axx discovery docfoliopages
 
1 PSUT Big Data Class, introduction
1 PSUT Big Data Class,  introduction1 PSUT Big Data Class,  introduction
1 PSUT Big Data Class, introductionAkram Al-Kouz
 
EmPower PRSA, 2012 - Analytics & Influencers
EmPower PRSA, 2012 - Analytics & InfluencersEmPower PRSA, 2012 - Analytics & Influencers
EmPower PRSA, 2012 - Analytics & Influencersjoerhoton
 

Similaire à Hms nyc* talk (20)

107 satellife
107 satellife107 satellife
107 satellife
 
The Trial Site & Beyond! Stuart Redding, The MRN Ltd.
The Trial Site & Beyond!  Stuart Redding, The MRN Ltd.The Trial Site & Beyond!  Stuart Redding, The MRN Ltd.
The Trial Site & Beyond! Stuart Redding, The MRN Ltd.
 
New information products to support the global system
New information products to support the global systemNew information products to support the global system
New information products to support the global system
 
Gutwiser final 2013 berkeley
Gutwiser final 2013 berkeleyGutwiser final 2013 berkeley
Gutwiser final 2013 berkeley
 
Social media marketing and engagement
Social media marketing and engagementSocial media marketing and engagement
Social media marketing and engagement
 
ACO = HIE + Analytics - a Healthcare IT Presentation
ACO = HIE + Analytics - a Healthcare IT PresentationACO = HIE + Analytics - a Healthcare IT Presentation
ACO = HIE + Analytics - a Healthcare IT Presentation
 
Jeremy Chapman - Australia - Tuesday 29 - Who Guiding Principles and quest fo...
Jeremy Chapman - Australia - Tuesday 29 - Who Guiding Principles and quest fo...Jeremy Chapman - Australia - Tuesday 29 - Who Guiding Principles and quest fo...
Jeremy Chapman - Australia - Tuesday 29 - Who Guiding Principles and quest fo...
 
Dr. Eliot Siegel: Watson and Deep QA Software in Pursuit of Personalized Medi...
Dr. Eliot Siegel: Watson and Deep QA Software in Pursuit of Personalized Medi...Dr. Eliot Siegel: Watson and Deep QA Software in Pursuit of Personalized Medi...
Dr. Eliot Siegel: Watson and Deep QA Software in Pursuit of Personalized Medi...
 
Security, Economics, Technology and the Sustainability Paradox
Security, Economics, Technology and the Sustainability ParadoxSecurity, Economics, Technology and the Sustainability Paradox
Security, Economics, Technology and the Sustainability Paradox
 
Improving Outcomes And Efficiencies Of Managing Patients In The Community - A...
Improving Outcomes And Efficiencies Of Managing Patients In The Community - A...Improving Outcomes And Efficiencies Of Managing Patients In The Community - A...
Improving Outcomes And Efficiencies Of Managing Patients In The Community - A...
 
20110719 mcguinness deborah_ontologies_for_the_real_world_microsoft_faculty_s...
20110719 mcguinness deborah_ontologies_for_the_real_world_microsoft_faculty_s...20110719 mcguinness deborah_ontologies_for_the_real_world_microsoft_faculty_s...
20110719 mcguinness deborah_ontologies_for_the_real_world_microsoft_faculty_s...
 
Data validation in the Digital Age
Data validation in the Digital AgeData validation in the Digital Age
Data validation in the Digital Age
 
Geoscape 2010 Census (Versailles 101008)
Geoscape   2010 Census (Versailles 101008)Geoscape   2010 Census (Versailles 101008)
Geoscape 2010 Census (Versailles 101008)
 
6-005-1430-Keeppanasseril
6-005-1430-Keeppanasseril6-005-1430-Keeppanasseril
6-005-1430-Keeppanasseril
 
CPR Capabilities Presentation
CPR Capabilities PresentationCPR Capabilities Presentation
CPR Capabilities Presentation
 
Research data management for medical data with pyradigm
Research data management for medical data with pyradigmResearch data management for medical data with pyradigm
Research data management for medical data with pyradigm
 
Marketing for Medical Device Startups
Marketing for Medical Device StartupsMarketing for Medical Device Startups
Marketing for Medical Device Startups
 
Axx discovery doc
Axx discovery docAxx discovery doc
Axx discovery doc
 
1 PSUT Big Data Class, introduction
1 PSUT Big Data Class,  introduction1 PSUT Big Data Class,  introduction
1 PSUT Big Data Class, introduction
 
EmPower PRSA, 2012 - Analytics & Influencers
EmPower PRSA, 2012 - Analytics & InfluencersEmPower PRSA, 2012 - Analytics & Influencers
EmPower PRSA, 2012 - Analytics & Influencers
 

Plus de Brian O'Neill

Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardBrian O'Neill
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
 
The Art of Platform Development
The Art of Platform DevelopmentThe Art of Platform Development
The Art of Platform DevelopmentBrian O'Neill
 
Collaborative software development
Collaborative software developmentCollaborative software development
Collaborative software developmentBrian O'Neill
 
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby GroupRuby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby GroupBrian O'Neill
 
Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Brian O'Neill
 

Plus de Brian O'Neill (8)

Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
The Art of Platform Development
The Art of Platform DevelopmentThe Art of Platform Development
The Art of Platform Development
 
Collaborative software development
Collaborative software developmentCollaborative software development
Collaborative software development
 
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby GroupRuby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
 
Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)
 

Hms nyc* talk

  • 1. 1• 800.593.4467 • info@healthmarketscience.com The Big Data Quadfecta Brian O’Neill Taylor Goetz Lead Architect, Health Market Science Development Lead, Health Market Scienc @boneill42, bone@alumni.brown.edu @ptgoetz, ptgoetz@gmail.com
  • 2. 1• 800.593.4467 • info@healthmarketscience.com Quadfecta? 1. Quadfecta • A legendary beirut/beer pong shot that lands on the tops of four cups simultaneously. Considered the rarest shot in the game, topping even the trifecta, 2-cup knockover-and-sink, and simultaneous 6-cup game-ending double bounce-in. • Kafka • Storm • Elastic Search • Cassandra
  • 3. 1• 800.593.4467 • info@healthmarketscience.com Volume 3 V’s Variety Velocity
  • 4. 1• 800.593.4467 • info@healthmarketscience.com The Use Case
  • 5. 1• 800.593.4467 • info@healthmarketscience.com Our Mission Prescriber eligibility and remediation Eliminate fraud, waste and abuse Insights into the healthcare space
  • 6. 1• 800.593.4467 • info@healthmarketscience.com The Business Master Data Solutions Business Medical Claims Data Health Care Provider & Facilities Solutions Medical Procedures & Diagnosis Variety/Velocity Volume/Velocity • >l2000 of sources • ~1B claims annually • 6 Million unique HCPs • +5B records annually • 10+ years history • 5+ years history CompleteView, Ex Data Challenges Data Challenges pense • Constant change in real • Sources have Manager, Complet world data incomplete capture eSpend • Conflicting & partial info • Overlapping source data • Frequent changes to Prescriber • Statistical projections & source structure Eligibility/Remdiati biases on • Authoritative sources vs. • Social media type crowdsource relationships Analtyics • Predicting source quality (Influencer Networks)
  • 7. 1• 800.593.4467 • info@healthmarketscience.com Our Solutions Business Needs Sales & Marketing Compliance Business Systems Finance & Legal 01010011 Solutions Provider Data Data Assessment, Integration & Compliance Market Enrichment Services Intelligence Advanced S orm t Technology Master Data Management HMS Authoritative Sources Medical Claims Federal State Web Derived PDC
  • 8. 1• 800.593.4467 • info@healthmarketscience.com Datacenter ¾ Petabytes of raw storage Virtualized (VMware) On a SAN Should we go physical???
  • 9. 1• 800.593.4467 • info@healthmarketscience.com Under the Hood User Interface Web Services Interfacing I’m happy Analytics Dashboard / Reports Visualization Match Customer Consolidate Indexing Relational Web Structured Storage Standardize Government Validate NoSQL Graph(s) Data Sources Distributed Processing Flexible Storage
  • 10. 1• 800.593.4467 • info@healthmarketscience.com Master Data Management Harvested faddress Î F@t0 Government flicense Î F@t5 Private fsanction Î F@t1 fsanction Î F@t4 Schema Change!
  • 11. 1• 800.593.4467 • info@healthmarketscience.com The Design
  • 12. 1• 800.593.4467 • info@healthmarketscience.com System of Record Flexibility (Variety) Scalability (Velocity + Volume)
  • 13. 1• 800.593.4467 • info@healthmarketscience.com
  • 14. 1• 800.593.4467 • info@healthmarketscience.com Design Principles Patterns Idempotent Operations Elegantly handle replay Immutable data Assertions of facts over time Anti-Patterns Transactions / Locking
  • 15. 1• 800.593.4467 • info@healthmarketscience.com State / Counting Exactly-once semantics for state Create small batches Order batches Batch Total 1 4 Batch 1 4 3 4 (wait) 2 10 (+6) å Batch 3 13 3 23 (+13) 3’ 23 (+0) Batch 2 6 Batch 3’ 13
  • 16. 1• 800.593.4467 • info@healthmarketscience.com What we did wrong… Could not react to transactional changes Needed extra logic to track what changed Took too long
  • 17. 1• 800.593.4467 • info@healthmarketscience.com What we did wrong… (II) AOP-based triggers Worked well initially. Business Processes captured as side- effects.
  • 18. 1• 800.593.4467 • info@healthmarketscience.com What we did right. REST APIs for Loose Coupling See Virgil: https://github.com/hmsonline/virgil But really… watch out for Intravert https://github.com/zznate/intravert-ug
  • 19. 1• 800.593.4467 • info@healthmarketscience.com Kafka • Millions of Messages • Replay Enabled • No transactions / Lightning Fast
  • 20. 1• 800.593.4467 • info@healthmarketscience.com Elastic Search • Edit Distance / Soundex • Native Scalability • Fuzzy Search • Geospatial • Facets
  • 21. 1• 800.593.4467 • info@healthmarketscience.com Storm • Guaranteed once semantics • Well-designed processing abstraction • Beats BYODP • Momentum
  • 22. 1• 800.593.4467 • info@healthmarketscience.com The System NP. Rewind! NP. We can route around it. C* ES 2 Kafka C* ES1 REST API A NP. Replication Factor > 1. C* Elastic Search Kafka Queue(s) C B Offset
  • 23. 1• 800.593.4467 • info@healthmarketscience.com ? What comes after Quadfecta?
  • 24. 1• 800.593.4467 • info@healthmarketscience.com Real-Time Integration Real-time CRUD via Web Services DRPC “Real-time” Queue Not quite sure?
  • 25. 1• 800.593.4467 • info@healthmarketscience.com The Storm/C* Bridge
  • 26. 1• 800.593.4467 • info@healthmarketscience.com Anatomy of a Storm Cluster Nimbus Master Node Zookeeper Cluster Coordination Supervisors Worker Nodes
  • 27. 1• 800.593.4467 • info@healthmarketscience.com Storm Primatives Streams Unbounded sequence of tuples Spouts Stream Sources Bolts Unit of Computation Topologies Combination of n Spouts and n Bolts Defines the overall “Computation”
  • 28. 1• 800.593.4467 • info@healthmarketscience.com Storm Spouts Represents a source (stream) of data Queues (JMS, Kafka, Kestrel, etc.) Twitter Firehose Sensor Data Emits “Tuples” (Events) based on source Primary Storm data structure Set of Key-Value pairs
  • 29. 1• 800.593.4467 • info@healthmarketscience.com Storm Bolts Receive Tuples from Spouts or other Bolts Operate on, or React to Data Functions/Filters/Joins/Aggregations Database writes/lookups Optionally emit additional Tuples
  • 30. 1• 800.593.4467 • info@healthmarketscience.com Storm Topologies Data flow between spouts and bolts Routing of Tuples between spouts/bolts Stream “Groupings” Parallelism of Components Long-Lived
  • 31. 1• 800.593.4467 • info@healthmarketscience.com Storm Topologies
  • 32. 1• 800.593.4467 • info@healthmarketscience.com Storm and Cassandra Use Cases: Write Storm Tuple data to C* Computation Results Pre-compute indices Read data from C* and emit Storm Tuples Dynamic Lookups http://github.com/hmsonline/storm-cassandra
  • 33. 1• 800.593.4467 • info@healthmarketscience.com Storm Cassandra Bolt Types CassandraBolt Cassandra LookupBolt C* CassandraBolt Writes data to Cassandra Available in Batching and Non-Batching CassandraLookupBolt Reads data from Cassandra http://github.com/hmsonline/storm-cassandra
  • 34. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project Provides generic Bolts for writing/reading Storm Tuples to/from C* Tuple Tuple Mapper Rows Tuples Columns Mapper Columns C* http://github.com/hmsonline/storm-cassandra
  • 35. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project TupleMapper Interface Tells the CassandraBolt how to write a tuple to an arbitrary data model Given a Storm Tuple: Map to Column Family Map to Row Key Map to Columns http://github.com/hmsonline/storm-cassandra
  • 36. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project ColumnsMapper Interface Tells the CassandraLookupBolt how to transform a C* row into a Storm Tuple Given a C* Row Key and list of Columns: Return a list of Storm Tuples http://github.com/hmsonline/storm-cassandra
  • 37. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project Current State: Version 0.4.0 Uses Astyanax Client Several out-of-the-box *Mapper Implementations: Basic Key-Value Columns Value-less Columns Counter Columns Lookup by row key Lookup by range query Composite Key/Column Support Trident support http://github.com/hmsonline/storm-cassandra
  • 38. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project Future Plans: Switch to CQL Enhanced Trident Support http://github.com/hmsonline/storm-cassandra
  • 39. 1• 800.593.4467 • info@healthmarketscience.com Persistent Word Count http://github.com/hmsonline/storm-cassandra
  • 40. 1• 800.593.4467 • info@healthmarketscience.com DRPC
  • 41. 1• 800.593.4467 • info@healthmarketscience.com “Reach” Computation
  • 42. 1• 800.593.4467 • info@healthmarketscience.com *Notional MDM Topology*
  • 43. 1• 800.593.4467 • info@healthmarketscience.com Load Topology
  • 44. 1• 800.593.4467 • info@healthmarketscience.com Shameless Shoutouts HMS (https://github.com/hmsonline/) storm-cassandra storm-elastic-search storm-jdbi (coming soon) ptgoetz (https://github.com/ptgoetz) storm-jms storm-signals
  • 45. 1• 800.593.4467 • info@healthmarketscience.com Next Level : Trident
  • 46. 1• 800.593.4467 • info@healthmarketscience.com Trident Provides a higher-level abstraction for stream processing Constructs for state management and Batching Adds additional primitives that abstract away common topological patterns Deprecates transactional topologies Distributes with Storm
  • 47. 1• 800.593.4467 • info@healthmarketscience.com Sample Trident Operations Partition Local Functions ( execute(x)  x + y ) Filters ( isKeep(x)  0,x ) PartitionAggregate Combiner ( pairwise combining ) Reducer ( iterative accumulation ) Aggregator ( byoa )
  • 48. 1• 800.593.4467 • info@healthmarketscience.com A sample topology TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate( MemcachedState.opaque(serverLocations), new Count(), new Fields("count")) .parallelismHint(6); https://github.com/nathanmarz/storm/wiki/Trident-state
  • 49. 1• 800.593.4467 • info@healthmarketscience.com Trident State Sequenced writes by batch/transaction id. Spouts Transactional Batch contents never change Opaque Batch contents can change State Transactional Store tx_id with counts to maintain sequencing of writes. Opaque Store previous value in order to overwrite the current value when contents of a batch change.

Notes de l'éditeur

  1. Storm:realtime, distributed computation systemComparable to complex event processing systemOriginated in the twitter analytics space.
  2. Tuple: set of key-value pairs (values can be serialized objects)
  3. Useful for pre-computing queries in real-time to optimize lookups (avoid expensive C* queries).