SlideShare une entreprise Scribd logo
1  sur  85
Télécharger pour lire hors ligne
Big Data Technologies and
           Techniques
                Ryan Brush
Distinguished Engineer, Cerner Corporation
               @ryanbrush
Relational Databases are Awesome
Atomic, transactional updates
   Guaranteed consistency

Relational Databases are Awesome
             Declarative queries
Easy to reason about
     Long track record of success
Relational Databases are Awesome
          …so use them!
Relational Databases are Awesome
          …so use them!

        But…
Those advantages have a cost
Global, atomic state means global,
atomic coordination

      Coordination does not scale linearly
The costs of coordination
     Remember the
     network effect?
The costs of coordination

             n(n -1)
  channels =
                2

  2 nodes = 1 channel
  5 nodes = 10 channels
  12 nodes = 66 channels
  25 nodes = 300 channels
So we better be able to scale
The costs of coordination
  Databases have optimized this in
  many clever ways, but a limit on
  scalability still exists
Let’s look at some ways to scale
Bulk processing billions of records
Bulk processing billions of records
 Data aggregation and storage
Bulk processing billions of records
 Data aggregation and storage
    Real-time processing of updates
Bulk processing billions of records
 Data aggregation and storage
    Real-time processing of updates
 Serving data for: Online Apps
                   Analytics
Let’s start with scalability of
bulk processing
Quiz: which one is scalable?
Quiz: which one is scalable?
    1000-node Hadoop cluster where
    jobs depend on a common process
Quiz: which one is scalable?
    1000-node Hadoop cluster where
    jobs depend on a common process
    1000 Windows ME machines running
    independent Excel macros
Quiz: which one is scalable?
    1000-node Hadoop cluster where
    jobs depend on a common process
    1000 Windows ME machines running
    independent Excel macros
Independence   Parallelizable
Independence      Parallelizable

      Parallelizable     Scalable
“Shared Nothing” architectures are the
most scalable…
“Shared Nothing” architectures are the
most scalable…
     …but most real-world problems require
     us to share something…
“Shared Nothing” architectures are the
most scalable…
     …but most real-world problems require
     us to share something…
  …so our designs usually have a parallel
  part and a serial part
The key is to make sure the vast majority
of our work in the cloud is independent and
parallelizable.
Amdahl’s Law
             1           S : speed improvement
S(N ) =                  P : ratio of the problem that
        (1- P) + P           can be parallelized
                     N   N: number of processors
MapReduce Primer
Input Data      Map Phase   Shuffle   Reduce
  Split 1                             Phase
                 Mapper 1
  Split 2        Mapper 2
                                      Reducer 1
  Split 3        Mapper 3
                                      Reducer 2
     .              .                     .
     .              .                     .
     .              .

                                      Reducer N


  Split N        Mapper N
MapReduce Example: Word Count
  Books   Map Phase     Shuffle   Reduce
          Count words             Phase
            per book              Sum words
          Count words                A-C
            per book              Sum words
               .                     D-E
                                     .
               .                      .
               .


                                  Sum words
                                     W-Z
          Count words
            per book
Notice there is still a serial part of the
problem: the of the reducers must be
combined
Notice there is still a serial part of the
problem: the of the reducers must be
combined
   …but this is much smaller, and can be
   handled by a single process
Also notice that the network is a shared
resource when processing big data
Also notice that the network is a shared
resource when processing big data
 So rather than moving data to computation,
 we move computation to data.
MapReduce Data Locality
Input Data     Map Phase       Shuffle     Reduce
  Split 1                                  Phase
                Mapper 1

  Split 2       Mapper 2                    Reducer 1

  Split 3       Mapper 3                    Reducer 2
                                                  .
                                                  .
     .             .
     .             .
     .             .
                                            Reducer N


  Split N       Mapper N


                           = a physical machine
Data locality is only guaranteed the Map
phase
Data locality is only guaranteed the Map
phase
 So the most data-intensive work should be
 done in the map, with smaller sets set to
 the reducer
Data locality is only guaranteed the Map
phase
 So the most data-intensive work should be
 done in the map, with smaller sets set to the
 reducer
Some Map/Reduce jobs have no reducer at
all!
MapReduce Gone Wrong
Books     Map Phase     Shuffle   Reduce
          Count words             Phase
            per book              Sum words
          Count words                A-C
            per book
                                  Sum words    Word
               .                     D-E
                                     .        Addition
               .
                                      .
               .                              Service

                                  Sum words
                                     W-Z
          Count words
            per book
Even if our Word Addition Service is
scalable, we’d need to scale it to the size of
the largest Map/Reduce job that will ever
use it
So for data processing, prefer embedded
libraries over remote services
So for data processing, prefer embedded
libraries over remote services
Use remote services for configuration, to
prime caches, etc. – just not for every data
element!
Joining a billion records
Word counts are great, but many real-world
problems mean bringing together multiple
datasets.

 So how do we “join” with MapReduce?
Map-Side Joins
When joining one big input to a small one,
Simply copy the small data set to each mapper
    Data Set 1     Map Phase     Shuffle   Reduce
                    Mapper 1               Phase
      Split 1
                    Data set 2

                                           Reducer 1
                    Mapper 2
      Split 2
                    Data set 2             Reducer 2
                                               .
                    Mapper 3                   .
      Split 3
                    Data set 2
Merge in Reducer
Route common items to the same reducer
  Data Set 1     Map Phase      Shuffle   Reduce
     Split 1                              Phase
                 Group by key
    Split 2      Group by key
                                          Reducer 1
    Split 3      Group by key
                                          Reducer 2
                                              .
                                              .
  Data Set 2
     Split 1     Group by key
                                          Reducer N
    Split 2      Group by key

    Split 3      Group by key
Higher-Level Constructs
MapReduce is a primitive operation for
higher-level constructs
Hive, Pig, Cascading, and Crunch all compile
Into MapReduce
                  Use one!


                                Crunch!
MapReduce and MPP Databases
MapReduce                          MPP Databases
Data in a distributed filesystem   Data in sharded relational databases
MapReduce                          MPP Databases
Data in a distributed filesystem   Data in sharded relational databases
Oriented towards unstructured      Oriented towards structured data
or semi-structured data
MapReduce                           MPP Databases
Data in a distributed filesystem    Data in sharded relational databases
Oriented towards unstructured       Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages   SQL
(e.g., Pig and Hive)
MapReduce                               MPP Databases
Data in a distributed filesystem        Data in sharded relational databases
Oriented towards unstructured           Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages       SQL
(e.g., Pig and Hive)
Poor support for iterative operations   Good support of iterative operations
MapReduce                               MPP Databases
Data in a distributed filesystem        Data in sharded relational databases
Oriented towards unstructured           Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages       SQL
(e.g., Pig and Hive)
Poor support for iterative operations   Good support of iterative operations
Arbitrarily complex programs            SQL and User-Defined Functions
running next to data                    running next to data
MapReduce                               MPP Databases
Data in a distributed filesystem        Data in sharded relational databases
Oriented towards unstructured           Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages       SQL
(e.g., Pig and Hive)
Poor support for iterative operations   Good support of iterative operations
Arbitrarily complex programs            SQL and User-Defined Functions
running next to data                    running next to data
Poor interactive query support          Good interactive query support
MapReduce   MPP Databases
      …are complementary!
MapReduce           MPP Databases
          …are complementary!

Map/Reduce to clean, normalize, reconcile
and codify data to load into a MPP system
for interactive analysis
Bulk processing of millions of records
 Data aggregation and storage
Hadoop Distributed Filesystem
  Scales to many petabytes
Hadoop Distributed Filesystem
  Scales to many petabytes
  Splits all files into blocks and spreads
  them across data nodes
Hadoop Distributed Filesystem
  Scales to many petabytes
  Splits all files into blocks and spreads
  them across data nodes
  The name node keeps track of what
  blocks belong to what file
Hadoop Distributed Filesystem
  Scales to many petabytes
  Splits all files into blocks and spreads
  them across data nodes
  The name node keeps track of what
  blocks belong to what file
  All blocks written in triplicate
Hadoop Distributed Filesystem
  Scales to many petabytes
  Splits all files into blocks and spreads
  them across data nodes
  The name node keeps track of what
  blocks belong to what file
  All blocks written in triplicate
  Write and append only –
  no random updates!
HDFS Writes

            Lookup Data Node
                                 Name Node
   Client

       Write

      Data Node 1              Data Node 2              Data Node N
      Block Replicate           Block Replicate . . .   Block

              Block                                         Block
HDFS Reads
              Lookup Block
              locations        Name Node
   Client

                Read

      Data Node 1            Data Node 2         Data Node N
      Block                  Block         ...   Block

              Block                                  Block
HDFS Shortcomings
 No random reads
 No random writes
 Doesn’t deal with many small files
HDFS Shortcomings
 No random reads
 No random writes
 Doesn’t deal with many small files


             Enter HBase
“Random Access To Your Planet-Size Data”
HBase
 Emulates random I/O with a
 Write Ahead Log (WAL)
 Periodically flushes log to sorted files
HBase
 Emulates random I/O with a
 Write Ahead Log (WAL)
 Periodically flushes log to sorted files
 Files accessible as tables, split across
 many regions, hosted by region servers
HBase
 Emulates random I/O with a
 Write Ahead Log (WAL)
 Periodically flushes log to sorted files
 Files accessible as tables, split across
 many regions, hosted by region servers
 Preserves scalability, data locality, and
 Map/Reduce features of Hadoop
Use HBase when:
 You have noisy, semi-structured data
Use HBase when:
 You have noisy, semi-structured data
 You want to apply massively parallel
 processing to your problem
Use HBase when:
 You have noisy, semi-structured data
 You want to apply massively parallel
 processing to your problem
 To handle huge write loads
Use HBase when:
 You have noisy, semi-structured data
 You want to apply massively parallel
 processing to your problem
 To handle huge write loads
 As a scalable key/value store
But there are drawbacks:
  Limited schema support
  Limited atomicity guarantees
  No built-in secondary indexes

HBase is a great tool for many jobs,
but not every job
The data store should align
with the needs of the application
So a pattern is emerging:
     Collection   Aggregation   Processing    Storage

     Millennium                                MPP

       CCDs                                  Relational
                   Hadoop
                                MapReduce
                     with
       Claims                      Jobs
                    HBase                    Document
                                               Store
        HL7
                                              HBase
But we have a potential bottleneck
     Collection   Aggregation   Processing    Storage

    Millennium                                 MPP

       CCDs                                  Relational
                   Hadoop
                                MapReduce
                     with
      Claims                       Jobs
                    HBase                    Document
                                               Store
        HL7
                                              HBase
Direct inserts are designed for online
updates, not massively parallel data loads
So shift the work into MapReduce, and pre-
build files for bulk import

      Oracle Loader for Hadoop
 HBase HFile Import     Bulk Loads for MPP
And we’re missing an important piece:
     Collection   Aggregation   Processing    Storage

    Millennium                                 MPP

       CCDs                                  Relational
                   Hadoop
                                MapReduce
                     with
      Claims                       Jobs
                    HBase                    Document
                                               Store
        HL7
                                              HBase
And we’re missing an important piece:
     Collection   Aggregation   Processing    Storage

    Millennium                                 MPP
                                Realtime
                                Processing
       CCDs                                  Relational
                   Hadoop
                     with
      Claims        HBase                    Document
                                Map/Red        Store
        HL7                     uce Jobs
                                 (batch)
                                              HBase
How do we make it fast?

                             Speed Layer




                              Batch Layer


http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems
How do we make it fast?
                          Move data to computation
 Hours of data

             Speed Layer
                                             Incremental
  Low Latency (seconds to process)           updates




                        Move computation to data
 Years of data

                 Batch Layer
                                              Bulk loads
High Latency (minutes or hours to process)
How do we make it fast?
               Complex Event Processing

          Speed Layer
  Storm




          Batch Layer         Hadoop
 MapReduce
And now, the challenge…
Process all data overnight
Quickly create new data models
   Fast iteration cycles means fast innovation

    Process all data overnight
             Simple correction of any bugs
Much easier to understand and work with
Questions?

Contenu connexe

Tendances

CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdfvishal choudhary
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Soumee Maschatak
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...Bikash Chandra Karmokar
 
IMPROVING SCHEDULING OF DATA TRANSMISSION IN TDMA SYSTEMS
IMPROVING SCHEDULING OF DATA TRANSMISSION IN TDMA SYSTEMSIMPROVING SCHEDULING OF DATA TRANSMISSION IN TDMA SYSTEMS
IMPROVING SCHEDULING OF DATA TRANSMISSION IN TDMA SYSTEMScsandit
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 

Tendances (14)

CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduce
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
 
lec2_ref.pdf
lec2_ref.pdflec2_ref.pdf
lec2_ref.pdf
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...
 
IMPROVING SCHEDULING OF DATA TRANSMISSION IN TDMA SYSTEMS
IMPROVING SCHEDULING OF DATA TRANSMISSION IN TDMA SYSTEMSIMPROVING SCHEDULING OF DATA TRANSMISSION IN TDMA SYSTEMS
IMPROVING SCHEDULING OF DATA TRANSMISSION IN TDMA SYSTEMS
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 

En vedette

Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBoston Consulting Group
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloudelliando dias
 
Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Taro L. Saito
 
Powerpoint act 1
Powerpoint act 1Powerpoint act 1
Powerpoint act 1iamemilioh
 
טרנספורמציה התמרה
טרנספורמציה  התמרה טרנספורמציה  התמרה
טרנספורמציה התמרה BSDjewishcoaching
 
Cicada selling-seo
Cicada selling-seoCicada selling-seo
Cicada selling-seoNed Wells
 
Ple ja yhteisöllinen oppiminen
Ple ja yhteisöllinen oppiminenPle ja yhteisöllinen oppiminen
Ple ja yhteisöllinen oppiminenTuure Puurunen
 
התמרה ושינוי בטבע
התמרה ושינוי בטבעהתמרה ושינוי בטבע
התמרה ושינוי בטבעBSDjewishcoaching
 
SOM1 - Immateriaalivarkauden sietämätön helppous
SOM1 - Immateriaalivarkauden sietämätön helppousSOM1 - Immateriaalivarkauden sietämätön helppous
SOM1 - Immateriaalivarkauden sietämätön helppousTuure Puurunen
 
Digestive system
Digestive systemDigestive system
Digestive systemLisa Josue
 
How did you use media technologies in the construction and research
How did you use media technologies in the construction and researchHow did you use media technologies in the construction and research
How did you use media technologies in the construction and researchStellaK17
 

En vedette (20)

Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014
 
Powerpoint act 1
Powerpoint act 1Powerpoint act 1
Powerpoint act 1
 
9 his 19.7.11
9 his 19.7.119 his 19.7.11
9 his 19.7.11
 
טרנספורמציה התמרה
טרנספורמציה  התמרה טרנספורמציה  התמרה
טרנספורמציה התמרה
 
Ix history 21.11.11
Ix history 21.11.11Ix history 21.11.11
Ix history 21.11.11
 
ÄII1 - Intro
ÄII1 - IntroÄII1 - Intro
ÄII1 - Intro
 
Cicada selling-seo
Cicada selling-seoCicada selling-seo
Cicada selling-seo
 
Ple ja yhteisöllinen oppiminen
Ple ja yhteisöllinen oppiminenPle ja yhteisöllinen oppiminen
Ple ja yhteisöllinen oppiminen
 
9 civics 3.8.11
9 civics 3.8.119 civics 3.8.11
9 civics 3.8.11
 
9 civics3.8.11
9 civics3.8.119 civics3.8.11
9 civics3.8.11
 
Presentation of arg
Presentation of argPresentation of arg
Presentation of arg
 
התמרה ושינוי בטבע
התמרה ושינוי בטבעהתמרה ושינוי בטבע
התמרה ושינוי בטבע
 
Osi model
Osi modelOsi model
Osi model
 
9 history 2.8.11
9 history 2.8.119 history 2.8.11
9 history 2.8.11
 
SOM1 - Immateriaalivarkauden sietämätön helppous
SOM1 - Immateriaalivarkauden sietämätön helppousSOM1 - Immateriaalivarkauden sietämätön helppous
SOM1 - Immateriaalivarkauden sietämätön helppous
 
JA3 - kurssin aloitus
JA3 - kurssin aloitusJA3 - kurssin aloitus
JA3 - kurssin aloitus
 
Digestive system
Digestive systemDigestive system
Digestive system
 
How did you use media technologies in the construction and research
How did you use media technologies in the construction and researchHow did you use media technologies in the construction and research
How did you use media technologies in the construction and research
 

Similaire à Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)Revolution Analytics
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
Intro to threp
Intro to threpIntro to threp
Intro to threpHong Wu
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce
MapReduceMapReduce
MapReduceKavyaGo
 
Adaptive MapReduce using Situation-Aware Mappers
Adaptive MapReduce using Situation-Aware MappersAdaptive MapReduce using Situation-Aware Mappers
Adaptive MapReduce using Situation-Aware Mappersrvernica
 

Similaire à Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques" (20)

Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
WELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANINGWELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANING
 
WELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANINGWELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANING
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
Intro to threp
Intro to threpIntro to threp
Intro to threp
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
MapReduce
MapReduceMapReduce
MapReduce
 
Adaptive MapReduce using Situation-Aware Mappers
Adaptive MapReduce using Situation-Aware MappersAdaptive MapReduce using Situation-Aware Mappers
Adaptive MapReduce using Situation-Aware Mappers
 

Dernier

Project Brief & Information Architecture Report
Project Brief & Information Architecture ReportProject Brief & Information Architecture Report
Project Brief & Information Architecture Reportamberjiles31
 
Developing Coaching Skills: Mine, Yours, Ours
Developing Coaching Skills: Mine, Yours, OursDeveloping Coaching Skills: Mine, Yours, Ours
Developing Coaching Skills: Mine, Yours, OursKaiNexus
 
UNLEASHING THE POWER OF PROGRAMMATIC ADVERTISING
UNLEASHING THE POWER OF PROGRAMMATIC ADVERTISINGUNLEASHING THE POWER OF PROGRAMMATIC ADVERTISING
UNLEASHING THE POWER OF PROGRAMMATIC ADVERTISINGlokeshwarmaha
 
Borderless Access - Global B2B Panel book-unlock 2024
Borderless Access - Global B2B Panel book-unlock 2024Borderless Access - Global B2B Panel book-unlock 2024
Borderless Access - Global B2B Panel book-unlock 2024Borderless Access
 
Data skills for Agile Teams- Killing story points
Data skills for Agile Teams- Killing story pointsData skills for Agile Teams- Killing story points
Data skills for Agile Teams- Killing story pointsyasinnathani
 
Chicago Medical Malpractice Lawyer Chicago Medical Malpractice Lawyer.pdf
Chicago Medical Malpractice Lawyer Chicago Medical Malpractice Lawyer.pdfChicago Medical Malpractice Lawyer Chicago Medical Malpractice Lawyer.pdf
Chicago Medical Malpractice Lawyer Chicago Medical Malpractice Lawyer.pdfSourav Sikder
 
HELENE HECKROTTE'S PROFESSIONAL PORTFOLIO.pptx
HELENE HECKROTTE'S PROFESSIONAL PORTFOLIO.pptxHELENE HECKROTTE'S PROFESSIONAL PORTFOLIO.pptx
HELENE HECKROTTE'S PROFESSIONAL PORTFOLIO.pptxHelene Heckrotte
 
Upgrade Your Banking Experience with Advanced Core Banking Applications
Upgrade Your Banking Experience with Advanced Core Banking ApplicationsUpgrade Your Banking Experience with Advanced Core Banking Applications
Upgrade Your Banking Experience with Advanced Core Banking ApplicationsIntellect Design Arena Ltd
 
Fabric RFID Wristbands in Ireland for Events and Festivals
Fabric RFID Wristbands in Ireland for Events and FestivalsFabric RFID Wristbands in Ireland for Events and Festivals
Fabric RFID Wristbands in Ireland for Events and FestivalsWristbands Ireland
 
BCE24 | Virtual Brand Ambassadors: Making Brands Personal - John Meulemans
BCE24 | Virtual Brand Ambassadors: Making Brands Personal - John MeulemansBCE24 | Virtual Brand Ambassadors: Making Brands Personal - John Meulemans
BCE24 | Virtual Brand Ambassadors: Making Brands Personal - John MeulemansBBPMedia1
 
Q2 2024 APCO Geopolitical Radar - The Global Operating Environment for Business
Q2 2024 APCO Geopolitical Radar - The Global Operating Environment for BusinessQ2 2024 APCO Geopolitical Radar - The Global Operating Environment for Business
Q2 2024 APCO Geopolitical Radar - The Global Operating Environment for BusinessAPCO
 
Live-Streaming in the Music Industry Webinar
Live-Streaming in the Music Industry WebinarLive-Streaming in the Music Industry Webinar
Live-Streaming in the Music Industry WebinarNathanielSchmuck
 
A flour, rice and Suji company in Jhang.
A flour, rice and Suji company in Jhang.A flour, rice and Suji company in Jhang.
A flour, rice and Suji company in Jhang.mcshagufta46
 
NASA CoCEI Scaling Strategy - November 2023
NASA CoCEI Scaling Strategy - November 2023NASA CoCEI Scaling Strategy - November 2023
NASA CoCEI Scaling Strategy - November 2023Steve Rader
 
Borderless Access - Global B2B Panel book-unlock 2024
Borderless Access - Global B2B Panel book-unlock 2024Borderless Access - Global B2B Panel book-unlock 2024
Borderless Access - Global B2B Panel book-unlock 2024Borderless Access
 
Mihir Menda - Member of Supervisory Board at RMZ
Mihir Menda - Member of Supervisory Board at RMZMihir Menda - Member of Supervisory Board at RMZ
Mihir Menda - Member of Supervisory Board at RMZKanakChauhan5
 
Borderless Access - Global Panel book-unlock 2024
Borderless Access - Global Panel book-unlock 2024Borderless Access - Global Panel book-unlock 2024
Borderless Access - Global Panel book-unlock 2024Borderless Access
 
Entrepreneurship & organisations: influences and organizations
Entrepreneurship & organisations: influences and organizationsEntrepreneurship & organisations: influences and organizations
Entrepreneurship & organisations: influences and organizationsP&CO
 
To Create Your Own Wig Online To Create Your Own Wig Online
To Create Your Own Wig Online  To Create Your Own Wig OnlineTo Create Your Own Wig Online  To Create Your Own Wig Online
To Create Your Own Wig Online To Create Your Own Wig Onlinelng ths
 
NewBase 25 March 2024 Energy News issue - 1710 by Khaled Al Awadi_compress...
NewBase  25 March  2024  Energy News issue - 1710 by Khaled Al Awadi_compress...NewBase  25 March  2024  Energy News issue - 1710 by Khaled Al Awadi_compress...
NewBase 25 March 2024 Energy News issue - 1710 by Khaled Al Awadi_compress...Khaled Al Awadi
 

Dernier (20)

Project Brief & Information Architecture Report
Project Brief & Information Architecture ReportProject Brief & Information Architecture Report
Project Brief & Information Architecture Report
 
Developing Coaching Skills: Mine, Yours, Ours
Developing Coaching Skills: Mine, Yours, OursDeveloping Coaching Skills: Mine, Yours, Ours
Developing Coaching Skills: Mine, Yours, Ours
 
UNLEASHING THE POWER OF PROGRAMMATIC ADVERTISING
UNLEASHING THE POWER OF PROGRAMMATIC ADVERTISINGUNLEASHING THE POWER OF PROGRAMMATIC ADVERTISING
UNLEASHING THE POWER OF PROGRAMMATIC ADVERTISING
 
Borderless Access - Global B2B Panel book-unlock 2024
Borderless Access - Global B2B Panel book-unlock 2024Borderless Access - Global B2B Panel book-unlock 2024
Borderless Access - Global B2B Panel book-unlock 2024
 
Data skills for Agile Teams- Killing story points
Data skills for Agile Teams- Killing story pointsData skills for Agile Teams- Killing story points
Data skills for Agile Teams- Killing story points
 
Chicago Medical Malpractice Lawyer Chicago Medical Malpractice Lawyer.pdf
Chicago Medical Malpractice Lawyer Chicago Medical Malpractice Lawyer.pdfChicago Medical Malpractice Lawyer Chicago Medical Malpractice Lawyer.pdf
Chicago Medical Malpractice Lawyer Chicago Medical Malpractice Lawyer.pdf
 
HELENE HECKROTTE'S PROFESSIONAL PORTFOLIO.pptx
HELENE HECKROTTE'S PROFESSIONAL PORTFOLIO.pptxHELENE HECKROTTE'S PROFESSIONAL PORTFOLIO.pptx
HELENE HECKROTTE'S PROFESSIONAL PORTFOLIO.pptx
 
Upgrade Your Banking Experience with Advanced Core Banking Applications
Upgrade Your Banking Experience with Advanced Core Banking ApplicationsUpgrade Your Banking Experience with Advanced Core Banking Applications
Upgrade Your Banking Experience with Advanced Core Banking Applications
 
Fabric RFID Wristbands in Ireland for Events and Festivals
Fabric RFID Wristbands in Ireland for Events and FestivalsFabric RFID Wristbands in Ireland for Events and Festivals
Fabric RFID Wristbands in Ireland for Events and Festivals
 
BCE24 | Virtual Brand Ambassadors: Making Brands Personal - John Meulemans
BCE24 | Virtual Brand Ambassadors: Making Brands Personal - John MeulemansBCE24 | Virtual Brand Ambassadors: Making Brands Personal - John Meulemans
BCE24 | Virtual Brand Ambassadors: Making Brands Personal - John Meulemans
 
Q2 2024 APCO Geopolitical Radar - The Global Operating Environment for Business
Q2 2024 APCO Geopolitical Radar - The Global Operating Environment for BusinessQ2 2024 APCO Geopolitical Radar - The Global Operating Environment for Business
Q2 2024 APCO Geopolitical Radar - The Global Operating Environment for Business
 
Live-Streaming in the Music Industry Webinar
Live-Streaming in the Music Industry WebinarLive-Streaming in the Music Industry Webinar
Live-Streaming in the Music Industry Webinar
 
A flour, rice and Suji company in Jhang.
A flour, rice and Suji company in Jhang.A flour, rice and Suji company in Jhang.
A flour, rice and Suji company in Jhang.
 
NASA CoCEI Scaling Strategy - November 2023
NASA CoCEI Scaling Strategy - November 2023NASA CoCEI Scaling Strategy - November 2023
NASA CoCEI Scaling Strategy - November 2023
 
Borderless Access - Global B2B Panel book-unlock 2024
Borderless Access - Global B2B Panel book-unlock 2024Borderless Access - Global B2B Panel book-unlock 2024
Borderless Access - Global B2B Panel book-unlock 2024
 
Mihir Menda - Member of Supervisory Board at RMZ
Mihir Menda - Member of Supervisory Board at RMZMihir Menda - Member of Supervisory Board at RMZ
Mihir Menda - Member of Supervisory Board at RMZ
 
Borderless Access - Global Panel book-unlock 2024
Borderless Access - Global Panel book-unlock 2024Borderless Access - Global Panel book-unlock 2024
Borderless Access - Global Panel book-unlock 2024
 
Entrepreneurship & organisations: influences and organizations
Entrepreneurship & organisations: influences and organizationsEntrepreneurship & organisations: influences and organizations
Entrepreneurship & organisations: influences and organizations
 
To Create Your Own Wig Online To Create Your Own Wig Online
To Create Your Own Wig Online  To Create Your Own Wig OnlineTo Create Your Own Wig Online  To Create Your Own Wig Online
To Create Your Own Wig Online To Create Your Own Wig Online
 
NewBase 25 March 2024 Energy News issue - 1710 by Khaled Al Awadi_compress...
NewBase  25 March  2024  Energy News issue - 1710 by Khaled Al Awadi_compress...NewBase  25 March  2024  Energy News issue - 1710 by Khaled Al Awadi_compress...
NewBase 25 March 2024 Energy News issue - 1710 by Khaled Al Awadi_compress...
 

Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

  • 1. Big Data Technologies and Techniques Ryan Brush Distinguished Engineer, Cerner Corporation @ryanbrush
  • 3. Atomic, transactional updates Guaranteed consistency Relational Databases are Awesome Declarative queries Easy to reason about Long track record of success
  • 4. Relational Databases are Awesome …so use them!
  • 5. Relational Databases are Awesome …so use them! But…
  • 6. Those advantages have a cost Global, atomic state means global, atomic coordination Coordination does not scale linearly
  • 7. The costs of coordination Remember the network effect?
  • 8. The costs of coordination n(n -1) channels = 2 2 nodes = 1 channel 5 nodes = 10 channels 12 nodes = 66 channels 25 nodes = 300 channels
  • 9. So we better be able to scale
  • 10. The costs of coordination Databases have optimized this in many clever ways, but a limit on scalability still exists
  • 11. Let’s look at some ways to scale
  • 13. Bulk processing billions of records Data aggregation and storage
  • 14. Bulk processing billions of records Data aggregation and storage Real-time processing of updates
  • 15. Bulk processing billions of records Data aggregation and storage Real-time processing of updates Serving data for: Online Apps Analytics
  • 16. Let’s start with scalability of bulk processing
  • 17. Quiz: which one is scalable?
  • 18. Quiz: which one is scalable? 1000-node Hadoop cluster where jobs depend on a common process
  • 19. Quiz: which one is scalable? 1000-node Hadoop cluster where jobs depend on a common process 1000 Windows ME machines running independent Excel macros
  • 20. Quiz: which one is scalable? 1000-node Hadoop cluster where jobs depend on a common process 1000 Windows ME machines running independent Excel macros
  • 21. Independence Parallelizable
  • 22. Independence Parallelizable Parallelizable Scalable
  • 23. “Shared Nothing” architectures are the most scalable…
  • 24. “Shared Nothing” architectures are the most scalable… …but most real-world problems require us to share something…
  • 25. “Shared Nothing” architectures are the most scalable… …but most real-world problems require us to share something… …so our designs usually have a parallel part and a serial part
  • 26. The key is to make sure the vast majority of our work in the cloud is independent and parallelizable.
  • 27. Amdahl’s Law 1 S : speed improvement S(N ) = P : ratio of the problem that (1- P) + P can be parallelized N N: number of processors
  • 28. MapReduce Primer Input Data Map Phase Shuffle Reduce Split 1 Phase Mapper 1 Split 2 Mapper 2 Reducer 1 Split 3 Mapper 3 Reducer 2 . . . . . . . . Reducer N Split N Mapper N
  • 29. MapReduce Example: Word Count Books Map Phase Shuffle Reduce Count words Phase per book Sum words Count words A-C per book Sum words . D-E . . . . Sum words W-Z Count words per book
  • 30. Notice there is still a serial part of the problem: the of the reducers must be combined
  • 31. Notice there is still a serial part of the problem: the of the reducers must be combined …but this is much smaller, and can be handled by a single process
  • 32. Also notice that the network is a shared resource when processing big data
  • 33. Also notice that the network is a shared resource when processing big data So rather than moving data to computation, we move computation to data.
  • 34. MapReduce Data Locality Input Data Map Phase Shuffle Reduce Split 1 Phase Mapper 1 Split 2 Mapper 2 Reducer 1 Split 3 Mapper 3 Reducer 2 . . . . . . . . Reducer N Split N Mapper N = a physical machine
  • 35. Data locality is only guaranteed the Map phase
  • 36. Data locality is only guaranteed the Map phase So the most data-intensive work should be done in the map, with smaller sets set to the reducer
  • 37. Data locality is only guaranteed the Map phase So the most data-intensive work should be done in the map, with smaller sets set to the reducer Some Map/Reduce jobs have no reducer at all!
  • 38. MapReduce Gone Wrong Books Map Phase Shuffle Reduce Count words Phase per book Sum words Count words A-C per book Sum words Word . D-E . Addition . . . Service Sum words W-Z Count words per book
  • 39. Even if our Word Addition Service is scalable, we’d need to scale it to the size of the largest Map/Reduce job that will ever use it
  • 40. So for data processing, prefer embedded libraries over remote services
  • 41. So for data processing, prefer embedded libraries over remote services Use remote services for configuration, to prime caches, etc. – just not for every data element!
  • 42. Joining a billion records Word counts are great, but many real-world problems mean bringing together multiple datasets. So how do we “join” with MapReduce?
  • 43. Map-Side Joins When joining one big input to a small one, Simply copy the small data set to each mapper Data Set 1 Map Phase Shuffle Reduce Mapper 1 Phase Split 1 Data set 2 Reducer 1 Mapper 2 Split 2 Data set 2 Reducer 2 . Mapper 3 . Split 3 Data set 2
  • 44. Merge in Reducer Route common items to the same reducer Data Set 1 Map Phase Shuffle Reduce Split 1 Phase Group by key Split 2 Group by key Reducer 1 Split 3 Group by key Reducer 2 . . Data Set 2 Split 1 Group by key Reducer N Split 2 Group by key Split 3 Group by key
  • 45. Higher-Level Constructs MapReduce is a primitive operation for higher-level constructs Hive, Pig, Cascading, and Crunch all compile Into MapReduce Use one! Crunch!
  • 46. MapReduce and MPP Databases
  • 47. MapReduce MPP Databases Data in a distributed filesystem Data in sharded relational databases
  • 48. MapReduce MPP Databases Data in a distributed filesystem Data in sharded relational databases Oriented towards unstructured Oriented towards structured data or semi-structured data
  • 49. MapReduce MPP Databases Data in a distributed filesystem Data in sharded relational databases Oriented towards unstructured Oriented towards structured data or semi-structured data Java or Domain-Specific Languages SQL (e.g., Pig and Hive)
  • 50. MapReduce MPP Databases Data in a distributed filesystem Data in sharded relational databases Oriented towards unstructured Oriented towards structured data or semi-structured data Java or Domain-Specific Languages SQL (e.g., Pig and Hive) Poor support for iterative operations Good support of iterative operations
  • 51. MapReduce MPP Databases Data in a distributed filesystem Data in sharded relational databases Oriented towards unstructured Oriented towards structured data or semi-structured data Java or Domain-Specific Languages SQL (e.g., Pig and Hive) Poor support for iterative operations Good support of iterative operations Arbitrarily complex programs SQL and User-Defined Functions running next to data running next to data
  • 52. MapReduce MPP Databases Data in a distributed filesystem Data in sharded relational databases Oriented towards unstructured Oriented towards structured data or semi-structured data Java or Domain-Specific Languages SQL (e.g., Pig and Hive) Poor support for iterative operations Good support of iterative operations Arbitrarily complex programs SQL and User-Defined Functions running next to data running next to data Poor interactive query support Good interactive query support
  • 53. MapReduce MPP Databases …are complementary!
  • 54. MapReduce MPP Databases …are complementary! Map/Reduce to clean, normalize, reconcile and codify data to load into a MPP system for interactive analysis
  • 55. Bulk processing of millions of records Data aggregation and storage
  • 56. Hadoop Distributed Filesystem Scales to many petabytes
  • 57. Hadoop Distributed Filesystem Scales to many petabytes Splits all files into blocks and spreads them across data nodes
  • 58. Hadoop Distributed Filesystem Scales to many petabytes Splits all files into blocks and spreads them across data nodes The name node keeps track of what blocks belong to what file
  • 59. Hadoop Distributed Filesystem Scales to many petabytes Splits all files into blocks and spreads them across data nodes The name node keeps track of what blocks belong to what file All blocks written in triplicate
  • 60. Hadoop Distributed Filesystem Scales to many petabytes Splits all files into blocks and spreads them across data nodes The name node keeps track of what blocks belong to what file All blocks written in triplicate Write and append only – no random updates!
  • 61. HDFS Writes Lookup Data Node Name Node Client Write Data Node 1 Data Node 2 Data Node N Block Replicate Block Replicate . . . Block Block Block
  • 62. HDFS Reads Lookup Block locations Name Node Client Read Data Node 1 Data Node 2 Data Node N Block Block ... Block Block Block
  • 63. HDFS Shortcomings No random reads No random writes Doesn’t deal with many small files
  • 64. HDFS Shortcomings No random reads No random writes Doesn’t deal with many small files Enter HBase “Random Access To Your Planet-Size Data”
  • 65. HBase Emulates random I/O with a Write Ahead Log (WAL) Periodically flushes log to sorted files
  • 66. HBase Emulates random I/O with a Write Ahead Log (WAL) Periodically flushes log to sorted files Files accessible as tables, split across many regions, hosted by region servers
  • 67. HBase Emulates random I/O with a Write Ahead Log (WAL) Periodically flushes log to sorted files Files accessible as tables, split across many regions, hosted by region servers Preserves scalability, data locality, and Map/Reduce features of Hadoop
  • 68. Use HBase when: You have noisy, semi-structured data
  • 69. Use HBase when: You have noisy, semi-structured data You want to apply massively parallel processing to your problem
  • 70. Use HBase when: You have noisy, semi-structured data You want to apply massively parallel processing to your problem To handle huge write loads
  • 71. Use HBase when: You have noisy, semi-structured data You want to apply massively parallel processing to your problem To handle huge write loads As a scalable key/value store
  • 72. But there are drawbacks: Limited schema support Limited atomicity guarantees No built-in secondary indexes HBase is a great tool for many jobs, but not every job
  • 73. The data store should align with the needs of the application
  • 74. So a pattern is emerging: Collection Aggregation Processing Storage Millennium MPP CCDs Relational Hadoop MapReduce with Claims Jobs HBase Document Store HL7 HBase
  • 75. But we have a potential bottleneck Collection Aggregation Processing Storage Millennium MPP CCDs Relational Hadoop MapReduce with Claims Jobs HBase Document Store HL7 HBase
  • 76. Direct inserts are designed for online updates, not massively parallel data loads So shift the work into MapReduce, and pre- build files for bulk import Oracle Loader for Hadoop HBase HFile Import Bulk Loads for MPP
  • 77. And we’re missing an important piece: Collection Aggregation Processing Storage Millennium MPP CCDs Relational Hadoop MapReduce with Claims Jobs HBase Document Store HL7 HBase
  • 78. And we’re missing an important piece: Collection Aggregation Processing Storage Millennium MPP Realtime Processing CCDs Relational Hadoop with Claims HBase Document Map/Red Store HL7 uce Jobs (batch) HBase
  • 79. How do we make it fast? Speed Layer Batch Layer http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems
  • 80. How do we make it fast? Move data to computation Hours of data Speed Layer Incremental Low Latency (seconds to process) updates Move computation to data Years of data Batch Layer Bulk loads High Latency (minutes or hours to process)
  • 81. How do we make it fast? Complex Event Processing Speed Layer Storm Batch Layer Hadoop MapReduce
  • 82. And now, the challenge…
  • 83. Process all data overnight
  • 84. Quickly create new data models Fast iteration cycles means fast innovation Process all data overnight Simple correction of any bugs Much easier to understand and work with