SlideShare une entreprise Scribd logo
1  sur  40
Low-Latency “OLAP” with Hadoop and HBase
      Andrei Dragomir | Software Engineer




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Synopsis


  §  What                        are we trying to solve
  §  Description                                              of our system
  §  How                     it works
  §  Minimizing                                            Latency




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   2
In a nutshell


  Low-latency OLAP system
  Hadoop DFS to store input data (ie log files, or
  HBase tables)
  The processing loop of the system takes a cube
  description and processes it (pre-aggregations)
  using Hadoop Map/Reduce.
  The output is written to a statistics HBase table.
  To get the data, users query a server, which scans
  the HBase table, applying the filters, roll-ups or
  drill-downs, and returning the result.
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   3
In a nutshell


  Low-latency OLAP system
  Hadoop DFS to store input data (ie log files, or
  HBase tables)
  The processing loop of the system takes a cube
  description and processes it (pre-aggregations)
  using Hadoop Map/Reduce.
  The output is written to a statistics HBase table.
  To get the data, users query a server, which scans
  the HBase table, applying the filters, roll-ups or
  drill-downs, and returning the result.
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   4
In a nutshell


  Low-latency OLAP system
  Hadoop DFS to store input data (ie log files, or
  HBase tables)
  The processing loop of the system takes a cube
  description and processes it (pre-aggregations)
  using Hadoop Map/Reduce.
  The output is written to a statistics HBase table.
  To get the data, users query a server, which scans
  the HBase table, applying the filters, roll-ups or
  drill-downs, and returning the result.
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   5
In a nutshell


  Low-latency OLAP system
  Hadoop DFS to store input data (ie log files, or
  HBase tables)
  The processing loop of the system takes a cube
  description and processes it (pre-aggregations)
  using Hadoop Map/Reduce.
  The output is written to a statistics HBase table.
  To get the data, users query a server, which scans
  the HBase table, applying the filters, roll-ups or
  drill-downs, and returning the result.
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   6
In a nutshell


  Low-latency OLAP system
  Hadoop DFS to store input data (ie log files, or
  HBase tables)
  The processing loop of the system takes a cube
  description and processes it (pre-aggregations)
  using Hadoop Map/Reduce.
  The output is written to a statistics HBase table.
  To get the data, users query a server, which scans
  the HBase table, applying the filters, roll-ups or
  drill-downs, and returning the result.
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   7
In a nutshell


  Low-latency OLAP system
  Hadoop DFS to store input data (ie log files, or
  HBase tables)
  The processing loop of the system takes a cube
  description and processes it (pre-aggregations)
  using Hadoop Map/Reduce.
  The output is written to a statistics HBase table.
  To get the data, users query a server, which scans
  the HBase table, applying the filters, roll-ups or
  drill-downs, and returning the result.
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   8
Vocabulary

  Date                              Country                            City          OS       Browser      Sales
  2012-05-12                        USA                                NY            Win      FF           $ 0.0
  2012-05-12                        USA                                NY            Win      FF           $ 10.0
  2012-05-13                        USA                                SF            OSX      Chrome       $ 25.0
  2012-05-13                        Canada                             Ontario       Linux    Chrome       $ 0.0
  2012-05-14                        USA                                Chicago       OSX      Safari       $ 15.0
  ...                               ...                                ...           ...      ...          ...
  5 Visits                          2 Countries 4 Cities:                            3 OS:    3 Browser:   $50.0
  3 Days                            USA: 4      NY: 2                                Win: 2   FF: 2        3 sales
                                    Canada: 1   SF: 1                                OSX: 2   Chrome: 2




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.      9
Vocabulary

  Date                              Country                            City           OS       Browser      Sales
  2012-05-12                        USA                                NY             Win      FF           $ 0.0
  2012-05-12                        USA                                NY             Win      FF           $ 10.0
  2012-05-13                        USA                                SF             OSX      Chrome       $ 25.0
  2012-05-13                        Canada                             Ontario        Linux    Chrome       $ 0.0
  2012-05-14                        USA                                Chicago        OSX      Safari       $ 15.0
  ...                               ...                                ...            ...      ...          ...
  5 Visits                          2 Countries 4 Cities:                             3 OS:    3 Browser:   $50.0
  3 Days                            USA: 4      NY: 2                                 Win: 2   FF: 2        3 sales
                                    Canada: 1   SF: 1                                 OSX: 2   Chrome: 2
  §    We want to get (mostly) numeric data: metrics
  §    These metrics have a set of labels (dimensions)
  §    We want to view the metrics by any combination of
        dimensions
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.      10
Vocabulary

  Date                              Country                            City           OS       Browser      Sales
  2012-05-12                        USA                                NY             Win      FF           $ 0.0
  2012-05-12                        USA                                NY             Win      FF           $ 10.0
  2012-05-13                        USA                                SF             OSX      Chrome       $ 25.0
  2012-05-13                        Canada                             Ontario        Linux    Chrome       $ 0.0
  2012-05-14                        USA                                Chicago        OSX      Safari       $ 15.0
  ...                               ...                                ...            ...      ...          ...
  5 Visits                          2 Countries 4 Cities:                             3 OS:    3 Browser:   $50.0
  3 Days                            USA: 4      NY: 2                                 Win: 2   FF: 2        3 sales
                                    Canada: 1   SF: 1                                 OSX: 2   Chrome: 2
  §    We want to get (mostly) numeric data: metrics
  §    These metrics have a set of labels (dimensions)
  §    We want to view the metrics by any combination of
        dimensions
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.      11
Vocabulary

  Date                              Country                            City           OS       Browser      Sales
  2012-05-12                        USA                                NY             Win      FF           $ 0.0
  2012-05-12                        USA                                NY             Win      FF           $ 10.0
  2012-05-13                        USA                                SF             OSX      Chrome       $ 25.0
  2012-05-13                        Canada                             Ontario        Linux    Chrome       $ 0.0
  2012-05-14                        USA                                Chicago        OSX      Safari       $ 15.0
  ...                               ...                                ...            ...      ...          ...
  5 Visits                          2 Countries 4 Cities:                             3 OS:    3 Browser:   $50.0
  3 Days                            USA: 4      NY: 2                                 Win: 2   FF: 2        3 sales
                                    Canada: 1   SF: 1                                 OSX: 2   Chrome: 2
  §    We want to get (mostly) numeric data: metrics
  §    These metrics have a set of labels (dimensions)
  §    We want to view the metrics by any combination of
        dimensions
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.      12
OLAP Queries

  §    Rolling up to country level                                                Country    visits   sales
  SELECT	
  COUNT(visits),	
  SUM(sales)	
                                         USA        4        $50
  GROUP	
  BY	
  country	
  
                                                                                   Canada     1        0




  §    “Slicing” by browser                                                       Country   visits sales

  SELECT	
  COUNT(visits),	
  SUM(sales)	
                                         USA       2         $10

  GROUP	
  BY	
  country	
                                                         Canada    0         0
  HAVING	
  browser	
  =	
  “FF”	
  

                                                                                   Browser   sales     visits
  §    Top browsers by sales
                                                                                   Chrome    $25       2
  SELECT	
  SUM(sales),	
  COUNT(visits)	
  	
  
  GROUP	
  BY	
  browser	
  	
                                                     Safari    $15       1

  ORDER	
  BY	
  sales	
                                                           FF        $10       2

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   13
Looking inside – physical diagram




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Looking inside – logical diagram




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Simplifying assumptions: pre-aggregation


  §  In          most cases...
       §  Data  needs to be summarized – hard to
             draw 1B data points
       §  You    don’t need to look at all dimensions at
             the same time – hard to correlate
       §  Not   all queries are used with the same
             frequency




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   16
A timeless CS problem: Optimize...


                                       Time                                                     Space
       §  Pre-aggregation                                                         §  Runtime

       §  Fast
                                                                                     aggregation
                                                                                   §  Flexible
       §  Efficient                               reads –
             O(1)
       §  Inflexible                                                              §  I/O,   CPU intensive
       §  Processing                                           latency            §  Slow– always need
       §  Combinatorial
                                                                                     to look at all the
             Explosion                                                               data
                                                                                   §  Low    throughput
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   17
Solution ?


  §  Just do both !
  §  Can tune: pre-aggregate more, or rely on
      runtime aggregation
  §  Ingestion + process speed vs Query speed

  §  Works just like normal queries +
      materialized views




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   18
Solution ?


  §  Process:   pre-aggregate all the report
       definitions, create an indexed HBase table.
  §  Query:   use the indexes to get the data
       fast. Perform extra aggregation, filtering if
       needed at runtime.
  §  Platform                                   strengths
       §  Parallelism                                         in M/R
       §  Fast  access and natural key ordering in
             HBase
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   19
Minimal HBase details

                                                                                   Row	
     Columns...	
  
  §    Data is stored in tables                                                   Key	
  
                                                                                   u1	
      v1	
      v2	
      v3	
  
  §    Each row has a key,
                                                                                   u2	
      v	
       X	
       ...	
  
        and any number of
        columns (long & wide)                                                      u3	
      v	
       x	
       ...	
  
                                                                                   u4	
      x	
       v2	
      ...	
  
  §    Ordered by row keys:                                                       u5	
      ...	
     v3	
      ...	
  
        clustered indexes
                                                                                   u6	
      ...	
     v5	
      ...	
  
        built-in
                                                                                   u7	
      ...	
     ...	
     ...	
  
  §    Sparse tables. NULLs                                                       u8	
      ...	
     ...	
     ...	
  
        are free.


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   20
Minimal HBase details

                                                                                           Row	
     Column
  §    Operations use row                                                                 key	
     ...	
  
        key: get(), put()	
                                                                aaa	
     v1	
  
                                                                                           aab	
     v2	
  
  §    Can scan a range of
                                                                                   ←	
  
        rows:[start,	
  end)	
                                                             aac	
     v3	
  
                                                                                   ←	
     aad	
     v4	
  
  §  We   can use the row                                                         ←	
     aae	
     v5	
  
        key as a built-in                                                          ←	
     aaf	
     v6	
  
        indexing                                                                           aba	
     ...	
  
        mechanism                                                                          abb	
     ...	
  




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   21
SaasBase vs. SQL Views Comparison




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   22
Reports configuration


  §    List of Dimensions (with custom classes,
        arguments, etc)
  §    List of Metrics (with custom classes, arguments,
        etc)
  §    List of Reports, each containing
        §    Dimensions (subset)
        §    Metrics (subset)
        §    Sorting, etc
  §  The    reports configuration is used in the
        entire system: import, process, query
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   23
Solution ?

  Date                         Countr                  Cit           Sale
                               y                       y             s
  2012-05-1 USA                                        NY            3
  2
  2012-05-1 USA                                        NY            10
  2
  2012-05-1 USA                                        SF            25
  3
  2012-05-1 CAN                                        ON            0
  3
  2012-05-1 USA                                        CH            15
  4




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   24
Solution ?

  Date                         Countr                  Cit           Sale
                               y                       y             s
  2012-05-1 USA                                        NY            3
  2
  2012-05-1 USA                                        NY            10
  2
  2012-05-1 USA                                        SF            25
  3
  2012-05-1 CAN                                        ON            0
  3
  2012-05-1 USA
  visits_by_city:	
  {	
             CH 15
  	
  	
  dimensions:	
  [country,	
  city],	
  	
  
  4
  	
  	
  metrics:	
  [visits]	
  
  },	
  	
  
  daily_sales:	
  {	
  
  	
  	
  dimensions:	
  [year,	
  month,	
  day,	
  
  country],	
  	
  
  	
  	
  metrics:	
  [sales]	
  
  }	
  

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   25
Solution ?

  Date                         Countr                  Cit           Sale
                               y                       y             s
  2012-05-1 USA                                        NY            3
  2
  2012-05-1 USA                                        NY            10
  2                                                                           	
  	
  	
  Statistics	
  HBASE	
  Output	
  Table	
  
                                                                                             	
  	
  	
  	
  	
  ROWKEY	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  VALUE	
  
  2012-05-1 USA                                        SF            25
  3                                                                           daily_sales/2012+05+12+USA	
  	
  	
  	
  $13	
  	
  
                                                                              daily_sales/2012+05+13+CAN	
  	
  	
  	
  $0	
  
  2012-05-1 CAN                                        ON            0
                                                                              daily_sales/2012+05+13+USA	
  	
  	
  	
  $25	
  
  3
                                                                              daily_sales/2012+05+14+USA	
  	
  	
  	
  $15	
  
  2012-05-1 USA
  visits_by_city:	
  {	
             CH 15                                    visits_by_city/CAN+ON	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  
  	
  	
  dimensions:	
  [country,	
  city],	
  	
  
  4
  	
  	
  metrics:	
  [visits]	
                                              visits_by_city/USA+CH	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  
  },	
  	
  
  daily_sales:	
  {	
                                                         visits_by_city/USA+NY	
  	
  	
  	
  	
  	
  	
  	
  	
  2	
  
  	
  	
  dimensions:	
  [year,	
  month,	
  day,	
                           visits_by_city/USA+SF	
  	
  	
  	
  	
  	
  	
  	
  	
  1	
  
  country],	
  	
  
  	
  	
  metrics:	
  [sales]	
  
  }	
  

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.     26
HBase natural order: hierarchical filtering




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   27
Sorting


  §  Add  the metrics that you want to sort by to the
       row key...
  §  In          a way that preserves the ordering




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   28
Sorting


  §  Add   the metrics that you want to sort by to the
        row key...
  §  In          a way that preserves the ordering
  §    ORDER	
  BY	
  metric	
  DESC	
  ==	
  Long.MAX_VALUE	
  –	
  metric	
  


  2012+05+USA+0000000000+	
  
  2012+05+USA+4294961296+SF 	
  =	
  1000	
  visits	
  
  2012+05+USA+4294961396+NY 	
  =	
  900	
  visits	
  
  .	
  .	
  .	
  	
  	
  
  2012+05+USA+9999999999+	
  


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   29
Minimizing Latency




© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Minimizing Import Latency


  §    Only import the minimal set of changes
  §    Map/Reduce input filters:
        §    c.a.s.a.i.FileCache – checks if file already
              processed
        §    c.a.s.a.i.FileDateFilter – checks if a date in
              the file path is against a specified interval
        §    process files from 3 days ago up until now,
              once
        §    HBase scan (from import table) start and stop row
  §    Minimize map-task overhead – stitch input splits
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   31
Minimizing Import Latency


  §    Minimize map-task overhead – stitch input splits
  §    for 400000 files -> 400000 Map Tasks, slow reduce-copy
        phase
  §    o.a.h.m.i.CombineFileInputFormat – make 2GB
        splits
  §    c.a.s.a.m.i.FixedMappersTableInputFormat –
        stitches multiple HBase regions in the same
        map task



© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   32
Minimizing Import Latency


  §    If warehousing in HBase, use
        o.a.h.h.m.HFileOutputFormat	
  
  §    ~ 100 times faster than using the API
  §    No shuffle step! you must use a global order partitioner
  §    Problem: data grows over time
  §    Solution: estimate output partitions based on input data
        size, and make partitions (regions) using this heuristic
  §    c.a.s.a.m.FileSizeDatePartitioner – inject input files
        size and dates and rebalance regions based on these,
        and a fixed size (2GB)


© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   33
Minimizing Processing Latency


  §    Processing involves reading the input (files, tables,
        events), pre-aggregating it (reducing cardinality) and
        generating tables that can be queried in real-time
  §    Processing does GROUP BY, COUNT/SUM/AVG, ORDER
        BY
  §    Minimize each M/R step: read, map, partition, combine,
        copy, sort, reduce, write
  §    Read
        §    Filter input data (incremental processing) – differentiate
              between OPEN and CLOSED data
        §    HBase Scan options: caching, batching, etc
        §    Ensure HBase table regions are distributed in the cluster
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   34
Minimizing Processing Latency


  §    c.a.s.a.m.j.SuperProcessor	
  
        §    One shot M/R job: for all data, for all reports, emit the
              pre-aggregated values in 1 map() call
        §    no allocations
        §    Simple and tight
        §    no system calls (avoid context switches)
        §    no String <> byte[] transformations
        §    minimize Map > Combine > Reduce I/O
        §    NO ALLOCATIONS



© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   35
Minimizing Query Latency


  §    c.a.s.a.m.t.ReportHandler	
  
        §    Simple Thrift server
  §    Data is already processed and pre-aggregated
  §    Query time does HAVING/WHERE (filters), extra
        GROUP BY (roll-ups)
  §    Calculate an optimal set of HBase scan()s	
  
        §    single / multiple scans
        §    start / stop rows (prefixes, index positions)
  §    Perform extra roll-ups / sorting
  §    Assorted sundries: paging, display-time ser/des, etc

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   36
Flexible


  §    Report configuration – the core of the system
  §    c.a.s.a.e.Dimension, c.a.s.a.e.Metric	
  
        §    Can override ser/des, aggregate functions (for metrics)
        §    Can override behavior (only add 1 if X...)
        §    Emergent patterns are rolled-up in the reporting core
  §    The entire processing loop can be written outside of
        M/R for realtime
        §    Storm ?
  §    Applied in 4 use-cases right now, easy to extend
  §    Some programming required
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.   37
Thank you


                                 adragomi@adobe.com / @adragomir
                                          http://hstack.org


       Our team: Adrian Muraru, Andrei Dulvac, Bogdan Dragu,
     Bogdan Drutu, Cosmin Lehene, Raluca Podiuc, Tudor Scurtu

© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
© 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Break!
Break takes place in the Community Showcase (Hall 2)
Sessions will resume at 3:35pm




                                                       Page 40

Contenu connexe

Tendances

Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastDataWorks Summit
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®confluent
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/OThousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/OGeorge Cao
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeDatabricks
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking VN
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayAltinity Ltd
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Big Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media AnalyticsBig Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media Analyticshafeeznazri
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 

Tendances (20)

Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the Beast
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/OThousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/O
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Big Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media AnalyticsBig Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media Analytics
 
Hive Does ACID
Hive Does ACIDHive Does ACID
Hive Does ACID
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 

En vedette

Low Latency “OLAP” with HBase - HBaseCon 2012
Low Latency “OLAP” with HBase - HBaseCon 2012Low Latency “OLAP” with HBase - HBaseCon 2012
Low Latency “OLAP” with HBase - HBaseCon 2012Cosmin Lehene
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Cosmin Lehene
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?DataWorks Summit
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopDataWorks Summit
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache KylinYang Li
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseHBaseCon
 
HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe
HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, AdobeHBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe
HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, AdobeCloudera, Inc.
 
Adding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupAdding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupLuke Han
 
(Ebook pdf) olap
(Ebook   pdf) olap(Ebook   pdf) olap
(Ebook pdf) olapTalita Lima
 
Sybase BAM Overview
Sybase BAM OverviewSybase BAM Overview
Sybase BAM OverviewXu Jiang
 
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecYang Li
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering PrinciplesXu Jiang
 
eBay Cloud CMS - QCon 2012 - http://yidb.org/
eBay Cloud CMS - QCon 2012 - http://yidb.org/eBay Cloud CMS - QCon 2012 - http://yidb.org/
eBay Cloud CMS - QCon 2012 - http://yidb.org/Xu Jiang
 
Apache Kylin Introduction
Apache Kylin IntroductionApache Kylin Introduction
Apache Kylin IntroductionLuke Han
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillDataWorks Summit
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Seshu Adunuthula
 
Polyglot Messaging with Apache ActiveMQ
Polyglot Messaging with Apache ActiveMQPolyglot Messaging with Apache ActiveMQ
Polyglot Messaging with Apache ActiveMQChristian Posta
 

En vedette (20)

Low Latency “OLAP” with HBase - HBaseCon 2012
Low Latency “OLAP” with HBase - HBaseCon 2012Low Latency “OLAP” with HBase - HBaseCon 2012
Low Latency “OLAP” with HBase - HBaseCon 2012
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on Hadoop
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache Kylin
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe
HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, AdobeHBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe
HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe
 
Adding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupAdding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark Meetup
 
(Ebook pdf) olap
(Ebook   pdf) olap(Ebook   pdf) olap
(Ebook pdf) olap
 
Sybase BAM Overview
Sybase BAM OverviewSybase BAM Overview
Sybase BAM Overview
 
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering Principles
 
eBay Cloud CMS - QCon 2012 - http://yidb.org/
eBay Cloud CMS - QCon 2012 - http://yidb.org/eBay Cloud CMS - QCon 2012 - http://yidb.org/
eBay Cloud CMS - QCon 2012 - http://yidb.org/
 
Apache Kylin Introduction
Apache Kylin IntroductionApache Kylin Introduction
Apache Kylin Introduction
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
Polyglot Messaging with Apache ActiveMQ
Polyglot Messaging with Apache ActiveMQPolyglot Messaging with Apache ActiveMQ
Polyglot Messaging with Apache ActiveMQ
 

Similaire à Low Latency OLAP with Hadoop and HBase

Kafka at half the price with JBOD setup
Kafka at half the price with JBOD setupKafka at half the price with JBOD setup
Kafka at half the price with JBOD setupDong Lin
 
AD303 - Extreme Makeover: IBM Lotus Domino Application Edition
AD303 - Extreme Makeover: IBM Lotus Domino Application EditionAD303 - Extreme Makeover: IBM Lotus Domino Application Edition
AD303 - Extreme Makeover: IBM Lotus Domino Application EditionRay Bilyk
 
Obvious and Non-Obvious Scalability Issues: Spotify Learnings
Obvious and Non-Obvious Scalability Issues: Spotify LearningsObvious and Non-Obvious Scalability Issues: Spotify Learnings
Obvious and Non-Obvious Scalability Issues: Spotify LearningsDavid Poblador i Garcia
 
Ajax for-coldfusion-developers
Ajax for-coldfusion-developersAjax for-coldfusion-developers
Ajax for-coldfusion-developersSudhakar Ganta
 
Apps vs. Sites vs. Content - a vendor-agnostic view on building stuff for the...
Apps vs. Sites vs. Content - a vendor-agnostic view on building stuff for the...Apps vs. Sites vs. Content - a vendor-agnostic view on building stuff for the...
Apps vs. Sites vs. Content - a vendor-agnostic view on building stuff for the...Kai Koenig
 
AD303: Extreme Makeover – IBM® Lotus® Domino® Application Edition
AD303: Extreme Makeover – IBM® Lotus® Domino® Application EditionAD303: Extreme Makeover – IBM® Lotus® Domino® Application Edition
AD303: Extreme Makeover – IBM® Lotus® Domino® Application EditionRay Bilyk
 
Developing games for consoles as an indie in 2019
Developing games for consoles as an indie in 2019Developing games for consoles as an indie in 2019
Developing games for consoles as an indie in 2019David Voyles
 
Developing for consoles as an indie in 2019
Developing for consoles as an indie in 2019Developing for consoles as an indie in 2019
Developing for consoles as an indie in 2019David Voyles
 
So go installation guide
So go installation guideSo go installation guide
So go installation guideJavier Urbaneja
 
DevDays 2011- Let’s get ready for the cloud: Building your applications so th...
DevDays 2011- Let’s get ready for the cloud: Building your applications so th...DevDays 2011- Let’s get ready for the cloud: Building your applications so th...
DevDays 2011- Let’s get ready for the cloud: Building your applications so th...Robert MacLean
 
Developing for Consoles as an Indie in 2015
Developing for Consoles as an Indie in 2015Developing for Consoles as an Indie in 2015
Developing for Consoles as an Indie in 2015Sarah Sexton
 
Tom Krcha: Building Games with Adobe Technologies
Tom Krcha: Building Games with Adobe TechnologiesTom Krcha: Building Games with Adobe Technologies
Tom Krcha: Building Games with Adobe TechnologiesDevGAMM Conference
 
Adobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom KrchaAdobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom Krchamochimedia
 
xTech2006_DB2onRails
xTech2006_DB2onRailsxTech2006_DB2onRails
xTech2006_DB2onRailswebuploader
 
Moving to the cloud azure, office365, and intune - concurrency
Moving to the cloud   azure, office365, and intune - concurrencyMoving to the cloud   azure, office365, and intune - concurrency
Moving to the cloud azure, office365, and intune - concurrencyConcurrency, Inc.
 
Macs OSX & Libraries
Macs OSX & LibrariesMacs OSX & Libraries
Macs OSX & LibrariesScott Kehoe
 
Business Case: IBM DB2 versus Oracle Database - Conor O'Mahony
Business Case: IBM DB2 versus Oracle Database - Conor O'MahonyBusiness Case: IBM DB2 versus Oracle Database - Conor O'Mahony
Business Case: IBM DB2 versus Oracle Database - Conor O'Mahonycomahony
 
Even internet computers want to be free: Using Linux and open source software...
Even internet computers want to be free: Using Linux and open source software...Even internet computers want to be free: Using Linux and open source software...
Even internet computers want to be free: Using Linux and open source software...North Bend Public Library
 
Next mmorpg architecture-siggraph_asia2010
Next mmorpg architecture-siggraph_asia2010Next mmorpg architecture-siggraph_asia2010
Next mmorpg architecture-siggraph_asia2010Jongwon Kim
 

Similaire à Low Latency OLAP with Hadoop and HBase (20)

Kafka at half the price with JBOD setup
Kafka at half the price with JBOD setupKafka at half the price with JBOD setup
Kafka at half the price with JBOD setup
 
AD303 - Extreme Makeover: IBM Lotus Domino Application Edition
AD303 - Extreme Makeover: IBM Lotus Domino Application EditionAD303 - Extreme Makeover: IBM Lotus Domino Application Edition
AD303 - Extreme Makeover: IBM Lotus Domino Application Edition
 
Obvious and Non-Obvious Scalability Issues: Spotify Learnings
Obvious and Non-Obvious Scalability Issues: Spotify LearningsObvious and Non-Obvious Scalability Issues: Spotify Learnings
Obvious and Non-Obvious Scalability Issues: Spotify Learnings
 
Ajax for-coldfusion-developers
Ajax for-coldfusion-developersAjax for-coldfusion-developers
Ajax for-coldfusion-developers
 
Apps vs. Sites vs. Content - a vendor-agnostic view on building stuff for the...
Apps vs. Sites vs. Content - a vendor-agnostic view on building stuff for the...Apps vs. Sites vs. Content - a vendor-agnostic view on building stuff for the...
Apps vs. Sites vs. Content - a vendor-agnostic view on building stuff for the...
 
AD303: Extreme Makeover – IBM® Lotus® Domino® Application Edition
AD303: Extreme Makeover – IBM® Lotus® Domino® Application EditionAD303: Extreme Makeover – IBM® Lotus® Domino® Application Edition
AD303: Extreme Makeover – IBM® Lotus® Domino® Application Edition
 
Developing games for consoles as an indie in 2019
Developing games for consoles as an indie in 2019Developing games for consoles as an indie in 2019
Developing games for consoles as an indie in 2019
 
Developing for consoles as an indie in 2019
Developing for consoles as an indie in 2019Developing for consoles as an indie in 2019
Developing for consoles as an indie in 2019
 
So go installation guide
So go installation guideSo go installation guide
So go installation guide
 
DevDays 2011- Let’s get ready for the cloud: Building your applications so th...
DevDays 2011- Let’s get ready for the cloud: Building your applications so th...DevDays 2011- Let’s get ready for the cloud: Building your applications so th...
DevDays 2011- Let’s get ready for the cloud: Building your applications so th...
 
Developing for Consoles as an Indie in 2015
Developing for Consoles as an Indie in 2015Developing for Consoles as an Indie in 2015
Developing for Consoles as an Indie in 2015
 
01 lab1
01 lab101 lab1
01 lab1
 
Tom Krcha: Building Games with Adobe Technologies
Tom Krcha: Building Games with Adobe TechnologiesTom Krcha: Building Games with Adobe Technologies
Tom Krcha: Building Games with Adobe Technologies
 
Adobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom KrchaAdobe Gaming Solutions by Tom Krcha
Adobe Gaming Solutions by Tom Krcha
 
xTech2006_DB2onRails
xTech2006_DB2onRailsxTech2006_DB2onRails
xTech2006_DB2onRails
 
Moving to the cloud azure, office365, and intune - concurrency
Moving to the cloud   azure, office365, and intune - concurrencyMoving to the cloud   azure, office365, and intune - concurrency
Moving to the cloud azure, office365, and intune - concurrency
 
Macs OSX & Libraries
Macs OSX & LibrariesMacs OSX & Libraries
Macs OSX & Libraries
 
Business Case: IBM DB2 versus Oracle Database - Conor O'Mahony
Business Case: IBM DB2 versus Oracle Database - Conor O'MahonyBusiness Case: IBM DB2 versus Oracle Database - Conor O'Mahony
Business Case: IBM DB2 versus Oracle Database - Conor O'Mahony
 
Even internet computers want to be free: Using Linux and open source software...
Even internet computers want to be free: Using Linux and open source software...Even internet computers want to be free: Using Linux and open source software...
Even internet computers want to be free: Using Linux and open source software...
 
Next mmorpg architecture-siggraph_asia2010
Next mmorpg architecture-siggraph_asia2010Next mmorpg architecture-siggraph_asia2010
Next mmorpg architecture-siggraph_asia2010
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Dernier (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Low Latency OLAP with Hadoop and HBase

  • 1. Low-Latency “OLAP” with Hadoop and HBase Andrei Dragomir | Software Engineer © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 2. Synopsis §  What are we trying to solve §  Description of our system §  How it works §  Minimizing Latency © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 2
  • 3. In a nutshell Low-latency OLAP system Hadoop DFS to store input data (ie log files, or HBase tables) The processing loop of the system takes a cube description and processes it (pre-aggregations) using Hadoop Map/Reduce. The output is written to a statistics HBase table. To get the data, users query a server, which scans the HBase table, applying the filters, roll-ups or drill-downs, and returning the result. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 3
  • 4. In a nutshell Low-latency OLAP system Hadoop DFS to store input data (ie log files, or HBase tables) The processing loop of the system takes a cube description and processes it (pre-aggregations) using Hadoop Map/Reduce. The output is written to a statistics HBase table. To get the data, users query a server, which scans the HBase table, applying the filters, roll-ups or drill-downs, and returning the result. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 4
  • 5. In a nutshell Low-latency OLAP system Hadoop DFS to store input data (ie log files, or HBase tables) The processing loop of the system takes a cube description and processes it (pre-aggregations) using Hadoop Map/Reduce. The output is written to a statistics HBase table. To get the data, users query a server, which scans the HBase table, applying the filters, roll-ups or drill-downs, and returning the result. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 5
  • 6. In a nutshell Low-latency OLAP system Hadoop DFS to store input data (ie log files, or HBase tables) The processing loop of the system takes a cube description and processes it (pre-aggregations) using Hadoop Map/Reduce. The output is written to a statistics HBase table. To get the data, users query a server, which scans the HBase table, applying the filters, roll-ups or drill-downs, and returning the result. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 6
  • 7. In a nutshell Low-latency OLAP system Hadoop DFS to store input data (ie log files, or HBase tables) The processing loop of the system takes a cube description and processes it (pre-aggregations) using Hadoop Map/Reduce. The output is written to a statistics HBase table. To get the data, users query a server, which scans the HBase table, applying the filters, roll-ups or drill-downs, and returning the result. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 7
  • 8. In a nutshell Low-latency OLAP system Hadoop DFS to store input data (ie log files, or HBase tables) The processing loop of the system takes a cube description and processes it (pre-aggregations) using Hadoop Map/Reduce. The output is written to a statistics HBase table. To get the data, users query a server, which scans the HBase table, applying the filters, roll-ups or drill-downs, and returning the result. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 8
  • 9. Vocabulary Date Country City OS Browser Sales 2012-05-12 USA NY Win FF $ 0.0 2012-05-12 USA NY Win FF $ 10.0 2012-05-13 USA SF OSX Chrome $ 25.0 2012-05-13 Canada Ontario Linux Chrome $ 0.0 2012-05-14 USA Chicago OSX Safari $ 15.0 ... ... ... ... ... ... 5 Visits 2 Countries 4 Cities: 3 OS: 3 Browser: $50.0 3 Days USA: 4 NY: 2 Win: 2 FF: 2 3 sales Canada: 1 SF: 1 OSX: 2 Chrome: 2 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 9
  • 10. Vocabulary Date Country City OS Browser Sales 2012-05-12 USA NY Win FF $ 0.0 2012-05-12 USA NY Win FF $ 10.0 2012-05-13 USA SF OSX Chrome $ 25.0 2012-05-13 Canada Ontario Linux Chrome $ 0.0 2012-05-14 USA Chicago OSX Safari $ 15.0 ... ... ... ... ... ... 5 Visits 2 Countries 4 Cities: 3 OS: 3 Browser: $50.0 3 Days USA: 4 NY: 2 Win: 2 FF: 2 3 sales Canada: 1 SF: 1 OSX: 2 Chrome: 2 §  We want to get (mostly) numeric data: metrics §  These metrics have a set of labels (dimensions) §  We want to view the metrics by any combination of dimensions © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 10
  • 11. Vocabulary Date Country City OS Browser Sales 2012-05-12 USA NY Win FF $ 0.0 2012-05-12 USA NY Win FF $ 10.0 2012-05-13 USA SF OSX Chrome $ 25.0 2012-05-13 Canada Ontario Linux Chrome $ 0.0 2012-05-14 USA Chicago OSX Safari $ 15.0 ... ... ... ... ... ... 5 Visits 2 Countries 4 Cities: 3 OS: 3 Browser: $50.0 3 Days USA: 4 NY: 2 Win: 2 FF: 2 3 sales Canada: 1 SF: 1 OSX: 2 Chrome: 2 §  We want to get (mostly) numeric data: metrics §  These metrics have a set of labels (dimensions) §  We want to view the metrics by any combination of dimensions © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 11
  • 12. Vocabulary Date Country City OS Browser Sales 2012-05-12 USA NY Win FF $ 0.0 2012-05-12 USA NY Win FF $ 10.0 2012-05-13 USA SF OSX Chrome $ 25.0 2012-05-13 Canada Ontario Linux Chrome $ 0.0 2012-05-14 USA Chicago OSX Safari $ 15.0 ... ... ... ... ... ... 5 Visits 2 Countries 4 Cities: 3 OS: 3 Browser: $50.0 3 Days USA: 4 NY: 2 Win: 2 FF: 2 3 sales Canada: 1 SF: 1 OSX: 2 Chrome: 2 §  We want to get (mostly) numeric data: metrics §  These metrics have a set of labels (dimensions) §  We want to view the metrics by any combination of dimensions © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 12
  • 13. OLAP Queries §  Rolling up to country level Country visits sales SELECT  COUNT(visits),  SUM(sales)   USA 4 $50 GROUP  BY  country   Canada 1 0 §  “Slicing” by browser Country visits sales SELECT  COUNT(visits),  SUM(sales)   USA 2 $10 GROUP  BY  country   Canada 0 0 HAVING  browser  =  “FF”   Browser sales visits §  Top browsers by sales Chrome $25 2 SELECT  SUM(sales),  COUNT(visits)     GROUP  BY  browser     Safari $15 1 ORDER  BY  sales   FF $10 2 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 13
  • 14. Looking inside – physical diagram © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 15. Looking inside – logical diagram © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 16. Simplifying assumptions: pre-aggregation §  In most cases... §  Data needs to be summarized – hard to draw 1B data points §  You don’t need to look at all dimensions at the same time – hard to correlate §  Not all queries are used with the same frequency © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 16
  • 17. A timeless CS problem: Optimize... Time Space §  Pre-aggregation §  Runtime §  Fast aggregation §  Flexible §  Efficient reads – O(1) §  Inflexible §  I/O, CPU intensive §  Processing latency §  Slow– always need §  Combinatorial to look at all the Explosion data §  Low throughput © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 17
  • 18. Solution ? §  Just do both ! §  Can tune: pre-aggregate more, or rely on runtime aggregation §  Ingestion + process speed vs Query speed §  Works just like normal queries + materialized views © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 18
  • 19. Solution ? §  Process: pre-aggregate all the report definitions, create an indexed HBase table. §  Query: use the indexes to get the data fast. Perform extra aggregation, filtering if needed at runtime. §  Platform strengths §  Parallelism in M/R §  Fast access and natural key ordering in HBase © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 19
  • 20. Minimal HBase details Row   Columns...   §  Data is stored in tables Key   u1   v1   v2   v3   §  Each row has a key, u2   v   X   ...   and any number of columns (long & wide) u3   v   x   ...   u4   x   v2   ...   §  Ordered by row keys: u5   ...   v3   ...   clustered indexes u6   ...   v5   ...   built-in u7   ...   ...   ...   §  Sparse tables. NULLs u8   ...   ...   ...   are free. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 20
  • 21. Minimal HBase details Row   Column §  Operations use row key   ...   key: get(), put()   aaa   v1   aab   v2   §  Can scan a range of ←   rows:[start,  end)   aac   v3   ←   aad   v4   §  We can use the row ←   aae   v5   key as a built-in ←   aaf   v6   indexing aba   ...   mechanism abb   ...   © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 21
  • 22. SaasBase vs. SQL Views Comparison © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 22
  • 23. Reports configuration §  List of Dimensions (with custom classes, arguments, etc) §  List of Metrics (with custom classes, arguments, etc) §  List of Reports, each containing §  Dimensions (subset) §  Metrics (subset) §  Sorting, etc §  The reports configuration is used in the entire system: import, process, query © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 23
  • 24. Solution ? Date Countr Cit Sale y y s 2012-05-1 USA NY 3 2 2012-05-1 USA NY 10 2 2012-05-1 USA SF 25 3 2012-05-1 CAN ON 0 3 2012-05-1 USA CH 15 4 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 24
  • 25. Solution ? Date Countr Cit Sale y y s 2012-05-1 USA NY 3 2 2012-05-1 USA NY 10 2 2012-05-1 USA SF 25 3 2012-05-1 CAN ON 0 3 2012-05-1 USA visits_by_city:  {   CH 15    dimensions:  [country,  city],     4    metrics:  [visits]   },     daily_sales:  {      dimensions:  [year,  month,  day,   country],        metrics:  [sales]   }   © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 25
  • 26. Solution ? Date Countr Cit Sale y y s 2012-05-1 USA NY 3 2 2012-05-1 USA NY 10 2      Statistics  HBASE  Output  Table            ROWKEY                        VALUE   2012-05-1 USA SF 25 3 daily_sales/2012+05+12+USA        $13     daily_sales/2012+05+13+CAN        $0   2012-05-1 CAN ON 0 daily_sales/2012+05+13+USA        $25   3 daily_sales/2012+05+14+USA        $15   2012-05-1 USA visits_by_city:  {   CH 15 visits_by_city/CAN+ON                  1      dimensions:  [country,  city],     4    metrics:  [visits]   visits_by_city/USA+CH                  1   },     daily_sales:  {   visits_by_city/USA+NY                  2      dimensions:  [year,  month,  day,   visits_by_city/USA+SF                  1   country],        metrics:  [sales]   }   © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 26
  • 27. HBase natural order: hierarchical filtering © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 27
  • 28. Sorting §  Add the metrics that you want to sort by to the row key... §  In a way that preserves the ordering © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 28
  • 29. Sorting §  Add the metrics that you want to sort by to the row key... §  In a way that preserves the ordering §  ORDER  BY  metric  DESC  ==  Long.MAX_VALUE  –  metric   2012+05+USA+0000000000+   2012+05+USA+4294961296+SF  =  1000  visits   2012+05+USA+4294961396+NY  =  900  visits   .  .  .       2012+05+USA+9999999999+   © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 29
  • 30. Minimizing Latency © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 31. Minimizing Import Latency §  Only import the minimal set of changes §  Map/Reduce input filters: §  c.a.s.a.i.FileCache – checks if file already processed §  c.a.s.a.i.FileDateFilter – checks if a date in the file path is against a specified interval §  process files from 3 days ago up until now, once §  HBase scan (from import table) start and stop row §  Minimize map-task overhead – stitch input splits © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 31
  • 32. Minimizing Import Latency §  Minimize map-task overhead – stitch input splits §  for 400000 files -> 400000 Map Tasks, slow reduce-copy phase §  o.a.h.m.i.CombineFileInputFormat – make 2GB splits §  c.a.s.a.m.i.FixedMappersTableInputFormat – stitches multiple HBase regions in the same map task © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 32
  • 33. Minimizing Import Latency §  If warehousing in HBase, use o.a.h.h.m.HFileOutputFormat   §  ~ 100 times faster than using the API §  No shuffle step! you must use a global order partitioner §  Problem: data grows over time §  Solution: estimate output partitions based on input data size, and make partitions (regions) using this heuristic §  c.a.s.a.m.FileSizeDatePartitioner – inject input files size and dates and rebalance regions based on these, and a fixed size (2GB) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 33
  • 34. Minimizing Processing Latency §  Processing involves reading the input (files, tables, events), pre-aggregating it (reducing cardinality) and generating tables that can be queried in real-time §  Processing does GROUP BY, COUNT/SUM/AVG, ORDER BY §  Minimize each M/R step: read, map, partition, combine, copy, sort, reduce, write §  Read §  Filter input data (incremental processing) – differentiate between OPEN and CLOSED data §  HBase Scan options: caching, batching, etc §  Ensure HBase table regions are distributed in the cluster © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 34
  • 35. Minimizing Processing Latency §  c.a.s.a.m.j.SuperProcessor   §  One shot M/R job: for all data, for all reports, emit the pre-aggregated values in 1 map() call §  no allocations §  Simple and tight §  no system calls (avoid context switches) §  no String <> byte[] transformations §  minimize Map > Combine > Reduce I/O §  NO ALLOCATIONS © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 35
  • 36. Minimizing Query Latency §  c.a.s.a.m.t.ReportHandler   §  Simple Thrift server §  Data is already processed and pre-aggregated §  Query time does HAVING/WHERE (filters), extra GROUP BY (roll-ups) §  Calculate an optimal set of HBase scan()s   §  single / multiple scans §  start / stop rows (prefixes, index positions) §  Perform extra roll-ups / sorting §  Assorted sundries: paging, display-time ser/des, etc © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 36
  • 37. Flexible §  Report configuration – the core of the system §  c.a.s.a.e.Dimension, c.a.s.a.e.Metric   §  Can override ser/des, aggregate functions (for metrics) §  Can override behavior (only add 1 if X...) §  Emergent patterns are rolled-up in the reporting core §  The entire processing loop can be written outside of M/R for realtime §  Storm ? §  Applied in 4 use-cases right now, easy to extend §  Some programming required © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 37
  • 38. Thank you adragomi@adobe.com / @adragomir http://hstack.org Our team: Adrian Muraru, Andrei Dulvac, Bogdan Dragu, Bogdan Drutu, Cosmin Lehene, Raluca Podiuc, Tudor Scurtu © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 39. © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
  • 40. Break! Break takes place in the Community Showcase (Hall 2) Sessions will resume at 3:35pm Page 40