Soumettre la recherche
Mettre en ligne
Low Latency “OLAP” with HBase - HBaseCon 2012
•
Télécharger en tant que PPTX, PDF
•
27 j'aime
•
26,060 vues
Cosmin Lehene
Suivre
Technologie
Business
Signaler
Partager
Signaler
Partager
1 sur 35
Télécharger maintenant
Recommandé
Low Latency OLAP with Hadoop and HBase
Low Latency OLAP with Hadoop and HBase
DataWorks Summit
HBase and Hadoop at Adobe
HBase and Hadoop at Adobe
Cosmin Lehene
DBA Basics guide
DBA Basics guide
azoznasser1
An Intro to Tuning Your SQL on DB2 for z/OS
An Intro to Tuning Your SQL on DB2 for z/OS
Willie Favero
DBA101
DBA101
Craig Mullins
DB2 V10 Migration Guidance
DB2 V10 Migration Guidance
Craig Mullins
JONSMITH10042016
JONSMITH10042016
Jon Smith
DB2 10 Smarter Database - IBM Tech Forum
DB2 10 Smarter Database - IBM Tech Forum
Surekha Parekh
Recommandé
Low Latency OLAP with Hadoop and HBase
Low Latency OLAP with Hadoop and HBase
DataWorks Summit
HBase and Hadoop at Adobe
HBase and Hadoop at Adobe
Cosmin Lehene
DBA Basics guide
DBA Basics guide
azoznasser1
An Intro to Tuning Your SQL on DB2 for z/OS
An Intro to Tuning Your SQL on DB2 for z/OS
Willie Favero
DBA101
DBA101
Craig Mullins
DB2 V10 Migration Guidance
DB2 V10 Migration Guidance
Craig Mullins
JONSMITH10042016
JONSMITH10042016
Jon Smith
DB2 10 Smarter Database - IBM Tech Forum
DB2 10 Smarter Database - IBM Tech Forum
Surekha Parekh
DB2 10 Webcast #1 - Overview And Migration Planning
DB2 10 Webcast #1 - Overview And Migration Planning
Laura Hood
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
parallellabs
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data Warehouse
Mark Ginnebaugh
Ta3
Ta3
leo1092
Monster
Monster
Jon Smith
Oracle10g new features
Oracle10g new features
Tanvi_Agrawal
DB210 Smarter Database IBM Tech Forum 2011
DB210 Smarter Database IBM Tech Forum 2011
Laura Hood
SQL Server Workshop Paul Bertucci
SQL Server Workshop Paul Bertucci
Mark Ginnebaugh
An Hour of DB2 Tips
An Hour of DB2 Tips
Craig Mullins
SQLFire Webinar
SQLFire Webinar
Carter Shanklin
SQLFire at Strata 2012
SQLFire at Strata 2012
Carter Shanklin
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speed
Korea Sdec
SQLFire lightning talk
SQLFire lightning talk
Carter Shanklin
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Xu Jiang
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Cosmin Lehene
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
Luke Han
HISTORIA ACTIVA
HISTORIA ACTIVA
Jose Ramon
Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)
nirvanafilmblog
Ha nacido un concursante
Ha nacido un concursante
Jose Ramon
DÍAS DE RADIO
DÍAS DE RADIO
Jose Ramon
Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)
GNOSS
RHBC Announcements 3/19/17
RHBC Announcements 3/19/17
rhbc
Contenu connexe
Tendances
DB2 10 Webcast #1 - Overview And Migration Planning
DB2 10 Webcast #1 - Overview And Migration Planning
Laura Hood
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
parallellabs
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data Warehouse
Mark Ginnebaugh
Ta3
Ta3
leo1092
Monster
Monster
Jon Smith
Oracle10g new features
Oracle10g new features
Tanvi_Agrawal
DB210 Smarter Database IBM Tech Forum 2011
DB210 Smarter Database IBM Tech Forum 2011
Laura Hood
SQL Server Workshop Paul Bertucci
SQL Server Workshop Paul Bertucci
Mark Ginnebaugh
An Hour of DB2 Tips
An Hour of DB2 Tips
Craig Mullins
SQLFire Webinar
SQLFire Webinar
Carter Shanklin
SQLFire at Strata 2012
SQLFire at Strata 2012
Carter Shanklin
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speed
Korea Sdec
SQLFire lightning talk
SQLFire lightning talk
Carter Shanklin
Tendances
(13)
DB2 10 Webcast #1 - Overview And Migration Planning
DB2 10 Webcast #1 - Overview And Migration Planning
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
SQL Server 2008 Fast Track Data Warehouse
SQL Server 2008 Fast Track Data Warehouse
Ta3
Ta3
Monster
Monster
Oracle10g new features
Oracle10g new features
DB210 Smarter Database IBM Tech Forum 2011
DB210 Smarter Database IBM Tech Forum 2011
SQL Server Workshop Paul Bertucci
SQL Server Workshop Paul Bertucci
An Hour of DB2 Tips
An Hour of DB2 Tips
SQLFire Webinar
SQLFire Webinar
SQLFire at Strata 2012
SQLFire at Strata 2012
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speed
SQLFire lightning talk
SQLFire lightning talk
En vedette
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Xu Jiang
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Cosmin Lehene
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
Luke Han
HISTORIA ACTIVA
HISTORIA ACTIVA
Jose Ramon
Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)
nirvanafilmblog
Ha nacido un concursante
Ha nacido un concursante
Jose Ramon
DÍAS DE RADIO
DÍAS DE RADIO
Jose Ramon
Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)
GNOSS
RHBC Announcements 3/19/17
RHBC Announcements 3/19/17
rhbc
The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)
clivecaines
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
Cosmin Lehene
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3
Markus Klems
Normas de cine
Normas de cine
Jose Ramon
Stateless Hypervisors at Scale
Stateless Hypervisors at Scale
Antony Messerl
Beacosystem V3
Beacosystem V3
Sean O'Sullivan
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?
DataWorks Summit
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
Cloudera, Inc.
Test strategies for data processing pipelines
Test strategies for data processing pipelines
Lars Albertsson
A Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
HBaseCon
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
Isheeta Sanghi
En vedette
(20)
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
Real-time “OLAP” for Big Data (+ use cases) - bigdata.ro 2013
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
HISTORIA ACTIVA
HISTORIA ACTIVA
Making Of Zoozoo (Part 1)
Making Of Zoozoo (Part 1)
Ha nacido un concursante
Ha nacido un concursante
DÍAS DE RADIO
DÍAS DE RADIO
Mismuseos.net: Art After Technology (putting cultural data to work)
Mismuseos.net: Art After Technology (putting cultural data to work)
RHBC Announcements 3/19/17
RHBC Announcements 3/19/17
The cognitive approach to abnormality (2)
The cognitive approach to abnormality (2)
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
Apache Cassandra Lesson: Data Modelling and CQL3
Apache Cassandra Lesson: Data Modelling and CQL3
Normas de cine
Normas de cine
Stateless Hypervisors at Scale
Stateless Hypervisors at Scale
Beacosystem V3
Beacosystem V3
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
Test strategies for data processing pipelines
Test strategies for data processing pipelines
A Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
Similaire à Low Latency “OLAP” with HBase - HBaseCon 2012
Xebia adobe flash mobile applications
Xebia adobe flash mobile applications
Michael Chaize
xTech2006_DB2onRails
xTech2006_DB2onRails
webuploader
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
François Le Droff
오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)
Linux Foundation Korea
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Romeo Kienzler
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
DataWorks Summit
Software im SAP Umfeld_IBM DB2
Software im SAP Umfeld_IBM DB2
IBM Switzerland
Ibm db2 big sql
Ibm db2 big sql
ModusOptimum
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
DataWorks Summit/Hadoop Summit
Monitoring with Icinga2 at Adobe
Monitoring with Icinga2 at Adobe
Icinga
Leveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN Performance
brettallison
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
Daniela Zuppini
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
Prasad Prabhu (PP)
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Sumeet Singh
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
DataWorks Summit
OVH Lab - Enterprise Cloud Databases
OVH Lab - Enterprise Cloud Databases
OVHcloud
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
Large Scale SQL Considerations for SharePoint Deployments
Large Scale SQL Considerations for SharePoint Deployments
Joel Oleson
Similaire à Low Latency “OLAP” with HBase - HBaseCon 2012
(20)
Xebia adobe flash mobile applications
Xebia adobe flash mobile applications
xTech2006_DB2onRails
xTech2006_DB2onRails
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
Flex and LiveCycle Data Services Best Practices from the Trenches (Adobe MAX ...
오라클 DR 및 복제 솔루션(Dbvisit 소개)
오라클 DR 및 복제 솔루션(Dbvisit 소개)
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
Software im SAP Umfeld_IBM DB2
Software im SAP Umfeld_IBM DB2
Ibm db2 big sql
Ibm db2 big sql
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Monitoring with Icinga2 at Adobe
Monitoring with Icinga2 at Adobe
Leveraging Open Source to Manage SAN Performance
Leveraging Open Source to Manage SAN Performance
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
OVH Lab - Enterprise Cloud Databases
OVH Lab - Enterprise Cloud Databases
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Large Scale SQL Considerations for SharePoint Deployments
Large Scale SQL Considerations for SharePoint Deployments
Dernier
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
fnnc6jmgwh
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
LoriGlavin3
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
Bernd Ruecker
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
BookNet Canada
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
Kari Kakkonen
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
Ingrid Airi González
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
Nicole Novielli
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
ThousandEyes
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
panagenda
2024 April Patch Tuesday
2024 April Patch Tuesday
Ivanti
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
Kaya Weers
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
Farhan Tariq
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
LoriGlavin3
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
Hiroshi SHIBATA
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
LoriGlavin3
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
panagenda
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
Inflectra
How to write a Business Continuity Plan
How to write a Business Continuity Plan
Databarracks
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
Lonnie McRorey
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Pim van der Noll
Dernier
(20)
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
2024 April Patch Tuesday
2024 April Patch Tuesday
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
How to write a Business Continuity Plan
How to write a Business Continuity Plan
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Low Latency “OLAP” with HBase - HBaseCon 2012
1.
Low Latency “OLAP”
with HBase Cosmin Lehene | Adobe © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
2.
What we needed
… and built OLAP Semantics Low Latency Ingestion High Throughput Real-time Query API Not hardcoded to web analytics or x-, y-, z- analytics, but extensible © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 2
3.
Building Blocks
Dimensions, Metrics Aggregations Roll-up, drill-down, slicing and dicing, sorting © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 3
4.
OLAP 101 –
Queries example Date Countr City OS Browser Sale y 2012-05-21 USA NY Windows FF 0.0 2012-05-21 USA NY Windows FF 10.0 2012-05-22 USA SF OSX Chrome 25.0 2012-05-22 Canada Ontario Linux Chrome 0.0 2012-05-23 USA Chicago OSX Safari 15.0 5 visits, 2 4 cities: 3 OS-es 3 browsers 50.0 3 days countries NY: 2 Win: 2 FF: 2 3 sales USA: 4 SF: 1 OSX: 2 Chrome:2 Canada: 1 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 4
5.
OLAP 101 –
Queries example Rolling up to country level: Country visits sales SELECT COUNT(visits), SUM(sales) USA 4 $50 GROUP BY country Canada 1 0 “Slicing” by browser Country visits sales SELECT COUNT(visits), SUM(sales) USA 2 $10 GROUP BY country Canada 0 0 HAVING browser = “FF” Top browsers by sales Browser sales visits SELECT SUM(sales), COUNT(visits) Chrome $25 2 GROUP BY browser Safari $15 1 ORDER BY sales FF $10 2 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 5
6.
OLAP – Runtime
Aggregation vs. Pre-aggregation Aggregate at runtime Pre-aggregate Most flexible Fast Fast – scatter gather Efficient – O(1) Space efficient High throughput But But I/O, CPU intensive More effort to process (latency) slow for larger data Combinatorial explosion (space) low throughput No flexibility © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 6
7.
Pre-aggregation
Data needs to be summarized Can’t visualize 1B data points (no, not even with Retina display) Difficult to comprehend correlations among more than 3 dimensions Not all dimension groups are relevant Index on a needed basis (view selection problem) Runtime aggregation == TeraSort for every query? Pre-aggregate to reduce cardinality © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 7
8.
SaasBase
We tune both pre-aggregation level vs. runtime post-aggregation (ingestion speed + space ) vs. (query speed) Think materialized views from RDBMS © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 8
9.
SaasBase Domain Model
Mapping © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 9
10.
SaasBase - Domain
Model Mapping © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 10
11.
SaasBase - Ingestion,
Processing, Indexing, Querying © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 11
12.
SaasBase - Ingestion,
Processing, Indexing, Querying © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 12
13.
Ingestion © 2012 Adobe
Systems Incorporated. All Rights Reserved. Adobe Confidential. 13
14.
Ingestion throughput vs.
latency Historical data (large batches) Optimize for throughput Increments (latest data, smaller) Optimize for latency © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 14
15.
Large, granular input
strategies Slow listing in HDFS Archive processed files Filtering input FileDateFilter (log name patterns: log-YYYY-MM-dd-HH.log) TableInputFormat start/stop row File Index in HBase (track processed/new files) Map tasks overhead - stitching input splits 400K files => 400K map tasks => overhead, slow reduce copy CombineFileInputFormat – 2GB-splits => 500 splits for 1TB FixedMappersTableInputFormat (e.g. 5-region splits) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 15
16.
Ingestion – Bulk
Import HFileOutputFormat (HFOF) 100s X faster than HBase API No need to recover from failed jobs No unnecessary load on machines * No shuffle - global reduce order required! e.g. first reduce key needs to be in the first region, last one in the last region Watch for uneven partitions © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 16
17.
HFOF – FileSizeDatePartitioner
1 partition(reduce) / day for initial import Uneven reduce (partitions) due to data growth over time Reduce k: 2010-12-04 = 500MB Reduce n: 2012-05-22 = 5GB => slow and will result in a 5GB region Balance reduce buckets based on input file sizes and the reduce key Generate sub-partitions based on predefined size (e.g. 1GB) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 17
18.
Processing © 2012 Adobe
Systems Incorporated. All Rights Reserved. Adobe Confidential. 18
19.
Processing
Processing involves reading the Input (files, tables, events), pre- aggregating it (reducing cardinality) and generating tables that can be queried in real-time 1 year: 1B events => 100B data points indexed Query => scan 365 data points (e.g. daily page views) Processing could be either MR or real-time (e.g. Storm) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 19
20.
Processing for OLAP
semantics GROUP BY (process, query) COUNT, SUM, AVG, etc. (process, query) SORT (process, query) HAVING (mostly query, can define pre-process constraints) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 20
21.
SaasBase vs. SQL
Views Comparison © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 21
22.
reports.json entities definition ©
2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 22
23.
Processing Performance
read, map, partition, combine, copy, sort, reduce, write Read: Scan.setCaching() (I/O ~ buffer) Scan.setBatching() (avoid timeouts for abnormal input, e.g. 1M hits/visit) Even region distribution across cluster (distributes CPU, I/O) Map: No unnecessary transformations: Bytes.toString(bytes) + Bytes.toBytes(string) (CPU) Avoid GC : new X() (CPU, Memory) Avoid system calls (context switching) Stripping unnecessary data (I/O) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 23
24.
Processing Performance
Hot (in memory) vs. Cold (on disk, on network) data Minimize I/O from disk/network Single shot MR job: SuperProcessor Emit all groups from one map() call Incremental processing Data format YYYY-MM-DD prefixed rowkey (HH:mm for more granularity) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 24
25.
Indexing © 2012 Adobe
Systems Incorporated. All Rights Reserved. Adobe Confidential. 25
26.
HBase natural order:
hierarchical representation © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 26
27.
Indexing - Why
Example: top 10 cities ~50K [country, city] combinations per day Top 10 cities for 1 year => 365 (days) X 50K ~=15M data points scanned If you add gender => 30M If you add Device, OS, Browser … Might compress well, but think about the environment How much energy would you spend for just top 10 cities? * Image from: http://my.neutralexistence.com/images/Green-Earth.jpg © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 27
28.
Indexing with HBase
“10” < “2” GROUP BY year, month, country, city ORDER BY visits DESC LIMIT 10 Lexicographic sorting 2012/05/USA/0000000000/ 2012/05/USA/4294961296/San Francisco = 1000 visits* 2012/05/USA/4294961396/New York = 900 visits* . . . 2012/05/USA/9999999999/ scan “t” startrow => “2012/05/USA/”, limit => 10 * Padding numbers for lexicographic sorting: 1000 -> Long.MAX_VALUE – 1000 = 4294961296 © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 28
29.
Query Engine
Always reads indexed, compact data Query parsing Scan strategy Single vs. multiple scans Start/stop rows (prefixes, index positions, etc.) Index selection (volatile indexes with incremental processing) Deserialization Post-aggregation, sorting, fuzzy-sorting etc. Paging Custom dimension/metric class loading © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 29
30.
Conclusions
OLAP semantics on a simple data model Data as first class citizen Domain Specific “Language” for Dimensions, Metrics, Aggregations Tunable performance, resource allocation Framework for vertical analytics systems © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 30
31.
Thank you!
Cosmin Lehene @clehene http://hstack.org Credits: Andrei Dragomir Adrian Muraru Andrei Dulvac Raluca Podiuc Tudor Scurtu Bogdan Dragu Bogdan Drutu © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 31
32.
© 2012 Adobe
Systems Incorporated. All Rights Reserved. Adobe Confidential.
33.
OLAP 101 -
Rollup Countr Visits Sale y USA 4 $50 Canada 1 $0 Rollup: SELECT COUNT(visits), SUM(sales) GROUP BY country © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 33
34.
OLAP 101 -
Slicing Date Countr City OS Browser Sale y 2012-03-02 USA NY Windows FF 0.0 2012-03-02 USA NY Windows FF 10.0 2012-03-03 USA S OSX Chrome 25.0 2012-03-03 Canada Ontario Linux Chrome 0.0 2012-03-04 USA Chicago OSX Safari 15.0 5 visits, 2 4 cities: 3 OS-es 3 browsers 50.0 3 days countries NY: 2 Win: 2 FF: 2 3 sales USA: 4 SF: 1 OSX: 2 Chrome:2 Canada: 1 Filter or Segment or Slice (WHERE or HAVING) © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 34
35.
OLAP 101 –
Sorting, TOP n Date Countr City OS Browser Sale y Chrome $25 Safari $15 Firefox $10 SELECT SUM(sales) as total GROUP BY browser ORDER BY total © 2012 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 35
Notes de l'éditeur
How many HBase users?
Data as first class citizen
Check contrast on projector
Just like speedvs space in general CS/algoQueries always hit indexes
Dimensions – readtransformserializedeserialize data attributesMetrics – read/transform/aggregate/serializeConstraints: ingestion filteringReport: instrument dimensions groups + metrics with aggregations, sorting
QUERY ENGINE -> INDEX(always realtime)
Initial import/process and NEW reports (not covered) on historical data
18K regions, upgrade to 0.92
DiagramHARD TO DIGEST (TOO MUCH INFO, TOO CONDENSED)
Process = aggregate,generate indexes (natural)Query = uses indexes, can do extra aggregation
LEFT: report definition, NOT a QUERYLIKE A VIEW - CREATED - THEN QUERIED
Inconsistent
Rowkey =dimensions group -> metrics (right)
GO BACK to EXPLAIN
>100K/sec/threadREALTIME
Data analysts work with familiar concepts
Télécharger maintenant