SlideShare une entreprise Scribd logo
1  sur  12
Télécharger pour lire hors ligne
Youtube Data
                  Warehouse
Biswapesh Chattopadhyay
biswapesh@google.com
XLDB 2011




                                 Google Confidential and Proprietary
YTDW - Motivation & History
● Consolidated warehouse of Youtube data
● Videos, playbacks, summarized logs, etc.
● Very large (X PB uncompressed, Trillion row tables)
● High volume ETL (XXX TB processed / day)
● 100% Google Tech Stack:
  ○ Query: Oracle -> MySQL -> ColumnIO
  ○ ETL: Python -> Sawzall + Tenzing + Python
  ○ Reporting: Microstrategy -> ABI
● Key technologies: Sawzall, Tenzing, Dremel, ABI




                                               Google Confidential and Proprietary
YTDW - Overall Architecture



  Logs
                                        Dremel     ABI
                                        Service   Reports

               ETL
             (Sawzall,
 MySQL       Tenzing,      YTDW
  DB        Python, C++   (ColumnIO /
                MR)          GFS)
                                        Tenzing
                                        Service

 Bigtable




                                                     Google Confidential and Proprietary
YTDW - About Sawzall
●   Scripting language on Google MR framework
●   Sawzall vs MR-Saw
●   Built-in security for accessing sensitive logs data
●   Strong support for aggregation and complex
    computations
●   Read/write various formats
●   Procedural language
●   Open sourced!
●   YTDW Usage:
    ○ ETL of Youtube logs
    ○ Complex one-off logs analysis
                                                Google Confidential and Proprietary
YTDW - About Tenzing
● SQL on MR - Think HIVE, HadoopSQL
● Key strengths:
   ○ Strong SQL support
   ○ Highly scalable - built on Google MR
   ○ Read / write many formats
● Weaknesses:
   ○ Not ideal for complex procedural code
   ○ Higher latency than Dremel
   ○ Limited support for nested-repeated structures
● YTDW Usage:
   ○ ETL for non-logs data, denormalizations
   ○ Medium complexity analysis on YTDW dataGoogle Confidential and Proprietary
YTDW - About Dremel
● Current use in YTDW:
  ○ Reporting query engine
  ○ Interactive simple logs analysis
● Key Strengths
  ○ Very low latency
  ○ SQL support
  ○ Strong nested-relational support
  ○ Access to logs
● Limitations
  ○ More complex SQL constructs (joins, setops, ...)
  ○ Limited library of functions
  ○ Doesn't scale as much as MR
                                              Google Confidential and Proprietary
YTDW - Techonology
     Comparison

              Sawzall   Tenzing   Dremel

 Latency       High     Medium     Low

Scalability    High      High     Medium

   SQL         None      High     Medium

  Power        High     Medium     Low

                                   Google Confidential and Proprietary
YTDW Future: Query Engines
 ● Adding MR capabilities to Dremel
   ○ Scalable reliable shuffle
   ○ Materializing large result sets
   ○ Read / write multiple data formats
 ● Easier / more powerful analysis in Dremel
   ○ User defined scalar and table values functions
   ○ More SQL features:
      ■ Better support for joins
      ■ Analytic functions, set operators, etc.
 ● Long term for Dremel:
   ○ Completely replace Tenzing MR backend
   ○ Extend BigQuery service capabilities
                                            Google Confidential and Proprietary
YTDW - About ABI
●   Complete reporting and dashboarding solution
●   Built on Google stack
●   Tight integration with Dremel and ColumnIO
●   Google Visualizations, some Flash
●   Current use in YTDW:
    ○ Most reports and dashboards




                                            Google Confidential and Proprietary
YTDW - Misc Technologies
● Python
  ○ Glue code - drivers, wrappers, etc.
  ○ Simple small scale extracts
● Scheduler
  ○ In-house scheduling framework
  ○ Built internally by YTDW engineers
● C++ MapReduce
  ○ Used sparingly for complex cases not possible
     using Sawzall /Tenzing
● Query Rewriter
  ○ Sits between ABI and Dremel
  ○ Rewrites queries to be faster / cheaper
                                          Google Confidential and Proprietary
YTDW - Performance
Performance improvement strategies:
   ● Data model:
       ○ De-normalized aggregated materialized views
       ○ Range partitioning
   ● Query rewrite layer:
       ○ Use the right aggregated materialized view
       ○ Prune partitions based on data knowledge
   ● Reporting front end
     ○ Aggressive result caching (memcache)


                                              Google Confidential and Proprietary
YTDW - Q & A




               Google Confidential and Proprietary

Contenu connexe

Tendances

Steam Learn: An introduction to Redis
Steam Learn: An introduction to RedisSteam Learn: An introduction to Redis
Steam Learn: An introduction to Redisinovia
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Ian Pointer
 
SysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDB
SysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDBSysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDB
SysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDBSeveralnines
 
«NoSQL Databases and Polyglot Persistence»
«NoSQL Databases and Polyglot Persistence»«NoSQL Databases and Polyglot Persistence»
«NoSQL Databases and Polyglot Persistence»Olga Lavrentieva
 
Google BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewGoogle BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewDoiT International
 
TechEvent Time Seriesd Databases
TechEvent Time Seriesd DatabasesTechEvent Time Seriesd Databases
TechEvent Time Seriesd DatabasesTrivadis
 
Webinar slides: How to Migrate from Oracle DB to MariaDB
Webinar slides: How to Migrate from Oracle DB to MariaDBWebinar slides: How to Migrate from Oracle DB to MariaDB
Webinar slides: How to Migrate from Oracle DB to MariaDBSeveralnines
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldJihoon Son
 
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018  - 10 - QubolePresto Summit 2018  - 10 - Qubole
Presto Summit 2018 - 10 - Qubolekbajda
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Bostonkbajda
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Modern Data Stack France
 
20160811 s301 e_prabhat
20160811 s301 e_prabhat20160811 s301 e_prabhat
20160811 s301 e_prabhatKumar Prabhat
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 
Data- How Does It Work-
Data- How Does It Work-Data- How Does It Work-
Data- How Does It Work-Boyang Niu
 
Working with the Moodle Database: The Basics
Working with the Moodle Database: The BasicsWorking with the Moodle Database: The Basics
Working with the Moodle Database: The BasicsSeveralnines
 
How QBerg scaled to store data longer, query it faster
How QBerg scaled to store data longer, query it fasterHow QBerg scaled to store data longer, query it faster
How QBerg scaled to store data longer, query it fasterMariaDB plc
 

Tendances (20)

Steam Learn: An introduction to Redis
Steam Learn: An introduction to RedisSteam Learn: An introduction to Redis
Steam Learn: An introduction to Redis
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 
SysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDB
SysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDBSysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDB
SysAdmin Working from Home? Tips to Automate MySQL, MariaDB, Postgres & MongoDB
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 
«NoSQL Databases and Polyglot Persistence»
«NoSQL Databases and Polyglot Persistence»«NoSQL Databases and Polyglot Persistence»
«NoSQL Databases and Polyglot Persistence»
 
Google BigQuery 101 & What’s New
Google BigQuery 101 & What’s NewGoogle BigQuery 101 & What’s New
Google BigQuery 101 & What’s New
 
TechEvent Time Seriesd Databases
TechEvent Time Seriesd DatabasesTechEvent Time Seriesd Databases
TechEvent Time Seriesd Databases
 
Webinar slides: How to Migrate from Oracle DB to MariaDB
Webinar slides: How to Migrate from Oracle DB to MariaDBWebinar slides: How to Migrate from Oracle DB to MariaDB
Webinar slides: How to Migrate from Oracle DB to MariaDB
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack WorldApache Tajo on Swift: Bringing SQL to the OpenStack World
Apache Tajo on Swift: Bringing SQL to the OpenStack World
 
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018  - 10 - QubolePresto Summit 2018  - 10 - Qubole
Presto Summit 2018 - 10 - Qubole
 
RubiX
RubiXRubiX
RubiX
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
20160811 s301 e_prabhat
20160811 s301 e_prabhat20160811 s301 e_prabhat
20160811 s301 e_prabhat
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Data- How Does It Work-
Data- How Does It Work-Data- How Does It Work-
Data- How Does It Work-
 
Working with the Moodle Database: The Basics
Working with the Moodle Database: The BasicsWorking with the Moodle Database: The Basics
Working with the Moodle Database: The Basics
 
How QBerg scaled to store data longer, query it faster
How QBerg scaled to store data longer, query it fasterHow QBerg scaled to store data longer, query it faster
How QBerg scaled to store data longer, query it faster
 

En vedette

Home care in wa state with audio take 3 show
Home care in wa state with audio take 3 showHome care in wa state with audio take 3 show
Home care in wa state with audio take 3 showKatheryn Howell
 
Home Care In W Atate With Audio Take 3 Show
Home  Care In  W  Atate With Audio Take 3 ShowHome  Care In  W  Atate With Audio Take 3 Show
Home Care In W Atate With Audio Take 3 ShowKatheryn Howell
 
Cell organelles
Cell organellesCell organelles
Cell organellesTownview
 
Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsliqiang xu
 
Xldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksXldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksliqiang xu
 
Xldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerXldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerliqiang xu
 

En vedette (7)

Cellresp.
Cellresp.Cellresp.
Cellresp.
 
Home care in wa state with audio take 3 show
Home care in wa state with audio take 3 showHome care in wa state with audio take 3 show
Home care in wa state with audio take 3 show
 
Home Care In W Atate With Audio Take 3 Show
Home  Care In  W  Atate With Audio Take 3 ShowHome  Care In  W  Atate With Audio Take 3 Show
Home Care In W Atate With Audio Take 3 Show
 
Cell organelles
Cell organellesCell organelles
Cell organelles
 
Xldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalyticsXldb2011 tue 0940_facebook_realtimeanalytics
Xldb2011 tue 0940_facebook_realtimeanalytics
 
Xldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocksXldb2011 wed 1415_andrew_lamb-buildingblocks
Xldb2011 wed 1415_andrew_lamb-buildingblocks
 
Xldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastnerXldb2011 tue 1055_tom_fastner
Xldb2011 tue 1055_tom_fastner
 

Similaire à Xldb2011 tue 1120_youtube_datawarehouse

Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdfDataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdfMiguel Angel Fajardo
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017Severalnines
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Introducing the ultimate MariaDB cloud, SkySQL
Introducing the ultimate MariaDB cloud, SkySQLIntroducing the ultimate MariaDB cloud, SkySQL
Introducing the ultimate MariaDB cloud, SkySQLMariaDB plc
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
SlamData Overview 9-1-2014
SlamData Overview 9-1-2014SlamData Overview 9-1-2014
SlamData Overview 9-1-2014carrjc2
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSKimmo Kantojärvi
 
Red Hat Gluster Storage - Direction, Roadmap and Use-Cases
Red Hat Gluster Storage - Direction, Roadmap and Use-CasesRed Hat Gluster Storage - Direction, Roadmap and Use-Cases
Red Hat Gluster Storage - Direction, Roadmap and Use-CasesRed_Hat_Storage
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumbergerinside-BigData.com
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the CloudAmihay Zer-Kavod
 
Big data on google platform dev fest presentation
Big data on google platform   dev fest presentationBig data on google platform   dev fest presentation
Big data on google platform dev fest presentationPrzemysław Pastuszka
 
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...Pôle Systematic Paris-Region
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 

Similaire à Xldb2011 tue 1120_youtube_datawarehouse (20)

Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdfDataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
 
MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017MySQL Cluster (NDB) - Best Practices Percona Live 2017
MySQL Cluster (NDB) - Best Practices Percona Live 2017
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Introducing the ultimate MariaDB cloud, SkySQL
Introducing the ultimate MariaDB cloud, SkySQLIntroducing the ultimate MariaDB cloud, SkySQL
Introducing the ultimate MariaDB cloud, SkySQL
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
SlamData Overview 9-1-2014
SlamData Overview 9-1-2014SlamData Overview 9-1-2014
SlamData Overview 9-1-2014
 
Running MySQL in AWS
Running MySQL in AWSRunning MySQL in AWS
Running MySQL in AWS
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Red Hat Gluster Storage - Direction, Roadmap and Use-Cases
Red Hat Gluster Storage - Direction, Roadmap and Use-CasesRed Hat Gluster Storage - Direction, Roadmap and Use-Cases
Red Hat Gluster Storage - Direction, Roadmap and Use-Cases
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 
Big data on google platform dev fest presentation
Big data on google platform   dev fest presentationBig data on google platform   dev fest presentation
Big data on google platform dev fest presentation
 
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 

Plus de liqiang xu

浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909liqiang xu
 
Csrf攻击原理及防御措施
Csrf攻击原理及防御措施Csrf攻击原理及防御措施
Csrf攻击原理及防御措施liqiang xu
 
Xldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inXldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inliqiang xu
 
Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)liqiang xu
 
大话Php之性能
大话Php之性能大话Php之性能
大话Php之性能liqiang xu
 
Nginx internals
Nginx internalsNginx internals
Nginx internalsliqiang xu
 
1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)liqiang xu
 
1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)liqiang xu
 

Plus de liqiang xu (9)

浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909浅谈灰度发布在贴吧的应用 支付宝 20130909
浅谈灰度发布在贴吧的应用 支付宝 20130909
 
Csrf攻击原理及防御措施
Csrf攻击原理及防御措施Csrf攻击原理及防御措施
Csrf攻击原理及防御措施
 
Hdfs comics
Hdfs comicsHdfs comics
Hdfs comics
 
Xldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inXldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_in
 
Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)Selenium私房菜(新手入门教程)
Selenium私房菜(新手入门教程)
 
大话Php之性能
大话Php之性能大话Php之性能
大话Php之性能
 
Nginx internals
Nginx internalsNginx internals
Nginx internals
 
1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)1.4亿在线背后的故事(2)
1.4亿在线背后的故事(2)
 
1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)1.4亿在线背后的故事(1)
1.4亿在线背后的故事(1)
 

Dernier

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Dernier (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Xldb2011 tue 1120_youtube_datawarehouse

  • 1. Youtube Data Warehouse Biswapesh Chattopadhyay biswapesh@google.com XLDB 2011 Google Confidential and Proprietary
  • 2. YTDW - Motivation & History ● Consolidated warehouse of Youtube data ● Videos, playbacks, summarized logs, etc. ● Very large (X PB uncompressed, Trillion row tables) ● High volume ETL (XXX TB processed / day) ● 100% Google Tech Stack: ○ Query: Oracle -> MySQL -> ColumnIO ○ ETL: Python -> Sawzall + Tenzing + Python ○ Reporting: Microstrategy -> ABI ● Key technologies: Sawzall, Tenzing, Dremel, ABI Google Confidential and Proprietary
  • 3. YTDW - Overall Architecture Logs Dremel ABI Service Reports ETL (Sawzall, MySQL Tenzing, YTDW DB Python, C++ (ColumnIO / MR) GFS) Tenzing Service Bigtable Google Confidential and Proprietary
  • 4. YTDW - About Sawzall ● Scripting language on Google MR framework ● Sawzall vs MR-Saw ● Built-in security for accessing sensitive logs data ● Strong support for aggregation and complex computations ● Read/write various formats ● Procedural language ● Open sourced! ● YTDW Usage: ○ ETL of Youtube logs ○ Complex one-off logs analysis Google Confidential and Proprietary
  • 5. YTDW - About Tenzing ● SQL on MR - Think HIVE, HadoopSQL ● Key strengths: ○ Strong SQL support ○ Highly scalable - built on Google MR ○ Read / write many formats ● Weaknesses: ○ Not ideal for complex procedural code ○ Higher latency than Dremel ○ Limited support for nested-repeated structures ● YTDW Usage: ○ ETL for non-logs data, denormalizations ○ Medium complexity analysis on YTDW dataGoogle Confidential and Proprietary
  • 6. YTDW - About Dremel ● Current use in YTDW: ○ Reporting query engine ○ Interactive simple logs analysis ● Key Strengths ○ Very low latency ○ SQL support ○ Strong nested-relational support ○ Access to logs ● Limitations ○ More complex SQL constructs (joins, setops, ...) ○ Limited library of functions ○ Doesn't scale as much as MR Google Confidential and Proprietary
  • 7. YTDW - Techonology Comparison Sawzall Tenzing Dremel Latency High Medium Low Scalability High High Medium SQL None High Medium Power High Medium Low Google Confidential and Proprietary
  • 8. YTDW Future: Query Engines ● Adding MR capabilities to Dremel ○ Scalable reliable shuffle ○ Materializing large result sets ○ Read / write multiple data formats ● Easier / more powerful analysis in Dremel ○ User defined scalar and table values functions ○ More SQL features: ■ Better support for joins ■ Analytic functions, set operators, etc. ● Long term for Dremel: ○ Completely replace Tenzing MR backend ○ Extend BigQuery service capabilities Google Confidential and Proprietary
  • 9. YTDW - About ABI ● Complete reporting and dashboarding solution ● Built on Google stack ● Tight integration with Dremel and ColumnIO ● Google Visualizations, some Flash ● Current use in YTDW: ○ Most reports and dashboards Google Confidential and Proprietary
  • 10. YTDW - Misc Technologies ● Python ○ Glue code - drivers, wrappers, etc. ○ Simple small scale extracts ● Scheduler ○ In-house scheduling framework ○ Built internally by YTDW engineers ● C++ MapReduce ○ Used sparingly for complex cases not possible using Sawzall /Tenzing ● Query Rewriter ○ Sits between ABI and Dremel ○ Rewrites queries to be faster / cheaper Google Confidential and Proprietary
  • 11. YTDW - Performance Performance improvement strategies: ● Data model: ○ De-normalized aggregated materialized views ○ Range partitioning ● Query rewrite layer: ○ Use the right aggregated materialized view ○ Prune partitions based on data knowledge ● Reporting front end ○ Aggressive result caching (memcache) Google Confidential and Proprietary
  • 12. YTDW - Q & A Google Confidential and Proprietary