SlideShare a Scribd company logo
1 of 33
Thursdays
                                        9:00 ET/PT




Hadoop in a Mission Critical Environment
November 2011
Jim Haas, Director Data Warehouse ETL
Introduction




2
About me

    » Worked at AT&T for 16 years building large scale
      Financial Systems: Billers, GL, Data Warehouse,
      Cost Modeling Systems for Network Services,
      Business and Consumer Business Units
    » Agency.com: several key clients – USSB, DIRECTV,
      GMACCM, Keyspan
    » CNET/CBSi – Director of Data Warehouse ETL
      systems. Have worked at CNET/CBSi since 2005 re-
      architecting Data Warehouse systems.




3
Who is CBS Interactive?




4
Top 20 Web Property
        Global Unique User Ranking (000)                                        US Unique User Ranking (000)

    1     Google Sites                           1,066,695                    1     Google Sites                          184,582
    2     Microsoft Sites                           914,237                   2     Microsoft Sites                       178,014
    3Facebook                                       769,655                   3     Yahoo! Sites                          177,123
    4     Yahoo! Sites                              701,378                   4Facebook                                   163,021
    5     Wikimedia Sites                           454,529                   5     AOL, Inc.                             105,861
    6     Amazon Sites                              319,548                   6     Amazon Sites                          103,709
    7     Apple Inc.                                264,537                   7     Ask Network                            91,994
    8TencentInc.                                    245,220                   8     Turner Digital                         89,981
    9     CBS Interactive                           242,571                   9     Glam Media                             88,303
10        Ask Network                               240,805                 10      Wikimedia Sites                        83,836
                                                                            11      CBS Interactive                        83,463




        Source: Global Ranking based on comScore Worldwide MediaMetrix for the month of September, 2011. US ranking based on
5       comScore US MediaMetrix for the month of September 2011.
Data Warehouse at CBSi



       Data
       Warehouse




                                                     External
                                                      Data
                                                     Sources
     Web Events/ Click-stream   Internal Systems/
                                Content Management




6
Intro – Business functions we support

    » Web site/media metrics (BI)
    » Website re-design A/B testing
    » Financial billers (download, clicks, partners, ads)
    » Ad event tracking
    » Data feeds for sites
    » External reporting
    » Custom event tracking servers
      – clicks, page views, downloads,
      – streaming video events, ad events, etc.


7
Intro - Data Warehouse Back-End

    » Architect
    » Design
    » Code
    » Operations
    » Early adoption admin
    » Hardware recommendations
    » Technology, Clusters, Database recommendations




8
Intro - Data Warehouse Back-End

    » Some interesting facts:
    - Run over 800 jobs a day
    - Peak days are over 500 million events per day with current
      processing, next quarter will be 1 billion per day
    - Events can spike at 30,000 per second
    - Build/maintain over 150 dimensions
    - Build 10 facts tables
    - Make detailed data available for >24 months retention, up from
      2 months previously
    - Integrated core DW plus 15 data marts
    - Build/maintain > 600 database tables
      - Facts 10 main tables, 175 fields
      - Dims 150 tables, 755 fields
      - Summary 200 tables, 3432 fields


9
Intro – Problem Domain




10
Intro – Problem Domain

     » Growth curve, data size delta over time
       – Database : from 3 to 300 TB in 3 years
       – Cluster : from 1 TB to ~1 PB in 3 years
       – Events : from 50 to 150 Billion per year

     » Special events that cause us angst:
       –   Tiger Woods, iPhone launches, March Madness,
       –   Football season, cyber monday,
       –   ISP/broadband slowdowns (video QOS),
       –   Kate's dress, Osama, e3, Comdex,
       –   Tom Brady injured, etc.

     » Old systems were bleeding, real difficult to support
       new volumes, requirements, uses
11
Intro – other logistical problems

     » Re-Architecture is too big for a waterfall approach,
       must be phased
     » Other surprise/evolved goals/intermediate objectives:
       – colo moves
       – new business functions (tracking all streaming video for
         CBS)
       – Swapping in a new database for Data Warehouse, etc.

     » Oh yeah, don’t plan on taking down time




12
Re-Architecture Goals

     » Fix i/o bound processing
     » Get more CPU horsepower
     » Move away from proprietary systems (inaccessibility)
     » Position for more agile change
     » Adapt to a changing organization
     » Deal with Legacy code
     » Do all of the above economically




13
Strategy




14
Re-Architecture Strategy

     » Build
     » Buy
     » Open Source
     » Service




15
Re-Architecture Strategy

     » POC/Proof of concept
     » Rules of engagement




16
Re-Architecture Strategy

     » General Tactics
       –   Code/ re-write
       –   Divide and conquer
       –   Moving parts (more or less)
       –   Paint the ship while it’s moving
       –   Do the hard stuff first




17
Re-Architecture Strategy

     » General Strategy
       – Faster, better, cheaper ?




18
ETL Tactics

     » ETL or ELT ?
     » System Functions
       1. Parsers – most complex/time consuming
       2.   History file creation/DB loads - reliable
       3.   Lookups – shared memory
       4.   Big dimensions – type 2 dimension > 20 Billion rows
       5.   Sessionize – complex reducer




19
Business/Other tactics

     » Disaggregate system to allow re-architecting pieces
     » Build bridges
     » Begin with easiest SLA
     » Start with most challenging data
     » Plan for live soft launches in parallel
     » Go for high resource (cpu/io) elements first




20
Testing tactics

     » Data centric testing
     » Properly abstracted controls
     » Tools




21
Release Tactics

     » 16 releases in 24 months
     » Allow parallel operation/soft launch capability
     » Put bridges in place




22
Hadoop Skunk Works

     » We do planning, purchasing, setup, admin, control till
       stable
     » Plan for turnover to central admin
     » Plan for multi-tenancy




23
Hadoop Tactics/ Order of sophistication

     » Hadoop Streaming
     » Mappers
     » Parallelized Collector
     » Lookups
     » Sessionize/Complex Reducer
     » Hadoop Ecosystem




24
Hadoop Ecosystem

     » M/R Streaming
     » HDFS
     » CDH2 GA
     » HIVE
     » CDH3
     » Other groups : Pig, Hbase, Zookeeper
     » SCM




25
Partner Strategy

     » Internal System Administration
     » Internal Platform Infrastructure Group
     » External – Cloudera
     » Internal Hadoop Admin – turnover control
     » Management Support/$




26
Re-Architecture Strategy

     » Mitigating risk
       –   Try not to bleed
       –   Try not to do it all ourselves
       –   Get a good application/job management tool
       –   Get/build a good test framework
       –   Do lots of testing
       –   Get support when needed




27
Re-Architecture Strategy

     » Dealing with the real world
       – New business requirements (e.g. tracking streaming vide
       – Colo moves




28
Results




29
CBSiHadoop Cluster


     External
      Data                                                  Other
     Sources                                               internal
                                    Hadoop
                                                           systems
                                    1.0 PB


                                             CDH2/3
                       ETL Client


Web Tracking Servers



                                           Data          Reporting
                                    Warehouse Database
                                          250 TB

Internal BU
Data Sources

30
Goals Achieved

     » Meeting SLA’s more reliably - Run time reduction
       – cut 8 hours from nightly batch, can process magnitudes
         more volume

     » Relative cost reduction
     » Fault tolerance
     » Easily scalable/upgradeable
     » Economically scalable/upgradeable
     » Reliable components
     » More manageable/maintainable
     » Less reliant on proprietary systems
31
In Summary

     » Hadoop is robust for mission critical processing
     » Fault tolerance is a reality
     » We’ve had excellent experience with stability of the
       architecture
     » Scalability is practically automatic
     » We’ve learned to plan ahead with scaling to avoid
       running at too high of a percentage of space
       utilization




32
In Summary

     » The Team
       – Jim Haas, Dan Lescohier, Michael Sun, Ron Mahoney,
         BatuUlug, SlavomirKrysiak, Richard Zhang

     » Management Support
       – Steph Lone, Guy Bayes


     » The ETL package: Lumberjack
       – lumberjack@cbsinteractive.com




33

More Related Content

Similar to Hadoop in a Mission Critical Environment

Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationKyle Hailey
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
10 Reasons Snowflake Is Great for Analytics
10 Reasons Snowflake Is Great for Analytics10 Reasons Snowflake Is Great for Analytics
10 Reasons Snowflake Is Great for AnalyticsSenturus
 
Hadoop is Happening
Hadoop is HappeningHadoop is Happening
Hadoop is HappeningPrecisely
 
Australia SharePoint Conference 2012 - SharePoint Performance - Tales from th...
Australia SharePoint Conference 2012 - SharePoint Performance - Tales from th...Australia SharePoint Conference 2012 - SharePoint Performance - Tales from th...
Australia SharePoint Conference 2012 - SharePoint Performance - Tales from th...Chris McNulty
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016Kent Graziano
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsArcadia Data
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Dell Digital Transformation Through AI and Data Analytics Webinar
Dell Digital Transformation Through AI and  Data Analytics WebinarDell Digital Transformation Through AI and  Data Analytics Webinar
Dell Digital Transformation Through AI and Data Analytics WebinarBill Wong
 
SQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataSQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataDenny Lee
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
Webinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-ServiceWebinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-ServiceMongoDB
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big datasolarisyourep
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big dataxKinAnx
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
 

Similar to Hadoop in a Mission Critical Environment (20)

Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualization
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
10 Reasons Snowflake Is Great for Analytics
10 Reasons Snowflake Is Great for Analytics10 Reasons Snowflake Is Great for Analytics
10 Reasons Snowflake Is Great for Analytics
 
Hadoop is Happening
Hadoop is HappeningHadoop is Happening
Hadoop is Happening
 
Australia SharePoint Conference 2012 - SharePoint Performance - Tales from th...
Australia SharePoint Conference 2012 - SharePoint Performance - Tales from th...Australia SharePoint Conference 2012 - SharePoint Performance - Tales from th...
Australia SharePoint Conference 2012 - SharePoint Performance - Tales from th...
 
Making Sense of Remote Sensing
Making Sense of Remote SensingMaking Sense of Remote Sensing
Making Sense of Remote Sensing
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time Analytics
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Dell Digital Transformation Through AI and Data Analytics Webinar
Dell Digital Transformation Through AI and  Data Analytics WebinarDell Digital Transformation Through AI and  Data Analytics Webinar
Dell Digital Transformation Through AI and Data Analytics Webinar
 
SQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataSQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big Data
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Webinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-ServiceWebinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-Service
 
Data Privacy at Scale
Data Privacy at ScaleData Privacy at Scale
Data Privacy at Scale
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Lecture1
Lecture1Lecture1
Lecture1
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Hadoop in a Mission Critical Environment

  • 1. Thursdays 9:00 ET/PT Hadoop in a Mission Critical Environment November 2011 Jim Haas, Director Data Warehouse ETL
  • 3. About me » Worked at AT&T for 16 years building large scale Financial Systems: Billers, GL, Data Warehouse, Cost Modeling Systems for Network Services, Business and Consumer Business Units » Agency.com: several key clients – USSB, DIRECTV, GMACCM, Keyspan » CNET/CBSi – Director of Data Warehouse ETL systems. Have worked at CNET/CBSi since 2005 re- architecting Data Warehouse systems. 3
  • 4. Who is CBS Interactive? 4
  • 5. Top 20 Web Property Global Unique User Ranking (000) US Unique User Ranking (000) 1 Google Sites 1,066,695 1 Google Sites 184,582 2 Microsoft Sites 914,237 2 Microsoft Sites 178,014 3Facebook 769,655 3 Yahoo! Sites 177,123 4 Yahoo! Sites 701,378 4Facebook 163,021 5 Wikimedia Sites 454,529 5 AOL, Inc. 105,861 6 Amazon Sites 319,548 6 Amazon Sites 103,709 7 Apple Inc. 264,537 7 Ask Network 91,994 8TencentInc. 245,220 8 Turner Digital 89,981 9 CBS Interactive 242,571 9 Glam Media 88,303 10 Ask Network 240,805 10 Wikimedia Sites 83,836 11 CBS Interactive 83,463 Source: Global Ranking based on comScore Worldwide MediaMetrix for the month of September, 2011. US ranking based on 5 comScore US MediaMetrix for the month of September 2011.
  • 6. Data Warehouse at CBSi Data Warehouse External Data Sources Web Events/ Click-stream Internal Systems/ Content Management 6
  • 7. Intro – Business functions we support » Web site/media metrics (BI) » Website re-design A/B testing » Financial billers (download, clicks, partners, ads) » Ad event tracking » Data feeds for sites » External reporting » Custom event tracking servers – clicks, page views, downloads, – streaming video events, ad events, etc. 7
  • 8. Intro - Data Warehouse Back-End » Architect » Design » Code » Operations » Early adoption admin » Hardware recommendations » Technology, Clusters, Database recommendations 8
  • 9. Intro - Data Warehouse Back-End » Some interesting facts: - Run over 800 jobs a day - Peak days are over 500 million events per day with current processing, next quarter will be 1 billion per day - Events can spike at 30,000 per second - Build/maintain over 150 dimensions - Build 10 facts tables - Make detailed data available for >24 months retention, up from 2 months previously - Integrated core DW plus 15 data marts - Build/maintain > 600 database tables - Facts 10 main tables, 175 fields - Dims 150 tables, 755 fields - Summary 200 tables, 3432 fields 9
  • 10. Intro – Problem Domain 10
  • 11. Intro – Problem Domain » Growth curve, data size delta over time – Database : from 3 to 300 TB in 3 years – Cluster : from 1 TB to ~1 PB in 3 years – Events : from 50 to 150 Billion per year » Special events that cause us angst: – Tiger Woods, iPhone launches, March Madness, – Football season, cyber monday, – ISP/broadband slowdowns (video QOS), – Kate's dress, Osama, e3, Comdex, – Tom Brady injured, etc. » Old systems were bleeding, real difficult to support new volumes, requirements, uses 11
  • 12. Intro – other logistical problems » Re-Architecture is too big for a waterfall approach, must be phased » Other surprise/evolved goals/intermediate objectives: – colo moves – new business functions (tracking all streaming video for CBS) – Swapping in a new database for Data Warehouse, etc. » Oh yeah, don’t plan on taking down time 12
  • 13. Re-Architecture Goals » Fix i/o bound processing » Get more CPU horsepower » Move away from proprietary systems (inaccessibility) » Position for more agile change » Adapt to a changing organization » Deal with Legacy code » Do all of the above economically 13
  • 15. Re-Architecture Strategy » Build » Buy » Open Source » Service 15
  • 16. Re-Architecture Strategy » POC/Proof of concept » Rules of engagement 16
  • 17. Re-Architecture Strategy » General Tactics – Code/ re-write – Divide and conquer – Moving parts (more or less) – Paint the ship while it’s moving – Do the hard stuff first 17
  • 18. Re-Architecture Strategy » General Strategy – Faster, better, cheaper ? 18
  • 19. ETL Tactics » ETL or ELT ? » System Functions 1. Parsers – most complex/time consuming 2. History file creation/DB loads - reliable 3. Lookups – shared memory 4. Big dimensions – type 2 dimension > 20 Billion rows 5. Sessionize – complex reducer 19
  • 20. Business/Other tactics » Disaggregate system to allow re-architecting pieces » Build bridges » Begin with easiest SLA » Start with most challenging data » Plan for live soft launches in parallel » Go for high resource (cpu/io) elements first 20
  • 21. Testing tactics » Data centric testing » Properly abstracted controls » Tools 21
  • 22. Release Tactics » 16 releases in 24 months » Allow parallel operation/soft launch capability » Put bridges in place 22
  • 23. Hadoop Skunk Works » We do planning, purchasing, setup, admin, control till stable » Plan for turnover to central admin » Plan for multi-tenancy 23
  • 24. Hadoop Tactics/ Order of sophistication » Hadoop Streaming » Mappers » Parallelized Collector » Lookups » Sessionize/Complex Reducer » Hadoop Ecosystem 24
  • 25. Hadoop Ecosystem » M/R Streaming » HDFS » CDH2 GA » HIVE » CDH3 » Other groups : Pig, Hbase, Zookeeper » SCM 25
  • 26. Partner Strategy » Internal System Administration » Internal Platform Infrastructure Group » External – Cloudera » Internal Hadoop Admin – turnover control » Management Support/$ 26
  • 27. Re-Architecture Strategy » Mitigating risk – Try not to bleed – Try not to do it all ourselves – Get a good application/job management tool – Get/build a good test framework – Do lots of testing – Get support when needed 27
  • 28. Re-Architecture Strategy » Dealing with the real world – New business requirements (e.g. tracking streaming vide – Colo moves 28
  • 30. CBSiHadoop Cluster External Data Other Sources internal Hadoop systems 1.0 PB CDH2/3 ETL Client Web Tracking Servers Data Reporting Warehouse Database 250 TB Internal BU Data Sources 30
  • 31. Goals Achieved » Meeting SLA’s more reliably - Run time reduction – cut 8 hours from nightly batch, can process magnitudes more volume » Relative cost reduction » Fault tolerance » Easily scalable/upgradeable » Economically scalable/upgradeable » Reliable components » More manageable/maintainable » Less reliant on proprietary systems 31
  • 32. In Summary » Hadoop is robust for mission critical processing » Fault tolerance is a reality » We’ve had excellent experience with stability of the architecture » Scalability is practically automatic » We’ve learned to plan ahead with scaling to avoid running at too high of a percentage of space utilization 32
  • 33. In Summary » The Team – Jim Haas, Dan Lescohier, Michael Sun, Ron Mahoney, BatuUlug, SlavomirKrysiak, Richard Zhang » Management Support – Steph Lone, Guy Bayes » The ETL package: Lumberjack – lumberjack@cbsinteractive.com 33

Editor's Notes

  1. CBSi has over 300 web sitesNotables: Gamespot, CNET, CBSSports, ZOL, PCHome, TV.com
  2. CBS is a top premium content company on the web
  3. 50,00 foot view of the DW at CBSiSimply, the CBSi DW is sum of all- CBSiclickstream/event data- almost all internal systems (in functional/organizational data marts)plus external data (mobile, geo-location data, etc.)Cover high level DW functions: collecting, cleanse, categorize, trasform, store, feed
  4. In summary,CBSi needs the DW for:- Metrics for informed design- Data for sites Data for billers/deals Data for these are not mutually exclusive, they can overlap
  5. We focus on operations/where the rubber meets the road:Some interesting facts:Run over 800 jobs a dayPeak days are over 500 million events per day with current processingEvents can spike at 30,000 per secondBuild/maintain over 150 dimensionsBuild 10 facts tablesMake detailed data available for >24 months retention, up from 2 months previouslyIntegrated core DW plus 15 data martsBuild/maintain > 600 database tablesFacts 10 main table, 175 fieldsDims 150 tables, 755 fieldsSummary 200 tables, 3432 fields
  6. We focus on operations/where the rubber meets the road:Some interesting facts:Run over 800 jobs a dayPeak days are over 500 million events per day with current processingEvents can spike at 30,000 per secondBuild/maintain over 150 dimensionsBuild 10 facts tablesMake detailed data available for >24 months retention, up from 2 months previouslyIntegrated core DW plus 15 data martsBuild/maintain > 600 database tablesFacts 12 main table, 175 fieldsDims 150 tables, 755 fieldsSummary 200 tables, 3432 fields
  7. 2008 CNET/CBS merge2009/2010 Video tracking2011 – only 3 quarters2012 ad events ill be processed
  8. From a cluster/framework perspective : goal is to get a framework/infrastructure that deals gets as many of these as possibleSome of the goals obviously cannot be solved by the cluster framework alone
  9. Make a distinction between ETL and frameworkWe have a predilection for buildingCNET has history of build and open source for solutionsBut have purchased some technology for obvious reasons: databases, reporting, job mgt., etc.we already began building our own ET We ruled out using a service such as Amazon EC2 for a few strategic reasons- we wanted data inside our walls- we wanted control over performance- we perceived it as more cost effective
  10. We’ve done several POC’s in last 4 years, job mgt., databaseDue to conditions, we modified our approach after doing paper evaluations of available cluster solutionsWe decided:To only do a POC of Hadoop, skunk works styleFocused on what we really needed:ParallelismHdfsScalabilityExtensibilityRationalizing the costs/benefitsStability and frameworks
  11. re-architecture also involved recoding Section of architectural blocks of the system and attack in a meaningful wayGood engineering says less pieces – so we decided to make more parts, but made them simpler and modular, we disaggregated processes no down time, so it had to be easy to swap in pieces start with a lower SLA/less critical set of processes
  12. BetterFasterCheaperWe concentrated on writing code that was betterMostly we relied on framework for faster and cheaper
  13. We had previously decided on etl, we did not like nor have lots of luck with eltThis was the order of changing systems pieces over in generalTHIS IS THE GENERAL ORDER OF ATTACKParsers - were resource intensive, probably most broken history/dbload - we wanted to begin storing history on hadoop to facilitate/lessen the retention of data in the database, i.e. keep the longer tail of history out of the db where its cheaper, allowed us to eliminate some of the larger db backup processesLookups - We needed lookups in hadoop so that we could eliminate passing data between old system and newBig dimensions - url and title dimensions were quite technically challenging, super sized dimsSessionize – heart of the system, we could eliminate most data passing between old cluster and new once this piece was in. However, this piece was most complex, risky and most significant. So we build up to it.
  14. disaggregate : use an Application mgt system , we abstracted all operational control that was feasible to this top level Bridges – old cluster to new, offload database to cluster, etc. SLA – not as important at that time that it be done quickly tried to build/deploy all pieces as soft launch/parallel wherever possible, then flip a switch to switch to re-architected once we see everything flows/works in live environment well started with china, significant volumes, challenging data (unicode), simpler application flow,- go for high resource usage processes that suffered from lengthy processes
  15. Lots of data checking, since we have ~ 400 fact fields in 10 fact tables, and ~ peaks over 500 M recs/day, needed to really check data thoroughly control layer not so hard, it was easy since we focused on control abstraction wrote a couple of tools to do mass data checking resultant database compares, sampling to the extreme in cluster data compare, could do brut force old and new with summaries there was lots of data archaeology to determine good versus bad differences, 99% of differences during testing were good, aka result better code
  16. Chunk it up, but not too much play it safe, try and see that everything works at scale before its active release and deploy, trial bridges/interfaces to new system before they need to be used
  17. For speed we did lots ourselves from the begiinng, we planned on turning infrastructure/admin to central group if all went well talked about sharing, but did not concentrate on it or begin acting on it initially, then likewise began in earnest once our ‘experiment’ was deemed successful
  18. Get our data pipelines to run as mappers upgrade our harvesting to a simpler yet reliable and faster model go for some more difficult but very beneficial , once done we could really drop flow through old cluster once we are adept at hadoop, and are confident of it, move the heart of the system expand our hadoop frameworks, hive, pig (other groups), zookeeper, etc., also perhaps it’s a bit of wanting to not be on bleeding edge and finding stability
  19. General order of adoption, usem/r and hdfs obviously beginning with concurrently, in other words the basicsDecided to go to cdh2 and stick with whatsga for stability/reliability reasonsWent to hive once we felt cluster could handle it, significant data stores there to use
  20. Partnering in the loose sense, both internal and externalNeeded sys admins help to spec, purchase, buy, install, admin linux boxesNeeded our pi group to do our custom builds of cdh/packages for compatibility and software infrastructure managementReally cloudera since we started, using resources, builds, online training, consulting, etc.Need internal hadoopadmins to we can go full force and get on withj building apps /systemsNeeded mgt support for obvious reasons