SlideShare une entreprise Scribd logo
1  sur  17
Télécharger pour lire hors ligne
A Plan for Large Scale Data Analytics
  Utilizing Aster nCluster to support processing in excess of 100 Billion
  rows per month




Will Duckworth, Director Software Engineering (wduckworth@comscore.com)
Agenda

 comScore
  – Introduction and Technology
 MM360 Initiative
 The Challenge
 Our Analysis
 Plans




                     © comScore, Inc. Proprietary and Confidential.   2
comScore, Inc.


 Founded in 1999

 Publically traded on NASDAQ (SCOR)

 Acquired MediaMetrix in 2002, M:Metrics in 2007, Certifica in 2009, and ARSgroup
  in 2010

 Corporate headquarters: Reston, VA

  – Offices in Chicago, NYC, San Francisco, Seattle, Toronto, London, Tokyo, and Paris

  – 500+ full-time employees

 Experienced senior leadership team with a unique record of innovation in the
  market research industry

 More than 1200 clients across many industries


                        © comScore, Inc. Proprietary and Confidential.   3
Advising Hundreds of Leading Businesses
 (partial list)

Internet     Agencies   Telecom             Financial                        Retail   Travel   CPG   Pharma   Technology




                            © comScore, Inc. Proprietary and Confidential.     4
Powerful Platform: Massive Database and Cost
Effective Technology Infrastructure


     Continuous                                                                 Massive
      Operation                                                             Operational Scale

 ■ 24/7                                                                   Largest Windows Data
 ■ 99.99% Uptime                                                          Warehouse in the World

                                                                            ■ 1,000 TB of
       Patents                                                                storage
                                                                            ■ 1,100 Servers
■ 3 Issued                              Database and                        ■ 30 TB per month
■ 24 Pending                      Computational Infrastructure



    Cost Effective                                                           Highly Scalable,
                                                    Proprietary
                                                                          Distributed Processing
                                                  Technology with
System Capex                                                                    Architecture
                                                Strong IP Protection
  < $7M/Year

       Sophisticated Technology to Keep Up With Internet Advancements
                     © comScore, Inc. Proprietary and Confidential.   5
Even for us this is getting big…

                                      New Rows per Day (panel vs. non-panel)
           12,000
Millions




           10,000




            8,000




            6,000




            4,000




            2,000




               0
              6/24/2009   7/24/2009   8/24/2009          9/24/2009            10/24/2009        11/24/2009   12/24/2009   1/24/2010   2/24/2010   3/24/2010

                                                                                    beacon         panel

                                           © comScore, Inc. Proprietary and Confidential.   6
Where we come from …

 Our skill set came from a need to measure Win32
 We chose technologies and built a core team around our
 mandate to have accurate consumer Internet
 measurement
 – All Intel Based
 – 2/3 Microsoft OS, 1/3 Linux OS
 – C++
 Now very much a “best tool for the job” organization




                 © comScore, Inc. Proprietary and Confidential.   7
MM360 Initiative




             © comScore, Inc. Proprietary and Confidential.   8
Internet = “The Most Measurable Medium”



                                                                       100% Accurate count of
                                                                         server requests, but…


                                                                        How many real users?
                                                                        What kind of users are they?
                                                                        Which request is a valid
                                                                         Page View?
                                                                        How long did the users spend
                                                                         on my site?




                  © comScore, Inc. Proprietary and Confidential.   9
Basic Problem with Servers: No Unique User ID

                     Web Analytics Approximation
            Unique User = Cookie ID (if Cookies can be set)
                     or IP Address + User Agent



Sounds Simple, But Major Problems:
 Cookies are deleted frequently, and the same person can be counted
  multiple times

 IP Addresses change frequently causing inflation of user counts

 In any case, servers identify a machine (or a browser), which can
  represent multiple persons or a fraction of the usage of a single person



                     © comScore, Inc. Proprietary and Confidential.   10
Media Metrix 360: Key Benefits for Participating Sites

 Comprehensive coverage: 100% of activity
 – New “Universe Report” covers mobile and public machines
 – Census-adjusted metrics in current Media Metrix reports (Home and
   Work)
 – Coverage Calculation for beaconing sites

 Improved coverage of At-Work population
 Harmonization / Reconciliation of panel vs. server
 More granularity
 More timely reporting
 Transparency

                   © comScore, Inc. Proprietary and Confidential.   11
The Challenge




            © comScore, Inc. Proprietary and Confidential.   12
Goals

 Be able to scale to support an initial monthly volume of 160 Billion
 records
 – Store 3 months of data online
 Be able to add incrementally to the environment to support growth
 Support advanced analytics
 – 150 analysts
 Support end user access to record level data, preferably through a
 SQL interface
 Support the storage of row level data
 Have yesterdays data available today



                    © comScore, Inc. Proprietary and Confidential.   13
Existing Internal Systems

 NGUA
  – Ability to run specific queries for a given time period very quickly because all processing is
    parallelized
  – Currently holding 560+ days of data; 800B+ rows.
  – All traffic for a machine for a month – 1 minute run time (140k records)
  – All traffic for pizzahut.com for a month – 4 minutes run time (1.9 million records)
  – All traffic from google.com where toys is in the URL – 1 hour 15 minutes (400k records)

■ Fusion
  – Primary System used for processing and providing the data behind the majority of
    comScore’s products and analysis
  – Runs on 32 servers
  – For one month we read over 8TB of compressed log files with over 40B rows
  – Produces 1.3 B rows and 120 GB of output for load into a DW
  – Can turn around the processing in less than 8 hours

 Both systems leverage the same core concepts of locality to data and distributed
  processing
                           © comScore, Inc. Proprietary and Confidential.   14
Aster Data nCluster

 Current Aster environments
 – Dev: 1 Queen; 3 Workers; 650+GB total storage
 – Prod: 1 Queen; 4 Loaders; 10 Workers; 32TB total storage

 Plans
 – Building new Prod environment 1 Queen, 70 workers and 10
   Loaders / Staging servers
 – 350TB total storage
 – 432 Cores




                © comScore, Inc. Proprietary and Confidential.   15
Aster Data nCluster

 Table design is key with data of this size
 – What is the end user going to do 80% of the time?
 On the web, no matter how clean you think your data set
 is there are still going to be issues
 – 6 Sigma on 10 billion records a day is still nearly 35,000
   “bad rows”
 – Staging Servers
 Looking at using Aster-Hadoop Data Connector for
 integration with in-house Hadoop environment
 – Aster Data for the analysts
 – Hadoop for the developers

                 © comScore, Inc. Proprietary and Confidential.   16
Critical Cost Drivers to factor into the Analysis

 Data Centers
 – Power is the big issue at data centers today. All allocations
   for power and space are based on the number of circuits and
   the cost per circuit are all expected to rise
 Servers
 – Even high end servers have reached relative commodity
   prices if you stay to the 2U footprint and standard
   components




                  © comScore, Inc. Proprietary and Confidential.   17

Contenu connexe

Tendances

Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...
Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...
Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...Yahoo Developer Network
 
Data infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInData infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInHari Shankar Sreekumar
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopSavvycom Savvycom
 
EMC Big Data Solutions Overview
EMC Big Data Solutions OverviewEMC Big Data Solutions Overview
EMC Big Data Solutions Overviewwalshe1
 
ROI of Big Data Analytics Native on Hadoop
ROI of Big Data Analytics Native on HadoopROI of Big Data Analytics Native on Hadoop
ROI of Big Data Analytics Native on HadoopDataWorks Summit
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataRichard McDougall
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big dataYukti Kaura
 
Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...
Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...
Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...Cloudera, Inc.
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data Geoffrey Fox
 
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...Dataconomy Media
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Etu Solution
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 

Tendances (19)

Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...
Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...
Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...
 
Data infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInData infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedIn
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
EMC Big Data Solutions Overview
EMC Big Data Solutions OverviewEMC Big Data Solutions Overview
EMC Big Data Solutions Overview
 
ROI of Big Data Analytics Native on Hadoop
ROI of Big Data Analytics Native on HadoopROI of Big Data Analytics Native on Hadoop
ROI of Big Data Analytics Native on Hadoop
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big Data
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...
Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...
Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台Track B-1 建構新世代的智慧數據平台
Track B-1 建構新世代的智慧數據平台
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
10g db grid
10g db grid10g db grid
10g db grid
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 

En vedette

The Power of_Like - How Social Marketing Works
The Power of_Like - How Social Marketing WorksThe Power of_Like - How Social Marketing Works
The Power of_Like - How Social Marketing WorksBoris Loukanov
 
The power of like. (ComScore, Facebook 2011)
The power of like.  (ComScore, Facebook 2011)The power of like.  (ComScore, Facebook 2011)
The power of like. (ComScore, Facebook 2011)Retelur Marketing
 
BigData @ comScore
BigData @ comScoreBigData @ comScore
BigData @ comScoreeaiti
 
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...MapR Technologies
 
Uber Analytics Test
Uber Analytics TestUber Analytics Test
Uber Analytics TestCoursetake
 
Facebook and Myspace App Platforms: A Brief Update
Facebook and Myspace App Platforms: A Brief UpdateFacebook and Myspace App Platforms: A Brief Update
Facebook and Myspace App Platforms: A Brief UpdateO'Reilly Media
 

En vedette (6)

The Power of_Like - How Social Marketing Works
The Power of_Like - How Social Marketing WorksThe Power of_Like - How Social Marketing Works
The Power of_Like - How Social Marketing Works
 
The power of like. (ComScore, Facebook 2011)
The power of like.  (ComScore, Facebook 2011)The power of like.  (ComScore, Facebook 2011)
The power of like. (ComScore, Facebook 2011)
 
BigData @ comScore
BigData @ comScoreBigData @ comScore
BigData @ comScore
 
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
How to Succeed in Hadoop: comScore’s Deceptively Simple Secrets to Deploying ...
 
Uber Analytics Test
Uber Analytics TestUber Analytics Test
Uber Analytics Test
 
Facebook and Myspace App Platforms: A Brief Update
Facebook and Myspace App Platforms: A Brief UpdateFacebook and Myspace App Platforms: A Brief Update
Facebook and Myspace App Platforms: A Brief Update
 

Similaire à comScore

Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisDay 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisAmazon Web Services
 
The Evolution of Data Architecture
The Evolution of Data ArchitectureThe Evolution of Data Architecture
The Evolution of Data ArchitectureWei-Chiu Chuang
 
Accelerating Cloud Services - Intel
Accelerating Cloud Services - IntelAccelerating Cloud Services - Intel
Accelerating Cloud Services - IntelAmazon Web Services
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...Maya Lumbroso
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...Dataconomy Media
 
Industrial Internet of Things: Protocols an Standards
Industrial Internet of Things: Protocols an StandardsIndustrial Internet of Things: Protocols an Standards
Industrial Internet of Things: Protocols an StandardsJavier Povedano
 
Building Web Applications on AWS - AWS Summit 2012 - NYC
Building Web Applications on AWS - AWS Summit 2012 - NYCBuilding Web Applications on AWS - AWS Summit 2012 - NYC
Building Web Applications on AWS - AWS Summit 2012 - NYCAmazon Web Services
 
Kaushal Amin & Big 5 IT trends in the world
Kaushal Amin & Big 5 IT trends in the worldKaushal Amin & Big 5 IT trends in the world
Kaushal Amin & Big 5 IT trends in the worldQuang PM
 
Technology Trends and Big Data in 2013-2014
Technology Trends and Big Data in 2013-2014Technology Trends and Big Data in 2013-2014
Technology Trends and Big Data in 2013-2014KMS Technology
 
Why is DDS the Right Technology for the Industrial Internet?
Why is DDS the Right Technology for the Industrial Internet?Why is DDS the Right Technology for the Industrial Internet?
Why is DDS the Right Technology for the Industrial Internet?Real-Time Innovations (RTI)
 
Wicsa2011 cloud tutorial
Wicsa2011 cloud tutorialWicsa2011 cloud tutorial
Wicsa2011 cloud tutorialAnna Liu
 
A non-technical introduction to Cloud Computing
A non-technical introduction to Cloud ComputingA non-technical introduction to Cloud Computing
A non-technical introduction to Cloud ComputingWilliam Pourmajidi
 
What would you do with a million cores - HPC on AWS
What would you do with a million cores - HPC on AWSWhat would you do with a million cores - HPC on AWS
What would you do with a million cores - HPC on AWSAmazon Web Services
 
Fog Computing is the Future of the Industrial Internet of Things
Fog Computing is the Future of the Industrial Internet of ThingsFog Computing is the Future of the Industrial Internet of Things
Fog Computing is the Future of the Industrial Internet of ThingsReal-Time Innovations (RTI)
 
Excellent slides on the new z13s announced on 16th Feb 2016
Excellent slides on the new z13s announced on 16th Feb 2016Excellent slides on the new z13s announced on 16th Feb 2016
Excellent slides on the new z13s announced on 16th Feb 2016Luigi Tommaseo
 
AWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAmazon Web Services
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Amazon Web Services
 
Windows Azure PaaS - Webinar Common Sense
Windows Azure PaaS - Webinar Common SenseWindows Azure PaaS - Webinar Common Sense
Windows Azure PaaS - Webinar Common SenseCommon Sense
 
Kb12012011 amitava cloud_computing
Kb12012011 amitava cloud_computingKb12012011 amitava cloud_computing
Kb12012011 amitava cloud_computingAmitava Kumar
 
Behind the Wizard’s Curtain: Scalability and Security at Zuora (Subscribed13)
Behind the Wizard’s Curtain:  Scalability and Security at Zuora (Subscribed13)Behind the Wizard’s Curtain:  Scalability and Security at Zuora (Subscribed13)
Behind the Wizard’s Curtain: Scalability and Security at Zuora (Subscribed13)Zuora, Inc.
 

Similaire à comScore (20)

Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisDay 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
 
The Evolution of Data Architecture
The Evolution of Data ArchitectureThe Evolution of Data Architecture
The Evolution of Data Architecture
 
Accelerating Cloud Services - Intel
Accelerating Cloud Services - IntelAccelerating Cloud Services - Intel
Accelerating Cloud Services - Intel
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
 
Industrial Internet of Things: Protocols an Standards
Industrial Internet of Things: Protocols an StandardsIndustrial Internet of Things: Protocols an Standards
Industrial Internet of Things: Protocols an Standards
 
Building Web Applications on AWS - AWS Summit 2012 - NYC
Building Web Applications on AWS - AWS Summit 2012 - NYCBuilding Web Applications on AWS - AWS Summit 2012 - NYC
Building Web Applications on AWS - AWS Summit 2012 - NYC
 
Kaushal Amin & Big 5 IT trends in the world
Kaushal Amin & Big 5 IT trends in the worldKaushal Amin & Big 5 IT trends in the world
Kaushal Amin & Big 5 IT trends in the world
 
Technology Trends and Big Data in 2013-2014
Technology Trends and Big Data in 2013-2014Technology Trends and Big Data in 2013-2014
Technology Trends and Big Data in 2013-2014
 
Why is DDS the Right Technology for the Industrial Internet?
Why is DDS the Right Technology for the Industrial Internet?Why is DDS the Right Technology for the Industrial Internet?
Why is DDS the Right Technology for the Industrial Internet?
 
Wicsa2011 cloud tutorial
Wicsa2011 cloud tutorialWicsa2011 cloud tutorial
Wicsa2011 cloud tutorial
 
A non-technical introduction to Cloud Computing
A non-technical introduction to Cloud ComputingA non-technical introduction to Cloud Computing
A non-technical introduction to Cloud Computing
 
What would you do with a million cores - HPC on AWS
What would you do with a million cores - HPC on AWSWhat would you do with a million cores - HPC on AWS
What would you do with a million cores - HPC on AWS
 
Fog Computing is the Future of the Industrial Internet of Things
Fog Computing is the Future of the Industrial Internet of ThingsFog Computing is the Future of the Industrial Internet of Things
Fog Computing is the Future of the Industrial Internet of Things
 
Excellent slides on the new z13s announced on 16th Feb 2016
Excellent slides on the new z13s announced on 16th Feb 2016Excellent slides on the new z13s announced on 16th Feb 2016
Excellent slides on the new z13s announced on 16th Feb 2016
 
AWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon Kinesis
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
 
Windows Azure PaaS - Webinar Common Sense
Windows Azure PaaS - Webinar Common SenseWindows Azure PaaS - Webinar Common Sense
Windows Azure PaaS - Webinar Common Sense
 
Kb12012011 amitava cloud_computing
Kb12012011 amitava cloud_computingKb12012011 amitava cloud_computing
Kb12012011 amitava cloud_computing
 
Behind the Wizard’s Curtain: Scalability and Security at Zuora (Subscribed13)
Behind the Wizard’s Curtain:  Scalability and Security at Zuora (Subscribed13)Behind the Wizard’s Curtain:  Scalability and Security at Zuora (Subscribed13)
Behind the Wizard’s Curtain: Scalability and Security at Zuora (Subscribed13)
 

Plus de Teradata Aster

Razorfish Multi-Channel Marketing: Better Customer Segmentation and Targeting
Razorfish Multi-Channel Marketing: Better Customer Segmentation and TargetingRazorfish Multi-Channel Marketing: Better Customer Segmentation and Targeting
Razorfish Multi-Channel Marketing: Better Customer Segmentation and TargetingTeradata Aster
 
Big Data Decision-Making
Big Data Decision-MakingBig Data Decision-Making
Big Data Decision-MakingTeradata Aster
 
Using Data to Manage in Today’s Chaotic Environment
Using Data to Manage in Today’s Chaotic EnvironmentUsing Data to Manage in Today’s Chaotic Environment
Using Data to Manage in Today’s Chaotic EnvironmentTeradata Aster
 
Big Analytics 2012 Event Survey Data
Big Analytics 2012 Event Survey DataBig Analytics 2012 Event Survey Data
Big Analytics 2012 Event Survey DataTeradata Aster
 
What Makes A Great Data Scientist?
What Makes A Great Data Scientist?What Makes A Great Data Scientist?
What Makes A Great Data Scientist?Teradata Aster
 
Practical Applications of Visual Analytics
Practical Applications of Visual AnalyticsPractical Applications of Visual Analytics
Practical Applications of Visual AnalyticsTeradata Aster
 
Trust and Influence in the Complex Network of Social Media
Trust and Influence in the Complex Network of Social MediaTrust and Influence in the Complex Network of Social Media
Trust and Influence in the Complex Network of Social MediaTeradata Aster
 
Turning Big Data to Business Advantage
Turning Big Data to Business AdvantageTurning Big Data to Business Advantage
Turning Big Data to Business AdvantageTeradata Aster
 
Big Brands Meet Big Data – The Newest Innovator’s Dilemma
Big Brands Meet Big Data – The Newest Innovator’s DilemmaBig Brands Meet Big Data – The Newest Innovator’s Dilemma
Big Brands Meet Big Data – The Newest Innovator’s DilemmaTeradata Aster
 
Simplifying Big Data Analytics for the Business
Simplifying Big Data Analytics for the BusinessSimplifying Big Data Analytics for the Business
Simplifying Big Data Analytics for the BusinessTeradata Aster
 
Evaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsEvaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsTeradata Aster
 
Keynote: Cross Industry Lessons from Moneyball Analytics
Keynote: Cross Industry Lessons from Moneyball AnalyticsKeynote: Cross Industry Lessons from Moneyball Analytics
Keynote: Cross Industry Lessons from Moneyball AnalyticsTeradata Aster
 
Technology Strategies for Big Data Analytics,
Technology Strategies for Big Data Analytics, Technology Strategies for Big Data Analytics,
Technology Strategies for Big Data Analytics, Teradata Aster
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondTeradata Aster
 
From Data Science to Business Value - Analytics Applied
From Data Science to Business Value - Analytics AppliedFrom Data Science to Business Value - Analytics Applied
From Data Science to Business Value - Analytics AppliedTeradata Aster
 
Solving the Education Crisis with Big Data
Solving the Education Crisis with Big DataSolving the Education Crisis with Big Data
Solving the Education Crisis with Big DataTeradata Aster
 
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced AnalyticsUsing SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced AnalyticsTeradata Aster
 
SAS aster data big data dc presentation public
SAS aster data big data dc presentation publicSAS aster data big data dc presentation public
SAS aster data big data dc presentation publicTeradata Aster
 
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...Teradata Aster
 
20100506 aster data big data summit - microstrategy (shareable)
20100506   aster data big data summit - microstrategy (shareable)20100506   aster data big data summit - microstrategy (shareable)
20100506 aster data big data summit - microstrategy (shareable)Teradata Aster
 

Plus de Teradata Aster (20)

Razorfish Multi-Channel Marketing: Better Customer Segmentation and Targeting
Razorfish Multi-Channel Marketing: Better Customer Segmentation and TargetingRazorfish Multi-Channel Marketing: Better Customer Segmentation and Targeting
Razorfish Multi-Channel Marketing: Better Customer Segmentation and Targeting
 
Big Data Decision-Making
Big Data Decision-MakingBig Data Decision-Making
Big Data Decision-Making
 
Using Data to Manage in Today’s Chaotic Environment
Using Data to Manage in Today’s Chaotic EnvironmentUsing Data to Manage in Today’s Chaotic Environment
Using Data to Manage in Today’s Chaotic Environment
 
Big Analytics 2012 Event Survey Data
Big Analytics 2012 Event Survey DataBig Analytics 2012 Event Survey Data
Big Analytics 2012 Event Survey Data
 
What Makes A Great Data Scientist?
What Makes A Great Data Scientist?What Makes A Great Data Scientist?
What Makes A Great Data Scientist?
 
Practical Applications of Visual Analytics
Practical Applications of Visual AnalyticsPractical Applications of Visual Analytics
Practical Applications of Visual Analytics
 
Trust and Influence in the Complex Network of Social Media
Trust and Influence in the Complex Network of Social MediaTrust and Influence in the Complex Network of Social Media
Trust and Influence in the Complex Network of Social Media
 
Turning Big Data to Business Advantage
Turning Big Data to Business AdvantageTurning Big Data to Business Advantage
Turning Big Data to Business Advantage
 
Big Brands Meet Big Data – The Newest Innovator’s Dilemma
Big Brands Meet Big Data – The Newest Innovator’s DilemmaBig Brands Meet Big Data – The Newest Innovator’s Dilemma
Big Brands Meet Big Data – The Newest Innovator’s Dilemma
 
Simplifying Big Data Analytics for the Business
Simplifying Big Data Analytics for the BusinessSimplifying Big Data Analytics for the Business
Simplifying Big Data Analytics for the Business
 
Evaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsEvaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics Platforms
 
Keynote: Cross Industry Lessons from Moneyball Analytics
Keynote: Cross Industry Lessons from Moneyball AnalyticsKeynote: Cross Industry Lessons from Moneyball Analytics
Keynote: Cross Industry Lessons from Moneyball Analytics
 
Technology Strategies for Big Data Analytics,
Technology Strategies for Big Data Analytics, Technology Strategies for Big Data Analytics,
Technology Strategies for Big Data Analytics,
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and Beyond
 
From Data Science to Business Value - Analytics Applied
From Data Science to Business Value - Analytics AppliedFrom Data Science to Business Value - Analytics Applied
From Data Science to Business Value - Analytics Applied
 
Solving the Education Crisis with Big Data
Solving the Education Crisis with Big DataSolving the Education Crisis with Big Data
Solving the Education Crisis with Big Data
 
Using SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced AnalyticsUsing SQL-MapReduce for Advanced Analytics
Using SQL-MapReduce for Advanced Analytics
 
SAS aster data big data dc presentation public
SAS aster data big data dc presentation publicSAS aster data big data dc presentation public
SAS aster data big data dc presentation public
 
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
 
20100506 aster data big data summit - microstrategy (shareable)
20100506   aster data big data summit - microstrategy (shareable)20100506   aster data big data summit - microstrategy (shareable)
20100506 aster data big data summit - microstrategy (shareable)
 

comScore

  • 1. A Plan for Large Scale Data Analytics Utilizing Aster nCluster to support processing in excess of 100 Billion rows per month Will Duckworth, Director Software Engineering (wduckworth@comscore.com)
  • 2. Agenda  comScore – Introduction and Technology  MM360 Initiative  The Challenge  Our Analysis  Plans © comScore, Inc. Proprietary and Confidential. 2
  • 3. comScore, Inc.  Founded in 1999  Publically traded on NASDAQ (SCOR)  Acquired MediaMetrix in 2002, M:Metrics in 2007, Certifica in 2009, and ARSgroup in 2010  Corporate headquarters: Reston, VA – Offices in Chicago, NYC, San Francisco, Seattle, Toronto, London, Tokyo, and Paris – 500+ full-time employees  Experienced senior leadership team with a unique record of innovation in the market research industry  More than 1200 clients across many industries © comScore, Inc. Proprietary and Confidential. 3
  • 4. Advising Hundreds of Leading Businesses (partial list) Internet Agencies Telecom Financial Retail Travel CPG Pharma Technology © comScore, Inc. Proprietary and Confidential. 4
  • 5. Powerful Platform: Massive Database and Cost Effective Technology Infrastructure Continuous Massive Operation Operational Scale ■ 24/7 Largest Windows Data ■ 99.99% Uptime Warehouse in the World ■ 1,000 TB of Patents storage ■ 1,100 Servers ■ 3 Issued Database and ■ 30 TB per month ■ 24 Pending Computational Infrastructure Cost Effective Highly Scalable, Proprietary Distributed Processing Technology with System Capex Architecture Strong IP Protection < $7M/Year Sophisticated Technology to Keep Up With Internet Advancements © comScore, Inc. Proprietary and Confidential. 5
  • 6. Even for us this is getting big… New Rows per Day (panel vs. non-panel) 12,000 Millions 10,000 8,000 6,000 4,000 2,000 0 6/24/2009 7/24/2009 8/24/2009 9/24/2009 10/24/2009 11/24/2009 12/24/2009 1/24/2010 2/24/2010 3/24/2010 beacon panel © comScore, Inc. Proprietary and Confidential. 6
  • 7. Where we come from …  Our skill set came from a need to measure Win32  We chose technologies and built a core team around our mandate to have accurate consumer Internet measurement – All Intel Based – 2/3 Microsoft OS, 1/3 Linux OS – C++  Now very much a “best tool for the job” organization © comScore, Inc. Proprietary and Confidential. 7
  • 8. MM360 Initiative © comScore, Inc. Proprietary and Confidential. 8
  • 9. Internet = “The Most Measurable Medium” 100% Accurate count of server requests, but…  How many real users?  What kind of users are they?  Which request is a valid Page View?  How long did the users spend on my site? © comScore, Inc. Proprietary and Confidential. 9
  • 10. Basic Problem with Servers: No Unique User ID Web Analytics Approximation Unique User = Cookie ID (if Cookies can be set) or IP Address + User Agent Sounds Simple, But Major Problems:  Cookies are deleted frequently, and the same person can be counted multiple times  IP Addresses change frequently causing inflation of user counts  In any case, servers identify a machine (or a browser), which can represent multiple persons or a fraction of the usage of a single person © comScore, Inc. Proprietary and Confidential. 10
  • 11. Media Metrix 360: Key Benefits for Participating Sites  Comprehensive coverage: 100% of activity – New “Universe Report” covers mobile and public machines – Census-adjusted metrics in current Media Metrix reports (Home and Work) – Coverage Calculation for beaconing sites  Improved coverage of At-Work population  Harmonization / Reconciliation of panel vs. server  More granularity  More timely reporting  Transparency © comScore, Inc. Proprietary and Confidential. 11
  • 12. The Challenge © comScore, Inc. Proprietary and Confidential. 12
  • 13. Goals  Be able to scale to support an initial monthly volume of 160 Billion records – Store 3 months of data online  Be able to add incrementally to the environment to support growth  Support advanced analytics – 150 analysts  Support end user access to record level data, preferably through a SQL interface  Support the storage of row level data  Have yesterdays data available today © comScore, Inc. Proprietary and Confidential. 13
  • 14. Existing Internal Systems  NGUA – Ability to run specific queries for a given time period very quickly because all processing is parallelized – Currently holding 560+ days of data; 800B+ rows. – All traffic for a machine for a month – 1 minute run time (140k records) – All traffic for pizzahut.com for a month – 4 minutes run time (1.9 million records) – All traffic from google.com where toys is in the URL – 1 hour 15 minutes (400k records) ■ Fusion – Primary System used for processing and providing the data behind the majority of comScore’s products and analysis – Runs on 32 servers – For one month we read over 8TB of compressed log files with over 40B rows – Produces 1.3 B rows and 120 GB of output for load into a DW – Can turn around the processing in less than 8 hours  Both systems leverage the same core concepts of locality to data and distributed processing © comScore, Inc. Proprietary and Confidential. 14
  • 15. Aster Data nCluster  Current Aster environments – Dev: 1 Queen; 3 Workers; 650+GB total storage – Prod: 1 Queen; 4 Loaders; 10 Workers; 32TB total storage  Plans – Building new Prod environment 1 Queen, 70 workers and 10 Loaders / Staging servers – 350TB total storage – 432 Cores © comScore, Inc. Proprietary and Confidential. 15
  • 16. Aster Data nCluster  Table design is key with data of this size – What is the end user going to do 80% of the time?  On the web, no matter how clean you think your data set is there are still going to be issues – 6 Sigma on 10 billion records a day is still nearly 35,000 “bad rows” – Staging Servers  Looking at using Aster-Hadoop Data Connector for integration with in-house Hadoop environment – Aster Data for the analysts – Hadoop for the developers © comScore, Inc. Proprietary and Confidential. 16
  • 17. Critical Cost Drivers to factor into the Analysis  Data Centers – Power is the big issue at data centers today. All allocations for power and space are based on the number of circuits and the cost per circuit are all expected to rise  Servers – Even high end servers have reached relative commodity prices if you stay to the 2U footprint and standard components © comScore, Inc. Proprietary and Confidential. 17