SlideShare une entreprise Scribd logo
1  sur  49
Bid Data  Ad Analytics Nena Marín, Ph.D. Principal Scientist & Director of Research
Outline Background Big Data Challenge: mysqlvsmynosql Data Intensive Computing: online vs offline Big Data Solution Schema changes Analytics approach changes Results Conclusions
Big Data Problem Overwhelming Amounts of Data & Growing Data exceeding 100 GB/TB/PBs Unstructured data/Content Structured data:  Warehouse for tri-state Electric and Gas Utility ServicesOracle OWB + Cognos 346,000 electric and 306,000 natural gas customers (2007) Semi & Unstructured data (Adometry 2010-now) Impression & click stream data Scale: ~3B impressions per day Page&Ad Tag customers Ad Server Log File customers Growth rates: 2 – 9% per month Unstructured content  Email (Adometry: Cross-Channel AA)  Sensor Data (HALO Project: 8.5 B particles 2009-2010)
Ad Analytics: X Insurance
X Insuranceover 4 Billion Impressions per month
Three Clients Volume
Big Data Analysis Off-line (batch) Ad Analytics Warehouse HALO project: TerascaleAstronomical Dataset on 240 compute nodes Longhorn Cluster (ref1) Hierarchical Density Shaving clustering algorithm Dataflow Clustering Algorithm Hadoop (128 compute-cores Spur cluster @ UT/TACC)   On-line (realtime) Netflix Recommender:  100 million ratings in 16.31 minutes . Effective = 9.6 microseconds per rating (ref2) REF1:http://www.computer.org/portal/web/csdl/doi/10.1109/ICDMW.2010.26 REF2: http://kdd09.crowdvine.com/talks/4963  (KDD2009)
Outline Background Big Data Challenge: mysqlvsmynosql Data Intensive Computing: online vs offline Big Data Solution Schema changes Analytics approach changes Results Conclusions
Big Data Solution Questions How will we add new nodes (grow)? Any single points of failure? Do the writes scale as well? How much administration will the system require? Implementation learning curve? If its open source, is there a healthy community? How much time and effort would we have to expend to deploy and integrate it? Does it use technology which we know we can work with? Integration tools, Presentation Tools, Analytics (data mining) Tools
My two cents 30 Billion rows per month puts you in VLDB territory, so you need partitioning.  The low cardinality dimensions would also suggest that bitmap indexes would be a performance win. Column store Most aggregates are by columns Agility to update schema: add columns  Quicklz compression on partitions and tables
Columnar Database Review Table WITH (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=quicklz,    OIDS=FALSE ) DISTRIBUTED BY (id); PARTITION BY RANGE(productid)           SUBPARTITION BY RANGE(submissiontime)                   SUBPARTITION BY LIST(status)            (           PARTITION clnt100002 START (100002) END (100003) EVERY (1) WITH (appendonly=true, compresstype=quicklz, orientation=column)                    (                   START ('2011-05-01 00:00:00'::timestamp without time zone) END ('2011-08-01 00:00:00'::timestamp without time zone) EVERY ('1 day'::interval) WITH (appendonly=true, compresstype=quicklz, orientation=column)                            (                           SUBPARTITION submVALUES(‘submitted’) WITH (appendonly=true, compresstype=quicklz, orientation=column),                            SUBPARTITION apprVALUES(‘approved’) WITH (appendonly=true, compresstype=quicklz, orientation=column),                            DEFAULT SUBPARTITION other  WITH (appendonly=true, compresstype=quicklz, orientation=column)                           )                   )           )
Why Partitioning? Because is partitioned by date, when query WHERE submissiondate>= & <=... the partition out of the date range won't be scanned and won't impact the performance...  If a partition is no longer needed, can create a new table with the content of the partition and drop the partition Can recreate partitions with "compression" and "append only" option to save disk space and IO Bandwidth.
Outline Background Big Data Challenge: mysqlvsmynosql Data Intensive Computing: online vs offline Big Data Solution Schema changes Analytics approach changes Results Conclusions
Question 1: Top rated products (filtered by brand or category) MDX:  WITHSET [TCat] AS TopCount([Product].[Subcategory].[Subcategory],10,[Measures].[Rating]) MEMBER [Product].[Subcategory].[Other] ASAggregate([Product].[Subcategory].[Subcategory] - TCat)SELECT { [Measures].[Rating] } ON COLUMNS, TCat + [Other] ON ROWS FROM [DW_PRODUCTS]  SQL query:select top X rating, count(id) from fact where brand = x and category = y group by rating order by count(id) desc
Recommender systems We Know What You OughtTo Be Watching This Summer 9/6/2011 17
Ratings Data Matrix Sparse Matrix Representation rowID, colID, Rating, colID, Rating, … User1, 1, 1, 2, 5 User2, 3, 3, 4, 4, 5, 3 9/6/2011 18
After ROW CLUSTERING RAW DATA After COLUMN Clustering Iterate Until convergence Movies Ratings Users K by L Coclusters K rows L
From the Training K by L Coclusters Average Ratings per cluster Average Ratings per User Average Ratings per Movie Global Average rating is 3.68 9/6/2011 20
Prediction Algorithm ,[object Object]
Rating = Cluster average
Case 2: known User, unknown Movie
Rating= User average
Case 3: unknown User, known Movie
Rating = Movie average
Case 4: unknown User, unknown Movie
Rating = Global Average9/6/2011 21
DataRush based Recommender System Dataflow Prediction Application Graph  Dataflow Training Application Graph 9/6/2011 22
Results: Scalability Across Cores 9/6/2011 23
Results
Question 1: Top rated products  Build Recommender Data Mining Model based on Coclustering customers & ratings. Training runtime for 100,480,507 ratings 16.31 minutes Apply recommender real-time: Effective prediction runtime: 9.738615 μs per rating
Question 2: Fastest Rising Products Method 1: Store Co-Clustering Model in DW Identify when products move from one cluster to another Method 2:Product Ratings Distributionbin ratings.  Establish Distribution Baseline for each product: Mu, s When Mu, s change beyond a certain threshold: Identify the movers/shakers & biggest losers
Question 3: Category Level Stats Category level statistics (including roll-ups) such as avg rating, content volume, etc Average = Sum(X)/n Greenplum OLAP grouping extensions: CUBE, ROLLUP, GROUPING SETS  SELECT productcategory, productid, sum(rating)/count(*)  FROM review GROUP BY ROLLUP(productcategory, productid) ORDER BY 1,2,3;
Question 4: Top Contributors Top contributors  Score (content submissions + Helpfulness Votes) Tag content:   Approve: positive reviews,  reject negative or inappropriate/ price.   Snippets: Highlight sentences in reviews Keep Score, sentiment, reason codes, snippets, product flaws, intelligence data, <Key> in DW Move content to readily available <Key, Value> Store. Query Top Score Contributors use <Key> to pull content real-time.
Brisk  by DataStax (buy local)
Cassandra First developed by Facebook SuperColumns can turn a simple key-value architecture into an architecture that handles sorted lists, based on an index specified by the user. Can scale from one node to several thousand nodes clustered in different data centers.  Can be tuned for more consistency or availability Smooth node replacement if one goes down
Outline Background Big Data Challenge: mysqlvsmynosql Data Intensive Computing: online vs offline Big Data Solution Schema changes Analytics approach changes Results Conclusions
Greenplum Bulk Loader gpload-f load_reviewer.yml -q -l ./gpload_reviewer.log
Load Times
Path to Conversion
Attribution Credit
X Insurance: Historical Data Load Times Weekly reports:  20 minutes Attribution Reports (pre-aggregated by 4 dimensions and deployed to GUI) Reach & Frequency Reports Campaign Optimization: ANN + Non-Linear Optimization Allocate budget onto different Sites + Placements to maximize conversions
Attribute by CreativeSize INSIGHT: Top Ranked Creative Sizes 	High Propensity to Convert 	High number of Conversions 	Low Cost 	High Revenue
Overlap Report Actionable: cookie sync with “Turn” cookies, use as block list to prevent reaching same cookies twice
Outline Background Big Data Challenge: mysqlvsmynosql Data Intensive Computing: online vs offline Big Data Solution Schema changes Analytics approach changes Results Conclusions
Common Problems (ALL) Data Quality Discovery Stats: before load & after load Establish Baselines and use them for validation Performance Growth rates and loading windows: low space triggers,  Latency of online queries Latency of offline queries Agility of Schema to change Under-estimate value of metadata design Integration Self documentation Self-governance

Contenu connexe

Tendances

03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
purnimatm
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
kayathri02
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data Mining
R A Akerkar
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Amuthamca
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
bhagathk
 

Tendances (14)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
 
Relaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked dataRelaxing global-as-view in mediated data integration from linked data
Relaxing global-as-view in mediated data integration from linked data
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Practical data science_public
Practical data science_publicPractical data science_public
Practical data science_public
 
4 preprocess
4 preprocess4 preprocess
4 preprocess
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data Mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Hpdw 2015-v10-paper
Hpdw 2015-v10-paperHpdw 2015-v10-paper
Hpdw 2015-v10-paper
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
 

En vedette

Intro to big data and hadoop ubc cs lecture series - g fawkes
Intro to big data and hadoop   ubc cs lecture series - g fawkesIntro to big data and hadoop   ubc cs lecture series - g fawkes
Intro to big data and hadoop ubc cs lecture series - g fawkes
gfawkesnew2
 
Bigdata Final NSF I-Corps Presentation
Bigdata Final NSF I-Corps PresentationBigdata Final NSF I-Corps Presentation
Bigdata Final NSF I-Corps Presentation
Stanford University
 

En vedette (20)

How applications of bigdata drive industries
How applications of bigdata drive industriesHow applications of bigdata drive industries
How applications of bigdata drive industries
 
Bigdata presentation
Bigdata presentationBigdata presentation
Bigdata presentation
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Harp Ped Technology Presentation
Harp Ped Technology PresentationHarp Ped Technology Presentation
Harp Ped Technology Presentation
 
Whomovedmycheese
WhomovedmycheeseWhomovedmycheese
Whomovedmycheese
 
Ramingining
RaminginingRamingining
Ramingining
 
Michael hawkinson 1
Michael hawkinson 1Michael hawkinson 1
Michael hawkinson 1
 
Quarterly Logistics Market Update
Quarterly Logistics Market UpdateQuarterly Logistics Market Update
Quarterly Logistics Market Update
 
Outsource Solutions Overview
Outsource Solutions OverviewOutsource Solutions Overview
Outsource Solutions Overview
 
Collaborative Outsourcing
Collaborative OutsourcingCollaborative Outsourcing
Collaborative Outsourcing
 
Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja
 
Hfsp bringing size based scheduling to hadoop
Hfsp bringing size based scheduling to hadoopHfsp bringing size based scheduling to hadoop
Hfsp bringing size based scheduling to hadoop
 
"HFSP: Size-based Scheduling for Hadoop" presentation for BigData 2014
"HFSP: Size-based Scheduling for Hadoop" presentation for BigData 2014"HFSP: Size-based Scheduling for Hadoop" presentation for BigData 2014
"HFSP: Size-based Scheduling for Hadoop" presentation for BigData 2014
 
Intro to big data and hadoop ubc cs lecture series - g fawkes
Intro to big data and hadoop   ubc cs lecture series - g fawkesIntro to big data and hadoop   ubc cs lecture series - g fawkes
Intro to big data and hadoop ubc cs lecture series - g fawkes
 
Bigdata Final NSF I-Corps Presentation
Bigdata Final NSF I-Corps PresentationBigdata Final NSF I-Corps Presentation
Bigdata Final NSF I-Corps Presentation
 
Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...
Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...
Hadoop, Spark and Big Data Summit presentation with SAP HANA Vora and a path ...
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
Hadoop Presentation
Hadoop PresentationHadoop Presentation
Hadoop Presentation
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

Similaire à Presentation_BigData_NenaMarin

Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
sumit621
 

Similaire à Presentation_BigData_NenaMarin (20)

Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
 
At the core you will have KUSTO
At the core you will have KUSTOAt the core you will have KUSTO
At the core you will have KUSTO
 
OLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingOLAP Cubes in Datawarehousing
OLAP Cubes in Datawarehousing
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 
ITReady DW Day2
ITReady DW Day2ITReady DW Day2
ITReady DW Day2
 
mod 2.pdf
mod 2.pdfmod 2.pdf
mod 2.pdf
 
Data Mining Presentation on Science Day 2023
Data Mining Presentation on Science Day 2023Data Mining Presentation on Science Day 2023
Data Mining Presentation on Science Day 2023
 
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Potter’S Wheel
Potter’S WheelPotter’S Wheel
Potter’S Wheel
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLScaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQL
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Apache Cassandra 2.0
Apache Cassandra 2.0Apache Cassandra 2.0
Apache Cassandra 2.0
 
Cs437 lecture 7-8
Cs437 lecture 7-8Cs437 lecture 7-8
Cs437 lecture 7-8
 
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jScalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
 
Data preperation
Data preperationData preperation
Data preperation
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Presentation_BigData_NenaMarin

  • 1. Bid Data Ad Analytics Nena Marín, Ph.D. Principal Scientist & Director of Research
  • 2. Outline Background Big Data Challenge: mysqlvsmynosql Data Intensive Computing: online vs offline Big Data Solution Schema changes Analytics approach changes Results Conclusions
  • 3. Big Data Problem Overwhelming Amounts of Data & Growing Data exceeding 100 GB/TB/PBs Unstructured data/Content Structured data: Warehouse for tri-state Electric and Gas Utility ServicesOracle OWB + Cognos 346,000 electric and 306,000 natural gas customers (2007) Semi & Unstructured data (Adometry 2010-now) Impression & click stream data Scale: ~3B impressions per day Page&Ad Tag customers Ad Server Log File customers Growth rates: 2 – 9% per month Unstructured content Email (Adometry: Cross-Channel AA) Sensor Data (HALO Project: 8.5 B particles 2009-2010)
  • 4. Ad Analytics: X Insurance
  • 5. X Insuranceover 4 Billion Impressions per month
  • 7. Big Data Analysis Off-line (batch) Ad Analytics Warehouse HALO project: TerascaleAstronomical Dataset on 240 compute nodes Longhorn Cluster (ref1) Hierarchical Density Shaving clustering algorithm Dataflow Clustering Algorithm Hadoop (128 compute-cores Spur cluster @ UT/TACC) On-line (realtime) Netflix Recommender: 100 million ratings in 16.31 minutes . Effective = 9.6 microseconds per rating (ref2) REF1:http://www.computer.org/portal/web/csdl/doi/10.1109/ICDMW.2010.26 REF2: http://kdd09.crowdvine.com/talks/4963 (KDD2009)
  • 8. Outline Background Big Data Challenge: mysqlvsmynosql Data Intensive Computing: online vs offline Big Data Solution Schema changes Analytics approach changes Results Conclusions
  • 9. Big Data Solution Questions How will we add new nodes (grow)? Any single points of failure? Do the writes scale as well? How much administration will the system require? Implementation learning curve? If its open source, is there a healthy community? How much time and effort would we have to expend to deploy and integrate it? Does it use technology which we know we can work with? Integration tools, Presentation Tools, Analytics (data mining) Tools
  • 10.
  • 11. My two cents 30 Billion rows per month puts you in VLDB territory, so you need partitioning. The low cardinality dimensions would also suggest that bitmap indexes would be a performance win. Column store Most aggregates are by columns Agility to update schema: add columns Quicklz compression on partitions and tables
  • 12. Columnar Database Review Table WITH (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=quicklz, OIDS=FALSE ) DISTRIBUTED BY (id); PARTITION BY RANGE(productid) SUBPARTITION BY RANGE(submissiontime) SUBPARTITION BY LIST(status) ( PARTITION clnt100002 START (100002) END (100003) EVERY (1) WITH (appendonly=true, compresstype=quicklz, orientation=column) ( START ('2011-05-01 00:00:00'::timestamp without time zone) END ('2011-08-01 00:00:00'::timestamp without time zone) EVERY ('1 day'::interval) WITH (appendonly=true, compresstype=quicklz, orientation=column) ( SUBPARTITION submVALUES(‘submitted’) WITH (appendonly=true, compresstype=quicklz, orientation=column), SUBPARTITION apprVALUES(‘approved’) WITH (appendonly=true, compresstype=quicklz, orientation=column), DEFAULT SUBPARTITION other WITH (appendonly=true, compresstype=quicklz, orientation=column) ) ) )
  • 13. Why Partitioning? Because is partitioned by date, when query WHERE submissiondate>= & <=... the partition out of the date range won't be scanned and won't impact the performance... If a partition is no longer needed, can create a new table with the content of the partition and drop the partition Can recreate partitions with "compression" and "append only" option to save disk space and IO Bandwidth.
  • 14.
  • 15. Outline Background Big Data Challenge: mysqlvsmynosql Data Intensive Computing: online vs offline Big Data Solution Schema changes Analytics approach changes Results Conclusions
  • 16. Question 1: Top rated products (filtered by brand or category) MDX: WITHSET [TCat] AS TopCount([Product].[Subcategory].[Subcategory],10,[Measures].[Rating]) MEMBER [Product].[Subcategory].[Other] ASAggregate([Product].[Subcategory].[Subcategory] - TCat)SELECT { [Measures].[Rating] } ON COLUMNS, TCat + [Other] ON ROWS FROM [DW_PRODUCTS] SQL query:select top X rating, count(id) from fact where brand = x and category = y group by rating order by count(id) desc
  • 17. Recommender systems We Know What You OughtTo Be Watching This Summer 9/6/2011 17
  • 18. Ratings Data Matrix Sparse Matrix Representation rowID, colID, Rating, colID, Rating, … User1, 1, 1, 2, 5 User2, 3, 3, 4, 4, 5, 3 9/6/2011 18
  • 19. After ROW CLUSTERING RAW DATA After COLUMN Clustering Iterate Until convergence Movies Ratings Users K by L Coclusters K rows L
  • 20. From the Training K by L Coclusters Average Ratings per cluster Average Ratings per User Average Ratings per Movie Global Average rating is 3.68 9/6/2011 20
  • 21.
  • 23. Case 2: known User, unknown Movie
  • 25. Case 3: unknown User, known Movie
  • 26. Rating = Movie average
  • 27. Case 4: unknown User, unknown Movie
  • 28. Rating = Global Average9/6/2011 21
  • 29. DataRush based Recommender System Dataflow Prediction Application Graph Dataflow Training Application Graph 9/6/2011 22
  • 30. Results: Scalability Across Cores 9/6/2011 23
  • 32. Question 1: Top rated products Build Recommender Data Mining Model based on Coclustering customers & ratings. Training runtime for 100,480,507 ratings 16.31 minutes Apply recommender real-time: Effective prediction runtime: 9.738615 μs per rating
  • 33. Question 2: Fastest Rising Products Method 1: Store Co-Clustering Model in DW Identify when products move from one cluster to another Method 2:Product Ratings Distributionbin ratings. Establish Distribution Baseline for each product: Mu, s When Mu, s change beyond a certain threshold: Identify the movers/shakers & biggest losers
  • 34. Question 3: Category Level Stats Category level statistics (including roll-ups) such as avg rating, content volume, etc Average = Sum(X)/n Greenplum OLAP grouping extensions: CUBE, ROLLUP, GROUPING SETS SELECT productcategory, productid, sum(rating)/count(*) FROM review GROUP BY ROLLUP(productcategory, productid) ORDER BY 1,2,3;
  • 35. Question 4: Top Contributors Top contributors Score (content submissions + Helpfulness Votes) Tag content: Approve: positive reviews, reject negative or inappropriate/ price. Snippets: Highlight sentences in reviews Keep Score, sentiment, reason codes, snippets, product flaws, intelligence data, <Key> in DW Move content to readily available <Key, Value> Store. Query Top Score Contributors use <Key> to pull content real-time.
  • 36. Brisk by DataStax (buy local)
  • 37. Cassandra First developed by Facebook SuperColumns can turn a simple key-value architecture into an architecture that handles sorted lists, based on an index specified by the user. Can scale from one node to several thousand nodes clustered in different data centers. Can be tuned for more consistency or availability Smooth node replacement if one goes down
  • 38. Outline Background Big Data Challenge: mysqlvsmynosql Data Intensive Computing: online vs offline Big Data Solution Schema changes Analytics approach changes Results Conclusions
  • 39. Greenplum Bulk Loader gpload-f load_reviewer.yml -q -l ./gpload_reviewer.log
  • 43.
  • 44. X Insurance: Historical Data Load Times Weekly reports: 20 minutes Attribution Reports (pre-aggregated by 4 dimensions and deployed to GUI) Reach & Frequency Reports Campaign Optimization: ANN + Non-Linear Optimization Allocate budget onto different Sites + Placements to maximize conversions
  • 45. Attribute by CreativeSize INSIGHT: Top Ranked Creative Sizes High Propensity to Convert High number of Conversions Low Cost High Revenue
  • 46. Overlap Report Actionable: cookie sync with “Turn” cookies, use as block list to prevent reaching same cookies twice
  • 47.
  • 48. Outline Background Big Data Challenge: mysqlvsmynosql Data Intensive Computing: online vs offline Big Data Solution Schema changes Analytics approach changes Results Conclusions
  • 49. Common Problems (ALL) Data Quality Discovery Stats: before load & after load Establish Baselines and use them for validation Performance Growth rates and loading windows: low space triggers, Latency of online queries Latency of offline queries Agility of Schema to change Under-estimate value of metadata design Integration Self documentation Self-governance
  • 50. Common Problems (Internet Advertising) Bad Data: Well defined Customer Data requirements Context: Campaign, Site, Placement IP: invalids, GEO & Demog. Cost Revenue Common cookie across the different data sources Have resigned in some cases to IP & user agent (browser & language) Only Aggregate Data Black-out periods and agility to rollover new quarters
  • 51. Greenplum: Lessons Learned Adding new nodes expanded cluster from 4 to 8 nodes Redistribution tool failed Duplicate rowids in multiple nodes had to re-load Single point of failure GPMASTER node is single point of failure All slave nodes are mirrored and failed segments can be recovered Read scale Network bandwidth Write scale Network Bandwidth Hard disk space (dead in the water at 80% use on GPMaster) IT resources Full Product Support Have been down for two weeks at a time Open source – healthy community Technology works well with others = PostgresqlPgAdmin, PentahoTalend Studio, etc. Learning curve = NONE everyone knew SQL Deployment = had several initial install issues. But deploying new clients is automated using Python and SQL
  • 52. Data Append Only Columnar Orientation Load balance via Distributed by: No indexing but partitioning by: Agitlity of Schema to Change Content Evaluating Cassandra & Brisk <Key, Value> store for content Score, sentiment, reason codes, + <Key> to DW Real-time Leverage Data Mining Models (PMML + store in DW) Use stored models to identify change in patterns Recommendations
  • 53. Q & A
  • 54.
  • 55. 48GB RAM
  • 56.
  • 57. Mapping Customer Data to AA Schema