Presentation_BigData_NenaMarin

Bid Data Ad Analytics Nena Marín, Ph.D. Principal Scientist & Director of Research

Outline Background Big Data Challenge: mysqlvsmynosql Data Intensive Computing: online vs offline Big Data Solution Schema changes Analytics approach changes Results Conclusions

Big Data Problem Overwhelming Amounts of Data & Growing Data exceeding 100 GB/TB/PBs Unstructured data/Content Structured data: Warehouse for tri-state Electric and Gas Utility ServicesOracle OWB + Cognos 346,000 electric and 306,000 natural gas customers (2007) Semi & Unstructured data (Adometry 2010-now) Impression & click stream data Scale: ~3B impressions per day Page&Ad Tag customers Ad Server Log File customers Growth rates: 2 – 9% per month Unstructured content Email (Adometry: Cross-Channel AA) Sensor Data (HALO Project: 8.5 B particles 2009-2010)

X Insuranceover 4 Billion Impressions per month

Big Data Analysis Off-line (batch) Ad Analytics Warehouse HALO project: TerascaleAstronomical Dataset on 240 compute nodes Longhorn Cluster (ref1) Hierarchical Density Shaving clustering algorithm Dataflow Clustering Algorithm Hadoop (128 compute-cores Spur cluster @ UT/TACC) On-line (realtime) Netflix Recommender: 100 million ratings in 16.31 minutes . Effective = 9.6 microseconds per rating (ref2) REF1:http://www.computer.org/portal/web/csdl/doi/10.1109/ICDMW.2010.26 REF2: http://kdd09.crowdvine.com/talks/4963 (KDD2009)

Big Data Solution Questions How will we add new nodes (grow)? Any single points of failure? Do the writes scale as well? How much administration will the system require? Implementation learning curve? If its open source, is there a healthy community? How much time and effort would we have to expend to deploy and integrate it? Does it use technology which we know we can work with? Integration tools, Presentation Tools, Analytics (data mining) Tools

My two cents 30 Billion rows per month puts you in VLDB territory, so you need partitioning. The low cardinality dimensions would also suggest that bitmap indexes would be a performance win. Column store Most aggregates are by columns Agility to update schema: add columns Quicklz compression on partitions and tables

Columnar Database Review Table WITH (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=quicklz, OIDS=FALSE ) DISTRIBUTED BY (id); PARTITION BY RANGE(productid) SUBPARTITION BY RANGE(submissiontime) SUBPARTITION BY LIST(status) ( PARTITION clnt100002 START (100002) END (100003) EVERY (1) WITH (appendonly=true, compresstype=quicklz, orientation=column) ( START ('2011-05-01 00:00:00'::timestamp without time zone) END ('2011-08-01 00:00:00'::timestamp without time zone) EVERY ('1 day'::interval) WITH (appendonly=true, compresstype=quicklz, orientation=column) ( SUBPARTITION submVALUES(‘submitted’) WITH (appendonly=true, compresstype=quicklz, orientation=column), SUBPARTITION apprVALUES(‘approved’) WITH (appendonly=true, compresstype=quicklz, orientation=column), DEFAULT SUBPARTITION other WITH (appendonly=true, compresstype=quicklz, orientation=column) ) ) )

Why Partitioning? Because is partitioned by date, when query WHERE submissiondate>= & <=... the partition out of the date range won't be scanned and won't impact the performance... If a partition is no longer needed, can create a new table with the content of the partition and drop the partition Can recreate partitions with "compression" and "append only" option to save disk space and IO Bandwidth.

Question 1: Top rated products (filtered by brand or category) MDX: WITHSET [TCat] AS TopCount([Product].[Subcategory].[Subcategory],10,[Measures].[Rating]) MEMBER [Product].[Subcategory].[Other] ASAggregate([Product].[Subcategory].[Subcategory] - TCat)SELECT { [Measures].[Rating] } ON COLUMNS, TCat + [Other] ON ROWS FROM [DW_PRODUCTS] SQL query:select top X rating, count(id) from fact where brand = x and category = y group by rating order by count(id) desc

Recommender systems We Know What You OughtTo Be Watching This Summer 9/6/2011 17

Ratings Data Matrix Sparse Matrix Representation rowID, colID, Rating, colID, Rating, … User1, 1, 1, 2, 5 User2, 3, 3, 4, 4, 5, 3 9/6/2011 18

After ROW CLUSTERING RAW DATA After COLUMN Clustering Iterate Until convergence Movies Ratings Users K by L Coclusters K rows L

From the Training K by L Coclusters Average Ratings per cluster Average Ratings per User Average Ratings per Movie Global Average rating is 3.68 9/6/2011 20

Prediction Algorithm ,[object Object]

Case 2: known User, unknown Movie

Case 3: unknown User, known Movie

Case 4: unknown User, unknown Movie

Rating = Global Average9/6/2011 21

DataRush based Recommender System Dataflow Prediction Application Graph Dataflow Training Application Graph 9/6/2011 22

Results: Scalability Across Cores 9/6/2011 23

Question 1: Top rated products Build Recommender Data Mining Model based on Coclustering customers & ratings. Training runtime for 100,480,507 ratings 16.31 minutes Apply recommender real-time: Effective prediction runtime: 9.738615 μs per rating

Question 2: Fastest Rising Products Method 1: Store Co-Clustering Model in DW Identify when products move from one cluster to another Method 2:Product Ratings Distributionbin ratings. Establish Distribution Baseline for each product: Mu, s When Mu, s change beyond a certain threshold: Identify the movers/shakers & biggest losers

Question 3: Category Level Stats Category level statistics (including roll-ups) such as avg rating, content volume, etc Average = Sum(X)/n Greenplum OLAP grouping extensions: CUBE, ROLLUP, GROUPING SETS SELECT productcategory, productid, sum(rating)/count(*) FROM review GROUP BY ROLLUP(productcategory, productid) ORDER BY 1,2,3;

Question 4: Top Contributors Top contributors Score (content submissions + Helpfulness Votes) Tag content: Approve: positive reviews, reject negative or inappropriate/ price. Snippets: Highlight sentences in reviews Keep Score, sentiment, reason codes, snippets, product flaws, intelligence data, <Key> in DW Move content to readily available <Key, Value> Store. Query Top Score Contributors use <Key> to pull content real-time.

Brisk by DataStax (buy local)

Cassandra First developed by Facebook SuperColumns can turn a simple key-value architecture into an architecture that handles sorted lists, based on an index specified by the user. Can scale from one node to several thousand nodes clustered in different data centers. Can be tuned for more consistency or availability Smooth node replacement if one goes down

Greenplum Bulk Loader gpload-f load_reviewer.yml -q -l ./gpload_reviewer.log

X Insurance: Historical Data Load Times Weekly reports: 20 minutes Attribution Reports (pre-aggregated by 4 dimensions and deployed to GUI) Reach & Frequency Reports Campaign Optimization: ANN + Non-Linear Optimization Allocate budget onto different Sites + Placements to maximize conversions

Attribute by CreativeSize INSIGHT: Top Ranked Creative Sizes High Propensity to Convert High number of Conversions Low Cost High Revenue

Overlap Report Actionable: cookie sync with “Turn” cookies, use as block list to prevent reaching same cookies twice

Common Problems (ALL) Data Quality Discovery Stats: before load & after load Establish Baselines and use them for validation Performance Growth rates and loading windows: low space triggers, Latency of online queries Latency of offline queries Agility of Schema to change Under-estimate value of metadata design Integration Self documentation Self-governance

Presentation_BigData_NenaMarin

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (14)

En vedette

En vedette (20)

Similaire à Presentation_BigData_NenaMarin

Similaire à Presentation_BigData_NenaMarin (20)

Dernier

Dernier (20)

Presentation_BigData_NenaMarin