Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Presentation_BigData_NenaMarin
1. Bid Data Ad Analytics Nena Marín, Ph.D. Principal Scientist & Director of Research
2. Outline Background Big Data Challenge: mysqlvsmynosql Data Intensive Computing: online vs offline Big Data Solution Schema changes Analytics approach changes Results Conclusions
3. Big Data Problem Overwhelming Amounts of Data & Growing Data exceeding 100 GB/TB/PBs Unstructured data/Content Structured data: Warehouse for tri-state Electric and Gas Utility ServicesOracle OWB + Cognos 346,000 electric and 306,000 natural gas customers (2007) Semi & Unstructured data (Adometry 2010-now) Impression & click stream data Scale: ~3B impressions per day Page&Ad Tag customers Ad Server Log File customers Growth rates: 2 – 9% per month Unstructured content Email (Adometry: Cross-Channel AA) Sensor Data (HALO Project: 8.5 B particles 2009-2010)
7. Big Data Analysis Off-line (batch) Ad Analytics Warehouse HALO project: TerascaleAstronomical Dataset on 240 compute nodes Longhorn Cluster (ref1) Hierarchical Density Shaving clustering algorithm Dataflow Clustering Algorithm Hadoop (128 compute-cores Spur cluster @ UT/TACC) On-line (realtime) Netflix Recommender: 100 million ratings in 16.31 minutes . Effective = 9.6 microseconds per rating (ref2) REF1:http://www.computer.org/portal/web/csdl/doi/10.1109/ICDMW.2010.26 REF2: http://kdd09.crowdvine.com/talks/4963 (KDD2009)
8. Outline Background Big Data Challenge: mysqlvsmynosql Data Intensive Computing: online vs offline Big Data Solution Schema changes Analytics approach changes Results Conclusions
9. Big Data Solution Questions How will we add new nodes (grow)? Any single points of failure? Do the writes scale as well? How much administration will the system require? Implementation learning curve? If its open source, is there a healthy community? How much time and effort would we have to expend to deploy and integrate it? Does it use technology which we know we can work with? Integration tools, Presentation Tools, Analytics (data mining) Tools
10.
11. My two cents 30 Billion rows per month puts you in VLDB territory, so you need partitioning. The low cardinality dimensions would also suggest that bitmap indexes would be a performance win. Column store Most aggregates are by columns Agility to update schema: add columns Quicklz compression on partitions and tables
12. Columnar Database Review Table WITH (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=quicklz, OIDS=FALSE ) DISTRIBUTED BY (id); PARTITION BY RANGE(productid) SUBPARTITION BY RANGE(submissiontime) SUBPARTITION BY LIST(status) ( PARTITION clnt100002 START (100002) END (100003) EVERY (1) WITH (appendonly=true, compresstype=quicklz, orientation=column) ( START ('2011-05-01 00:00:00'::timestamp without time zone) END ('2011-08-01 00:00:00'::timestamp without time zone) EVERY ('1 day'::interval) WITH (appendonly=true, compresstype=quicklz, orientation=column) ( SUBPARTITION submVALUES(‘submitted’) WITH (appendonly=true, compresstype=quicklz, orientation=column), SUBPARTITION apprVALUES(‘approved’) WITH (appendonly=true, compresstype=quicklz, orientation=column), DEFAULT SUBPARTITION other WITH (appendonly=true, compresstype=quicklz, orientation=column) ) ) )
13. Why Partitioning? Because is partitioned by date, when query WHERE submissiondate>= & <=... the partition out of the date range won't be scanned and won't impact the performance... If a partition is no longer needed, can create a new table with the content of the partition and drop the partition Can recreate partitions with "compression" and "append only" option to save disk space and IO Bandwidth.
14.
15. Outline Background Big Data Challenge: mysqlvsmynosql Data Intensive Computing: online vs offline Big Data Solution Schema changes Analytics approach changes Results Conclusions
16. Question 1: Top rated products (filtered by brand or category) MDX: WITHSET [TCat] AS TopCount([Product].[Subcategory].[Subcategory],10,[Measures].[Rating]) MEMBER [Product].[Subcategory].[Other] ASAggregate([Product].[Subcategory].[Subcategory] - TCat)SELECT { [Measures].[Rating] } ON COLUMNS, TCat + [Other] ON ROWS FROM [DW_PRODUCTS] SQL query:select top X rating, count(id) from fact where brand = x and category = y group by rating order by count(id) desc
19. After ROW CLUSTERING RAW DATA After COLUMN Clustering Iterate Until convergence Movies Ratings Users K by L Coclusters K rows L
20. From the Training K by L Coclusters Average Ratings per cluster Average Ratings per User Average Ratings per Movie Global Average rating is 3.68 9/6/2011 20
32. Question 1: Top rated products Build Recommender Data Mining Model based on Coclustering customers & ratings. Training runtime for 100,480,507 ratings 16.31 minutes Apply recommender real-time: Effective prediction runtime: 9.738615 μs per rating
33. Question 2: Fastest Rising Products Method 1: Store Co-Clustering Model in DW Identify when products move from one cluster to another Method 2:Product Ratings Distributionbin ratings. Establish Distribution Baseline for each product: Mu, s When Mu, s change beyond a certain threshold: Identify the movers/shakers & biggest losers
34. Question 3: Category Level Stats Category level statistics (including roll-ups) such as avg rating, content volume, etc Average = Sum(X)/n Greenplum OLAP grouping extensions: CUBE, ROLLUP, GROUPING SETS SELECT productcategory, productid, sum(rating)/count(*) FROM review GROUP BY ROLLUP(productcategory, productid) ORDER BY 1,2,3;
35. Question 4: Top Contributors Top contributors Score (content submissions + Helpfulness Votes) Tag content: Approve: positive reviews, reject negative or inappropriate/ price. Snippets: Highlight sentences in reviews Keep Score, sentiment, reason codes, snippets, product flaws, intelligence data, <Key> in DW Move content to readily available <Key, Value> Store. Query Top Score Contributors use <Key> to pull content real-time.
37. Cassandra First developed by Facebook SuperColumns can turn a simple key-value architecture into an architecture that handles sorted lists, based on an index specified by the user. Can scale from one node to several thousand nodes clustered in different data centers. Can be tuned for more consistency or availability Smooth node replacement if one goes down
38. Outline Background Big Data Challenge: mysqlvsmynosql Data Intensive Computing: online vs offline Big Data Solution Schema changes Analytics approach changes Results Conclusions
44. X Insurance: Historical Data Load Times Weekly reports: 20 minutes Attribution Reports (pre-aggregated by 4 dimensions and deployed to GUI) Reach & Frequency Reports Campaign Optimization: ANN + Non-Linear Optimization Allocate budget onto different Sites + Placements to maximize conversions
45. Attribute by CreativeSize INSIGHT: Top Ranked Creative Sizes High Propensity to Convert High number of Conversions Low Cost High Revenue
46. Overlap Report Actionable: cookie sync with “Turn” cookies, use as block list to prevent reaching same cookies twice
47.
48. Outline Background Big Data Challenge: mysqlvsmynosql Data Intensive Computing: online vs offline Big Data Solution Schema changes Analytics approach changes Results Conclusions
49. Common Problems (ALL) Data Quality Discovery Stats: before load & after load Establish Baselines and use them for validation Performance Growth rates and loading windows: low space triggers, Latency of online queries Latency of offline queries Agility of Schema to change Under-estimate value of metadata design Integration Self documentation Self-governance
50. Common Problems (Internet Advertising) Bad Data: Well defined Customer Data requirements Context: Campaign, Site, Placement IP: invalids, GEO & Demog. Cost Revenue Common cookie across the different data sources Have resigned in some cases to IP & user agent (browser & language) Only Aggregate Data Black-out periods and agility to rollover new quarters
51. Greenplum: Lessons Learned Adding new nodes expanded cluster from 4 to 8 nodes Redistribution tool failed Duplicate rowids in multiple nodes had to re-load Single point of failure GPMASTER node is single point of failure All slave nodes are mirrored and failed segments can be recovered Read scale Network bandwidth Write scale Network Bandwidth Hard disk space (dead in the water at 80% use on GPMaster) IT resources Full Product Support Have been down for two weeks at a time Open source – healthy community Technology works well with others = PostgresqlPgAdmin, PentahoTalend Studio, etc. Learning curve = NONE everyone knew SQL Deployment = had several initial install issues. But deploying new clients is automated using Python and SQL
52. Data Append Only Columnar Orientation Load balance via Distributed by: No indexing but partitioning by: Agitlity of Schema to Change Content Evaluating Cassandra & Brisk <Key, Value> store for content Score, sentiment, reason codes, + <Key> to DW Real-time Leverage Data Mining Models (PMML + store in DW) Use stored models to identify change in patterns Recommendations