Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Building a Data Warehouse for
Business Analytics using Spark SQL
Copyright Edmunds.com, Inc. (the “Company”). Edmunds and ...
About me:
Blagoy Kaloferov
Big Data Software Engineer
About my company:
Edmunds.com is a car buying platform
18M+ unique v...
“It’s about time!”
 


Agenda
1.  Introduction and Architecture
2.  Building Next Gen DWH
3.  Automating Ad Revenue using ...
Business Analytics at
Edmunds.com
Divisions:
DWH Engineers
Business Analysts
Statistics team
Two major groups:
Map Reduce ...
Spark SQL :
•  Simplified ETL and enhanced Visualization tools
•  Allows anyone in BA to quickly build new Data marts
•  E...
Data Ingestions / ETL
Architecture for Analytics
Raw Data
Clickstream
Inventory
Dealer
Lead
Transaction
HDFS
aggregates
Ma...
Data Ingestions / ETL
Reporting
Ad Hoc / Dashboards
Architecture for Analytics
Raw Data
Clickstream
Inventory
Dealer
Lead
...
Data Ingestions / ETL
Reporting
Ad Hoc / Dashboards
Architecture for Analytics
Raw Data
Clickstream
Inventory
Dealer
Lead
...
Data Ingestions / ETL
Reporting
Ad Hoc / Dashboards
Architecture for Analytics
Raw Data
Clickstream
Inventory
Dealer
Lead
...
Data Ingestions / ETL
Reporting
Ad Hoc / Dashboards
Architecture for Analytics
Raw Data
Clickstream
Inventory
Dealer
Lead
...
Our approach:
o  Spark SQL tables similar to
existing our Redshift tables
o  Best fit for us are Hive tables
pointing to S...
•  S3 Datasets have thousands of directories:
( location	
  /year/month/day/hour	
  )
•  Every new S3 directory for each d...
Utilities to for Spark SQL tables and S3:
•  Register valid partitions of Spark SQL tables with spark Hive Metastore
•  Cr...
“It’s about time!”
 


S3 and Spark SQL potential
o  Now that all S3 data is easily
accessible, there are a lot of
opportu...
Platfora Dashboards Pipeline
Optimization
o  Platfora is a Visualization Analytics Tool
o  Provides More than 200 dashboar...
Platfora Dashboards Pipeline
Optimization
Limitations:
o  We can not optimize the Platfora Map Reduce jobs
o  Defined Data...
Platfora Dealer Leads dataset
Use Case
o  Dealer Leads: Lead Submitter insights dataset
o  More than 40 Visual Dashboards ...
“It’s about time!”
 


Optimizing Dealer Leads Dataset
Dealer Leads Platfora Dataset stats:
o  300+ attributes
o  Usually ...
“It’s about time!”
 


Optimizing Dealer Leads Dataset
How do we optimize it?
“It’s about time!”
 


Optimizing Dealer Leads Dataset
How do we optimize it?
1. Have Spark SQL do the work!
o  All requir...
Dealer Leads Using Spark SQL
Demo
Dealer Leads Data Mart using Spark SQL
Demo
Expose all original 300+ attributes
Enhance:...
Dealer Leads Using Spark SQL
results
o  Spark SQL aggregation in 10 minutes.
o  Adds dimension attributes that were not av...
ETL and Visualization
takeaway
o  Now anyone in BA can perform and support ETL on their own
o  New Data marts can be expor...
Usual POC process
o  Business Analyst Project Prototype in SQL
o  Not scalable. Takes Ad Hoc resources from RDBMS
o  SQL t...
“It’s about time!”
 


POC with Spark SQL vision
new POC process using Spark
o  A developer and BA can work together on th...
Ad Revenue Billing Use Case
Definitions:
Impression, CPM
Line Item, Order
Introduction
OEM Advertising on website
Introduction
OEM Advertising on website
Ad Revenue computed at the end of the month
using OEM provided impression data
Ad ...
Impressions served * CPM != actual revenue
o  There are billing adjustment rules!
o  Each OEM has a set of unique rules th...
•  Line Item groupings
Line Item adjustments’ examples
Box representing each example
ORIGINAL	
  Line	
  Item	
  	
  |	
  ...
•  Line Item groupings
Line Item adjustments’ examples
Box representing each example
ORIGINAL	
  Line	
  Item	
  	
  |	
  ...
•  Line Item groupings
Line Item adjustments’ examples
Box representing each example
MERGED	
  |	
  CPM	
  |	
  NEW_impres...
•  Line Item groupings
•  Capping / Adjustments
Line Item adjustments’ examples
Box representing each example
MERGED	
  |	...
•  Line Item groupings
•  Capping / Adjustments
Line Item adjustments’ examples
Box representing each example
MERGED	
  |	...
•  Line Item groupings
•  Capping / Adjustments
Line Item adjustments’ examples
Box representing each example
MERGED	
  |	...
•  Line Item groupings
•  Capping / Adjustments
Line Item adjustments’ examples
Box representing each example
MERGED	
  |	...
1_day_impr	
  |	
  CPM	
  |	
  a?ributes
1d,	
  7d,	
  MTD,	
  QTD,	
  YTD:	
  adjusted_impr	
  |	
  CPM	
  	
  
Billing E...
Automation Challenges
o  Many rules , user defined inputs, the logic changes
o  Need for scalable unified platform
o  Need...
Billing Rules Modeling Project
How do we develop it?
Ad Revenue
Billing Rules Modeling Project
How do we develop it?
Spark + Spark SQL approach
BA + Developers + OEM Account T...
Project Architecture in Spark
Sum / Transform / Join Rows
Processing separated in phases where input / outputs are
Spark S...
Project Architecture in Spark
Base Line Items
Spark SQL
Merged Line Items
Spark SQL
Adjusted Line Items
Spark SQL
Phase 1 ...
Spark SQL
Project Architecture in Spark
Business Analysts
DFP / Other
Impressions
Line Item
Dimensions
Base Line Items
Spa...
Spark SQL
Project Architecture in Spark
Business Analysts BA + Developer
DFP / Other
Impressions
Line Item
Dimensions
- Ma...
Spark SQL
Billing Rules
Engine
Project Architecture in Spark
Business Analysts BA + Developer
DFP / Other
Impressions
Line...
“It’s about time!”
 


Billing Rules Modeling
Achievements
o  Increased accuracy of revenue forecasts for BA
o  Cost savin...
Thank you!
Blagoy Kaloferov
bkaloferov@edmunds.com
Questions?
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)
Prochain SlideShare
Chargement dans…5
×

Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

Presentation at Spark Summit 2015

  • Identifiez-vous pour voir les commentaires

Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kaloferov, Edmunds)

  1. 1. Building a Data Warehouse for Business Analytics using Spark SQL Copyright Edmunds.com, Inc. (the “Company”). Edmunds and the Edmunds.com logo are registered trademarks of the Company. This document contains proprietary and/or confidential information of the Company. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Company, and any such disclosure requires the express approval of the Company. Blagoy Kaloferov Software Engineer 06/15/2015
  2. 2. About me: Blagoy Kaloferov Big Data Software Engineer About my company: Edmunds.com is a car buying platform 18M+ unique visitors each month Today’s talk
  3. 3. “It’s about time!”  
 Agenda 1.  Introduction and Architecture 2.  Building Next Gen DWH 3.  Automating Ad Revenue using Spark SQL 4.  Conclusion
  4. 4. Business Analytics at Edmunds.com Divisions: DWH Engineers Business Analysts Statistics team Two major groups: Map Reduce / Spark developers Analysts with advanced SQL skills
  5. 5. Spark SQL : •  Simplified ETL and enhanced Visualization tools •  Allows anyone in BA to quickly build new Data marts •  Enabled a scalable POC to Production process for our projects Proposition
  6. 6. Data Ingestions / ETL Architecture for Analytics Raw Data Clickstream Inventory Dealer Lead Transaction HDFS aggregates Map Reduce jobs DWH Developers Business Analyst Hadoop Cluster
  7. 7. Data Ingestions / ETL Reporting Ad Hoc / Dashboards Architecture for Analytics Raw Data Clickstream Inventory Dealer Lead Transaction HDFS aggregates Map Reduce jobs Database Redshift Business Intelligence Tools Platfora Tableau DWH Developers Business Analysts Hadoop Cluster
  8. 8. Data Ingestions / ETL Reporting Ad Hoc / Dashboards Architecture for Analytics Raw Data Clickstream Inventory Dealer Lead Transaction HDFS aggregates Map Reduce jobs Database Redshift Business Intelligence Tools Platfora Tableau DWH Developers Business Analysts Hadoop Cluster Spark Spark SQL Databricks
  9. 9. Data Ingestions / ETL Reporting Ad Hoc / Dashboards Architecture for Analytics Raw Data Clickstream Inventory Dealer Lead Transaction HDFS aggregates Map Reduce jobs Database Redshift Business Intelligence Tools Platfora Tableau DWH Developers Business Analysts Hadoop Cluster Spark Spark SQL
  10. 10. Data Ingestions / ETL Reporting Ad Hoc / Dashboards Architecture for Analytics Raw Data Clickstream Inventory Dealer Lead Transaction HDFS aggregates Map Reduce jobs Database Redshift Business Intelligence Tools Platfora Tableau DWH Developers Business Analysts Hadoop Cluster Spark Spark SQL ETL
  11. 11. Our approach: o  Spark SQL tables similar to existing our Redshift tables o  Best fit for us are Hive tables pointing to S3 delimited data o  Exposed hundreds of Spark SQL tables Exposing S3 data via Spark SQL
  12. 12. •  S3 Datasets have thousands of directories: ( location  /year/month/day/hour  ) •  Every new S3 directory for each dataset has to be registered Adding Latest Table Partitions Adding Latest Table Partitions Spark SQL tables S3 dataset S3 dataset S3 dataset S3 dataset S3 dataset S3 dataset S3 dataset2015/05/31/01 2015/05/31/02 Spark SQL table partitions
  13. 13. Utilities to for Spark SQL tables and S3: •  Register valid partitions of Spark SQL tables with spark Hive Metastore •  Create Last_X_ Days copy of any Spark SQL table in memory Scheduled jobs: •  Registers latest available directories for all Spark SQL tables programmatically •  Updates Last_3_ Days of core datasets in memory Adding Latest Table PartitionsSpark SQL tables
  14. 14. “It’s about time!”  
 S3 and Spark SQL potential o  Now that all S3 data is easily accessible, there are a lot of opportunities ! o  Anyone can ETL on prefixed aggregates and create new Data Marts Spark Cluster Spark SQL tables Last_3_days Tables Utilities and UDF’s Business Intelligence Tools Faster Pipeline Better Insights
  15. 15. Platfora Dashboards Pipeline Optimization o  Platfora is a Visualization Analytics Tool o  Provides More than 200 dashboards for BA o  Uses MapReduce to load aggregates Source Dataset Build / Update Lens Dashboards HDFS S3 Joined Datasets MapReduce jobs
  16. 16. Platfora Dashboards Pipeline Optimization Limitations: o  We can not optimize the Platfora Map Reduce jobs o  Defined Data Marts not available elsewhere Source Dataset Build / Update Lens Dashboards HDFS S3 Joined Datasets MapReduce jobs Join  on  lead_id  ,  inventory_id,     visitor_id,  dealer_id  …    
  17. 17. Platfora Dealer Leads dataset Use Case o  Dealer Leads: Lead Submitter insights dataset o  More than 40 Visual Dashboards are using Dealer Leads Lead Submitter Data Region Info Transaction Data Vehicle Info Dealer Info Dealer Leads Joined Dataset Lead Categorization Lead Submitter Insights join
  18. 18. “It’s about time!”  
 Optimizing Dealer Leads Dataset Dealer Leads Platfora Dataset stats: o  300+ attributes o  Usually takes 2-3 hours to build lens o  Scheduled to build daily
  19. 19. “It’s about time!”  
 Optimizing Dealer Leads Dataset How do we optimize it?
  20. 20. “It’s about time!”  
 Optimizing Dealer Leads Dataset How do we optimize it? 1. Have Spark SQL do the work! o  All required datasets are exposed as Spark SQL tables o  Add new useful attributes 2. Make the ETL easy for anyone in Business Analytics to do it themselves o  Provide utilities and UDF’s so that aggregated data can be exposed to Visualization tools
  21. 21. Dealer Leads Using Spark SQL Demo Dealer Leads Data Mart using Spark SQL Demo Expose all original 300+ attributes Enhance: Join with site_traffic Dealer Leads Dataset Traffic Data Lead submitter journey Entry page, page views, device … aggregate_traffic_spark_sql
  22. 22. Dealer Leads Using Spark SQL results o  Spark SQL aggregation in 10 minutes. o  Adds dimension attributes that were not available before o  Platfora does not need to join aggregates o  Significantly reduced latency o  Dashboard refreshed every 2 hours instead of once per day. Spark SQL Dealer Leads Lens 10 minutes 10 minutes Dashboards
  23. 23. ETL and Visualization takeaway o  Now anyone in BA can perform and support ETL on their own o  New Data marts can be exported to RDBMS S3 New Data Marts Using Spark SQL Redshift Platfora Tableau Spark Cluster Spark SQL tables Last N days Tables Utilities Spark SQL connector ETL load
  24. 24. Usual POC process o  Business Analyst Project Prototype in SQL o  Not scalable. Takes Ad Hoc resources from RDBMS o  SQL to Map Reduce o  Transition from two very different frameworks o  MR do not always fit complicated business logic. o  Supported only by Developers POC with Spark SQL vision
  25. 25. “It’s about time!”  
 POC with Spark SQL vision new POC process using Spark o  A developer and BA can work together on the same platform and collaborate using Spark o  Its scalable o  No need to switch frameworks when productionalizing
  26. 26. Ad Revenue Billing Use Case Definitions: Impression, CPM Line Item, Order Introduction OEM Advertising on website
  27. 27. Introduction OEM Advertising on website Ad Revenue computed at the end of the month using OEM provided impression data Ad Revenue Billing Use Case
  28. 28. Impressions served * CPM != actual revenue o  There are billing adjustment rules! o  Each OEM has a set of unique rules that determine the actual revenue. o  Adjusting revenue numbers requires manual user inputs from OEM’s Account Manager Ad Revenue End of Month billing
  29. 29. •  Line Item groupings Line Item adjustments’ examples Box representing each example ORIGINAL  Line  Item    |  CPM  |  impressions  |  a?ributes
  30. 30. •  Line Item groupings Line Item adjustments’ examples Box representing each example ORIGINAL  Line  Item    |  CPM  |  impressions  |  a?ributes SUPPORT  Line  Item  |  CPM  |  impressions
  31. 31. •  Line Item groupings Line Item adjustments’ examples Box representing each example MERGED  |  CPM  |  NEW_impressions  |  a?ributes Combine data
  32. 32. •  Line Item groupings •  Capping / Adjustments Line Item adjustments’ examples Box representing each example MERGED  |  CPM  |  NEW_impressions  |  a?ributes impressions_served > Contract ? Line  Item    |  impressions|  Contract
  33. 33. •  Line Item groupings •  Capping / Adjustments Line Item adjustments’ examples Box representing each example MERGED  |  CPM  |  NEW_impressions  |  a?ributes Line  Item    |  CAPPED_impressions|  Contract Cap impression! impressions_served > Contract ?
  34. 34. •  Line Item groupings •  Capping / Adjustments Line Item adjustments’ examples Box representing each example MERGED  |  CPM  |  NEW_impressions  |  a?ributes impressions_served > (X% * Contract) ? Line  Item    |  impressions|  Contract
  35. 35. •  Line Item groupings •  Capping / Adjustments Line Item adjustments’ examples Box representing each example MERGED  |  CPM  |  NEW_impressions  |  a?ributes impressions_served > (X% * Contract) ? Adjust impression! Line  Item    |  ADJUSTED_impressions|  Contract
  36. 36. 1_day_impr  |  CPM  |  a?ributes 1d,  7d,  MTD,  QTD,  YTD:  adjusted_impr  |  CPM     Billing Engine Each Line Item Process Vision: Impressions served * CPM = actual revenue Can we automate ad revenue calculation?
  37. 37. Automation Challenges o  Many rules , user defined inputs, the logic changes o  Need for scalable unified platform o  Need for tight collaboration between OEM team, Business Analysts and DWH developers Can we automate ad revenue calculation?
  38. 38. Billing Rules Modeling Project How do we develop it?
  39. 39. Ad Revenue Billing Rules Modeling Project How do we develop it? Spark + Spark SQL approach BA + Developers + OEM Account Team collaboration Goal is an Ad Performance Dashboard Adjusted Billing Ad RevenueOEM:
  40. 40. Project Architecture in Spark Sum / Transform / Join Rows Processing separated in phases where input / outputs are Spark SQL tables 1_day_impr  |  CPM  |  a?ributes Spark SQL RowEach Line Item =
  41. 41. Project Architecture in Spark Base Line Items Spark SQL Merged Line Items Spark SQL Adjusted Line Items Spark SQL Phase 1 Phase 2 Phase 3 Ad Performance Tableau Dashboard
  42. 42. Spark SQL Project Architecture in Spark Business Analysts DFP / Other Impressions Line Item Dimensions Base Line Items Spark SQL join aggregate Phase 1 Merged Line Items Spark SQL Adjusted Line Items Spark SQL Phase 2 Phase 3 Ad Performance Tableau Dashboard
  43. 43. Spark SQL Project Architecture in Spark Business Analysts BA + Developer DFP / Other Impressions Line Item Dimensions - Manual Groupings - Other Inputs Spark SQL Line Item Merging Engine OEM Account Managers Base Line Items Spark SQL Merged Line Items Spark SQL join aggregate Phase 1 Phase 2 Adjusted Line Items Spark SQL Phase 3 Ad Performance Tableau Dashboard
  44. 44. Spark SQL Billing Rules Engine Project Architecture in Spark Business Analysts BA + Developer DFP / Other Impressions Line Item Dimensions - Manual Groupings - Other Inputs Spark SQL Line Item Merging Engine Ad Performance Tableau Business Analysts OEM Account Managers Base Line Items Spark SQL Merged Line Items Spark SQL Adjusted Line Items Spark SQL join aggregate Phase 1 Phase 2 Phase 3 Dashboard
  45. 45. “It’s about time!”  
 Billing Rules Modeling Achievements o  Increased accuracy of revenue forecasts for BA o  Cost savings by not having a dedicated team doing manual adjustments o  Monitor ad delivery rate for orders o  Allows us to detect abnormalities in ad serving o  Collaboration between BA and DWH Developers
  46. 46. Thank you! Blagoy Kaloferov bkaloferov@edmunds.com Questions?

×