SlideShare une entreprise Scribd logo
1  sur  50
Joining The Club
A c c e l e r a t i n g B i g D a t a w i t h A p a c h e S p a r k
Dollar Shave Club
Background on DSC
Engineering at DSC
Growth of Data Team
Show & Tell: Machine Learning Pipeline
Outline
A David and Goliath Story
Introduction
of new
members
Goliath
Engineering at DSC
u  Frontend
u  Ember.js web apps
u  iOS and Android apps
u  HTML email
u  Backend
u  Ruby on Rails web backends
u  Internal services (Ruby, Node.js, Golang, Python, Elixir)
u  Data and analytics (Python, SQL, Spark)
u  QA
u  CircleCI, SauceLabs, Jenkins
u  TestUnit, Selenium
u  IT
u  Office and warehouse IT
Engineering at DSC
highscalability.com
Data Engineering at DSC
A David and Big Data Story
Big Data
What is the barrier to entry?
Big Data
What is the barrier to entry?
u  Requires a different set of capabilities
Big Data
What is the barrier to entry?
u  Requires a different set of capabilities
u  Investing resources without an obvious ROI
Big Data
What is the barrier to entry?
u  Requires a different set of capabilities
u  Investing resources without an obvious ROI
Knowing where to start
Good Foundations
Data Engineering
u  Machine learning pipeline
u  Models served in production
u  Exploratory Analysis
u  Customer segmentation (clustering)
u  Hypothesis testing
u  Data mining
u  NLP (topic modeling)
Data Engineering
u  Maxwell + Kafka + Spark Streaming
u  Streaming data replication
u  Streaming metrics directly from the data layer
Anatomy of a Machine Learning Pipeline
Box Manager Email
Box Manager Email
Problem: Order the product tiles in “Box Manager Email” to maximize profit
Constraints:
u  Every customer sees some ordered set of products
u  Do not show products already added to box
Box Manager Email
Problem: Order the product tiles in “Box Manager Email” to maximize profit
Constraints:
u  Every customer sees some ordered set of products
u  Do not show products already added to box
+25% revenue per email open
Strategy
For each product, model the behavior which best distinguishes someone who
buys that product from someone who buys other products; rank a product by
the strength of the indicative behavior, when present, and rank a product
randomly otherwise
	
Model
u  Logistic Regression
u  Learns the “tipping point” between
success and failure
u  Success = “buys product X”
Design
u  Extract data from data warehouse (Redshift)
u  Join that data with hand-curated metadata (knowledge base)
u  Aggregate and pivot events by customer and discretized time
u  Generate a training set of feature vectors
u  Select features to include in the final model
u  Train and productionize the final model
def performExtraction(
extractorClass, exportName, join_table=None, join_key_col=None,
start_col=None, include_start_col=True, event_start_date=None
):
customer_id_col = extractorClass.customer_id_col
timestamp_col = extractorClass.timestamp_col
extr_agrs = extractorArgs(
customer_id_col, timestamp_col, join_table, join_key_col,
start_col, include_start_col, event_start_date
)
extractor = extractorClass(**extr_agrs)
export_path = redshiftExportPath(exportName)
return extractor.exportFromRedshift(export_path) # writes to Parquet
Extract
def exportFromRedshift(self, path):
export = self.exportDataFrame()
writeParquetWithRetry(export, path)
return sqlContext.read.parquet(path)
.persist(StorageLevel.MEMORY_AND_DISK)
def exportDataFrame(self):
query = self.generateQuery()
return sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", urlOption)
.option("query", query)
.option("tempdir", tempdir)
.load()
Extract
Domain Knowledge is Critical
The way that an expert organizes and represents facts in their domain.
u  Guides feature extraction
u  Prevents overfitting
u  Vastly superior to unsupervised feature extraction (e.g., PCA)
Aggregate (Shard, Compress, Join) and Pivot!
This dance is hard to choreograph
Aggregate (Shard, Compress, Join) and Pivot!
This dance is hard to choreograph
u  8,736 columns
u  2.6 million rows
Dataframes API is not optimized for extremely wide datasets
def generateQuery(self):
return """
{0}
FROM {1}
GROUP BY customer_id, {2}, {3}, {4}
""".format(
self.selectClause(), self._tempTableName,
self.bucketingExpr(), self.timestampCol, self.startDateExpr
)
def perform(self):
self.preprocessedDataFrame().registerTempTable(self._tempTableName)
return sqlContext.sql(self.generateQuery())
Aggregate (Shard, Compress, Join) and Pivot!
Aggregate (Shard, Compress, Join) and Pivot!
Aggregate (Shard, Compress, Join) and Pivot!
(0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0)	
(	18,	(6,16),	(1,2)	)
def perform(self):
keyedMonthlyEvents = self.dataFrame.map(self.keyRow())
pivotRDD = keyedMonthlyEvents
.combineByKey(
self.initPivot(),
self.pivotEvent(),
self.combineDicts()
)
.map(self.convertToRow())
.persist(StorageLevel.MEMORY_AND_DISK)
return sqlContext.createDataFrame(pivotRDD, self.pivotedSchema())
Aggregate (Shard, Compress, Join) and Pivot!
Aggregate (Compress, Shard, Join) and Pivot!
Featurize
u  "Explode" each customer's history into several "windows" of time.
Featurize
u  "Explode" each customer's history into several "windows" of time.
u  Define one or more prediction targets
Featurize
u  "Explode" each customer's history into several "windows" of time.
u  Define one or more prediction targets
u  Standardize each historical feature
Featurize
u  "Explode" each customer's history into several "windows" of time.
u  Define one or more prediction targets
u  Standardize each historical feature
u  Persist on S3 as text files of compressed sparse vectors
Select Features
Select Features
1.  Randomly select a set of new features to test
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously selected features
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously selected features
3.  Train model
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously selected features
3.  Train model
4.  Calculate the p-value for each feature
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously selected features
3.  Train model
4.  Calculate the p-value for each feature
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously selected features
3.  Train model
4.  Calculate the p-value for each feature
5.  Retain significant features
6.  Repeat
Production Model
u  Spark ML makes parameter tuning easy
u  Reusable modules!
brett.bevers@dollarshaveclub.com
h+p://app.jobvite.com/m?33KSgiwI

Contenu connexe

Tendances

Tendances (20)

GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony Fox
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
Advanced goldengate training ⅰ
Advanced goldengate training ⅰAdvanced goldengate training ⅰ
Advanced goldengate training ⅰ
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
OrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityOrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionality
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
Spark etl
Spark etlSpark etl
Spark etl
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Introduction to Spark SQL & Catalyst
Introduction to Spark SQL & CatalystIntroduction to Spark SQL & Catalyst
Introduction to Spark SQL & Catalyst
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 

En vedette

En vedette (8)

Data-Driven @ Netflix
Data-Driven @ NetflixData-Driven @ Netflix
Data-Driven @ Netflix
 
Case Study Netflix
Case Study NetflixCase Study Netflix
Case Study Netflix
 
Netflix Case Study
Netflix Case StudyNetflix Case Study
Netflix Case Study
 
Netflix marketing plan
Netflix marketing plan Netflix marketing plan
Netflix marketing plan
 
Netflix case study
Netflix case studyNetflix case study
Netflix case study
 
Netflix Business Model & Strategy
Netflix Business Model & StrategyNetflix Business Model & Strategy
Netflix Business Model & Strategy
 
Netflix Case Study
Netflix Case StudyNetflix Case Study
Netflix Case Study
 
Culture
CultureCulture
Culture
 

Similaire à Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club

10 ways to make your code rock
10 ways to make your code rock10 ways to make your code rock
10 ways to make your code rock
martincronje
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
DataWorks Summit
 
Windows Store app using XAML and C#: Enterprise Product Development
Windows Store app using XAML and C#: Enterprise Product Development Windows Store app using XAML and C#: Enterprise Product Development
Windows Store app using XAML and C#: Enterprise Product Development
Mahmoud Hamed Mahmoud
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjay
Ajay Shriwastava
 

Similaire à Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club (20)

Agile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBAgile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDB
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupAgile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science Meetup
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on Snowflake
 
10 ways to make your code rock
10 ways to make your code rock10 ways to make your code rock
10 ways to make your code rock
 
Coding Naked 2023
Coding Naked 2023Coding Naked 2023
Coding Naked 2023
 
Social media analytics using Azure Technologies
Social media analytics using Azure TechnologiesSocial media analytics using Azure Technologies
Social media analytics using Azure Technologies
 
Dsug 05 02-15 - ScalDI - lightweight DI in Scala
Dsug 05 02-15 - ScalDI - lightweight DI in ScalaDsug 05 02-15 - ScalDI - lightweight DI in Scala
Dsug 05 02-15 - ScalDI - lightweight DI in Scala
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data Warehouse
 
Big Data for Small Businesses & Startups
Big Data for Small Businesses & StartupsBig Data for Small Businesses & Startups
Big Data for Small Businesses & Startups
 
Tips for Building your First XPages Java Application
Tips for Building your First XPages Java ApplicationTips for Building your First XPages Java Application
Tips for Building your First XPages Java Application
 
Windows Store app using XAML and C#: Enterprise Product Development
Windows Store app using XAML and C#: Enterprise Product Development Windows Store app using XAML and C#: Enterprise Product Development
Windows Store app using XAML and C#: Enterprise Product Development
 
Mstr meetup
Mstr meetupMstr meetup
Mstr meetup
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjay
 
Ajax-Tutorial
Ajax-TutorialAjax-Tutorial
Ajax-Tutorial
 

Plus de Data Con LA

Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 

Plus de Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Dernier (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club

  • 1. Joining The Club A c c e l e r a t i n g B i g D a t a w i t h A p a c h e S p a r k Dollar Shave Club
  • 2. Background on DSC Engineering at DSC Growth of Data Team Show & Tell: Machine Learning Pipeline Outline
  • 3. A David and Goliath Story Introduction
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10. Engineering at DSC u  Frontend u  Ember.js web apps u  iOS and Android apps u  HTML email u  Backend u  Ruby on Rails web backends u  Internal services (Ruby, Node.js, Golang, Python, Elixir) u  Data and analytics (Python, SQL, Spark) u  QA u  CircleCI, SauceLabs, Jenkins u  TestUnit, Selenium u  IT u  Office and warehouse IT
  • 12. Data Engineering at DSC A David and Big Data Story
  • 13. Big Data What is the barrier to entry?
  • 14. Big Data What is the barrier to entry? u  Requires a different set of capabilities
  • 15. Big Data What is the barrier to entry? u  Requires a different set of capabilities u  Investing resources without an obvious ROI
  • 16. Big Data What is the barrier to entry? u  Requires a different set of capabilities u  Investing resources without an obvious ROI Knowing where to start
  • 18. Data Engineering u  Machine learning pipeline u  Models served in production u  Exploratory Analysis u  Customer segmentation (clustering) u  Hypothesis testing u  Data mining u  NLP (topic modeling)
  • 19. Data Engineering u  Maxwell + Kafka + Spark Streaming u  Streaming data replication u  Streaming metrics directly from the data layer
  • 20. Anatomy of a Machine Learning Pipeline
  • 22. Box Manager Email Problem: Order the product tiles in “Box Manager Email” to maximize profit Constraints: u  Every customer sees some ordered set of products u  Do not show products already added to box
  • 23. Box Manager Email Problem: Order the product tiles in “Box Manager Email” to maximize profit Constraints: u  Every customer sees some ordered set of products u  Do not show products already added to box +25% revenue per email open
  • 24. Strategy For each product, model the behavior which best distinguishes someone who buys that product from someone who buys other products; rank a product by the strength of the indicative behavior, when present, and rank a product randomly otherwise Model u  Logistic Regression u  Learns the “tipping point” between success and failure u  Success = “buys product X”
  • 25. Design u  Extract data from data warehouse (Redshift) u  Join that data with hand-curated metadata (knowledge base) u  Aggregate and pivot events by customer and discretized time u  Generate a training set of feature vectors u  Select features to include in the final model u  Train and productionize the final model
  • 26. def performExtraction( extractorClass, exportName, join_table=None, join_key_col=None, start_col=None, include_start_col=True, event_start_date=None ): customer_id_col = extractorClass.customer_id_col timestamp_col = extractorClass.timestamp_col extr_agrs = extractorArgs( customer_id_col, timestamp_col, join_table, join_key_col, start_col, include_start_col, event_start_date ) extractor = extractorClass(**extr_agrs) export_path = redshiftExportPath(exportName) return extractor.exportFromRedshift(export_path) # writes to Parquet Extract
  • 27. def exportFromRedshift(self, path): export = self.exportDataFrame() writeParquetWithRetry(export, path) return sqlContext.read.parquet(path) .persist(StorageLevel.MEMORY_AND_DISK) def exportDataFrame(self): query = self.generateQuery() return sqlContext.read .format("com.databricks.spark.redshift") .option("url", urlOption) .option("query", query) .option("tempdir", tempdir) .load() Extract
  • 28.
  • 29. Domain Knowledge is Critical The way that an expert organizes and represents facts in their domain. u  Guides feature extraction u  Prevents overfitting u  Vastly superior to unsupervised feature extraction (e.g., PCA)
  • 30.
  • 31. Aggregate (Shard, Compress, Join) and Pivot! This dance is hard to choreograph
  • 32. Aggregate (Shard, Compress, Join) and Pivot! This dance is hard to choreograph u  8,736 columns u  2.6 million rows Dataframes API is not optimized for extremely wide datasets
  • 33. def generateQuery(self): return """ {0} FROM {1} GROUP BY customer_id, {2}, {3}, {4} """.format( self.selectClause(), self._tempTableName, self.bucketingExpr(), self.timestampCol, self.startDateExpr ) def perform(self): self.preprocessedDataFrame().registerTempTable(self._tempTableName) return sqlContext.sql(self.generateQuery()) Aggregate (Shard, Compress, Join) and Pivot!
  • 34. Aggregate (Shard, Compress, Join) and Pivot!
  • 35. Aggregate (Shard, Compress, Join) and Pivot! (0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0) ( 18, (6,16), (1,2) )
  • 36. def perform(self): keyedMonthlyEvents = self.dataFrame.map(self.keyRow()) pivotRDD = keyedMonthlyEvents .combineByKey( self.initPivot(), self.pivotEvent(), self.combineDicts() ) .map(self.convertToRow()) .persist(StorageLevel.MEMORY_AND_DISK) return sqlContext.createDataFrame(pivotRDD, self.pivotedSchema()) Aggregate (Shard, Compress, Join) and Pivot!
  • 37. Aggregate (Compress, Shard, Join) and Pivot!
  • 38. Featurize u  "Explode" each customer's history into several "windows" of time.
  • 39. Featurize u  "Explode" each customer's history into several "windows" of time. u  Define one or more prediction targets
  • 40. Featurize u  "Explode" each customer's history into several "windows" of time. u  Define one or more prediction targets u  Standardize each historical feature
  • 41. Featurize u  "Explode" each customer's history into several "windows" of time. u  Define one or more prediction targets u  Standardize each historical feature u  Persist on S3 as text files of compressed sparse vectors
  • 43. Select Features 1.  Randomly select a set of new features to test
  • 44. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features
  • 45. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model
  • 46. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model 4.  Calculate the p-value for each feature
  • 47. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model 4.  Calculate the p-value for each feature
  • 48. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model 4.  Calculate the p-value for each feature 5.  Retain significant features 6.  Repeat
  • 49. Production Model u  Spark ML makes parameter tuning easy u  Reusable modules!