SlideShare une entreprise Scribd logo
1  sur  50
Joining The Club
A c c e l e r a t i n g B i g D a t a w i t h A p a c h e S p a r k
Dollar Shave Club
Background on DSC
Engineering at DSC
Growth of Data Team
Show & Tell: Machine Learning Pipeline
Outline
A David and Goliath Story
Introduction
of new
members
Goliath
Engineering at DSC
u  Frontend
u  Ember.js web apps
u  iOS and Android apps
u  HTML email
u  Backend
u  Ruby on Rails web backends
u  Internal services (Ruby, Node.js, Golang, Python, Elixir)
u  Data and analytics (Python, SQL, Spark)
u  QA
u  CircleCI, SauceLabs, Jenkins
u  TestUnit, Selenium
u  IT
u  Office and warehouse IT
Engineering at DSC
highscalability.com
Data Engineering at DSC
A David and Big Data Story
Big Data
What is the barrier to entry?
Big Data
What is the barrier to entry?
u  Requires a different set of capabilities
Big Data
What is the barrier to entry?
u  Requires a different set of capabilities
u  Investing resources without an obvious ROI
Big Data
What is the barrier to entry?
u  Requires a different set of capabilities
u  Investing resources without an obvious ROI
Knowing where to start
Good Foundations
Data Engineering
u  Machine learning pipeline
u  Models served in production
u  Exploratory Analysis
u  Customer segmentation (clustering)
u  Hypothesis testing
u  Data mining
u  NLP (topic modeling)
Data Engineering
u  Maxwell + Kafka + Spark Streaming
u  Streaming data replication
u  Streaming metrics directly from the data layer
Anatomy of a Machine Learning Pipeline
Box Manager Email
Box Manager Email
Problem: Order the product tiles in “Box Manager Email” to maximize profit
Constraints:
u  Every customer sees some ordered set of products
u  Do not show products already added to box
Box Manager Email
Problem: Order the product tiles in “Box Manager Email” to maximize profit
Constraints:
u  Every customer sees some ordered set of products
u  Do not show products already added to box
+25% revenue per email open
Strategy
For each product, model the behavior which best distinguishes someone who
buys that product from someone who buys other products; rank a product by
the strength of the indicative behavior, when present, and rank a product
randomly otherwise
	
Model
u  Logistic Regression
u  Learns the “tipping point” between
success and failure
u  Success = “buys product X”
Design
u  Extract data from data warehouse (Redshift)
u  Join that data with hand-curated metadata (knowledge base)
u  Aggregate and pivot events by customer and discretized time
u  Generate a training set of feature vectors
u  Select features to include in the final model
u  Train and productionize the final model
def performExtraction(
extractorClass, exportName, join_table=None, join_key_col=None,
start_col=None, include_start_col=True, event_start_date=None
):
customer_id_col = extractorClass.customer_id_col
timestamp_col = extractorClass.timestamp_col
extr_agrs = extractorArgs(
customer_id_col, timestamp_col, join_table, join_key_col,
start_col, include_start_col, event_start_date
)
extractor = extractorClass(**extr_agrs)
export_path = redshiftExportPath(exportName)
return extractor.exportFromRedshift(export_path) # writes to Parquet
Extract
def exportFromRedshift(self, path):
export = self.exportDataFrame()
writeParquetWithRetry(export, path)
return sqlContext.read.parquet(path)
.persist(StorageLevel.MEMORY_AND_DISK)
def exportDataFrame(self):
query = self.generateQuery()
return sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", urlOption)
.option("query", query)
.option("tempdir", tempdir)
.load()
Extract
Domain Knowledge is Critical
The way that an expert organizes and represents facts in their domain.
u  Guides feature extraction
u  Prevents overfitting
u  Vastly superior to unsupervised feature extraction (e.g., PCA)
Aggregate (Shard, Compress, Join) and Pivot!
This dance is hard to choreograph
Aggregate (Shard, Compress, Join) and Pivot!
This dance is hard to choreograph
u  8,736 columns
u  2.6 million rows
Dataframes API is not optimized for extremely wide datasets
def generateQuery(self):
return """
{0}
FROM {1}
GROUP BY customer_id, {2}, {3}, {4}
""".format(
self.selectClause(), self._tempTableName,
self.bucketingExpr(), self.timestampCol, self.startDateExpr
)
def perform(self):
self.preprocessedDataFrame().registerTempTable(self._tempTableName)
return sqlContext.sql(self.generateQuery())
Aggregate (Shard, Compress, Join) and Pivot!
Aggregate (Shard, Compress, Join) and Pivot!
Aggregate (Shard, Compress, Join) and Pivot!
(0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0)	
(	18,	(6,16),	(1,2)	)
def perform(self):
keyedMonthlyEvents = self.dataFrame.map(self.keyRow())
pivotRDD = keyedMonthlyEvents
.combineByKey(
self.initPivot(),
self.pivotEvent(),
self.combineDicts()
)
.map(self.convertToRow())
.persist(StorageLevel.MEMORY_AND_DISK)
return sqlContext.createDataFrame(pivotRDD, self.pivotedSchema())
Aggregate (Shard, Compress, Join) and Pivot!
Aggregate (Compress, Shard, Join) and Pivot!
Featurize
u  "Explode" each customer's history into several "windows" of time.
Featurize
u  "Explode" each customer's history into several "windows" of time.
u  Define one or more prediction targets
Featurize
u  "Explode" each customer's history into several "windows" of time.
u  Define one or more prediction targets
u  Standardize each historical feature
Featurize
u  "Explode" each customer's history into several "windows" of time.
u  Define one or more prediction targets
u  Standardize each historical feature
u  Persist on S3 as text files of compressed sparse vectors
Select Features
Select Features
1.  Randomly select a set of new features to test
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously selected features
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously selected features
3.  Train model
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously selected features
3.  Train model
4.  Calculate the p-value for each feature
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously selected features
3.  Train model
4.  Calculate the p-value for each feature
Select Features
1.  Randomly select a set of new features to test
2.  Derive training set for new features + previously selected features
3.  Train model
4.  Calculate the p-value for each feature
5.  Retain significant features
6.  Repeat
Production Model
u  Spark ML makes parameter tuning easy
u  Reusable modules!
brett.bevers@dollarshaveclub.com
h+p://app.jobvite.com/m?33KSgiwI

Contenu connexe

Tendances

GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxDatabricks
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
 
Advanced goldengate training ⅰ
Advanced goldengate training ⅰAdvanced goldengate training ⅰ
Advanced goldengate training ⅰoggers
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterSri Ambati
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
 
OrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityOrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityCurtis Mosters
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and DatasetKazuaki Ishizaki
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...Herman Wu
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Databricks
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Bryan Yang
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 
Introduction to Spark SQL & Catalyst
Introduction to Spark SQL & CatalystIntroduction to Spark SQL & Catalyst
Introduction to Spark SQL & CatalystTakuya UESHIN
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSigmoid
 

Tendances (20)

GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony Fox
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
Advanced goldengate training ⅰ
Advanced goldengate training ⅰAdvanced goldengate training ⅰ
Advanced goldengate training ⅰ
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryFrustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
 
OrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityOrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionality
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
Spark etl
Spark etlSpark etl
Spark etl
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Introduction to Spark SQL & Catalyst
Introduction to Spark SQL & CatalystIntroduction to Spark SQL & Catalyst
Introduction to Spark SQL & Catalyst
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 

En vedette

En vedette (8)

Data-Driven @ Netflix
Data-Driven @ NetflixData-Driven @ Netflix
Data-Driven @ Netflix
 
Case Study Netflix
Case Study NetflixCase Study Netflix
Case Study Netflix
 
Netflix Case Study
Netflix Case StudyNetflix Case Study
Netflix Case Study
 
Netflix marketing plan
Netflix marketing plan Netflix marketing plan
Netflix marketing plan
 
Netflix case study
Netflix case studyNetflix case study
Netflix case study
 
Netflix Business Model & Strategy
Netflix Business Model & StrategyNetflix Business Model & Strategy
Netflix Business Model & Strategy
 
Netflix Case Study
Netflix Case StudyNetflix Case Study
Netflix Case Study
 
Culture
CultureCulture
Culture
 

Similaire à Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club

Agile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBAgile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBRussell Jurney
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Agile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupAgile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupRussell Jurney
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeAtScale
 
10 ways to make your code rock
10 ways to make your code rock10 ways to make your code rock
10 ways to make your code rockmartincronje
 
Social media analytics using Azure Technologies
Social media analytics using Azure TechnologiesSocial media analytics using Azure Technologies
Social media analytics using Azure TechnologiesKoray Kocabas
 
Dsug 05 02-15 - ScalDI - lightweight DI in Scala
Dsug 05 02-15 - ScalDI - lightweight DI in ScalaDsug 05 02-15 - ScalDI - lightweight DI in Scala
Dsug 05 02-15 - ScalDI - lightweight DI in ScalaUgo Matrangolo
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerDataWorks Summit
 
Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseFeatureByte
 
Big Data for Small Businesses & Startups
Big Data for Small Businesses & StartupsBig Data for Small Businesses & Startups
Big Data for Small Businesses & StartupsFujio Turner
 
Tips for Building your First XPages Java Application
Tips for Building your First XPages Java ApplicationTips for Building your First XPages Java Application
Tips for Building your First XPages Java ApplicationTeamstudio
 
Windows Store app using XAML and C#: Enterprise Product Development
Windows Store app using XAML and C#: Enterprise Product Development Windows Store app using XAML and C#: Enterprise Product Development
Windows Store app using XAML and C#: Enterprise Product Development Mahmoud Hamed Mahmoud
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayAjay Shriwastava
 

Similaire à Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club (20)

Agile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBAgile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDB
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupAgile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science Meetup
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on Snowflake
 
10 ways to make your code rock
10 ways to make your code rock10 ways to make your code rock
10 ways to make your code rock
 
Coding Naked 2023
Coding Naked 2023Coding Naked 2023
Coding Naked 2023
 
Social media analytics using Azure Technologies
Social media analytics using Azure TechnologiesSocial media analytics using Azure Technologies
Social media analytics using Azure Technologies
 
Dsug 05 02-15 - ScalDI - lightweight DI in Scala
Dsug 05 02-15 - ScalDI - lightweight DI in ScalaDsug 05 02-15 - ScalDI - lightweight DI in Scala
Dsug 05 02-15 - ScalDI - lightweight DI in Scala
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
Simplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data WarehouseSimplify Feature Engineering in Your Data Warehouse
Simplify Feature Engineering in Your Data Warehouse
 
Big Data for Small Businesses & Startups
Big Data for Small Businesses & StartupsBig Data for Small Businesses & Startups
Big Data for Small Businesses & Startups
 
Tips for Building your First XPages Java Application
Tips for Building your First XPages Java ApplicationTips for Building your First XPages Java Application
Tips for Building your First XPages Java Application
 
Windows Store app using XAML and C#: Enterprise Product Development
Windows Store app using XAML and C#: Enterprise Product Development Windows Store app using XAML and C#: Enterprise Product Development
Windows Store app using XAML and C#: Enterprise Product Development
 
Mstr meetup
Mstr meetupMstr meetup
Mstr meetup
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjay
 
Ajax-Tutorial
Ajax-TutorialAjax-Tutorial
Ajax-Tutorial
 

Plus de Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

Plus de Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Dernier

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Dernier (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club

  • 1. Joining The Club A c c e l e r a t i n g B i g D a t a w i t h A p a c h e S p a r k Dollar Shave Club
  • 2. Background on DSC Engineering at DSC Growth of Data Team Show & Tell: Machine Learning Pipeline Outline
  • 3. A David and Goliath Story Introduction
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10. Engineering at DSC u  Frontend u  Ember.js web apps u  iOS and Android apps u  HTML email u  Backend u  Ruby on Rails web backends u  Internal services (Ruby, Node.js, Golang, Python, Elixir) u  Data and analytics (Python, SQL, Spark) u  QA u  CircleCI, SauceLabs, Jenkins u  TestUnit, Selenium u  IT u  Office and warehouse IT
  • 12. Data Engineering at DSC A David and Big Data Story
  • 13. Big Data What is the barrier to entry?
  • 14. Big Data What is the barrier to entry? u  Requires a different set of capabilities
  • 15. Big Data What is the barrier to entry? u  Requires a different set of capabilities u  Investing resources without an obvious ROI
  • 16. Big Data What is the barrier to entry? u  Requires a different set of capabilities u  Investing resources without an obvious ROI Knowing where to start
  • 18. Data Engineering u  Machine learning pipeline u  Models served in production u  Exploratory Analysis u  Customer segmentation (clustering) u  Hypothesis testing u  Data mining u  NLP (topic modeling)
  • 19. Data Engineering u  Maxwell + Kafka + Spark Streaming u  Streaming data replication u  Streaming metrics directly from the data layer
  • 20. Anatomy of a Machine Learning Pipeline
  • 22. Box Manager Email Problem: Order the product tiles in “Box Manager Email” to maximize profit Constraints: u  Every customer sees some ordered set of products u  Do not show products already added to box
  • 23. Box Manager Email Problem: Order the product tiles in “Box Manager Email” to maximize profit Constraints: u  Every customer sees some ordered set of products u  Do not show products already added to box +25% revenue per email open
  • 24. Strategy For each product, model the behavior which best distinguishes someone who buys that product from someone who buys other products; rank a product by the strength of the indicative behavior, when present, and rank a product randomly otherwise Model u  Logistic Regression u  Learns the “tipping point” between success and failure u  Success = “buys product X”
  • 25. Design u  Extract data from data warehouse (Redshift) u  Join that data with hand-curated metadata (knowledge base) u  Aggregate and pivot events by customer and discretized time u  Generate a training set of feature vectors u  Select features to include in the final model u  Train and productionize the final model
  • 26. def performExtraction( extractorClass, exportName, join_table=None, join_key_col=None, start_col=None, include_start_col=True, event_start_date=None ): customer_id_col = extractorClass.customer_id_col timestamp_col = extractorClass.timestamp_col extr_agrs = extractorArgs( customer_id_col, timestamp_col, join_table, join_key_col, start_col, include_start_col, event_start_date ) extractor = extractorClass(**extr_agrs) export_path = redshiftExportPath(exportName) return extractor.exportFromRedshift(export_path) # writes to Parquet Extract
  • 27. def exportFromRedshift(self, path): export = self.exportDataFrame() writeParquetWithRetry(export, path) return sqlContext.read.parquet(path) .persist(StorageLevel.MEMORY_AND_DISK) def exportDataFrame(self): query = self.generateQuery() return sqlContext.read .format("com.databricks.spark.redshift") .option("url", urlOption) .option("query", query) .option("tempdir", tempdir) .load() Extract
  • 28.
  • 29. Domain Knowledge is Critical The way that an expert organizes and represents facts in their domain. u  Guides feature extraction u  Prevents overfitting u  Vastly superior to unsupervised feature extraction (e.g., PCA)
  • 30.
  • 31. Aggregate (Shard, Compress, Join) and Pivot! This dance is hard to choreograph
  • 32. Aggregate (Shard, Compress, Join) and Pivot! This dance is hard to choreograph u  8,736 columns u  2.6 million rows Dataframes API is not optimized for extremely wide datasets
  • 33. def generateQuery(self): return """ {0} FROM {1} GROUP BY customer_id, {2}, {3}, {4} """.format( self.selectClause(), self._tempTableName, self.bucketingExpr(), self.timestampCol, self.startDateExpr ) def perform(self): self.preprocessedDataFrame().registerTempTable(self._tempTableName) return sqlContext.sql(self.generateQuery()) Aggregate (Shard, Compress, Join) and Pivot!
  • 34. Aggregate (Shard, Compress, Join) and Pivot!
  • 35. Aggregate (Shard, Compress, Join) and Pivot! (0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,0) ( 18, (6,16), (1,2) )
  • 36. def perform(self): keyedMonthlyEvents = self.dataFrame.map(self.keyRow()) pivotRDD = keyedMonthlyEvents .combineByKey( self.initPivot(), self.pivotEvent(), self.combineDicts() ) .map(self.convertToRow()) .persist(StorageLevel.MEMORY_AND_DISK) return sqlContext.createDataFrame(pivotRDD, self.pivotedSchema()) Aggregate (Shard, Compress, Join) and Pivot!
  • 37. Aggregate (Compress, Shard, Join) and Pivot!
  • 38. Featurize u  "Explode" each customer's history into several "windows" of time.
  • 39. Featurize u  "Explode" each customer's history into several "windows" of time. u  Define one or more prediction targets
  • 40. Featurize u  "Explode" each customer's history into several "windows" of time. u  Define one or more prediction targets u  Standardize each historical feature
  • 41. Featurize u  "Explode" each customer's history into several "windows" of time. u  Define one or more prediction targets u  Standardize each historical feature u  Persist on S3 as text files of compressed sparse vectors
  • 43. Select Features 1.  Randomly select a set of new features to test
  • 44. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features
  • 45. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model
  • 46. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model 4.  Calculate the p-value for each feature
  • 47. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model 4.  Calculate the p-value for each feature
  • 48. Select Features 1.  Randomly select a set of new features to test 2.  Derive training set for new features + previously selected features 3.  Train model 4.  Calculate the p-value for each feature 5.  Retain significant features 6.  Repeat
  • 49. Production Model u  Spark ML makes parameter tuning easy u  Reusable modules!