Abstract:-
Data engineering at Dollar Shave Club has grown significantly over the last year. In that time, it has expanded in scope from conventional web-analytics and business intelligence to include real-time, big data and machine learning applications. We have bootstrapped a dedicated data engineering team in parallel with developing a new category of capabilities. And the business value that we delivered early on has allowed us to forge new roles for our data products and services in developing and carrying out business strategy. This progress was made possible, in large part, by adopting Apache Spark as an application framework. This talk describes what we have been able to accomplish using Spark at Dollar Shave Club.
Bio:-
Brett Bevers, Ph.D. Brett is a backend engineer and leads the data engineering team at Dollar Shave Club. More importantly, he is an ex-academic who is driven to understand and tackle hard problems. His latest challenge has been to develop tools powerful enough to support data-driven decision making in high value projects.
10. Engineering at DSC
u Frontend
u Ember.js web apps
u iOS and Android apps
u HTML email
u Backend
u Ruby on Rails web backends
u Internal services (Ruby, Node.js, Golang, Python, Elixir)
u Data and analytics (Python, SQL, Spark)
u QA
u CircleCI, SauceLabs, Jenkins
u TestUnit, Selenium
u IT
u Office and warehouse IT
14. Big Data
What is the barrier to entry?
u Requires a different set of capabilities
15. Big Data
What is the barrier to entry?
u Requires a different set of capabilities
u Investing resources without an obvious ROI
16. Big Data
What is the barrier to entry?
u Requires a different set of capabilities
u Investing resources without an obvious ROI
Knowing where to start
18. Data Engineering
u Machine learning pipeline
u Models served in production
u Exploratory Analysis
u Customer segmentation (clustering)
u Hypothesis testing
u Data mining
u NLP (topic modeling)
19. Data Engineering
u Maxwell + Kafka + Spark Streaming
u Streaming data replication
u Streaming metrics directly from the data layer
22. Box Manager Email
Problem: Order the product tiles in “Box Manager Email” to maximize profit
Constraints:
u Every customer sees some ordered set of products
u Do not show products already added to box
23. Box Manager Email
Problem: Order the product tiles in “Box Manager Email” to maximize profit
Constraints:
u Every customer sees some ordered set of products
u Do not show products already added to box
+25% revenue per email open
24. Strategy
For each product, model the behavior which best distinguishes someone who
buys that product from someone who buys other products; rank a product by
the strength of the indicative behavior, when present, and rank a product
randomly otherwise
Model
u Logistic Regression
u Learns the “tipping point” between
success and failure
u Success = “buys product X”
25. Design
u Extract data from data warehouse (Redshift)
u Join that data with hand-curated metadata (knowledge base)
u Aggregate and pivot events by customer and discretized time
u Generate a training set of feature vectors
u Select features to include in the final model
u Train and productionize the final model
29. Domain Knowledge is Critical
The way that an expert organizes and represents facts in their domain.
u Guides feature extraction
u Prevents overfitting
u Vastly superior to unsupervised feature extraction (e.g., PCA)
32. Aggregate (Shard, Compress, Join) and Pivot!
This dance is hard to choreograph
u 8,736 columns
u 2.6 million rows
Dataframes API is not optimized for extremely wide datasets
33. def generateQuery(self):
return """
{0}
FROM {1}
GROUP BY customer_id, {2}, {3}, {4}
""".format(
self.selectClause(), self._tempTableName,
self.bucketingExpr(), self.timestampCol, self.startDateExpr
)
def perform(self):
self.preprocessedDataFrame().registerTempTable(self._tempTableName)
return sqlContext.sql(self.generateQuery())
Aggregate (Shard, Compress, Join) and Pivot!
39. Featurize
u "Explode" each customer's history into several "windows" of time.
u Define one or more prediction targets
40. Featurize
u "Explode" each customer's history into several "windows" of time.
u Define one or more prediction targets
u Standardize each historical feature
41. Featurize
u "Explode" each customer's history into several "windows" of time.
u Define one or more prediction targets
u Standardize each historical feature
u Persist on S3 as text files of compressed sparse vectors
44. Select Features
1. Randomly select a set of new features to test
2. Derive training set for new features + previously selected features
45. Select Features
1. Randomly select a set of new features to test
2. Derive training set for new features + previously selected features
3. Train model
46. Select Features
1. Randomly select a set of new features to test
2. Derive training set for new features + previously selected features
3. Train model
4. Calculate the p-value for each feature
47. Select Features
1. Randomly select a set of new features to test
2. Derive training set for new features + previously selected features
3. Train model
4. Calculate the p-value for each feature
48. Select Features
1. Randomly select a set of new features to test
2. Derive training set for new features + previously selected features
3. Train model
4. Calculate the p-value for each feature
5. Retain significant features
6. Repeat