Feature engineering- writing code to map raw input data into a set of signals that will be fed into a machine learning algorithm- is the dark art of data science. Although the process of crafting new features is tedious and failure-prone, the key to a successful model is a diverse set of high-quality features that are informed by domain experts. Recently, academic researchers have begun to focus on the problem of feature engineering, and have started to publish research that addresses the relative lack of tools that are designed to support the feature engineering process. In this talk, I will review some of my favorite papers and present some efforts to convert these ideas into tools that leverage the principles of reactive application design in order to make feature engineering (dare I say it) fun.
Discuss traffic congestion and the problem of induced demand.
Discuss scheduling and resource management (i.e., you’re only allowed to drive your Ferrari between midnight and 6 AM.)
Spark, as a tool for interactively working with “unstructured” data sources and building machine learning models, is unparalleled.
Data scientists know how to structure data in a way that maximizes the number of questions that can be answered by a single MR job.
SQL vs. MapReduce vs. Spark
Briefly, we want to model data in a way that allows our data processing engine to take advantage of it for the problem we’re trying to solve.
The general awesomeness of the dimensional data model for reporting and exploratory analytics. How all of the visual SQL interfaces expect it and are optimized for it, and how all of the engines try to accommodate it as best they can. Denormalization – but not aggregation.
Aggregation as a special kind of denormalization.
7 of these queries operate on three core tables: customers, orders, and lineitems
7 of these queries operate on three core tables: customers, orders, and lineitems
Exhibit for Spark here.
Aligning our data product strategy w/our company product strategy