Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering by Josh Wills of Cloudera

1© Cloudera, Inc. All rights reserved.
Engines, Models, and Algorithms for Building Data Products
Josh Wills | Senior Director of Data Science
Data Science in Action

About Me

Think Like A Data Scientist:
SQL, MapReduce, and Spark

1. Read and deserialize input data.
2. Project/filter input records.
3. Shuffle: serialize it, send over the
network, deserialize it.
4. Apply aggregation logic.
5. Serialize output data.
The Life of a Data Processing Job

• Most of the time in any data
processing job is spent
serializing/deserializing data
• Two ways we can pay this cost
• CPU (Compressed data)
• I/O (Uncompressed data)
• The different engines have different
strategies for handling this cost
Handling the Cost of Serialization

The Traditional RDBMS Approach

The Cost of The Traditional RDBMS Approach

Query Scheduling and Exploratory Data Analysis

The Spark Approach

The Cost of the Spark Approach

(Shameless Plug)

The MapReduce Approach

MapReduce In The Hands of a Data Scientist

Example: Hive Multi-Insert

Some Speculative Thoughts on the Future

Data Modeling for Data Scientists

• Data modeling is the process by which
we organize our data in order to meet
our business requirements
• Historically, this has been a relational-
centric design exercise
• Given the properties of these
different engines and our goals, how
should we organize our data?
Data Engines and Data Modeling

SQL Engines and the Dimensional Data Model

NoSQL Data Models: Normalization and Aggregation

Motivating Example: Spelling Correction

Event Series Analytics

A Simple Star Schema for Spell Correction

• What parameters does this model
need…
• during the analysis phase?
• during deployment?
• Some Candidates
• Lag time between events
• Similarity of queries
• What else?
Designing the Spell Correction Data Product

A Supernova Schema for Search

Spell Correction in SQL

Exhibit: http://github.com/jwills/exhibit

An Illustrative Example

Build A Supernova Table

The Resulting Table Schema

A Simple Aggregation

A More Interesting Aggregation

• Crunch-based MapReduce job that
uses a YAML file to execute a series of
computations over a given Hive table
• Each run may generate multiple
output tables
• Each output table is composed of
one or more aggregations
• SQL/Javascript Code
• Key Fields
• Types: SUM, QUANTILE, etc.
Exhibit ETL Engine

• Your first supernova should be used
to support batch reporting and model
building operations
• Apache HBase is a supernova data
model with real-time update
capabilities
• BigTable was originally used to
handle real-time updates to
DocJoins for web indexing
FAQ: How Do You Handle Updates?

FAQ: What Data Belongs Inside of the Supernova?

Pushing The Limits: The TPC-H Benchmark

Query Hive MR (Seconds, w/# of MR Jobs) Impala (Seconds)
1 228 (1 MR) 24
3 462 (3 MR) 115
4 232 (2 MR) 147
6 85 (1 MR) 6
10 502 (3 MR) 50
12 241 (2 MR) 40
13 228 (2 MR) 127
18 859 (4 MR) 189
TOTAL 2837 (18 MR) 698
TPC-H: Hive vs. Impala (Floratou et al., 2014)

Data Modeling on Spark

• Graph algorithms on Spark
• Developed at AMPLab, now part of
core project
• Created by Joseph Gonzalez, one of
the founders of Graphlab/Dato
• Uses custom partitioned VertexRDD
and EdgeRDD structures to
represent graph data and apply
graph algorithms like PageRank,
connected components, etc.
GraphX

• Spark Time Series Project
• Utilities for munging time series data
and applying statistical tests
• Similar to MATLAB/Pandas
• EWMA
• Autoregressive models
• Missing data imputation
• https://github.com/cloudera/spark-
timeseries
Spark for Time Series Data

Executing A Data Product Strategy

Aligning Data Strategy with Product Strategy

• Google has built a factory for creating
data products
• Spell correction engine started as a
one-off, resource-intensive data
analysis job
• But it was generalized to power
multiple data products
• Oneboxes
• Search recommendations
• Personalized search
The Data Products Factory

Public Transit for Questions

Building Public Transit Is Expensive…

• Our data infrastructure is evolving
into a major metropolitan area
• More data producers and
consumers
• Diverse needs
• Major metropolitan areas without
significant mass transit systems:
• Los Angeles
• Atlanta
• Jakarta
…but it beats the alternative

Going To Real Time

• Lots of approaches to building
systems that create real-time data
cubes
• Summingbird from Twitter
• Kylin from eBay
• Pinot from LinkedIn
• Lots of volume and velocity, but not
so much variety
• Excellent for reporting, less useful for
machine learning
Real-Time Systems and Counting Stuff

The Operational/Analytical Impedance Mismatch

NoSQL Data Models Revisited

• The challenge in taking machine
learning models to production is the
feature engineering, not the model
itself
• Ensuring all data is available
• Verifying that features are
calculated the same way in both
environments
• One data model with multiple
backend implementations
• Offline vs. Online
Exhibit-Style Machine Learning

Thanks!
jwills@cloudera.com
@josh_wills

Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering by Josh Wills of Cloudera

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering by Josh Wills of Cloudera

Similaire à Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering by Josh Wills of Cloudera (20)

Plus de Data Con LA

Plus de Data Con LA (20)

Dernier

Dernier (20)

Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering by Josh Wills of Cloudera

Notes de l'éditeur