SlideShare une entreprise Scribd logo
1  sur  51
1© Cloudera, Inc. All rights reserved.
Engines, Models, and Algorithms for Building Data Products
Josh Wills | Senior Director of Data Science
Data Science in Action
2© Cloudera, Inc. All rights reserved.
About Me
3© Cloudera, Inc. All rights reserved.
Think Like A Data Scientist:
SQL, MapReduce, and Spark
4© Cloudera, Inc. All rights reserved.
1. Read and deserialize input data.
2. Project/filter input records.
3. Shuffle: serialize it, send over the
network, deserialize it.
4. Apply aggregation logic.
5. Serialize output data.
The Life of a Data Processing Job
5© Cloudera, Inc. All rights reserved.
• Most of the time in any data
processing job is spent
serializing/deserializing data
• Two ways we can pay this cost
• CPU (Compressed data)
• I/O (Uncompressed data)
• The different engines have different
strategies for handling this cost
Handling the Cost of Serialization
6© Cloudera, Inc. All rights reserved.
The Traditional RDBMS Approach
7© Cloudera, Inc. All rights reserved.
The Cost of The Traditional RDBMS Approach
8© Cloudera, Inc. All rights reserved.
Query Scheduling and Exploratory Data Analysis
9© Cloudera, Inc. All rights reserved.
The Spark Approach
10© Cloudera, Inc. All rights reserved.
The Cost of the Spark Approach
11© Cloudera, Inc. All rights reserved.
(Shameless Plug)
12© Cloudera, Inc. All rights reserved.
The MapReduce Approach
13© Cloudera, Inc. All rights reserved.
MapReduce In The Hands of a Data Scientist
14© Cloudera, Inc. All rights reserved.
Example: Hive Multi-Insert
15© Cloudera, Inc. All rights reserved.
Some Speculative Thoughts on the Future
16© Cloudera, Inc. All rights reserved.
Data Modeling for Data Scientists
17© Cloudera, Inc. All rights reserved.
• Data modeling is the process by which
we organize our data in order to meet
our business requirements
• Historically, this has been a relational-
centric design exercise
• Given the properties of these
different engines and our goals, how
should we organize our data?
Data Engines and Data Modeling
18© Cloudera, Inc. All rights reserved.
SQL Engines and the Dimensional Data Model
19© Cloudera, Inc. All rights reserved.
NoSQL Data Models: Normalization and Aggregation
20© Cloudera, Inc. All rights reserved.
Motivating Example: Spelling Correction
21© Cloudera, Inc. All rights reserved.
Event Series Analytics
22© Cloudera, Inc. All rights reserved.
A Simple Star Schema for Spell Correction
23© Cloudera, Inc. All rights reserved.
• What parameters does this model
need…
• during the analysis phase?
• during deployment?
• Some Candidates
• Lag time between events
• Similarity of queries
• What else?
Designing the Spell Correction Data Product
24© Cloudera, Inc. All rights reserved.
A Supernova Schema for Search
25© Cloudera, Inc. All rights reserved.
Spell Correction in SQL
26© Cloudera, Inc. All rights reserved.
Exhibit: http://github.com/jwills/exhibit
27© Cloudera, Inc. All rights reserved.
An Illustrative Example
28© Cloudera, Inc. All rights reserved.
Build A Supernova Table
29© Cloudera, Inc. All rights reserved.
The Resulting Table Schema
30© Cloudera, Inc. All rights reserved.
A Simple Aggregation
31© Cloudera, Inc. All rights reserved.
A More Interesting Aggregation
32© Cloudera, Inc. All rights reserved.
• Crunch-based MapReduce job that
uses a YAML file to execute a series of
computations over a given Hive table
• Each run may generate multiple
output tables
• Each output table is composed of
one or more aggregations
• SQL/Javascript Code
• Key Fields
• Types: SUM, QUANTILE, etc.
Exhibit ETL Engine
33© Cloudera, Inc. All rights reserved.
• Your first supernova should be used
to support batch reporting and model
building operations
• Apache HBase is a supernova data
model with real-time update
capabilities
• BigTable was originally used to
handle real-time updates to
DocJoins for web indexing
FAQ: How Do You Handle Updates?
34© Cloudera, Inc. All rights reserved.
FAQ: What Data Belongs Inside of the Supernova?
35© Cloudera, Inc. All rights reserved.
Pushing The Limits: The TPC-H Benchmark
36© Cloudera, Inc. All rights reserved.
Query Hive MR (Seconds, w/# of MR Jobs) Impala (Seconds)
1 228 (1 MR) 24
3 462 (3 MR) 115
4 232 (2 MR) 147
6 85 (1 MR) 6
10 502 (3 MR) 50
12 241 (2 MR) 40
13 228 (2 MR) 127
18 859 (4 MR) 189
TOTAL 2837 (18 MR) 698
TPC-H: Hive vs. Impala (Floratou et al., 2014)
37© Cloudera, Inc. All rights reserved.
Data Modeling on Spark
38© Cloudera, Inc. All rights reserved.
• Graph algorithms on Spark
• Developed at AMPLab, now part of
core project
• Created by Joseph Gonzalez, one of
the founders of Graphlab/Dato
• Uses custom partitioned VertexRDD
and EdgeRDD structures to
represent graph data and apply
graph algorithms like PageRank,
connected components, etc.
GraphX
39© Cloudera, Inc. All rights reserved.
• Spark Time Series Project
• Utilities for munging time series data
and applying statistical tests
• Similar to MATLAB/Pandas
• EWMA
• Autoregressive models
• Missing data imputation
• https://github.com/cloudera/spark-
timeseries
Spark for Time Series Data
40© Cloudera, Inc. All rights reserved.
Executing A Data Product Strategy
41© Cloudera, Inc. All rights reserved.
Aligning Data Strategy with Product Strategy
42© Cloudera, Inc. All rights reserved.
• Google has built a factory for creating
data products
• Spell correction engine started as a
one-off, resource-intensive data
analysis job
• But it was generalized to power
multiple data products
• Oneboxes
• Search recommendations
• Personalized search
The Data Products Factory
43© Cloudera, Inc. All rights reserved.
Public Transit for Questions
44© Cloudera, Inc. All rights reserved.
Building Public Transit Is Expensive…
45© Cloudera, Inc. All rights reserved.
• Our data infrastructure is evolving
into a major metropolitan area
• More data producers and
consumers
• Diverse needs
• Major metropolitan areas without
significant mass transit systems:
• Los Angeles
• Atlanta
• Jakarta
…but it beats the alternative
46© Cloudera, Inc. All rights reserved.
Going To Real Time
47© Cloudera, Inc. All rights reserved.
• Lots of approaches to building
systems that create real-time data
cubes
• Summingbird from Twitter
• Kylin from eBay
• Pinot from LinkedIn
• Lots of volume and velocity, but not
so much variety
• Excellent for reporting, less useful for
machine learning
Real-Time Systems and Counting Stuff
48© Cloudera, Inc. All rights reserved.
The Operational/Analytical Impedance Mismatch
49© Cloudera, Inc. All rights reserved.
NoSQL Data Models Revisited
50© Cloudera, Inc. All rights reserved.
• The challenge in taking machine
learning models to production is the
feature engineering, not the model
itself
• Ensuring all data is available
• Verifying that features are
calculated the same way in both
environments
• One data model with multiple
backend implementations
• Offline vs. Online
Exhibit-Style Machine Learning
51© Cloudera, Inc. All rights reserved.
Thanks!
jwills@cloudera.com
@josh_wills

Contenu connexe

Tendances

Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkMatt Ingenthron
 
Lightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidLightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidDataWorks Summit
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduGrant Henke
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
The new big data
The new big dataThe new big data
The new big dataAdam Doyle
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Big Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBig Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBlueData, Inc.
 
Oracle Big data at work
Oracle Big data at workOracle Big data at work
Oracle Big data at worksolarisyougood
 
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudCloudera, Inc.
 
Ravi Namboori 's Open stack framework introduction
Ravi Namboori 's Open stack framework introductionRavi Namboori 's Open stack framework introduction
Ravi Namboori 's Open stack framework introductionRavi namboori
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark Summit
 
Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsWaqas Idrees
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesQubole
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesDatabricks
 
Unleash the power of Azure Data Factory
Unleash the power of Azure Data Factory Unleash the power of Azure Data Factory
Unleash the power of Azure Data Factory Sergio Zenatti Filho
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerDataWorks Summit
 
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Data Con LA
 

Tendances (20)

Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
Lightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and DruidLightning Fast Analytics with Hive LLAP and Druid
Lightning Fast Analytics with Hive LLAP and Druid
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
The new big data
The new big dataThe new big data
The new big data
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Big Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBig Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 Telco
 
Oracle Big data at work
Oracle Big data at workOracle Big data at work
Oracle Big data at work
 
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
 
Ravi Namboori 's Open stack framework introduction
Ravi Namboori 's Open stack framework introductionRavi Namboori 's Open stack framework introduction
Ravi Namboori 's Open stack framework introduction
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan Saldich
 
Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake Analytics
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
Intuit Analytics Cloud 101
Intuit Analytics Cloud 101Intuit Analytics Cloud 101
Intuit Analytics Cloud 101
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on Kubernetes
 
Unleash the power of Azure Data Factory
Unleash the power of Azure Data Factory Unleash the power of Azure Data Factory
Unleash the power of Azure Data Factory
 
Kudu Deep-Dive
Kudu Deep-DiveKudu Deep-Dive
Kudu Deep-Dive
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
 

En vedette

Hybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianHybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianData Con LA
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks Meir Maor
 
VoltDB Big Data Camp LA 2014 - Scott Jar
VoltDB  Big Data Camp LA 2014 - Scott JarVoltDB  Big Data Camp LA 2014 - Scott Jar
VoltDB Big Data Camp LA 2014 - Scott JarData Con LA
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaData Con LA
 
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...Data Con LA
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksData Con LA
 
L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringMachine Learning Valencia
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data typesAlice Zheng
 
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaBig Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaData Con LA
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Data Con LA
 
Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...Ra'Fat Al-Msie'deen
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsData Con LA
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Data Con LA
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Data Con LA
 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBigML, Inc
 
Big Data Day LA 2016/ Data Science Track - Affinity Marketing Leveraging Crow...
Big Data Day LA 2016/ Data Science Track - Affinity Marketing Leveraging Crow...Big Data Day LA 2016/ Data Science Track - Affinity Marketing Leveraging Crow...
Big Data Day LA 2016/ Data Science Track - Affinity Marketing Leveraging Crow...Data Con LA
 

En vedette (20)

Hybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerianHybrid architecture integrateduserviewdata-peyman_mohajerian
Hybrid architecture integrateduserviewdata-peyman_mohajerian
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks
 
VoltDB Big Data Camp LA 2014 - Scott Jar
VoltDB  Big Data Camp LA 2014 - Scott JarVoltDB  Big Data Camp LA 2014 - Scott Jar
VoltDB Big Data Camp LA 2014 - Scott Jar
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
L5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature EngineeringL5. Data Transformation and Feature Engineering
L5. Data Transformation and Feature Engineering
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data types
 
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaBig Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
 
Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature Engineering
 
Big Data Day LA 2016/ Data Science Track - Affinity Marketing Leveraging Crow...
Big Data Day LA 2016/ Data Science Track - Affinity Marketing Leveraging Crow...Big Data Day LA 2016/ Data Science Track - Affinity Marketing Leveraging Crow...
Big Data Day LA 2016/ Data Science Track - Affinity Marketing Leveraging Crow...
 

Similaire à Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering by Josh Wills of Cloudera

Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera, Inc.
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Cloudera, Inc.
 
Building Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball ApproachBuilding Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball Approachjoshwills
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
The Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine LearningThe Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine LearningCloudera, Inc.
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerCloudera, Inc.
 
Dagster - DataOps and MLOps for Machine Learning Engineers.pdf
Dagster - DataOps and MLOps for Machine Learning Engineers.pdfDagster - DataOps and MLOps for Machine Learning Engineers.pdf
Dagster - DataOps and MLOps for Machine Learning Engineers.pdfHong Ong
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaNeo4j
 
Machine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationMachine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationDataWorks Summit
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Cloudera, Inc.
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationCloudera, Inc.
 
Serverless patterns
Serverless patternsServerless patterns
Serverless patternsJesse Butler
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Cloudera, Inc.
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartchCloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Hadoop and Manufacturing
Hadoop and ManufacturingHadoop and Manufacturing
Hadoop and ManufacturingCloudera, Inc.
 

Similaire à Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering by Josh Wills of Cloudera (20)

Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made Easy
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18Spark and Deep Learning Frameworks at Scale 7.19.18
Spark and Deep Learning Frameworks at Scale 7.19.18
 
Building Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball ApproachBuilding Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball Approach
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
The Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine LearningThe Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine Learning
 
Unlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator OptimizerUnlock Hadoop Success with Cloudera Navigator Optimizer
Unlock Hadoop Success with Cloudera Navigator Optimizer
 
Dagster - DataOps and MLOps for Machine Learning Engineers.pdf
Dagster - DataOps and MLOps for Machine Learning Engineers.pdfDagster - DataOps and MLOps for Machine Learning Engineers.pdf
Dagster - DataOps and MLOps for Machine Learning Engineers.pdf
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
 
Machine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationMachine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to Implementation
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your Organization
 
Serverless patterns
Serverless patternsServerless patterns
Serverless patterns
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartch
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Hadoop and Manufacturing
Hadoop and ManufacturingHadoop and Manufacturing
Hadoop and Manufacturing
 

Plus de Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

Plus de Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Dernier

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering by Josh Wills of Cloudera

  • 1. 1© Cloudera, Inc. All rights reserved. Engines, Models, and Algorithms for Building Data Products Josh Wills | Senior Director of Data Science Data Science in Action
  • 2. 2© Cloudera, Inc. All rights reserved. About Me
  • 3. 3© Cloudera, Inc. All rights reserved. Think Like A Data Scientist: SQL, MapReduce, and Spark
  • 4. 4© Cloudera, Inc. All rights reserved. 1. Read and deserialize input data. 2. Project/filter input records. 3. Shuffle: serialize it, send over the network, deserialize it. 4. Apply aggregation logic. 5. Serialize output data. The Life of a Data Processing Job
  • 5. 5© Cloudera, Inc. All rights reserved. • Most of the time in any data processing job is spent serializing/deserializing data • Two ways we can pay this cost • CPU (Compressed data) • I/O (Uncompressed data) • The different engines have different strategies for handling this cost Handling the Cost of Serialization
  • 6. 6© Cloudera, Inc. All rights reserved. The Traditional RDBMS Approach
  • 7. 7© Cloudera, Inc. All rights reserved. The Cost of The Traditional RDBMS Approach
  • 8. 8© Cloudera, Inc. All rights reserved. Query Scheduling and Exploratory Data Analysis
  • 9. 9© Cloudera, Inc. All rights reserved. The Spark Approach
  • 10. 10© Cloudera, Inc. All rights reserved. The Cost of the Spark Approach
  • 11. 11© Cloudera, Inc. All rights reserved. (Shameless Plug)
  • 12. 12© Cloudera, Inc. All rights reserved. The MapReduce Approach
  • 13. 13© Cloudera, Inc. All rights reserved. MapReduce In The Hands of a Data Scientist
  • 14. 14© Cloudera, Inc. All rights reserved. Example: Hive Multi-Insert
  • 15. 15© Cloudera, Inc. All rights reserved. Some Speculative Thoughts on the Future
  • 16. 16© Cloudera, Inc. All rights reserved. Data Modeling for Data Scientists
  • 17. 17© Cloudera, Inc. All rights reserved. • Data modeling is the process by which we organize our data in order to meet our business requirements • Historically, this has been a relational- centric design exercise • Given the properties of these different engines and our goals, how should we organize our data? Data Engines and Data Modeling
  • 18. 18© Cloudera, Inc. All rights reserved. SQL Engines and the Dimensional Data Model
  • 19. 19© Cloudera, Inc. All rights reserved. NoSQL Data Models: Normalization and Aggregation
  • 20. 20© Cloudera, Inc. All rights reserved. Motivating Example: Spelling Correction
  • 21. 21© Cloudera, Inc. All rights reserved. Event Series Analytics
  • 22. 22© Cloudera, Inc. All rights reserved. A Simple Star Schema for Spell Correction
  • 23. 23© Cloudera, Inc. All rights reserved. • What parameters does this model need… • during the analysis phase? • during deployment? • Some Candidates • Lag time between events • Similarity of queries • What else? Designing the Spell Correction Data Product
  • 24. 24© Cloudera, Inc. All rights reserved. A Supernova Schema for Search
  • 25. 25© Cloudera, Inc. All rights reserved. Spell Correction in SQL
  • 26. 26© Cloudera, Inc. All rights reserved. Exhibit: http://github.com/jwills/exhibit
  • 27. 27© Cloudera, Inc. All rights reserved. An Illustrative Example
  • 28. 28© Cloudera, Inc. All rights reserved. Build A Supernova Table
  • 29. 29© Cloudera, Inc. All rights reserved. The Resulting Table Schema
  • 30. 30© Cloudera, Inc. All rights reserved. A Simple Aggregation
  • 31. 31© Cloudera, Inc. All rights reserved. A More Interesting Aggregation
  • 32. 32© Cloudera, Inc. All rights reserved. • Crunch-based MapReduce job that uses a YAML file to execute a series of computations over a given Hive table • Each run may generate multiple output tables • Each output table is composed of one or more aggregations • SQL/Javascript Code • Key Fields • Types: SUM, QUANTILE, etc. Exhibit ETL Engine
  • 33. 33© Cloudera, Inc. All rights reserved. • Your first supernova should be used to support batch reporting and model building operations • Apache HBase is a supernova data model with real-time update capabilities • BigTable was originally used to handle real-time updates to DocJoins for web indexing FAQ: How Do You Handle Updates?
  • 34. 34© Cloudera, Inc. All rights reserved. FAQ: What Data Belongs Inside of the Supernova?
  • 35. 35© Cloudera, Inc. All rights reserved. Pushing The Limits: The TPC-H Benchmark
  • 36. 36© Cloudera, Inc. All rights reserved. Query Hive MR (Seconds, w/# of MR Jobs) Impala (Seconds) 1 228 (1 MR) 24 3 462 (3 MR) 115 4 232 (2 MR) 147 6 85 (1 MR) 6 10 502 (3 MR) 50 12 241 (2 MR) 40 13 228 (2 MR) 127 18 859 (4 MR) 189 TOTAL 2837 (18 MR) 698 TPC-H: Hive vs. Impala (Floratou et al., 2014)
  • 37. 37© Cloudera, Inc. All rights reserved. Data Modeling on Spark
  • 38. 38© Cloudera, Inc. All rights reserved. • Graph algorithms on Spark • Developed at AMPLab, now part of core project • Created by Joseph Gonzalez, one of the founders of Graphlab/Dato • Uses custom partitioned VertexRDD and EdgeRDD structures to represent graph data and apply graph algorithms like PageRank, connected components, etc. GraphX
  • 39. 39© Cloudera, Inc. All rights reserved. • Spark Time Series Project • Utilities for munging time series data and applying statistical tests • Similar to MATLAB/Pandas • EWMA • Autoregressive models • Missing data imputation • https://github.com/cloudera/spark- timeseries Spark for Time Series Data
  • 40. 40© Cloudera, Inc. All rights reserved. Executing A Data Product Strategy
  • 41. 41© Cloudera, Inc. All rights reserved. Aligning Data Strategy with Product Strategy
  • 42. 42© Cloudera, Inc. All rights reserved. • Google has built a factory for creating data products • Spell correction engine started as a one-off, resource-intensive data analysis job • But it was generalized to power multiple data products • Oneboxes • Search recommendations • Personalized search The Data Products Factory
  • 43. 43© Cloudera, Inc. All rights reserved. Public Transit for Questions
  • 44. 44© Cloudera, Inc. All rights reserved. Building Public Transit Is Expensive…
  • 45. 45© Cloudera, Inc. All rights reserved. • Our data infrastructure is evolving into a major metropolitan area • More data producers and consumers • Diverse needs • Major metropolitan areas without significant mass transit systems: • Los Angeles • Atlanta • Jakarta …but it beats the alternative
  • 46. 46© Cloudera, Inc. All rights reserved. Going To Real Time
  • 47. 47© Cloudera, Inc. All rights reserved. • Lots of approaches to building systems that create real-time data cubes • Summingbird from Twitter • Kylin from eBay • Pinot from LinkedIn • Lots of volume and velocity, but not so much variety • Excellent for reporting, less useful for machine learning Real-Time Systems and Counting Stuff
  • 48. 48© Cloudera, Inc. All rights reserved. The Operational/Analytical Impedance Mismatch
  • 49. 49© Cloudera, Inc. All rights reserved. NoSQL Data Models Revisited
  • 50. 50© Cloudera, Inc. All rights reserved. • The challenge in taking machine learning models to production is the feature engineering, not the model itself • Ensuring all data is available • Verifying that features are calculated the same way in both environments • One data model with multiple backend implementations • Offline vs. Online Exhibit-Style Machine Learning
  • 51. 51© Cloudera, Inc. All rights reserved. Thanks! jwills@cloudera.com @josh_wills

Notes de l'éditeur

  1. Discuss traffic congestion and the problem of induced demand.
  2. Discuss scheduling and resource management (i.e., you’re only allowed to drive your Ferrari between midnight and 6 AM.)
  3. Spark, as a tool for interactively working with “unstructured” data sources and building machine learning models, is unparalleled.
  4. Data scientists know how to structure data in a way that maximizes the number of questions that can be answered by a single MR job.
  5. SQL vs. MapReduce vs. Spark
  6. Briefly, we want to model data in a way that allows our data processing engine to take advantage of it for the problem we’re trying to solve.
  7. The general awesomeness of the dimensional data model for reporting and exploratory analytics. How all of the visual SQL interfaces expect it and are optimized for it, and how all of the engines try to accommodate it as best they can. Denormalization – but not aggregation.
  8. Aggregation as a special kind of denormalization.
  9. 7 of these queries operate on three core tables: customers, orders, and lineitems
  10. 7 of these queries operate on three core tables: customers, orders, and lineitems
  11. Exhibit for Spark here.
  12. Aligning our data product strategy w/our company product strategy
  13. Aggregation as a special kind of denormalization.