SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
Data Transformation and Feature Engineering
Charles Parker
Allston Trading
2
• Oregon State University (Structured output spaces)

• Music recognition

• Real-time strategy game-playing

• Kodak Research Labs

• Media classification (audio, video)

• Document Classification

• Performance Evaluation

• BigML

• Allston Trading (applying machine learning to market data)
Full Disclosure
3
• But it’s “machine learning”!

• Your data sucks (or at least I hope it does) . . .

• Data is broken

• Data is incomplete

• . . . but you know about it!

• Make the problem easier

• Make the answer more obvious

• Don’t waste time modeling the obvious

• Until you find the right algorithm for it
Data Transformation
Your Data Sucks I: Broken Features
• Suppose you have a market data feature called
trade imbalance = (buy - sell) / total volume that
you calculate every five minutes

• Now suppose there are no trades over five minutes

• What to do?

• Point or feature removal

• Easy default
4
Your Data Sucks II: Missing Values
• Suppose you’re building a model
to predict the presence or
absence of cancer

• Each feature is a medical test

• Some are simple (height,
weight, temperature)

• Some are complex (blood
counts, CAT scan)

• Some patients have had all of
these done, some have not. 

• Does the presence or absence of
a CAT scan tell you something?
Should it be a feature?
5
Height Weight
Blood
Test
Cancer?
179 80 No
160 60 2,4 No
150 65 4,5 Yes
155 70 No
Simplifying Your Problem
• What about the class
variable?

• It’s just another feature, so it
can be engineered

• Change the problem

• Do you need so many
classes?

• Do you need to do a
regression?
6
Feature Engineering: What?
• Your data may be too “raw”
for learning

• Multimedia Data

• Raw text data

• Something must be done to
make the data “learnable”

• Compute edge histograms,
SIFT features

• Do word counts, latent
topic modeling
7
An Instructive Example
• Build a model to determine if two geo-coordinates
are walking distance from one another
8
Lat. 1 Long 1. Lat. 2 Long. 2 Can Walk?
48.871507 2.354350 48.872111 2.354933 Yes
48.872111 2.354933 44.597422 -123.248367 No
48.872232 2.354211 48.872111 2.354933 Yes
44.597422 -123.248367 48.872232 2.354211 No
• Whether two points are
walking distance from
each other is not an
obvious function of the
latitude and longitude

• But it is an obvious
function of the distance
between the two points

• Unfortunately, that
function is quite
complicated

• Fortunately, you know it
already!
9
An Instructive Example
• Build a model to determine if two geo-coordinates
are walking distance from one another
10
Lat. 1 Long 1. Lat. 2 Long. 2
Distance
(km)
Can Walk?
48.871507 2.354350 48.872111 2.354933 2 Yes
48.872111 2.354933 44.597422 -123.248367 9059 No
48.872232 2.354211 48.872111 2.354933 5 Yes
44.597422 -123.248367 48.872232 2.354211 9056 No
Feature Engineering
• One of the core (maybe the core)
competencies of a machine learning engineer

• Requires domain understanding

• Requires algorithm understanding

• If you do it really well, you eliminate the need
for machine learning entirely

• Gives you another path to success; you can
often substitute domain knowledge for
modeling expertise

• But what if you don’t have specific domain
knowledge?
11
Techniques I: Discretization
• Construct meaningful bins for a
continuous feature (two or more)

• Body temperature

• Credit score

• The new features are categorical
features, each category of which
has nice semantics

• Don’t make the algorithm waste
effort modeling things that you
already know about
12
Techniques II: Delta
• Sometimes, the difference between two features is
the important bit

• As it was in the distance example

• Also holds a lot in the time domain

• Example: Hiss in speech recognition

• Struggling? Just differentiate! (In all seriousness,
this sometimes works)
13
Techniques III: Windowing
• If points are distributed in time,
previous points in the same
window are often very informative

• Weather

• Stock prices

• Add this to a 1-d sequence of
points to get an instant machine
learning problem!

• Sensor data

• User behavior

• Maybe add some delta features?
14
Techniques IV: Standardization
• Constrain each feature to have a mean of zero and standard deviation of one
(subtract the mean and divide by the standard deviation).

• Good for domains with heterogeneous but gaussian-distributed data sources

• Demographic data

• Medical testing

• Note that this isn’t in general effective for decision trees!

• Transformation is order preserving

• Decision tree splits rely only on ordering!

• Good for things like k-NN
15
Techniques V: Normalization
• Force each feature vector to have unit norm (e.g., [0, 1, 0] -> [0, 1, 0]
and [1, 1, 1] -> [0.57, 0.57, 0.57])

• Nice for sparse feature spaces like text

• Helps us tell the difference between documents and dictionaries

• We’ll come back to the idea of sparsity

• Note that this will effect decision trees

• Does not necessarily preserve order (co-dependency between
features)

• A lesson against over-generalization of technique!
16
What Do We Really Want?
• This is nice, but what ever happened to “machine
learning”?

• Construct a feature space in which “learning is
easy”, whatever that means

• The space must preserve “important aspects of the
data”, whatever that means

• Are there general ways of posing this problem?
(Spoiler Alert: Yes)
17
Aside I: Projection
• A projection is a one-to-
one mapping from one
feature space to another

• We want a function f(x)
that projects a point x
into a space where a
good classifier is obvious

• The axes (features) in
your new space are
called your new basis
18
f(x)
f(x)
A Hack Projection: Distance to Cluster
• Do clustering on your data

• For each point, compute the
distance to each cluster centroid

• These distances are your new
features

• The new space can be either
higher or lower dimensional than
your new space

• For highly clustered data, this
can be a fairly powerful feature
space
19
Principle Components Analysis
• Find the axis through
the data with the
highest variance

• Repeat for the next
orthogonal axis and so
on, until you run out of
data or dimensions

• Each axis is a feature
20
PCA is Nice!
• Generally quite fast (matrix decomposition)

• Features are linear combinations of originals (which
means you can project test data into the space)

• Features are linearly independent (great for some
algorithms)

• Data can often be “explained” with just the first few
components (so this can be “dimensionality
reduction”)
21
Spectral Embeddings
• Two of the seminal ones
are Isomap and LLE

• Generally, compute the
nearest neighbor matrix
and use this to create the
embedding

• Pro: Pretty spectacular
results

• Con: No projection matrix
22
Combination Methods
• Large Margin Nearest Neighbor, Xing’s Method

• Create an objective function that preserves neighbor
relationships

• Neighbor distances (unsupervised)

• Closest points of the same class (supervised)

• Clever search for a projection matrix that satisfies this
objective (usually an elaborate sort of gradient descent)

• I’ve had some success with these
23
Aside II: Sparsity
• Machine learning is essentially compression, and
constantly plays at the edges of this idea

• Minimum description length

• Bayesian information criteria

• L1 and L2 regularization

• Sparse representations are easily compressed

• So does that mean they’re more powerful?
24
Sparsity I: Text Data
• Text data is inherently sparse

• The fact that we choose a small number of words to
use gives a document its semantics

• Text features are incredibly powerful in the grand
scheme of feature spaces

• One or two words allow us to do accurate
classification

• But those one or two words must be sparse
25
Sparsity II: EigenFaces
• Here are the first few
components of PCA applied to
a collection of face images

• A small number of these
explain a huge part of a huge
number of faces

• First components are like stop
words, last few (sparse)
components make recognition
easy
26
Sparsity III: The Fourier Transform
• Very complex waveform

• Turns out to be easily
expressible as a
combination of a few
(i.e., sparse) constant
frequency signals

• Such representations
make accurate speech
recognition possible
27
Sparse Coding
• Iterate

• Choose a basis

• Evaluate that basis based on how well you can use
it to reconstruct the input, and how sparse it is

• Take some sort of gradient step to improve that
evaluation

• Andrew Ng’s efficient sparse coding algorithms and
Hinton’s deep autoencoders are both flavors of this
28
The New Basis
• Text: Topics

• Audio: Frequency
Transform

• Visual: Pen Strokes
29
Another Hack: Totally Random Trees
• Train a bunch of decision trees

• With no objective!

• Each leaf is a feature

• Ta-da! Sparse basis

• This actually works
30
And More and More
• There are a ton a variations on these themes

• Dimensionality Reduction

• Metric Learning

• “Coding” or “Encoding”

• Nice canonical implementations can be found at:
http://lvdmaaten.github.io/drtoolbox/
31

Contenu connexe

Tendances

BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBigML, Inc
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introductionAnas Jamil
 
Generative Adversarial Networks : Basic architecture and variants
Generative Adversarial Networks : Basic architecture and variantsGenerative Adversarial Networks : Basic architecture and variants
Generative Adversarial Networks : Basic architecture and variantsananth
 
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
Deep Learning For Practitioners,  lecture 2: Selecting the right applications...Deep Learning For Practitioners,  lecture 2: Selecting the right applications...
Deep Learning For Practitioners, lecture 2: Selecting the right applications...ananth
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsMark Peng
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies台灣資料科學年會
 
Machine Learning - Supervised learning
Machine Learning - Supervised learningMachine Learning - Supervised learning
Machine Learning - Supervised learningManeesha Caldera
 
VSSML17 L6. Time Series and Deepnets
VSSML17 L6. Time Series and DeepnetsVSSML17 L6. Time Series and Deepnets
VSSML17 L6. Time Series and DeepnetsBigML, Inc
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learningananth
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overviewananth
 
Generating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural NetworksGenerating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural NetworksJonathan Mugan
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models ananth
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Treesananth
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditMichael BENESTY
 
Machine learning
Machine learningMachine learning
Machine learningeonx_32
 
Brief introduction to Machine Learning
Brief introduction to Machine LearningBrief introduction to Machine Learning
Brief introduction to Machine LearningCodeForFrankfurt
 

Tendances (20)

BSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 SessionsBSSML16 L5. Summary Day 1 Sessions
BSSML16 L5. Summary Day 1 Sessions
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introduction
 
L3. Decision Trees
L3. Decision TreesL3. Decision Trees
L3. Decision Trees
 
Generative Adversarial Networks : Basic architecture and variants
Generative Adversarial Networks : Basic architecture and variantsGenerative Adversarial Networks : Basic architecture and variants
Generative Adversarial Networks : Basic architecture and variants
 
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
Deep Learning For Practitioners,  lecture 2: Selecting the right applications...Deep Learning For Practitioners,  lecture 2: Selecting the right applications...
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
 
General Tips for participating Kaggle Competitions
General Tips for participating Kaggle CompetitionsGeneral Tips for participating Kaggle Competitions
General Tips for participating Kaggle Competitions
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies
 
Machine Learning - Supervised learning
Machine Learning - Supervised learningMachine Learning - Supervised learning
Machine Learning - Supervised learning
 
VSSML17 L6. Time Series and Deepnets
VSSML17 L6. Time Series and DeepnetsVSSML17 L6. Time Series and Deepnets
VSSML17 L6. Time Series and Deepnets
 
Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learning
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overview
 
Generating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural NetworksGenerating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural Networks
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
[系列活動] 機器學習速遊
[系列活動] 機器學習速遊[系列活動] 機器學習速遊
[系列活動] 機器學習速遊
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Trees
 
Feature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax auditFeature Importance Analysis with XGBoost in Tax audit
Feature Importance Analysis with XGBoost in Tax audit
 
Machine learning
Machine learningMachine learning
Machine learning
 
Brief introduction to Machine Learning
Brief introduction to Machine LearningBrief introduction to Machine Learning
Brief introduction to Machine Learning
 
ML Basics
ML BasicsML Basics
ML Basics
 

En vedette

The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature EngineeringAlice Zheng
 
Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data typesAlice Zheng
 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBigML, Inc
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringDataRobot
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoostDataRobot
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Gregg Kalpan Resume
Gregg Kalpan ResumeGregg Kalpan Resume
Gregg Kalpan ResumeGregg Kaplan
 
A field guide the machine learning zoo
A field guide the machine learning zoo A field guide the machine learning zoo
A field guide the machine learning zoo Theodoros Vasiloudis
 
Introduction to Machine Learning* Prof. D. Spears
Introduction to Machine Learning* Prof. D. SpearsIntroduction to Machine Learning* Prof. D. Spears
Introduction to Machine Learning* Prof. D. Spearsbutest
 
introducción a Machine Learning
introducción a Machine Learningintroducción a Machine Learning
introducción a Machine Learningbutest
 
Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks Meir Maor
 
Machine Learning - Where to Next?, May 2015
Machine Learning  - Where to Next?, May 2015Machine Learning  - Where to Next?, May 2015
Machine Learning - Where to Next?, May 2015Peter Morgan
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA
 
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaBig Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaData Con LA
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IMachine Learning Valencia
 
Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...Ra'Fat Al-Msie'deen
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentationHJ van Veen
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions odsc
 

En vedette (20)

L1. State of the Art in Machine Learning
L1. State of the Art in Machine LearningL1. State of the Art in Machine Learning
L1. State of the Art in Machine Learning
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
 
Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data types
 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature Engineering
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoost
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Gregg Kalpan Resume
Gregg Kalpan ResumeGregg Kalpan Resume
Gregg Kalpan Resume
 
A field guide the machine learning zoo
A field guide the machine learning zoo A field guide the machine learning zoo
A field guide the machine learning zoo
 
Introduction to Machine Learning* Prof. D. Spears
Introduction to Machine Learning* Prof. D. SpearsIntroduction to Machine Learning* Prof. D. Spears
Introduction to Machine Learning* Prof. D. Spears
 
introducción a Machine Learning
introducción a Machine Learningintroducción a Machine Learning
introducción a Machine Learning
 
Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks
 
Machine Learning - Where to Next?, May 2015
Machine Learning  - Where to Next?, May 2015Machine Learning  - Where to Next?, May 2015
Machine Learning - Where to Next?, May 2015
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
L4. Ensembles of Decision Trees
L4. Ensembles of Decision TreesL4. Ensembles of Decision Trees
L4. Ensembles of Decision Trees
 
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaBig Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 
Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
 

Similaire à L5. Data Transformation and Feature Engineering

NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA Taiwan
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbersYutaka Kawai
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxIvo Andreev
 
Tsinghua invited talk_zhou_xing_v2r0
Tsinghua invited talk_zhou_xing_v2r0Tsinghua invited talk_zhou_xing_v2r0
Tsinghua invited talk_zhou_xing_v2r0Joe Xing
 
Machine learning with R
Machine learning with RMachine learning with R
Machine learning with RMaarten Smeets
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptxssuserf583ac
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptxRohanBorgalli
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptxSreeVani74
 
Topological Data Analysis
Topological Data AnalysisTopological Data Analysis
Topological Data AnalysisDeviousQuant
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarArt of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarSpark Summit
 
Smaller and Easier: Machine Learning on Embedded Things
Smaller and Easier: Machine Learning on Embedded ThingsSmaller and Easier: Machine Learning on Embedded Things
Smaller and Easier: Machine Learning on Embedded ThingsNUS-ISS
 
Deductive databases
Deductive databasesDeductive databases
Deductive databasesJohn Popoola
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Cloudera, Inc.
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer InsightMapR Technologies
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...Feng Li
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesDavid Martínez Rego
 

Similaire à L5. Data Transformation and Feature Engineering (20)

NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
 
Tsinghua invited talk_zhou_xing_v2r0
Tsinghua invited talk_zhou_xing_v2r0Tsinghua invited talk_zhou_xing_v2r0
Tsinghua invited talk_zhou_xing_v2r0
 
Machine learning with R
Machine learning with RMachine learning with R
Machine learning with R
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
 
prace_days_ml_2019.pptx
prace_days_ml_2019.pptxprace_days_ml_2019.pptx
prace_days_ml_2019.pptx
 
CPP19 - Revision
CPP19 - RevisionCPP19 - Revision
CPP19 - Revision
 
Topological Data Analysis
Topological Data AnalysisTopological Data Analysis
Topological Data Analysis
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarArt of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel Sarwar
 
Smaller and Easier: Machine Learning on Embedded Things
Smaller and Easier: Machine Learning on Embedded ThingsSmaller and Easier: Machine Learning on Embedded Things
Smaller and Easier: Machine Learning on Embedded Things
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 

Plus de Machine Learning Valencia

Plus de Machine Learning Valencia (9)

From Turing To Humanoid Robots - Ramón López de Mántaras
From Turing To Humanoid Robots - Ramón López de MántarasFrom Turing To Humanoid Robots - Ramón López de Mántaras
From Turing To Humanoid Robots - Ramón López de Mántaras
 
Artificial Intelligence Progress - Tom Dietterich
Artificial Intelligence Progress - Tom DietterichArtificial Intelligence Progress - Tom Dietterich
Artificial Intelligence Progress - Tom Dietterich
 
LR2. Summary Day 2
LR2. Summary Day 2LR2. Summary Day 2
LR2. Summary Day 2
 
L15. Machine Learning - Black Art
L15. Machine Learning - Black ArtL15. Machine Learning - Black Art
L15. Machine Learning - Black Art
 
L14. Anomaly Detection
L14. Anomaly DetectionL14. Anomaly Detection
L14. Anomaly Detection
 
L9. Real World Machine Learning - Cooking Predictions
L9. Real World Machine Learning - Cooking PredictionsL9. Real World Machine Learning - Cooking Predictions
L9. Real World Machine Learning - Cooking Predictions
 
L11. The Future of Machine Learning
L11. The Future of Machine LearningL11. The Future of Machine Learning
L11. The Future of Machine Learning
 
L7. A developers’ overview of the world of predictive APIs
L7. A developers’ overview of the world of predictive APIsL7. A developers’ overview of the world of predictive APIs
L7. A developers’ overview of the world of predictive APIs
 
L6. Unbalanced Datasets
L6. Unbalanced DatasetsL6. Unbalanced Datasets
L6. Unbalanced Datasets
 

Dernier

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 

Dernier (20)

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 

L5. Data Transformation and Feature Engineering

  • 1. Data Transformation and Feature Engineering Charles Parker Allston Trading
  • 2. 2 • Oregon State University (Structured output spaces) • Music recognition • Real-time strategy game-playing • Kodak Research Labs • Media classification (audio, video) • Document Classification • Performance Evaluation • BigML • Allston Trading (applying machine learning to market data) Full Disclosure
  • 3. 3 • But it’s “machine learning”! • Your data sucks (or at least I hope it does) . . . • Data is broken • Data is incomplete • . . . but you know about it! • Make the problem easier • Make the answer more obvious • Don’t waste time modeling the obvious • Until you find the right algorithm for it Data Transformation
  • 4. Your Data Sucks I: Broken Features • Suppose you have a market data feature called trade imbalance = (buy - sell) / total volume that you calculate every five minutes • Now suppose there are no trades over five minutes • What to do? • Point or feature removal • Easy default 4
  • 5. Your Data Sucks II: Missing Values • Suppose you’re building a model to predict the presence or absence of cancer • Each feature is a medical test • Some are simple (height, weight, temperature) • Some are complex (blood counts, CAT scan) • Some patients have had all of these done, some have not. • Does the presence or absence of a CAT scan tell you something? Should it be a feature? 5 Height Weight Blood Test Cancer? 179 80 No 160 60 2,4 No 150 65 4,5 Yes 155 70 No
  • 6. Simplifying Your Problem • What about the class variable? • It’s just another feature, so it can be engineered • Change the problem • Do you need so many classes? • Do you need to do a regression? 6
  • 7. Feature Engineering: What? • Your data may be too “raw” for learning • Multimedia Data • Raw text data • Something must be done to make the data “learnable” • Compute edge histograms, SIFT features • Do word counts, latent topic modeling 7
  • 8. An Instructive Example • Build a model to determine if two geo-coordinates are walking distance from one another 8 Lat. 1 Long 1. Lat. 2 Long. 2 Can Walk? 48.871507 2.354350 48.872111 2.354933 Yes 48.872111 2.354933 44.597422 -123.248367 No 48.872232 2.354211 48.872111 2.354933 Yes 44.597422 -123.248367 48.872232 2.354211 No
  • 9. • Whether two points are walking distance from each other is not an obvious function of the latitude and longitude • But it is an obvious function of the distance between the two points • Unfortunately, that function is quite complicated • Fortunately, you know it already! 9
  • 10. An Instructive Example • Build a model to determine if two geo-coordinates are walking distance from one another 10 Lat. 1 Long 1. Lat. 2 Long. 2 Distance (km) Can Walk? 48.871507 2.354350 48.872111 2.354933 2 Yes 48.872111 2.354933 44.597422 -123.248367 9059 No 48.872232 2.354211 48.872111 2.354933 5 Yes 44.597422 -123.248367 48.872232 2.354211 9056 No
  • 11. Feature Engineering • One of the core (maybe the core) competencies of a machine learning engineer • Requires domain understanding • Requires algorithm understanding • If you do it really well, you eliminate the need for machine learning entirely • Gives you another path to success; you can often substitute domain knowledge for modeling expertise • But what if you don’t have specific domain knowledge? 11
  • 12. Techniques I: Discretization • Construct meaningful bins for a continuous feature (two or more) • Body temperature • Credit score • The new features are categorical features, each category of which has nice semantics • Don’t make the algorithm waste effort modeling things that you already know about 12
  • 13. Techniques II: Delta • Sometimes, the difference between two features is the important bit • As it was in the distance example • Also holds a lot in the time domain • Example: Hiss in speech recognition • Struggling? Just differentiate! (In all seriousness, this sometimes works) 13
  • 14. Techniques III: Windowing • If points are distributed in time, previous points in the same window are often very informative • Weather • Stock prices • Add this to a 1-d sequence of points to get an instant machine learning problem! • Sensor data • User behavior • Maybe add some delta features? 14
  • 15. Techniques IV: Standardization • Constrain each feature to have a mean of zero and standard deviation of one (subtract the mean and divide by the standard deviation). • Good for domains with heterogeneous but gaussian-distributed data sources • Demographic data • Medical testing • Note that this isn’t in general effective for decision trees! • Transformation is order preserving • Decision tree splits rely only on ordering! • Good for things like k-NN 15
  • 16. Techniques V: Normalization • Force each feature vector to have unit norm (e.g., [0, 1, 0] -> [0, 1, 0] and [1, 1, 1] -> [0.57, 0.57, 0.57]) • Nice for sparse feature spaces like text • Helps us tell the difference between documents and dictionaries • We’ll come back to the idea of sparsity • Note that this will effect decision trees • Does not necessarily preserve order (co-dependency between features) • A lesson against over-generalization of technique! 16
  • 17. What Do We Really Want? • This is nice, but what ever happened to “machine learning”? • Construct a feature space in which “learning is easy”, whatever that means • The space must preserve “important aspects of the data”, whatever that means • Are there general ways of posing this problem? (Spoiler Alert: Yes) 17
  • 18. Aside I: Projection • A projection is a one-to- one mapping from one feature space to another • We want a function f(x) that projects a point x into a space where a good classifier is obvious • The axes (features) in your new space are called your new basis 18 f(x) f(x)
  • 19. A Hack Projection: Distance to Cluster • Do clustering on your data • For each point, compute the distance to each cluster centroid • These distances are your new features • The new space can be either higher or lower dimensional than your new space • For highly clustered data, this can be a fairly powerful feature space 19
  • 20. Principle Components Analysis • Find the axis through the data with the highest variance • Repeat for the next orthogonal axis and so on, until you run out of data or dimensions • Each axis is a feature 20
  • 21. PCA is Nice! • Generally quite fast (matrix decomposition) • Features are linear combinations of originals (which means you can project test data into the space) • Features are linearly independent (great for some algorithms) • Data can often be “explained” with just the first few components (so this can be “dimensionality reduction”) 21
  • 22. Spectral Embeddings • Two of the seminal ones are Isomap and LLE • Generally, compute the nearest neighbor matrix and use this to create the embedding • Pro: Pretty spectacular results • Con: No projection matrix 22
  • 23. Combination Methods • Large Margin Nearest Neighbor, Xing’s Method • Create an objective function that preserves neighbor relationships • Neighbor distances (unsupervised) • Closest points of the same class (supervised) • Clever search for a projection matrix that satisfies this objective (usually an elaborate sort of gradient descent) • I’ve had some success with these 23
  • 24. Aside II: Sparsity • Machine learning is essentially compression, and constantly plays at the edges of this idea • Minimum description length • Bayesian information criteria • L1 and L2 regularization • Sparse representations are easily compressed • So does that mean they’re more powerful? 24
  • 25. Sparsity I: Text Data • Text data is inherently sparse • The fact that we choose a small number of words to use gives a document its semantics • Text features are incredibly powerful in the grand scheme of feature spaces • One or two words allow us to do accurate classification • But those one or two words must be sparse 25
  • 26. Sparsity II: EigenFaces • Here are the first few components of PCA applied to a collection of face images • A small number of these explain a huge part of a huge number of faces • First components are like stop words, last few (sparse) components make recognition easy 26
  • 27. Sparsity III: The Fourier Transform • Very complex waveform • Turns out to be easily expressible as a combination of a few (i.e., sparse) constant frequency signals • Such representations make accurate speech recognition possible 27
  • 28. Sparse Coding • Iterate • Choose a basis • Evaluate that basis based on how well you can use it to reconstruct the input, and how sparse it is • Take some sort of gradient step to improve that evaluation • Andrew Ng’s efficient sparse coding algorithms and Hinton’s deep autoencoders are both flavors of this 28
  • 29. The New Basis • Text: Topics • Audio: Frequency Transform • Visual: Pen Strokes 29
  • 30. Another Hack: Totally Random Trees • Train a bunch of decision trees • With no objective! • Each leaf is a feature • Ta-da! Sparse basis • This actually works 30
  • 31. And More and More • There are a ton a variations on these themes • Dimensionality Reduction • Metric Learning • “Coding” or “Encoding” • Nice canonical implementations can be found at: http://lvdmaaten.github.io/drtoolbox/ 31