L5. Data Transformation and Feature Engineering

Data Transformation and Feature Engineering
Charles Parker
Allston Trading

2
• Oregon State University (Structured output spaces)

• Music recognition

• Real-time strategy game-playing

• Kodak Research Labs

• Media classiﬁcation (audio, video)

• Document Classiﬁcation

• Performance Evaluation

• BigML

• Allston Trading (applying machine learning to market data)
Full Disclosure

3
• But it’s “machine learning”!

• Your data sucks (or at least I hope it does) . . .

• Data is broken

• Data is incomplete

• . . . but you know about it!

• Make the problem easier

• Make the answer more obvious

• Don’t waste time modeling the obvious

• Until you ﬁnd the right algorithm for it
Data Transformation

Your Data Sucks I: Broken Features
• Suppose you have a market data feature called
trade imbalance = (buy - sell) / total volume that
you calculate every ﬁve minutes

• Now suppose there are no trades over ﬁve minutes

• What to do?

• Point or feature removal

• Easy default
4

Your Data Sucks II: Missing Values
• Suppose you’re building a model
to predict the presence or
absence of cancer

• Each feature is a medical test

• Some are simple (height,
weight, temperature)

• Some are complex (blood
counts, CAT scan)

• Some patients have had all of
these done, some have not.

• Does the presence or absence of
a CAT scan tell you something?
Should it be a feature?
5
Height Weight
Blood
Test
Cancer?
179 80 No
160 60 2,4 No
150 65 4,5 Yes
155 70 No

Simplifying Your Problem
• What about the class
variable?

• It’s just another feature, so it
can be engineered

• Change the problem

• Do you need so many
classes?

• Do you need to do a
regression?
6

Feature Engineering: What?
• Your data may be too “raw”
for learning

• Multimedia Data

• Raw text data

• Something must be done to
make the data “learnable”

• Compute edge histograms,
SIFT features

• Do word counts, latent
topic modeling
7

An Instructive Example
• Build a model to determine if two geo-coordinates
are walking distance from one another
8
Lat. 1 Long 1. Lat. 2 Long. 2 Can Walk?
48.871507 2.354350 48.872111 2.354933 Yes
48.872111 2.354933 44.597422 -123.248367 No
48.872232 2.354211 48.872111 2.354933 Yes
44.597422 -123.248367 48.872232 2.354211 No

• Whether two points are
walking distance from
each other is not an
obvious function of the
latitude and longitude

• But it is an obvious
function of the distance
between the two points

• Unfortunately, that
function is quite
complicated

• Fortunately, you know it
already!
9

An Instructive Example
• Build a model to determine if two geo-coordinates
are walking distance from one another
10
Lat. 1 Long 1. Lat. 2 Long. 2
Distance
(km)
Can Walk?
48.871507 2.354350 48.872111 2.354933 2 Yes
48.872111 2.354933 44.597422 -123.248367 9059 No
48.872232 2.354211 48.872111 2.354933 5 Yes
44.597422 -123.248367 48.872232 2.354211 9056 No

Feature Engineering
• One of the core (maybe the core)
competencies of a machine learning engineer

• Requires domain understanding

• Requires algorithm understanding

• If you do it really well, you eliminate the need
for machine learning entirely

• Gives you another path to success; you can
often substitute domain knowledge for
modeling expertise

• But what if you don’t have speciﬁc domain
knowledge?
11

Techniques I: Discretization
• Construct meaningful bins for a
continuous feature (two or more)

• Body temperature

• Credit score

• The new features are categorical
features, each category of which
has nice semantics

• Don’t make the algorithm waste
eﬀort modeling things that you
already know about
12

Techniques II: Delta
• Sometimes, the diﬀerence between two features is
the important bit

• As it was in the distance example

• Also holds a lot in the time domain

• Example: Hiss in speech recognition

• Struggling? Just diﬀerentiate! (In all seriousness,
this sometimes works)
13

Techniques III: Windowing
• If points are distributed in time,
previous points in the same
window are often very informative

• Weather

• Stock prices

• Add this to a 1-d sequence of
points to get an instant machine
learning problem!

• Sensor data

• User behavior

• Maybe add some delta features?
14

Techniques IV: Standardization
• Constrain each feature to have a mean of zero and standard deviation of one
(subtract the mean and divide by the standard deviation).

• Good for domains with heterogeneous but gaussian-distributed data sources

• Demographic data

• Medical testing

• Note that this isn’t in general eﬀective for decision trees!

• Transformation is order preserving

• Decision tree splits rely only on ordering!

• Good for things like k-NN
15

Techniques V: Normalization
• Force each feature vector to have unit norm (e.g., [0, 1, 0] -> [0, 1, 0]
and [1, 1, 1] -> [0.57, 0.57, 0.57])

• Nice for sparse feature spaces like text

• Helps us tell the diﬀerence between documents and dictionaries

• We’ll come back to the idea of sparsity

• Note that this will eﬀect decision trees

• Does not necessarily preserve order (co-dependency between
features)

• A lesson against over-generalization of technique!
16

What Do We Really Want?
• This is nice, but what ever happened to “machine
learning”?

• Construct a feature space in which “learning is
easy”, whatever that means

• The space must preserve “important aspects of the
data”, whatever that means

• Are there general ways of posing this problem?
(Spoiler Alert: Yes)
17

Aside I: Projection
• A projection is a one-to-
one mapping from one
feature space to another

• We want a function f(x)
that projects a point x
into a space where a
good classiﬁer is obvious

• The axes (features) in
your new space are
called your new basis
18
f(x)
f(x)

A Hack Projection: Distance to Cluster
• Do clustering on your data

• For each point, compute the
distance to each cluster centroid

• These distances are your new
features

• The new space can be either
higher or lower dimensional than
your new space

• For highly clustered data, this
can be a fairly powerful feature
space
19

Principle Components Analysis
• Find the axis through
the data with the
highest variance

• Repeat for the next
orthogonal axis and so
on, until you run out of
data or dimensions

• Each axis is a feature
20

PCA is Nice!
• Generally quite fast (matrix decomposition)

• Features are linear combinations of originals (which
means you can project test data into the space)

• Features are linearly independent (great for some
algorithms)

• Data can often be “explained” with just the ﬁrst few
components (so this can be “dimensionality
reduction”)
21

Spectral Embeddings
• Two of the seminal ones
are Isomap and LLE

• Generally, compute the
nearest neighbor matrix
and use this to create the
embedding

• Pro: Pretty spectacular
results

• Con: No projection matrix
22

Combination Methods
• Large Margin Nearest Neighbor, Xing’s Method

• Create an objective function that preserves neighbor
relationships

• Neighbor distances (unsupervised)

• Closest points of the same class (supervised)

• Clever search for a projection matrix that satisﬁes this
objective (usually an elaborate sort of gradient descent)

• I’ve had some success with these
23

Aside II: Sparsity
• Machine learning is essentially compression, and
constantly plays at the edges of this idea

• Minimum description length

• Bayesian information criteria

• L1 and L2 regularization

• Sparse representations are easily compressed

• So does that mean they’re more powerful?
24

Sparsity I: Text Data
• Text data is inherently sparse

• The fact that we choose a small number of words to
use gives a document its semantics

• Text features are incredibly powerful in the grand
scheme of feature spaces

• One or two words allow us to do accurate
classiﬁcation

• But those one or two words must be sparse
25

Sparsity II: EigenFaces
• Here are the ﬁrst few
components of PCA applied to
a collection of face images

• A small number of these
explain a huge part of a huge
number of faces

• First components are like stop
words, last few (sparse)
components make recognition
easy
26

Sparsity III: The Fourier Transform
• Very complex waveform

• Turns out to be easily
expressible as a
combination of a few
(i.e., sparse) constant
frequency signals

• Such representations
make accurate speech
recognition possible
27

Sparse Coding
• Iterate

• Choose a basis

• Evaluate that basis based on how well you can use
it to reconstruct the input, and how sparse it is

• Take some sort of gradient step to improve that
evaluation

• Andrew Ng’s eﬃcient sparse coding algorithms and
Hinton’s deep autoencoders are both ﬂavors of this
28

The New Basis
• Text: Topics

• Audio: Frequency
Transform

• Visual: Pen Strokes
29

Another Hack: Totally Random Trees
• Train a bunch of decision trees

• With no objective!

• Each leaf is a feature

• Ta-da! Sparse basis

• This actually works
30

And More and More
• There are a ton a variations on these themes

• Dimensionality Reduction

• Metric Learning

• “Coding” or “Encoding”

• Nice canonical implementations can be found at:
http://lvdmaaten.github.io/drtoolbox/
31

L5. Data Transformation and Feature Engineering

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à L5. Data Transformation and Feature Engineering

Similaire à L5. Data Transformation and Feature Engineering (20)

Plus de Machine Learning Valencia

Plus de Machine Learning Valencia (9)

Dernier

Dernier (20)

L5. Data Transformation and Feature Engineering