Valencian Summer School 2015
Day 1
Lecture 5
Data Transformation and Feature Engineering
Charles Parker (Alston Trading)
https://bigml.com/events/valencian-summer-school-in-machine-learning-2015
2. 2
• Oregon State University (Structured output spaces)
• Music recognition
• Real-time strategy game-playing
• Kodak Research Labs
• Media classification (audio, video)
• Document Classification
• Performance Evaluation
• BigML
• Allston Trading (applying machine learning to market data)
Full Disclosure
3. 3
• But it’s “machine learning”!
• Your data sucks (or at least I hope it does) . . .
• Data is broken
• Data is incomplete
• . . . but you know about it!
• Make the problem easier
• Make the answer more obvious
• Don’t waste time modeling the obvious
• Until you find the right algorithm for it
Data Transformation
4. Your Data Sucks I: Broken Features
• Suppose you have a market data feature called
trade imbalance = (buy - sell) / total volume that
you calculate every five minutes
• Now suppose there are no trades over five minutes
• What to do?
• Point or feature removal
• Easy default
4
5. Your Data Sucks II: Missing Values
• Suppose you’re building a model
to predict the presence or
absence of cancer
• Each feature is a medical test
• Some are simple (height,
weight, temperature)
• Some are complex (blood
counts, CAT scan)
• Some patients have had all of
these done, some have not.
• Does the presence or absence of
a CAT scan tell you something?
Should it be a feature?
5
Height Weight
Blood
Test
Cancer?
179 80 No
160 60 2,4 No
150 65 4,5 Yes
155 70 No
6. Simplifying Your Problem
• What about the class
variable?
• It’s just another feature, so it
can be engineered
• Change the problem
• Do you need so many
classes?
• Do you need to do a
regression?
6
7. Feature Engineering: What?
• Your data may be too “raw”
for learning
• Multimedia Data
• Raw text data
• Something must be done to
make the data “learnable”
• Compute edge histograms,
SIFT features
• Do word counts, latent
topic modeling
7
8. An Instructive Example
• Build a model to determine if two geo-coordinates
are walking distance from one another
8
Lat. 1 Long 1. Lat. 2 Long. 2 Can Walk?
48.871507 2.354350 48.872111 2.354933 Yes
48.872111 2.354933 44.597422 -123.248367 No
48.872232 2.354211 48.872111 2.354933 Yes
44.597422 -123.248367 48.872232 2.354211 No
9. • Whether two points are
walking distance from
each other is not an
obvious function of the
latitude and longitude
• But it is an obvious
function of the distance
between the two points
• Unfortunately, that
function is quite
complicated
• Fortunately, you know it
already!
9
10. An Instructive Example
• Build a model to determine if two geo-coordinates
are walking distance from one another
10
Lat. 1 Long 1. Lat. 2 Long. 2
Distance
(km)
Can Walk?
48.871507 2.354350 48.872111 2.354933 2 Yes
48.872111 2.354933 44.597422 -123.248367 9059 No
48.872232 2.354211 48.872111 2.354933 5 Yes
44.597422 -123.248367 48.872232 2.354211 9056 No
11. Feature Engineering
• One of the core (maybe the core)
competencies of a machine learning engineer
• Requires domain understanding
• Requires algorithm understanding
• If you do it really well, you eliminate the need
for machine learning entirely
• Gives you another path to success; you can
often substitute domain knowledge for
modeling expertise
• But what if you don’t have specific domain
knowledge?
11
12. Techniques I: Discretization
• Construct meaningful bins for a
continuous feature (two or more)
• Body temperature
• Credit score
• The new features are categorical
features, each category of which
has nice semantics
• Don’t make the algorithm waste
effort modeling things that you
already know about
12
13. Techniques II: Delta
• Sometimes, the difference between two features is
the important bit
• As it was in the distance example
• Also holds a lot in the time domain
• Example: Hiss in speech recognition
• Struggling? Just differentiate! (In all seriousness,
this sometimes works)
13
14. Techniques III: Windowing
• If points are distributed in time,
previous points in the same
window are often very informative
• Weather
• Stock prices
• Add this to a 1-d sequence of
points to get an instant machine
learning problem!
• Sensor data
• User behavior
• Maybe add some delta features?
14
15. Techniques IV: Standardization
• Constrain each feature to have a mean of zero and standard deviation of one
(subtract the mean and divide by the standard deviation).
• Good for domains with heterogeneous but gaussian-distributed data sources
• Demographic data
• Medical testing
• Note that this isn’t in general effective for decision trees!
• Transformation is order preserving
• Decision tree splits rely only on ordering!
• Good for things like k-NN
15
16. Techniques V: Normalization
• Force each feature vector to have unit norm (e.g., [0, 1, 0] -> [0, 1, 0]
and [1, 1, 1] -> [0.57, 0.57, 0.57])
• Nice for sparse feature spaces like text
• Helps us tell the difference between documents and dictionaries
• We’ll come back to the idea of sparsity
• Note that this will effect decision trees
• Does not necessarily preserve order (co-dependency between
features)
• A lesson against over-generalization of technique!
16
17. What Do We Really Want?
• This is nice, but what ever happened to “machine
learning”?
• Construct a feature space in which “learning is
easy”, whatever that means
• The space must preserve “important aspects of the
data”, whatever that means
• Are there general ways of posing this problem?
(Spoiler Alert: Yes)
17
18. Aside I: Projection
• A projection is a one-to-
one mapping from one
feature space to another
• We want a function f(x)
that projects a point x
into a space where a
good classifier is obvious
• The axes (features) in
your new space are
called your new basis
18
f(x)
f(x)
19. A Hack Projection: Distance to Cluster
• Do clustering on your data
• For each point, compute the
distance to each cluster centroid
• These distances are your new
features
• The new space can be either
higher or lower dimensional than
your new space
• For highly clustered data, this
can be a fairly powerful feature
space
19
20. Principle Components Analysis
• Find the axis through
the data with the
highest variance
• Repeat for the next
orthogonal axis and so
on, until you run out of
data or dimensions
• Each axis is a feature
20
21. PCA is Nice!
• Generally quite fast (matrix decomposition)
• Features are linear combinations of originals (which
means you can project test data into the space)
• Features are linearly independent (great for some
algorithms)
• Data can often be “explained” with just the first few
components (so this can be “dimensionality
reduction”)
21
22. Spectral Embeddings
• Two of the seminal ones
are Isomap and LLE
• Generally, compute the
nearest neighbor matrix
and use this to create the
embedding
• Pro: Pretty spectacular
results
• Con: No projection matrix
22
23. Combination Methods
• Large Margin Nearest Neighbor, Xing’s Method
• Create an objective function that preserves neighbor
relationships
• Neighbor distances (unsupervised)
• Closest points of the same class (supervised)
• Clever search for a projection matrix that satisfies this
objective (usually an elaborate sort of gradient descent)
• I’ve had some success with these
23
24. Aside II: Sparsity
• Machine learning is essentially compression, and
constantly plays at the edges of this idea
• Minimum description length
• Bayesian information criteria
• L1 and L2 regularization
• Sparse representations are easily compressed
• So does that mean they’re more powerful?
24
25. Sparsity I: Text Data
• Text data is inherently sparse
• The fact that we choose a small number of words to
use gives a document its semantics
• Text features are incredibly powerful in the grand
scheme of feature spaces
• One or two words allow us to do accurate
classification
• But those one or two words must be sparse
25
26. Sparsity II: EigenFaces
• Here are the first few
components of PCA applied to
a collection of face images
• A small number of these
explain a huge part of a huge
number of faces
• First components are like stop
words, last few (sparse)
components make recognition
easy
26
27. Sparsity III: The Fourier Transform
• Very complex waveform
• Turns out to be easily
expressible as a
combination of a few
(i.e., sparse) constant
frequency signals
• Such representations
make accurate speech
recognition possible
27
28. Sparse Coding
• Iterate
• Choose a basis
• Evaluate that basis based on how well you can use
it to reconstruct the input, and how sparse it is
• Take some sort of gradient step to improve that
evaluation
• Andrew Ng’s efficient sparse coding algorithms and
Hinton’s deep autoencoders are both flavors of this
28
29. The New Basis
• Text: Topics
• Audio: Frequency
Transform
• Visual: Pen Strokes
29
30. Another Hack: Totally Random Trees
• Train a bunch of decision trees
• With no objective!
• Each leaf is a feature
• Ta-da! Sparse basis
• This actually works
30
31. And More and More
• There are a ton a variations on these themes
• Dimensionality Reduction
• Metric Learning
• “Coding” or “Encoding”
• Nice canonical implementations can be found at:
http://lvdmaaten.github.io/drtoolbox/
31