Understanding Feature Space in Machine Learning

Alice Zheng
Alice ZhengSr Manager, Applied Science at Amazon - Hiring research software engineers and managers à Amazon
Understanding
Feature Space in
Machine Learning
Alice Zheng, Dato
September 9, 2015
1
2
My journey so far
Applied machine learning
(Data science)
Build ML tools
Shortage of experts
and good tools.
3
Why machine learning?
Model data.
Make predictions.
Build intelligent
applications.
4
The machine learning pipeline
I fell in love the instant I laid
my eyes on that puppy. His
big eyes and playful tail, his
soft furry paws, …
Raw data
Features
Models
Predictions
Deploy in
production
Feature = numeric representation of raw data
6
Representing natural text
It is a puppy and it is
extremely cute.
What’s important?
Phrases? Specific
words? Ordering?
Subject, object, verb?
Classify:
puppy or not?
Raw Text
{“it”:2,
“is”:2,
“a”:1,
“puppy”:1,
“and”:1,
“extremely”:1,
“cute”:1 }
Bag of Words
7
Representing natural text
It is a puppy and it is
extremely cute.
Classify:
puppy or not?
Raw Text Bag of Words
it 2
they 0
I 1
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
cute 1
extremely 1
… …
Sparse vector
representation
8
Representing images
Image source: “Recognizing and learning object categories,”
Li Fei-Fei, Rob Fergus, Anthony Torralba, ICCV 2005—2009.
Raw image:
millions of RGB triplets,
one for each pixel
Classify:
person or animal?
Raw Image Bag of Visual Words
9
Representing images
Classify:
person or animal?
Raw Image Deep learning features
3.29
-15
-5.24
48.3
1.36
47.1
-
1.92
36.5
2.83
95.4
-19
-89
5.09
37.8
Dense vector
representation
10
Feature space in machine learning
• Raw data  high dimensional vectors
• Collection of data points  point cloud in feature space
• Model = geometric summary of point cloud
• Feature engineering = creating features of the appropriate
granularity for the task
Crudely speaking, mathematicians fall into two
categories: the algebraists, who find it easiest to
reduce all problems to sets of numbers and
variables, and the geometers, who understand the
world through shapes.
-- Masha Gessen, “Perfect Rigor”
12
Algebra vs. Geometry
a
b
c
a2 + b2 = c2
Algebra Geometry
Pythagorean
Theorem
(Euclidean space)
13
Visualizing a sphere in 2D
x2 + y2 = 1
a
b
c
Pythagorean theorem:
a2 + b2 = c2
x
y
1
1
14
Visualizing a sphere in 3D
x2 + y2 + z2 = 1
x
y
z
1
1
1
15
Visualizing a sphere in 4D
x2 + y2 + z2 + t2 = 1
x
y
z
1
1
1
16
Why are we looking at spheres?
= =
= =
Poincaré Conjecture:
All physical objects without holes
is “equivalent” to a sphere.
17
The power of higher dimensions
• A sphere in 4D can model the birth and death process of
physical objects
• Point clouds = approximate geometric shapes
• High dimensional features can model many things
Visualizing Feature Space
19
The challenge of high dimension geometry
• Feature space can have hundreds to millions of
dimensions
• In high dimensions, our geometric imagination is limited
- Algebra comes to our aid
20
Visualizing bag-of-words
puppy
cute
1
1
I have a puppy and
it is extremely cute
I have a puppy and
it is extremely cute
it 1
they 0
I 1
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
zebra 0
cute 1
extremely 1
… …
21
Visualizing bag-of-words
puppy
cute
1
1
1
extremely
I have a puppy and
it is extremely cute
I have an extremely
cute cat
I have a cute
puppy
22
Document point cloud
word 1
word 2
23
What is a model?
• Model = mathematical “summary” of data
• What’s a summary?
- A geometric shape
24
Classification model
Feature 2
Feature 1
Decide between two classes
25
Clustering model
Feature 2
Feature 1
Group data points tightly
26
Regression model
Target
Feature
Fit the target values
Visualizing Feature Engineering
28
When does bag-of-words fail?
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
Task: find a surface that separates
documents about dogs vs. cats
Problem: the word “have” adds fluff
instead of information
I have a dog
and I have a pen
1
29
Improving on bag-of-words
• Idea: “normalize” word counts so that popular words
are discounted
• Term frequency (tf) = Number of times a terms
appears in a document
• Inverse document frequency of word (idf) =
• N = total number of documents
• Tf-idf count = tf x idf
30
From BOW to tf-idf
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
idf(puppy) = log 4
idf(cat) = log 4
idf(have) = log 1 = 0
I have a dog
and I have a pen
1
31
From BOW to tf-idf
puppy
cat1
have
tfidf(puppy) = log 4
tfidf(cat) = log 4
tfidf(have) = 0
I have a dog
and I have a pen,
I have a kitten
1
log 4
log 4
I have a cat
I have a puppy
Decision surface
Tf-idf flattens
uninformative
dimensions in the
BOW point cloud
32
Entry points of feature engineering
• Start from data and task
- What’s the best text representation for classification?
• Start from modeling method
- What kind of features does k-means assume?
- What does linear regression assume about the data?
33
That’s not all, folks!
• There’s a lot more to feature engineering:
- Feature normalization
- Feature transformations
- “Regularizing” models
- Learning the right features
• Dato is hiring! jobs@dato.com
alicez@dato.com @RainyData
1 sur 33

Recommandé

ppt on machine learning to deep learning (1).pptx par
ppt on machine learning to deep learning (1).pptxppt on machine learning to deep learning (1).pptx
ppt on machine learning to deep learning (1).pptxAnweshaGarima
1.7K vues26 diapositives
AlexNet par
AlexNetAlexNet
AlexNetBertil Hatt
2.9K vues55 diapositives
07 regularization par
07 regularization07 regularization
07 regularizationRonald Teo
853 vues13 diapositives
Introduction to Machine Learning par
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
128.5K vues35 diapositives
Introduction to Deep Learning par
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
3.9K vues66 diapositives
3.7 outlier analysis par
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysisKrish_ver2
15K vues17 diapositives

Contenu connexe

Tendances

Introduction to CNN par
Introduction to CNNIntroduction to CNN
Introduction to CNNShuai Zhang
8.9K vues18 diapositives
Version spaces par
Version spacesVersion spaces
Version spacesGekkietje
3.7K vues61 diapositives
Fundamentals of Neural Networks par
Fundamentals of Neural NetworksFundamentals of Neural Networks
Fundamentals of Neural NetworksGagan Deep
4.7K vues54 diapositives
GAN - Theory and Applications par
GAN - Theory and ApplicationsGAN - Theory and Applications
GAN - Theory and ApplicationsEmanuele Ghelfi
9.5K vues41 diapositives
Transfer Learning: An overview par
Transfer Learning: An overviewTransfer Learning: An overview
Transfer Learning: An overviewjins0618
8.2K vues131 diapositives
Few shot learning/ one shot learning/ machine learning par
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningﺁﺻﻒ ﻋﻠﯽ ﻣﯿﺮ
1.9K vues31 diapositives

Tendances(20)

Introduction to CNN par Shuai Zhang
Introduction to CNNIntroduction to CNN
Introduction to CNN
Shuai Zhang8.9K vues
Version spaces par Gekkietje
Version spacesVersion spaces
Version spaces
Gekkietje3.7K vues
Fundamentals of Neural Networks par Gagan Deep
Fundamentals of Neural NetworksFundamentals of Neural Networks
Fundamentals of Neural Networks
Gagan Deep4.7K vues
Transfer Learning: An overview par jins0618
Transfer Learning: An overviewTransfer Learning: An overview
Transfer Learning: An overview
jins06188.2K vues
Training Neural Networks par Databricks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
Databricks3.6K vues
Deep Learning With Neural Networks par Aniket Maurya
Deep Learning With Neural NetworksDeep Learning With Neural Networks
Deep Learning With Neural Networks
Aniket Maurya3.1K vues
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ... par Kirill Eremenko
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Kirill Eremenko2.8K vues
Image classification using convolutional neural network par KIRAN R
Image classification using convolutional neural networkImage classification using convolutional neural network
Image classification using convolutional neural network
KIRAN R5.8K vues
Image classification using CNN par Noura Hussein
Image classification using CNNImage classification using CNN
Image classification using CNN
Noura Hussein5.3K vues
Logistic regression in Machine Learning par Kuppusamy P
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine Learning
Kuppusamy P989 vues
Machine Learning basics par NeeleEilers
Machine Learning basicsMachine Learning basics
Machine Learning basics
NeeleEilers240 vues
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio par Marina Santini
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Marina Santini151.4K vues
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck par SlideTeam
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete DeckAI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck
SlideTeam2.7K vues

En vedette

The How and Why of Feature Engineering par
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature EngineeringAlice Zheng
14.3K vues44 diapositives
Feature Engineering - Getting most out of data for predictive models par
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
36.3K vues75 diapositives
Horovod - Distributed TensorFlow Made Easy par
Horovod - Distributed TensorFlow Made EasyHorovod - Distributed TensorFlow Made Easy
Horovod - Distributed TensorFlow Made EasyAlexander Sergeev
24.4K vues20 diapositives
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ... par
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Sujit Pal
7.7K vues19 diapositives
Lessons from 2MM machine learning models par
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning modelsExtract Data Conference
60.9K vues17 diapositives
Large-Scale Training with GPUs at Facebook par
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookFaisal Siddiqi
15.5K vues32 diapositives

En vedette(8)

The How and Why of Feature Engineering par Alice Zheng
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
Alice Zheng14.3K vues
Feature Engineering - Getting most out of data for predictive models par Gabriel Moreira
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira36.3K vues
Horovod - Distributed TensorFlow Made Easy par Alexander Sergeev
Horovod - Distributed TensorFlow Made EasyHorovod - Distributed TensorFlow Made Easy
Horovod - Distributed TensorFlow Made Easy
Alexander Sergeev24.4K vues
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ... par Sujit Pal
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Sujit Pal7.7K vues
Large-Scale Training with GPUs at Facebook par Faisal Siddiqi
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at Facebook
Faisal Siddiqi15.5K vues
Parameter Server Approach for Online Learning at Twitter par Zhiyong (Joe) Xie
Parameter Server Approach for Online Learning at TwitterParameter Server Approach for Online Learning at Twitter
Parameter Server Approach for Online Learning at Twitter
Zhiyong (Joe) Xie18.2K vues
2017 10-10 (netflix ml platform meetup) learning item and user representation... par Ed Chi
2017 10-10 (netflix ml platform meetup) learning item and user representation...2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...
Ed Chi20.3K vues

Similaire à Understanding Feature Space in Machine Learning

Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle par
Understanding Feature Space in Machine Learning - Data Science Pop-up SeattleUnderstanding Feature Space in Machine Learning - Data Science Pop-up Seattle
Understanding Feature Space in Machine Learning - Data Science Pop-up SeattleDomino Data Lab
6.9K vues31 diapositives
Maths in the PYP - A Journey through the Arts par
Maths in the PYP - A Journey through the ArtsMaths in the PYP - A Journey through the Arts
Maths in the PYP - A Journey through the Artsmadahay
1.3K vues26 diapositives
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법 par
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법NAVER D2
6K vues68 diapositives
CO Quadratic Inequalties.pptx par
CO Quadratic Inequalties.pptxCO Quadratic Inequalties.pptx
CO Quadratic Inequalties.pptxManuelEsponilla
33 vues58 diapositives
Latent dirichlet allocation_and_topic_modeling par
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingankit_ppt
229 vues51 diapositives
Overview of Machine Learning and Feature Engineering par
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringTuri, Inc.
13.6K vues64 diapositives

Similaire à Understanding Feature Space in Machine Learning(20)

Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle par Domino Data Lab
Understanding Feature Space in Machine Learning - Data Science Pop-up SeattleUnderstanding Feature Space in Machine Learning - Data Science Pop-up Seattle
Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle
Domino Data Lab 6.9K vues
Maths in the PYP - A Journey through the Arts par madahay
Maths in the PYP - A Journey through the ArtsMaths in the PYP - A Journey through the Arts
Maths in the PYP - A Journey through the Arts
madahay1.3K vues
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법 par NAVER D2
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
NAVER D26K vues
Latent dirichlet allocation_and_topic_modeling par ankit_ppt
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modeling
ankit_ppt229 vues
Overview of Machine Learning and Feature Engineering par Turi, Inc.
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
Turi, Inc.13.6K vues
Introduction to Search Systems - ScaleConf Colombia 2017 par Toria Gibbs
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017
Toria Gibbs1.3K vues
CSCE181 Big ideas in NLP par Insoo Chung
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
Insoo Chung90 vues
Peter Norvig - NYC Machine Learning 2013 par Michael Scovetta
Peter Norvig - NYC Machine Learning 2013Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013
Michael Scovetta1.5K vues
syntherella feedback synthesizer par Eelke Folmer
syntherella feedback synthesizersyntherella feedback synthesizer
syntherella feedback synthesizer
Eelke Folmer278 vues
Deep Learning Class #0 - You Can Do It par Holberton School
Deep Learning Class #0 - You Can Do ItDeep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do It
Holberton School2.9K vues
12 intelligent systems-formal_conceptanalysis par STI Innsbruck
12 intelligent systems-formal_conceptanalysis12 intelligent systems-formal_conceptanalysis
12 intelligent systems-formal_conceptanalysis
STI Innsbruck92 vues
Word2vec ultimate beginner par Sungmin Yang
Word2vec ultimate beginnerWord2vec ultimate beginner
Word2vec ultimate beginner
Sungmin Yang818 vues
Cardinality In Blind Children par Jill Lyons
Cardinality In Blind ChildrenCardinality In Blind Children
Cardinality In Blind Children
Jill Lyons3 vues
Edutalk f2013 par Mel Chua
Edutalk f2013Edutalk f2013
Edutalk f2013
Mel Chua4.1K vues
Using binary classifiers par butest
Using binary classifiersUsing binary classifiers
Using binary classifiers
butest468 vues

Dernier

Krishna VSC 692 Credit Seminar.pptx par
Krishna VSC 692 Credit Seminar.pptxKrishna VSC 692 Credit Seminar.pptx
Krishna VSC 692 Credit Seminar.pptxKrishnaSharma682993
13 vues54 diapositives
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... par
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...ILRI
7 vues6 diapositives
ZEBRA FISH: as model organism.pptx par
ZEBRA FISH: as model organism.pptxZEBRA FISH: as model organism.pptx
ZEBRA FISH: as model organism.pptxmahimachoudhary0807
14 vues17 diapositives
Bacterial Reproduction.pdf par
Bacterial Reproduction.pdfBacterial Reproduction.pdf
Bacterial Reproduction.pdfNandadulalSannigrahi
37 vues32 diapositives
DNA manipulation Enzymes 2.pdf par
DNA manipulation Enzymes 2.pdfDNA manipulation Enzymes 2.pdf
DNA manipulation Enzymes 2.pdfNetHelix
6 vues42 diapositives
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... par
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...ILRI
10 vues1 diapositive

Dernier(20)

Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... par ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI7 vues
DNA manipulation Enzymes 2.pdf par NetHelix
DNA manipulation Enzymes 2.pdfDNA manipulation Enzymes 2.pdf
DNA manipulation Enzymes 2.pdf
NetHelix6 vues
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ... par ILRI
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
Small ruminant keepers’ knowledge, attitudes and practices towards peste des ...
ILRI10 vues
Note on the Riemann Hypothesis par vegafrank2
Note on the Riemann HypothesisNote on the Riemann Hypothesis
Note on the Riemann Hypothesis
vegafrank29 vues
INTRODUCTION TO PLANT SYSTEMATICS.pptx par RASHMI M G
INTRODUCTION TO PLANT SYSTEMATICS.pptxINTRODUCTION TO PLANT SYSTEMATICS.pptx
INTRODUCTION TO PLANT SYSTEMATICS.pptx
RASHMI M G 5 vues
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F... par SwagatBehera9
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...
Effect of Integrated Nutrient Management on Growth and Yield of Solanaceous F...
SwagatBehera96 vues
2. Natural Sciences and Technology Author Siyavula.pdf par ssuser821efa
2. Natural Sciences and Technology Author Siyavula.pdf2. Natural Sciences and Technology Author Siyavula.pdf
2. Natural Sciences and Technology Author Siyavula.pdf
ssuser821efa13 vues
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe... par Anmol Vishnu Gupta
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
Study on Drug Drug Interaction Through Prescription Analysis of Type II Diabe...
Worldviews and their (im)plausibility: Science and Holism par JohnWilkins48
Worldviews and their (im)plausibility: Science and HolismWorldviews and their (im)plausibility: Science and Holism
Worldviews and their (im)plausibility: Science and Holism
JohnWilkins4844 vues
XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto... par Sérgio Sacani
XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto...XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto...
XUE: Molecular Inventory in the Inner Region of an Extremely Irradiated Proto...
Sérgio Sacani787 vues

Understanding Feature Space in Machine Learning

  • 1. Understanding Feature Space in Machine Learning Alice Zheng, Dato September 9, 2015 1
  • 2. 2 My journey so far Applied machine learning (Data science) Build ML tools Shortage of experts and good tools.
  • 3. 3 Why machine learning? Model data. Make predictions. Build intelligent applications.
  • 4. 4 The machine learning pipeline I fell in love the instant I laid my eyes on that puppy. His big eyes and playful tail, his soft furry paws, … Raw data Features Models Predictions Deploy in production
  • 5. Feature = numeric representation of raw data
  • 6. 6 Representing natural text It is a puppy and it is extremely cute. What’s important? Phrases? Specific words? Ordering? Subject, object, verb? Classify: puppy or not? Raw Text {“it”:2, “is”:2, “a”:1, “puppy”:1, “and”:1, “extremely”:1, “cute”:1 } Bag of Words
  • 7. 7 Representing natural text It is a puppy and it is extremely cute. Classify: puppy or not? Raw Text Bag of Words it 2 they 0 I 1 am 0 how 0 puppy 1 and 1 cat 0 aardvark 0 cute 1 extremely 1 … … Sparse vector representation
  • 8. 8 Representing images Image source: “Recognizing and learning object categories,” Li Fei-Fei, Rob Fergus, Anthony Torralba, ICCV 2005—2009. Raw image: millions of RGB triplets, one for each pixel Classify: person or animal? Raw Image Bag of Visual Words
  • 9. 9 Representing images Classify: person or animal? Raw Image Deep learning features 3.29 -15 -5.24 48.3 1.36 47.1 - 1.92 36.5 2.83 95.4 -19 -89 5.09 37.8 Dense vector representation
  • 10. 10 Feature space in machine learning • Raw data  high dimensional vectors • Collection of data points  point cloud in feature space • Model = geometric summary of point cloud • Feature engineering = creating features of the appropriate granularity for the task
  • 11. Crudely speaking, mathematicians fall into two categories: the algebraists, who find it easiest to reduce all problems to sets of numbers and variables, and the geometers, who understand the world through shapes. -- Masha Gessen, “Perfect Rigor”
  • 12. 12 Algebra vs. Geometry a b c a2 + b2 = c2 Algebra Geometry Pythagorean Theorem (Euclidean space)
  • 13. 13 Visualizing a sphere in 2D x2 + y2 = 1 a b c Pythagorean theorem: a2 + b2 = c2 x y 1 1
  • 14. 14 Visualizing a sphere in 3D x2 + y2 + z2 = 1 x y z 1 1 1
  • 15. 15 Visualizing a sphere in 4D x2 + y2 + z2 + t2 = 1 x y z 1 1 1
  • 16. 16 Why are we looking at spheres? = = = = Poincaré Conjecture: All physical objects without holes is “equivalent” to a sphere.
  • 17. 17 The power of higher dimensions • A sphere in 4D can model the birth and death process of physical objects • Point clouds = approximate geometric shapes • High dimensional features can model many things
  • 19. 19 The challenge of high dimension geometry • Feature space can have hundreds to millions of dimensions • In high dimensions, our geometric imagination is limited - Algebra comes to our aid
  • 20. 20 Visualizing bag-of-words puppy cute 1 1 I have a puppy and it is extremely cute I have a puppy and it is extremely cute it 1 they 0 I 1 am 0 how 0 puppy 1 and 1 cat 0 aardvark 0 zebra 0 cute 1 extremely 1 … …
  • 21. 21 Visualizing bag-of-words puppy cute 1 1 1 extremely I have a puppy and it is extremely cute I have an extremely cute cat I have a cute puppy
  • 23. 23 What is a model? • Model = mathematical “summary” of data • What’s a summary? - A geometric shape
  • 24. 24 Classification model Feature 2 Feature 1 Decide between two classes
  • 25. 25 Clustering model Feature 2 Feature 1 Group data points tightly
  • 28. 28 When does bag-of-words fail? puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten Task: find a surface that separates documents about dogs vs. cats Problem: the word “have” adds fluff instead of information I have a dog and I have a pen 1
  • 29. 29 Improving on bag-of-words • Idea: “normalize” word counts so that popular words are discounted • Term frequency (tf) = Number of times a terms appears in a document • Inverse document frequency of word (idf) = • N = total number of documents • Tf-idf count = tf x idf
  • 30. 30 From BOW to tf-idf puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten idf(puppy) = log 4 idf(cat) = log 4 idf(have) = log 1 = 0 I have a dog and I have a pen 1
  • 31. 31 From BOW to tf-idf puppy cat1 have tfidf(puppy) = log 4 tfidf(cat) = log 4 tfidf(have) = 0 I have a dog and I have a pen, I have a kitten 1 log 4 log 4 I have a cat I have a puppy Decision surface Tf-idf flattens uninformative dimensions in the BOW point cloud
  • 32. 32 Entry points of feature engineering • Start from data and task - What’s the best text representation for classification? • Start from modeling method - What kind of features does k-means assume? - What does linear regression assume about the data?
  • 33. 33 That’s not all, folks! • There’s a lot more to feature engineering: - Feature normalization - Feature transformations - “Regularizing” models - Learning the right features • Dato is hiring! jobs@dato.com alicez@dato.com @RainyData

Notes de l'éditeur

  1. Features sit between raw data and model. They can make or break an application.