SlideShare une entreprise Scribd logo
1  sur  33
Understanding
Feature Space in
Machine Learning
Alice Zheng, Dato
September 9, 2015
1
2
My journey so far
Applied machine learning
(Data science)
Build ML tools
Shortage of experts
and good tools.
3
Why machine learning?
Model data.
Make predictions.
Build intelligent
applications.
4
The machine learning pipeline
I fell in love the instant I laid
my eyes on that puppy. His
big eyes and playful tail, his
soft furry paws, …
Raw data
Features
Models
Predictions
Deploy in
production
Feature = numeric representation of raw data
6
Representing natural text
It is a puppy and it is
extremely cute.
What’s important?
Phrases? Specific
words? Ordering?
Subject, object, verb?
Classify:
puppy or not?
Raw Text
{“it”:2,
“is”:2,
“a”:1,
“puppy”:1,
“and”:1,
“extremely”:1,
“cute”:1 }
Bag of Words
7
Representing natural text
It is a puppy and it is
extremely cute.
Classify:
puppy or not?
Raw Text Bag of Words
it 2
they 0
I 1
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
cute 1
extremely 1
… …
Sparse vector
representation
8
Representing images
Image source: “Recognizing and learning object categories,”
Li Fei-Fei, Rob Fergus, Anthony Torralba, ICCV 2005—2009.
Raw image:
millions of RGB triplets,
one for each pixel
Classify:
person or animal?
Raw Image Bag of Visual Words
9
Representing images
Classify:
person or animal?
Raw Image Deep learning features
3.29
-15
-5.24
48.3
1.36
47.1
-
1.92
36.5
2.83
95.4
-19
-89
5.09
37.8
Dense vector
representation
10
Feature space in machine learning
• Raw data  high dimensional vectors
• Collection of data points  point cloud in feature space
• Model = geometric summary of point cloud
• Feature engineering = creating features of the appropriate
granularity for the task
Crudely speaking, mathematicians fall into two
categories: the algebraists, who find it easiest to
reduce all problems to sets of numbers and
variables, and the geometers, who understand the
world through shapes.
-- Masha Gessen, “Perfect Rigor”
12
Algebra vs. Geometry
a
b
c
a2 + b2 = c2
Algebra Geometry
Pythagorean
Theorem
(Euclidean space)
13
Visualizing a sphere in 2D
x2 + y2 = 1
a
b
c
Pythagorean theorem:
a2 + b2 = c2
x
y
1
1
14
Visualizing a sphere in 3D
x2 + y2 + z2 = 1
x
y
z
1
1
1
15
Visualizing a sphere in 4D
x2 + y2 + z2 + t2 = 1
x
y
z
1
1
1
16
Why are we looking at spheres?
= =
= =
Poincaré Conjecture:
All physical objects without holes
is “equivalent” to a sphere.
17
The power of higher dimensions
• A sphere in 4D can model the birth and death process of
physical objects
• Point clouds = approximate geometric shapes
• High dimensional features can model many things
Visualizing Feature Space
19
The challenge of high dimension geometry
• Feature space can have hundreds to millions of
dimensions
• In high dimensions, our geometric imagination is limited
- Algebra comes to our aid
20
Visualizing bag-of-words
puppy
cute
1
1
I have a puppy and
it is extremely cute
I have a puppy and
it is extremely cute
it 1
they 0
I 1
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
zebra 0
cute 1
extremely 1
… …
21
Visualizing bag-of-words
puppy
cute
1
1
1
extremely
I have a puppy and
it is extremely cute
I have an extremely
cute cat
I have a cute
puppy
22
Document point cloud
word 1
word 2
23
What is a model?
• Model = mathematical “summary” of data
• What’s a summary?
- A geometric shape
24
Classification model
Feature 2
Feature 1
Decide between two classes
25
Clustering model
Feature 2
Feature 1
Group data points tightly
26
Regression model
Target
Feature
Fit the target values
Visualizing Feature Engineering
28
When does bag-of-words fail?
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
Task: find a surface that separates
documents about dogs vs. cats
Problem: the word “have” adds fluff
instead of information
I have a dog
and I have a pen
1
29
Improving on bag-of-words
• Idea: “normalize” word counts so that popular words
are discounted
• Term frequency (tf) = Number of times a terms
appears in a document
• Inverse document frequency of word (idf) =
• N = total number of documents
• Tf-idf count = tf x idf
30
From BOW to tf-idf
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
idf(puppy) = log 4
idf(cat) = log 4
idf(have) = log 1 = 0
I have a dog
and I have a pen
1
31
From BOW to tf-idf
puppy
cat1
have
tfidf(puppy) = log 4
tfidf(cat) = log 4
tfidf(have) = 0
I have a dog
and I have a pen,
I have a kitten
1
log 4
log 4
I have a cat
I have a puppy
Decision surface
Tf-idf flattens
uninformative
dimensions in the
BOW point cloud
32
Entry points of feature engineering
• Start from data and task
- What’s the best text representation for classification?
• Start from modeling method
- What kind of features does k-means assume?
- What does linear regression assume about the data?
33
That’s not all, folks!
• There’s a lot more to feature engineering:
- Feature normalization
- Feature transformations
- “Regularizing” models
- Learning the right features
• Dato is hiring! jobs@dato.com
alicez@dato.com @RainyData

Contenu connexe

Tendances

Tendances (20)

Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Forward checking
Forward checkingForward checking
Forward checking
 
Rate distortion theory
Rate distortion theoryRate distortion theory
Rate distortion theory
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
Forward and Backward chaining in AI
Forward and Backward chaining in AIForward and Backward chaining in AI
Forward and Backward chaining in AI
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
 
Hamiltonian cycle in data structure 2
Hamiltonian cycle in data structure  2Hamiltonian cycle in data structure  2
Hamiltonian cycle in data structure 2
 
AI Lecture 3 (solving problems by searching)
AI Lecture 3 (solving problems by searching)AI Lecture 3 (solving problems by searching)
AI Lecture 3 (solving problems by searching)
 
Intelligent web applications
Intelligent web applicationsIntelligent web applications
Intelligent web applications
 
Intelligent agent
Intelligent agentIntelligent agent
Intelligent agent
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Word embedding
Word embedding Word embedding
Word embedding
 
AI - Introduction to Bellman Equations
AI - Introduction to Bellman EquationsAI - Introduction to Bellman Equations
AI - Introduction to Bellman Equations
 
Genetic Algorithm in Artificial Intelligence
Genetic Algorithm in Artificial IntelligenceGenetic Algorithm in Artificial Intelligence
Genetic Algorithm in Artificial Intelligence
 
Matrix chain multiplication
Matrix chain multiplicationMatrix chain multiplication
Matrix chain multiplication
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.
 
Multi Head, Multi Tape Turing Machine
Multi Head, Multi Tape Turing MachineMulti Head, Multi Tape Turing Machine
Multi Head, Multi Tape Turing Machine
 
Greedy algorithm
Greedy algorithmGreedy algorithm
Greedy algorithm
 
Planning in AI(Partial order planning)
Planning in AI(Partial order planning)Planning in AI(Partial order planning)
Planning in AI(Partial order planning)
 

En vedette

2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...
Ed Chi
 

En vedette (8)

The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Horovod - Distributed TensorFlow Made Easy
Horovod - Distributed TensorFlow Made EasyHorovod - Distributed TensorFlow Made Easy
Horovod - Distributed TensorFlow Made Easy
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
 
Lessons from 2MM machine learning models
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning models
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at Facebook
 
Parameter Server Approach for Online Learning at Twitter
Parameter Server Approach for Online Learning at TwitterParameter Server Approach for Online Learning at Twitter
Parameter Server Approach for Online Learning at Twitter
 
2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...
 

Similaire à Understanding Feature Space in Machine Learning

Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013
Michael Scovetta
 
DL Classe 0 - You can do it
DL Classe 0 - You can do itDL Classe 0 - You can do it
DL Classe 0 - You can do it
Gregory Renard
 
Edutalk f2013
Edutalk f2013Edutalk f2013
Edutalk f2013
Mel Chua
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiers
butest
 

Similaire à Understanding Feature Space in Machine Learning (20)

Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle
Understanding Feature Space in Machine Learning - Data Science Pop-up SeattleUnderstanding Feature Space in Machine Learning - Data Science Pop-up Seattle
Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle
 
Maths in the PYP - A Journey through the Arts
Maths in the PYP - A Journey through the ArtsMaths in the PYP - A Journey through the Arts
Maths in the PYP - A Journey through the Arts
 
Introduction to LLMs, Prompt Engineering fundamentals,
Introduction to LLMs, Prompt Engineering fundamentals,Introduction to LLMs, Prompt Engineering fundamentals,
Introduction to LLMs, Prompt Engineering fundamentals,
 
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
 
CO Quadratic Inequalties.pptx
CO Quadratic Inequalties.pptxCO Quadratic Inequalties.pptx
CO Quadratic Inequalties.pptx
 
Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modeling
 
Ml3
Ml3Ml3
Ml3
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
 
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
 
Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
 
Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013
 
syntherella feedback synthesizer
syntherella feedback synthesizersyntherella feedback synthesizer
syntherella feedback synthesizer
 
Deep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do ItDeep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do It
 
DL Classe 0 - You can do it
DL Classe 0 - You can do itDL Classe 0 - You can do it
DL Classe 0 - You can do it
 
Word2vec ultimate beginner
Word2vec ultimate beginnerWord2vec ultimate beginner
Word2vec ultimate beginner
 
Edutalk f2013
Edutalk f2013Edutalk f2013
Edutalk f2013
 
Collegeteaching102
Collegeteaching102Collegeteaching102
Collegeteaching102
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiers
 
Translation to QL Part 1
Translation to QL Part 1Translation to QL Part 1
Translation to QL Part 1
 

Dernier

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 

Dernier (20)

GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 

Understanding Feature Space in Machine Learning

  • 1. Understanding Feature Space in Machine Learning Alice Zheng, Dato September 9, 2015 1
  • 2. 2 My journey so far Applied machine learning (Data science) Build ML tools Shortage of experts and good tools.
  • 3. 3 Why machine learning? Model data. Make predictions. Build intelligent applications.
  • 4. 4 The machine learning pipeline I fell in love the instant I laid my eyes on that puppy. His big eyes and playful tail, his soft furry paws, … Raw data Features Models Predictions Deploy in production
  • 5. Feature = numeric representation of raw data
  • 6. 6 Representing natural text It is a puppy and it is extremely cute. What’s important? Phrases? Specific words? Ordering? Subject, object, verb? Classify: puppy or not? Raw Text {“it”:2, “is”:2, “a”:1, “puppy”:1, “and”:1, “extremely”:1, “cute”:1 } Bag of Words
  • 7. 7 Representing natural text It is a puppy and it is extremely cute. Classify: puppy or not? Raw Text Bag of Words it 2 they 0 I 1 am 0 how 0 puppy 1 and 1 cat 0 aardvark 0 cute 1 extremely 1 … … Sparse vector representation
  • 8. 8 Representing images Image source: “Recognizing and learning object categories,” Li Fei-Fei, Rob Fergus, Anthony Torralba, ICCV 2005—2009. Raw image: millions of RGB triplets, one for each pixel Classify: person or animal? Raw Image Bag of Visual Words
  • 9. 9 Representing images Classify: person or animal? Raw Image Deep learning features 3.29 -15 -5.24 48.3 1.36 47.1 - 1.92 36.5 2.83 95.4 -19 -89 5.09 37.8 Dense vector representation
  • 10. 10 Feature space in machine learning • Raw data  high dimensional vectors • Collection of data points  point cloud in feature space • Model = geometric summary of point cloud • Feature engineering = creating features of the appropriate granularity for the task
  • 11. Crudely speaking, mathematicians fall into two categories: the algebraists, who find it easiest to reduce all problems to sets of numbers and variables, and the geometers, who understand the world through shapes. -- Masha Gessen, “Perfect Rigor”
  • 12. 12 Algebra vs. Geometry a b c a2 + b2 = c2 Algebra Geometry Pythagorean Theorem (Euclidean space)
  • 13. 13 Visualizing a sphere in 2D x2 + y2 = 1 a b c Pythagorean theorem: a2 + b2 = c2 x y 1 1
  • 14. 14 Visualizing a sphere in 3D x2 + y2 + z2 = 1 x y z 1 1 1
  • 15. 15 Visualizing a sphere in 4D x2 + y2 + z2 + t2 = 1 x y z 1 1 1
  • 16. 16 Why are we looking at spheres? = = = = Poincaré Conjecture: All physical objects without holes is “equivalent” to a sphere.
  • 17. 17 The power of higher dimensions • A sphere in 4D can model the birth and death process of physical objects • Point clouds = approximate geometric shapes • High dimensional features can model many things
  • 19. 19 The challenge of high dimension geometry • Feature space can have hundreds to millions of dimensions • In high dimensions, our geometric imagination is limited - Algebra comes to our aid
  • 20. 20 Visualizing bag-of-words puppy cute 1 1 I have a puppy and it is extremely cute I have a puppy and it is extremely cute it 1 they 0 I 1 am 0 how 0 puppy 1 and 1 cat 0 aardvark 0 zebra 0 cute 1 extremely 1 … …
  • 21. 21 Visualizing bag-of-words puppy cute 1 1 1 extremely I have a puppy and it is extremely cute I have an extremely cute cat I have a cute puppy
  • 23. 23 What is a model? • Model = mathematical “summary” of data • What’s a summary? - A geometric shape
  • 24. 24 Classification model Feature 2 Feature 1 Decide between two classes
  • 25. 25 Clustering model Feature 2 Feature 1 Group data points tightly
  • 28. 28 When does bag-of-words fail? puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten Task: find a surface that separates documents about dogs vs. cats Problem: the word “have” adds fluff instead of information I have a dog and I have a pen 1
  • 29. 29 Improving on bag-of-words • Idea: “normalize” word counts so that popular words are discounted • Term frequency (tf) = Number of times a terms appears in a document • Inverse document frequency of word (idf) = • N = total number of documents • Tf-idf count = tf x idf
  • 30. 30 From BOW to tf-idf puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten idf(puppy) = log 4 idf(cat) = log 4 idf(have) = log 1 = 0 I have a dog and I have a pen 1
  • 31. 31 From BOW to tf-idf puppy cat1 have tfidf(puppy) = log 4 tfidf(cat) = log 4 tfidf(have) = 0 I have a dog and I have a pen, I have a kitten 1 log 4 log 4 I have a cat I have a puppy Decision surface Tf-idf flattens uninformative dimensions in the BOW point cloud
  • 32. 32 Entry points of feature engineering • Start from data and task - What’s the best text representation for classification? • Start from modeling method - What kind of features does k-means assume? - What does linear regression assume about the data?
  • 33. 33 That’s not all, folks! • There’s a lot more to feature engineering: - Feature normalization - Feature transformations - “Regularizing” models - Learning the right features • Dato is hiring! jobs@dato.com alicez@dato.com @RainyData

Notes de l'éditeur

  1. Features sit between raw data and model. They can make or break an application.