SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Data-driven modeling
                                           APAM E4990


                                           Jake Hofman

                                          Columbia University


                                         January 30, 2012




Jake Hofman   (Columbia University)        Data-driven modeling   January 30, 2012   1 / 23
Outline



      1 Digit recognition



      2 Image classification



      3 Acquiring image data




Jake Hofman   (Columbia University)   Data-driven modeling   January 30, 2012   2 / 23
Digit recognition

          Classification is an supervised learning task by which we aim to
             predict the correct label for an example given its features




                                               ↓
                                  0 5 4 1 4 9
              e.g. determine which digit {0, 1, . . . , 9} is in depicted in each
                                         image


Jake Hofman    (Columbia University)    Data-driven modeling            January 30, 2012   3 / 23
Digit recognition

          Determine which digit {0, 1, . . . , 9} is in depicted in each image




Jake Hofman   (Columbia University)   Data-driven modeling          January 30, 2012   4 / 23
Images as arrays


               Grayscale images ↔ 2-d arrays of M × N pixel intensities




Jake Hofman   (Columbia University)   Data-driven modeling       January 30, 2012   5 / 23
Images as arrays

               Grayscale images ↔ 2-d arrays of M × N pixel intensities




          Represent each image as a “vector of pixels”, flattening the 2-d
                          array of pixels to a 1-d vector

Jake Hofman   (Columbia University)   Data-driven modeling       January 30, 2012   5 / 23
k-nearest neighbors classification
         k-nearest neighbors: memorize training examples, predict labels
                   using labels of the k closest training points




                          Intuition: nearby points have similar labels

Jake Hofman   (Columbia University)      Data-driven modeling            January 30, 2012   6 / 23
k-nearest neighbors classification




              Small k gives a complex boundary, large k results in coarse
                                     averaging


Jake Hofman    (Columbia University)   Data-driven modeling      January 30, 2012   7 / 23
k-nearest neighbors classification




                Evaluate performance on a held-out test set to assess
                                generalization error


Jake Hofman   (Columbia University)   Data-driven modeling      January 30, 2012   7 / 23
Digit recognition


       Simple digit classifer with k=1 nearest neighbors
          ./ classify_digits . py




Jake Hofman   (Columbia University)   Data-driven modeling   January 30, 2012   8 / 23
Outline



      1 Digit recognition



      2 Image classification



      3 Acquiring image data




Jake Hofman   (Columbia University)   Data-driven modeling   January 30, 2012   9 / 23
Image classification


                      Determine if an image is a landscape or headshot




                                            ↓                 ↓
                                       ’landscape’        ’headshot’

              Represent each image with a binned RGB intensity histogram




Jake Hofman    (Columbia University)         Data-driven modeling      January 30, 2012   10 / 23
Images as arrays
          Color images ↔ 3-d arrays of M × N × 3 RGB pixel intensities




       import matplotlib . image as mpimg
       I = mpimg . imread ( ' chairs . jpg ')

Jake Hofman   (Columbia University)   Data-driven modeling   January 30, 2012   11 / 23
Images as arrays
          Color images ↔ 3-d arrays of M × N × 3 RGB pixel intensities




       import matplotlib . image as mpimg
       I = mpimg . imread ( ' chairs . jpg ')


Jake Hofman   (Columbia University)   Data-driven modeling   January 30, 2012   11 / 23
Intensity histograms

        Disregard all spatial information, simply count pixels by intensities
               (e.g. lots of pixels with bright green and dark blue)




Jake Hofman   (Columbia University)   Data-driven modeling       January 30, 2012   12 / 23
Intensity histograms

                                How many bins for pixel intensities?




         Too many bins gives a noisy, overly complex representation of
                                  the data,
           while using too few bins results in an overly simple one


Jake Hofman   (Columbia University)        Data-driven modeling        January 30, 2012   13 / 23
Image classification

       Classify
           ./ classify_flickr . py 16 9
               flickr_headshot flickr_landscape




       Change in performance on test set with number of neighbors
       k      =   1,    accuracy      =   0.7125
       k      =   3,    accuracy      =   0.7425
       k      =   5,    accuracy      =   0.7725
       k      =   7,    accuracy      =   0.7650
       k      =   9,    accuracy      =   0.7500


Jake Hofman   (Columbia University)       Data-driven modeling   January 30, 2012   14 / 23
Outline



      1 Digit recognition



      2 Image classification



      3 Acquiring image data




Jake Hofman   (Columbia University)   Data-driven modeling   January 30, 2012   15 / 23
Simple screen scraping




       One-liner to download ESL digit data
          wget - Nr -- level =1 --no - parent http :// www -
            stat . stanford . edu /~ tibs / ElemStatLearn /
            datasets / zip . digits




Jake Hofman   (Columbia University)   Data-driven modeling   January 30, 2012   16 / 23
Simple screen scraping


       One-liner to scrape images from a webpage
          wget -O - http :// bit . ly / zxy0jN |
            tr ' ' ' ' " = ' 'n ' |
            egrep '^ http .*( png | jpg | gif ) ' |
            xargs wget




Jake Hofman   (Columbia University)   Data-driven modeling   January 30, 2012   17 / 23
Simple screen scraping


       One-liner to scrape images from a webpage
          wget -O - http :// bit . ly / zxy0jN |
            tr ' ' ' ' " = ' 'n ' |
            egrep '^ http .*( png | jpg | gif ) ' |
            xargs wget



              • get page source
              • translate quotes and = to newlines
              • match urls with image extensions
              • download qualifying images



Jake Hofman    (Columbia University)   Data-driven modeling   January 30, 2012   17 / 23
“cat flickr xargs wget”?




Jake Hofman   (Columbia University)   Data-driven modeling   January 30, 2012   18 / 23
Flickr API




Jake Hofman   (Columbia University)   Data-driven modeling   January 30, 2012   19 / 23
YQL: SELECT * FROM Internet1




                                   http://developer.yahoo.com/yql




              1
                  http://oreillynet.com/pub/e/1369
Jake Hofman       (Columbia University)      Data-driven modeling   January 30, 2012   20 / 23
YQL: Console




                       http://developer.yahoo.com/yql/console


Jake Hofman   (Columbia University)   Data-driven modeling   January 30, 2012   21 / 23
YQL + Python
       Python function for public YQL queries
        def yql_public ( query , env = False ):
            # build dictionary of GET parameters
            params = { 'q ': query , ' format ': ' json '}
            if env :
                params [ ' env '] = env

                 # escape query
                 query_str = urlencode ( params )

                 # fetch results
                 url = '% s ?% s ' % ( YQL_PUBLIC , query_str )
                 result = urlopen ( url )

                 # parse json and return
                 return json . load ( result )[ ' query ' ][ ' results ']
Jake Hofman   (Columbia University)   Data-driven modeling   January 30, 2012   22 / 23
YQL + Python + Flickr


       Fetch info for “interestingness” photos
          ./ simpleyql . py ' select * from flickr . photos
            . interestingness (20) where api_key = " ... " '



       Download thumbnails for photos tagged with “vivid”
          ./ download_flickr . py vivid 500 < api_key >




Jake Hofman   (Columbia University)   Data-driven modeling   January 30, 2012   23 / 23

Contenu connexe

En vedette

Chapter 1 market & marketing
Chapter 1 market & marketingChapter 1 market & marketing
Chapter 1 market & marketing
Ho Cao Viet
 
1172. su.-.pps_angelines
1172.  su.-.pps_angelines1172.  su.-.pps_angelines
1172. su.-.pps_angelines
filipj2000
 
K state candidacy presentation
K state candidacy presentationK state candidacy presentation
K state candidacy presentation
Teague Allen
 
Towers of today
Towers of todayTowers of today
Towers of today
filipj2000
 
Letter to arvind kejriwal dec 31 13 water issue
Letter to arvind kejriwal dec 31 13   water issueLetter to arvind kejriwal dec 31 13   water issue
Letter to arvind kejriwal dec 31 13 water issue
Madhukar Varshney
 
PropSafe - property renting & mgmt solution
PropSafe - property renting & mgmt solutionPropSafe - property renting & mgmt solution
PropSafe - property renting & mgmt solution
Ratnesh1979
 
Praias παραλίες
Praias   παραλίεςPraias   παραλίες
Praias παραλίες
filipj2000
 

En vedette (20)

Chapter 1 market & marketing
Chapter 1 market & marketingChapter 1 market & marketing
Chapter 1 market & marketing
 
Orosz fotosok
Orosz fotosokOrosz fotosok
Orosz fotosok
 
Maherprofessional c vbio216
Maherprofessional c vbio216Maherprofessional c vbio216
Maherprofessional c vbio216
 
产品早期市场推广探路实践 by XDash
产品早期市场推广探路实践 by XDash产品早期市场推广探路实践 by XDash
产品早期市场推广探路实践 by XDash
 
Verkiezing schaduwburgemeester dorpscafé malden ruud van gisteren d.d. 3 3-20...
Verkiezing schaduwburgemeester dorpscafé malden ruud van gisteren d.d. 3 3-20...Verkiezing schaduwburgemeester dorpscafé malden ruud van gisteren d.d. 3 3-20...
Verkiezing schaduwburgemeester dorpscafé malden ruud van gisteren d.d. 3 3-20...
 
Creativity
CreativityCreativity
Creativity
 
Reuters2
Reuters2Reuters2
Reuters2
 
1172. su.-.pps_angelines
1172.  su.-.pps_angelines1172.  su.-.pps_angelines
1172. su.-.pps_angelines
 
K state candidacy presentation
K state candidacy presentationK state candidacy presentation
K state candidacy presentation
 
Cost Reductions Process - Results
Cost Reductions Process - ResultsCost Reductions Process - Results
Cost Reductions Process - Results
 
Successful Employee Engagement through Sharepoint OOTB tools
Successful Employee Engagement through Sharepoint OOTB toolsSuccessful Employee Engagement through Sharepoint OOTB tools
Successful Employee Engagement through Sharepoint OOTB tools
 
Towers of today
Towers of todayTowers of today
Towers of today
 
Letter to arvind kejriwal dec 31 13 water issue
Letter to arvind kejriwal dec 31 13   water issueLetter to arvind kejriwal dec 31 13   water issue
Letter to arvind kejriwal dec 31 13 water issue
 
PropSafe - property renting & mgmt solution
PropSafe - property renting & mgmt solutionPropSafe - property renting & mgmt solution
PropSafe - property renting & mgmt solution
 
Tearsofawoman
TearsofawomanTearsofawoman
Tearsofawoman
 
Praias παραλίες
Praias   παραλίεςPraias   παραλίες
Praias παραλίες
 
Grecia linda
Grecia lindaGrecia linda
Grecia linda
 
sibgrapi2015
sibgrapi2015sibgrapi2015
sibgrapi2015
 
Imp local
Imp localImp local
Imp local
 
Internet Message
Internet MessageInternet Message
Internet Message
 

Plus de jakehofman

NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
jakehofman
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classification
jakehofman
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regression
jakehofman
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experiments
jakehofman
 
Computational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIComputational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part II
jakehofman
 
Computational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part IComputational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part I
jakehofman
 

Plus de jakehofman (20)

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalization
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scale
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in R
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overview
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scale
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classification
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regression
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experiments
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
 
Computational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part IIComputational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 08: Counting Fast, Part II
 
Computational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part IComputational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 07: Counting Fast, Part I
 

Dernier

Dernier (20)

HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 

Data-driven modeling: Lecture 02

  • 1. Data-driven modeling APAM E4990 Jake Hofman Columbia University January 30, 2012 Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 1 / 23
  • 2. Outline 1 Digit recognition 2 Image classification 3 Acquiring image data Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 2 / 23
  • 3. Digit recognition Classification is an supervised learning task by which we aim to predict the correct label for an example given its features ↓ 0 5 4 1 4 9 e.g. determine which digit {0, 1, . . . , 9} is in depicted in each image Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 3 / 23
  • 4. Digit recognition Determine which digit {0, 1, . . . , 9} is in depicted in each image Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 4 / 23
  • 5. Images as arrays Grayscale images ↔ 2-d arrays of M × N pixel intensities Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 5 / 23
  • 6. Images as arrays Grayscale images ↔ 2-d arrays of M × N pixel intensities Represent each image as a “vector of pixels”, flattening the 2-d array of pixels to a 1-d vector Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 5 / 23
  • 7. k-nearest neighbors classification k-nearest neighbors: memorize training examples, predict labels using labels of the k closest training points Intuition: nearby points have similar labels Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 6 / 23
  • 8. k-nearest neighbors classification Small k gives a complex boundary, large k results in coarse averaging Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 7 / 23
  • 9. k-nearest neighbors classification Evaluate performance on a held-out test set to assess generalization error Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 7 / 23
  • 10. Digit recognition Simple digit classifer with k=1 nearest neighbors ./ classify_digits . py Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 8 / 23
  • 11. Outline 1 Digit recognition 2 Image classification 3 Acquiring image data Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 9 / 23
  • 12. Image classification Determine if an image is a landscape or headshot ↓ ↓ ’landscape’ ’headshot’ Represent each image with a binned RGB intensity histogram Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 10 / 23
  • 13. Images as arrays Color images ↔ 3-d arrays of M × N × 3 RGB pixel intensities import matplotlib . image as mpimg I = mpimg . imread ( ' chairs . jpg ') Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 11 / 23
  • 14. Images as arrays Color images ↔ 3-d arrays of M × N × 3 RGB pixel intensities import matplotlib . image as mpimg I = mpimg . imread ( ' chairs . jpg ') Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 11 / 23
  • 15. Intensity histograms Disregard all spatial information, simply count pixels by intensities (e.g. lots of pixels with bright green and dark blue) Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 12 / 23
  • 16. Intensity histograms How many bins for pixel intensities? Too many bins gives a noisy, overly complex representation of the data, while using too few bins results in an overly simple one Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 13 / 23
  • 17. Image classification Classify ./ classify_flickr . py 16 9 flickr_headshot flickr_landscape Change in performance on test set with number of neighbors k = 1, accuracy = 0.7125 k = 3, accuracy = 0.7425 k = 5, accuracy = 0.7725 k = 7, accuracy = 0.7650 k = 9, accuracy = 0.7500 Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 14 / 23
  • 18. Outline 1 Digit recognition 2 Image classification 3 Acquiring image data Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 15 / 23
  • 19. Simple screen scraping One-liner to download ESL digit data wget - Nr -- level =1 --no - parent http :// www - stat . stanford . edu /~ tibs / ElemStatLearn / datasets / zip . digits Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 16 / 23
  • 20. Simple screen scraping One-liner to scrape images from a webpage wget -O - http :// bit . ly / zxy0jN | tr ' ' ' ' " = ' 'n ' | egrep '^ http .*( png | jpg | gif ) ' | xargs wget Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 17 / 23
  • 21. Simple screen scraping One-liner to scrape images from a webpage wget -O - http :// bit . ly / zxy0jN | tr ' ' ' ' " = ' 'n ' | egrep '^ http .*( png | jpg | gif ) ' | xargs wget • get page source • translate quotes and = to newlines • match urls with image extensions • download qualifying images Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 17 / 23
  • 22. “cat flickr xargs wget”? Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 18 / 23
  • 23. Flickr API Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 19 / 23
  • 24. YQL: SELECT * FROM Internet1 http://developer.yahoo.com/yql 1 http://oreillynet.com/pub/e/1369 Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 20 / 23
  • 25. YQL: Console http://developer.yahoo.com/yql/console Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 21 / 23
  • 26. YQL + Python Python function for public YQL queries def yql_public ( query , env = False ): # build dictionary of GET parameters params = { 'q ': query , ' format ': ' json '} if env : params [ ' env '] = env # escape query query_str = urlencode ( params ) # fetch results url = '% s ?% s ' % ( YQL_PUBLIC , query_str ) result = urlopen ( url ) # parse json and return return json . load ( result )[ ' query ' ][ ' results '] Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 22 / 23
  • 27. YQL + Python + Flickr Fetch info for “interestingness” photos ./ simpleyql . py ' select * from flickr . photos . interestingness (20) where api_key = " ... " ' Download thumbnails for photos tagged with “vivid” ./ download_flickr . py vivid 500 < api_key > Jake Hofman (Columbia University) Data-driven modeling January 30, 2012 23 / 23