SlideShare une entreprise Scribd logo
1  sur  79
Télécharger pour lire hors ligne
Data-driven modeling
                                           APAM E4990


                                           Jake Hofman

                                          Columbia University


                                         January 23, 2012




Jake Hofman   (Columbia University)        Data-driven modeling   January 23, 2012   1 / 19
Data-dependent products



              • Effective/practical systems that learn from experience impact
                our daily lives, e.g.:
                    •   Recommendation systems
                    •   Spam detection
                    •   Optical character recognition
                    •   Face recognition
                    •   Fraud detection
                    •   Machine translation
                    •   ...




Jake Hofman    (Columbia University)       Data-driven modeling   January 23, 2012   2 / 19
Learning by example




Jake Hofman   (Columbia University)   Data-driven modeling   January 23, 2012   3 / 19
Learning by example




              • How did you solve this problem?
              • Can you make this process explicit (e.g. write code to do so)?


Jake Hofman    (Columbia University)   Data-driven modeling        January 23, 2012   3 / 19
Learning by example




              • We learn quickly from few, relatively unstructured examples ...
                but we don’t understand how we accomplish this
              • We’d like to develop algorithms that enable machines to learn
                by example from large data sets

Jake Hofman    (Columbia University)   Data-driven modeling         January 23, 2012   4 / 19
Got data?
              • Web service APIs expose lots of data




Jake Hofman    (Columbia University)   Data-driven modeling   January 23, 2012   5 / 19
Got data?


              • Many free, public data sets available online




Jake Hofman    (Columbia University)   Data-driven modeling    January 23, 2012   6 / 19
Black-boxified?




Jake Hofman   (Columbia University)   Data-driven modeling   January 23, 2012   7 / 19
Black-boxified?




Jake Hofman   (Columbia University)   Data-driven modeling   January 23, 2012   7 / 19
Roadmap?




                                      Step 1: Have data
                                      Step 2: ???
                                      Step 3: Profit




Jake Hofman   (Columbia University)      Data-driven modeling   January 23, 2012   8 / 19
Roadmap, take two


              1    Get data




Jake Hofman       (Columbia University)   Data-driven modeling   January 23, 2012   9 / 19
Roadmap, take two


              1    Get data
              2    Visualize/perform sanity checks
              3    Clean/filter observations
              4    Choose features to represent data




Jake Hofman       (Columbia University)   Data-driven modeling   January 23, 2012   9 / 19
Roadmap, take two


              1    Get data
              2    Visualize/perform sanity checks
              3    Clean/filter observations
              4    Choose features to represent data
              5    Specify model
              6    Specify loss function




Jake Hofman       (Columbia University)    Data-driven modeling   January 23, 2012   9 / 19
Roadmap, take two


              1    Get data
              2    Visualize/perform sanity checks
              3    Clean/filter observations
              4    Choose features to represent data
              5    Specify model
              6    Specify loss function
              7    Develop algorithm to minimize loss




Jake Hofman       (Columbia University)    Data-driven modeling   January 23, 2012   9 / 19
Roadmap, take two


              1    Get data
              2    Visualize/perform sanity checks
              3    Clean/filter observations
              4    Choose features to represent data
              5    Specify model
              6    Specify loss function
              7    Develop algorithm to minimize loss
              8    Choose performance measure
              9    “Train” to minimize loss
          10       “Test” to evaluate generalization



Jake Hofman       (Columbia University)    Data-driven modeling   January 23, 2012   9 / 19
Topics


        • Supervised                                    • Unsupervised
            • k-nearest neighbors                           • K-means
            • Naive Bayes                                   • Mixture models
            • Linear regression                             • Principal components
            • Logistic regression                             analysis
            • Support vector machines                       • Topic models
            • Collaborative filtering
            • Matrix factorization


              • Data representation: feature space, selection, normalization
              • Model assessment: complexity control, cross-validation, ROC
                curve, Bayesian Occam’s razor
              • Large-scale learning


Jake Hofman    (Columbia University)   Data-driven modeling              January 23, 2012   10 / 19
Everything old is new again1


              • Many fields ...
                    • Statistics
                    • Pattern recognition
                    • Data mining
                    • Machine learning
              • ... similar goals
                    • Extract and recognize patterns in data
                    • Interpret or explain observations
                    • Test validity of hypotheses
                    • Efficiently search the space of hypotheses
                    • Design efficient algorithms enabling machines to learn from
                      data




              1
                  http://cbcl.mit.edu/publications/theses/thesis-rifkin.pdf
Jake Hofman       (Columbia University)   Data-driven modeling        January 23, 2012   11 / 19
Statistics vs. machine learning2


                                                  Glossary

                          Machine learning                    Statistics


                          network, graphs                     model


                          weights                             parameters


                          learning                            fitting


                          generalization                      test set performance


                          supervised learning                 regression/classification


                          unsupervised learning               density estimation, clustering


                          large grant = $1,000,000            large grant= $50,000


                          nice place to have a meeting:       nice place to have a meeting:
                          Snowbird, Utah, French Alps         Las Vegas in August




                                                          1
         2
            http:
        //anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/
Jake Hofman (Columbia University) Data-driven modeling        January 23, 2012                 12 / 19
Philosophy

              • We would like models that:
                 • Provide predictive and explanatory power
                 • Are complex enough to describe observed phenomena
                 • Are simple enough to generalize to future observations




Jake Hofman    (Columbia University)   Data-driven modeling          January 23, 2012   13 / 19
Example: Netflix Prize




Jake Hofman   (Columbia University)   Data-driven modeling   January 23, 2012   14 / 19
Example: Netflix Prize




Jake Hofman   (Columbia University)   Data-driven modeling   January 23, 2012   15 / 19
Shipping = Feature




Jake Hofman   (Columbia University)   Data-driven modeling   January 23, 2012   16 / 19
References




Jake Hofman   (Columbia University)   Data-driven modeling   January 23, 2012   17 / 19
Disclaimer


       You may be bored if you already know how to ...
              • Acquire data from APIs
              • Clean/explore/visualize data
              • Classify and cluster various types of data (e.g., images, text)
              • Code in Python, R, SciPy/NumPy, etc.
              • Scale solutions to large data sets (e.g. Hadoop, SGD)
              • Script with unix tools on the command line, e.g.

                $ sed -e ’s/<[^>]*>//g’ < page.html > page.txt




Jake Hofman    (Columbia University)   Data-driven modeling         January 23, 2012   18 / 19
Themes




                                      Data jeopardy


         Regardless of scale, it’s difficult to find the right questions to ask
                                    of the data




Jake Hofman   (Columbia University)    Data-driven modeling      January 23, 2012   19 / 19
Themes




                                      Data hacking


        Cleaning and normalizing data is a substantial amount of the work
                        (and likely impacts results)




Jake Hofman   (Columbia University)    Data-driven modeling   January 23, 2012   19 / 19
Themes




                                      Data hacking


               The ability to iterate quickly, asking and answering many
                                 questions, is crucial




Jake Hofman   (Columbia University)    Data-driven modeling      January 23, 2012   19 / 19
Themes




                                      Data hacking


                  Hacks happen: sed/awk/grep are useful, and scale




Jake Hofman   (Columbia University)    Data-driven modeling     January 23, 2012   19 / 19
Themes




                                      “Data science”


              Simple methods (e.g., linear models) work surprisingly well,
                            especially with lots of data




Jake Hofman   (Columbia University)    Data-driven modeling       January 23, 2012   19 / 19
Themes




                                      “Data science”


              It’s easy to cover your tracks—things are often much more
                             complicated than they appear




Jake Hofman   (Columbia University)    Data-driven modeling    January 23, 2012   19 / 19
Predicting consumer activity with Web search
with Sharad Goel, S´bastien Lahaie, David Pennock, Duncan Watts
                   e
                                                  "Right Round"
                                                                                        c

                                10



                                20
                         Rank




                                30


                                              Billboard
                                40            Search

                                     Mar−09   Apr−09 May−09 Jun−09             Jul−09   Aug−09
                                                          Week




Jake Hofman   (Yahoo! Research)                Learning from Online Activity                 November 30, 2011   6 / 71
Search predictions
Motivation



                                                                             "Right Round"
                                                                                                            c

                                                           10



   Does collective search activity                         20




                                                    Rank
   provide useful predictive signal
   about real-world outcomes?                              30


                                                                         Billboard
                                                           40            Search

                                                                Mar−09   Apr−09 May−09 Jun−09      Jul−09   Aug−09
                                                                                     Week




Jake Hofman   (Yahoo! Research)   Learning from Online Activity                             November 30, 2011   7 / 71
Search predictions
Motivation




        Past work mainly focuses on predicting the present1 and ignores
              baseline models trained on publicly available data

                                    8                                                                Actual
                                    7                                                                Search
              Flu Level (Percent)




                                    6                                                                Autoregressive
                                    5
                                    4
                                    3
                                    2
                                    1

                                        2004        2005      2006            2007         2008   2009           2010
                                                                       Date




          1
              Varian, 2009
Jake Hofman                     (Yahoo! Research)          Learning from Online Activity             November 30, 2011   8 / 71
Search predictions
Motivation




                            We predict future sales for movies, video games, and music

                                   "Transformers 2"                                    "Tom Clancy's HAWX"                                          "Right Round"
                                                            a                                                         b                                                              c

                                                                                                                                  10
      Search Volume




                                                                 Search Volume


                                                                                                                                  20




                                                                                                                           Rank
                                                                                                                                  30


                                                                                                                                                Billboard
                                                                                                                                  40            Search

                      −30    −20     −10   0    10     20   30                   −30   −20   −10   0     10      20   30               Mar−09   Apr−09 May−09      Jun−09   Jul−09   Aug−09
                              Time to Release (Days)                                    Time to Release (Days)                                              Week




Jake Hofman                  (Yahoo! Research)                                   Learning from Online Activity                                           November 30, 2011                    9 / 71
Search predictions
Search models




      For movies and video games, predict opening weekend box office
      and first month sales, respectively:

                           log(revenue) = β0 + β1 log(search) + 

      For music, predict following week’s Billboard Hot 100 rank:

                   billboardt+1 = β0 + β1 searcht + β2 searcht−1 + 




Jake Hofman   (Yahoo! Research)      Learning from Online Activity   November 30, 2011   10 / 71
Search predictions
Search volume




Jake Hofman   (Yahoo! Research)   Learning from Online Activity   November 30, 2011   11 / 71
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01
Data-driven modeling: Lecture 01

Contenu connexe

Tendances

The Importance of Data Visualization
The Importance of Data VisualizationThe Importance of Data Visualization
The Importance of Data VisualizationCenterline Digital
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 
Tda presentation
Tda presentationTda presentation
Tda presentationHJ van Veen
 
Unit4: Knowledge Representation
Unit4: Knowledge RepresentationUnit4: Knowledge Representation
Unit4: Knowledge RepresentationTekendra Nath Yogi
 
DATA VISUALIZATION.pptx
DATA VISUALIZATION.pptxDATA VISUALIZATION.pptx
DATA VISUALIZATION.pptxPraneethBhai1
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
Data Visualization - What can you see? #baai17
Data Visualization - What can you see? #baai17Data Visualization - What can you see? #baai17
Data Visualization - What can you see? #baai17Eugene O'Loughlin
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPTANUSUYA T K
 
Using SHAP to Understand Black Box Models
Using SHAP to Understand Black Box ModelsUsing SHAP to Understand Black Box Models
Using SHAP to Understand Black Box ModelsJonathan Bechtel
 
What is cluster analysis
What is cluster analysisWhat is cluster analysis
What is cluster analysisPrabhat gangwar
 
Decision Trees
Decision TreesDecision Trees
Decision TreesStudent
 

Tendances (20)

The Importance of Data Visualization
The Importance of Data VisualizationThe Importance of Data Visualization
The Importance of Data Visualization
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Tda presentation
Tda presentationTda presentation
Tda presentation
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
Data analytics vs. Data analysis
Data analytics vs. Data analysisData analytics vs. Data analysis
Data analytics vs. Data analysis
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Unit4: Knowledge Representation
Unit4: Knowledge RepresentationUnit4: Knowledge Representation
Unit4: Knowledge Representation
 
DATA VISUALIZATION.pptx
DATA VISUALIZATION.pptxDATA VISUALIZATION.pptx
DATA VISUALIZATION.pptx
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Data Visualization - What can you see? #baai17
Data Visualization - What can you see? #baai17Data Visualization - What can you see? #baai17
Data Visualization - What can you see? #baai17
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
 
Using SHAP to Understand Black Box Models
Using SHAP to Understand Black Box ModelsUsing SHAP to Understand Black Box Models
Using SHAP to Understand Black Box Models
 
Clustering
ClusteringClustering
Clustering
 
What is cluster analysis
What is cluster analysisWhat is cluster analysis
What is cluster analysis
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 

En vedette

Paris e suas igrejas
Paris e suas igrejasParis e suas igrejas
Paris e suas igrejasfilipj2000
 
Summarizing ppoint
Summarizing ppointSummarizing ppoint
Summarizing ppointSusan Isbell
 
Lb oso polar-polar bear
Lb oso polar-polar bearLb oso polar-polar bear
Lb oso polar-polar bearfilipj2000
 
Jim sterne terametric_twitter-webinar_120910
Jim sterne terametric_twitter-webinar_120910Jim sterne terametric_twitter-webinar_120910
Jim sterne terametric_twitter-webinar_120910Terametric
 
Defrag: Applying Twitter Analytics in Real Time
Defrag: Applying Twitter Analytics in Real TimeDefrag: Applying Twitter Analytics in Real Time
Defrag: Applying Twitter Analytics in Real TimeTerametric
 
Hieu qua san xuat bap lai tren dat lua Dong bang song Cuu Long-TS. Ho Cao Viet
Hieu qua san  xuat bap lai tren dat lua Dong bang song Cuu Long-TS. Ho Cao VietHieu qua san  xuat bap lai tren dat lua Dong bang song Cuu Long-TS. Ho Cao Viet
Hieu qua san xuat bap lai tren dat lua Dong bang song Cuu Long-TS. Ho Cao VietHo Cao Viet
 
Visualizing geolocated Internet measurements
Visualizing geolocated Internet measurementsVisualizing geolocated Internet measurements
Visualizing geolocated Internet measurementsClaudio Squarcella
 
7강 기업교육론 20110413
7강 기업교육론 201104137강 기업교육론 20110413
7강 기업교육론 20110413조현경
 
Stroke symposiuma tpaper91411
Stroke symposiuma tpaper91411Stroke symposiuma tpaper91411
Stroke symposiuma tpaper91411Pat Maher
 
10강 기업교육론 20110504
10강 기업교육론 2011050410강 기업교육론 20110504
10강 기업교육론 20110504조현경
 
기업교육론 6장 학생발표자료
기업교육론 6장 학생발표자료기업교육론 6장 학생발표자료
기업교육론 6장 학생발표자료조현경
 
Learning from Web Activity
Learning from Web ActivityLearning from Web Activity
Learning from Web Activityjakehofman
 
Maherprofessional c vbio216
Maherprofessional c vbio216Maherprofessional c vbio216
Maherprofessional c vbio216Pat Maher
 

En vedette (20)

Creativity
CreativityCreativity
Creativity
 
移动产品界面适配设计
移动产品界面适配设计移动产品界面适配设计
移动产品界面适配设计
 
Niver helen
Niver helenNiver helen
Niver helen
 
Maine prevention savings 11 22-10
Maine prevention savings 11 22-10Maine prevention savings 11 22-10
Maine prevention savings 11 22-10
 
Paris e suas igrejas
Paris e suas igrejasParis e suas igrejas
Paris e suas igrejas
 
Tecnicas
TecnicasTecnicas
Tecnicas
 
Baile
Baile Baile
Baile
 
Summarizing ppoint
Summarizing ppointSummarizing ppoint
Summarizing ppoint
 
Lb oso polar-polar bear
Lb oso polar-polar bearLb oso polar-polar bear
Lb oso polar-polar bear
 
Jim sterne terametric_twitter-webinar_120910
Jim sterne terametric_twitter-webinar_120910Jim sterne terametric_twitter-webinar_120910
Jim sterne terametric_twitter-webinar_120910
 
Defrag: Applying Twitter Analytics in Real Time
Defrag: Applying Twitter Analytics in Real TimeDefrag: Applying Twitter Analytics in Real Time
Defrag: Applying Twitter Analytics in Real Time
 
Hieu qua san xuat bap lai tren dat lua Dong bang song Cuu Long-TS. Ho Cao Viet
Hieu qua san  xuat bap lai tren dat lua Dong bang song Cuu Long-TS. Ho Cao VietHieu qua san  xuat bap lai tren dat lua Dong bang song Cuu Long-TS. Ho Cao Viet
Hieu qua san xuat bap lai tren dat lua Dong bang song Cuu Long-TS. Ho Cao Viet
 
Visualizing geolocated Internet measurements
Visualizing geolocated Internet measurementsVisualizing geolocated Internet measurements
Visualizing geolocated Internet measurements
 
7강 기업교육론 20110413
7강 기업교육론 201104137강 기업교육론 20110413
7강 기업교육론 20110413
 
Stroke symposiuma tpaper91411
Stroke symposiuma tpaper91411Stroke symposiuma tpaper91411
Stroke symposiuma tpaper91411
 
10강 기업교육론 20110504
10강 기업교육론 2011050410강 기업교육론 20110504
10강 기업교육론 20110504
 
Spectra-Profile
Spectra-ProfileSpectra-Profile
Spectra-Profile
 
기업교육론 6장 학생발표자료
기업교육론 6장 학생발표자료기업교육론 6장 학생발표자료
기업교육론 6장 학생발표자료
 
Learning from Web Activity
Learning from Web ActivityLearning from Web Activity
Learning from Web Activity
 
Maherprofessional c vbio216
Maherprofessional c vbio216Maherprofessional c vbio216
Maherprofessional c vbio216
 

Similaire à Data-driven modeling: Lecture 01

Data-driven modeling: Lecture 02
Data-driven modeling: Lecture 02Data-driven modeling: Lecture 02
Data-driven modeling: Lecture 02jakehofman
 
Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09jakehofman
 
Computational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to CountingComputational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to Countingjakehofman
 
Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10jakehofman
 
Machine Learning: Learning with data
Machine Learning: Learning with dataMachine Learning: Learning with data
Machine Learning: Learning with dataONE Talks
 
One talk Machine Learning
One talk Machine LearningOne talk Machine Learning
One talk Machine LearningONE Talks
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
 
A data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototypingA data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototypingAkin Osman Kazakci
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learningjaumebp
 
MACHINE LEARNING LIFE CYCLE
MACHINE LEARNING LIFE CYCLEMACHINE LEARNING LIFE CYCLE
MACHINE LEARNING LIFE CYCLEBhimsen Joshi
 
Dm sei-tutorial-v7
Dm sei-tutorial-v7Dm sei-tutorial-v7
Dm sei-tutorial-v7CS, NcState
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Robert Williams
 
Machine-Learned Ranking using Distributed Parallel Genetic Programming
Machine-Learned Ranking using Distributed Parallel Genetic ProgrammingMachine-Learned Ranking using Distributed Parallel Genetic Programming
Machine-Learned Ranking using Distributed Parallel Genetic Programmingrusho1234
 
Connections b/w active learning and model extraction
Connections b/w active learning and model extractionConnections b/w active learning and model extraction
Connections b/w active learning and model extractionAnmol Dwivedi
 
Algorithmic Fairness: A Brief Introduction
Algorithmic Fairness: A Brief IntroductionAlgorithmic Fairness: A Brief Introduction
Algorithmic Fairness: A Brief IntroductionAnthonyMelson
 

Similaire à Data-driven modeling: Lecture 01 (20)

Data-driven modeling: Lecture 02
Data-driven modeling: Lecture 02Data-driven modeling: Lecture 02
Data-driven modeling: Lecture 02
 
Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 09
 
Computational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to CountingComputational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 02: An Introduction to Counting
 
Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 10
 
Machine Learning: Learning with data
Machine Learning: Learning with dataMachine Learning: Learning with data
Machine Learning: Learning with data
 
One talk Machine Learning
One talk Machine LearningOne talk Machine Learning
One talk Machine Learning
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
A data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototypingA data science observatory based on RAMP - rapid analytics and model prototyping
A data science observatory based on RAMP - rapid analytics and model prototyping
 
10 best practices in operational analytics
10 best practices in operational analytics 10 best practices in operational analytics
10 best practices in operational analytics
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Lecture-2 Applied ML .pptx
Lecture-2 Applied ML .pptxLecture-2 Applied ML .pptx
Lecture-2 Applied ML .pptx
 
MACHINE LEARNING LIFE CYCLE
MACHINE LEARNING LIFE CYCLEMACHINE LEARNING LIFE CYCLE
MACHINE LEARNING LIFE CYCLE
 
Dm sei-tutorial-v7
Dm sei-tutorial-v7Dm sei-tutorial-v7
Dm sei-tutorial-v7
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
 
Machine-Learned Ranking using Distributed Parallel Genetic Programming
Machine-Learned Ranking using Distributed Parallel Genetic ProgrammingMachine-Learned Ranking using Distributed Parallel Genetic Programming
Machine-Learned Ranking using Distributed Parallel Genetic Programming
 
Connections b/w active learning and model extraction
Connections b/w active learning and model extractionConnections b/w active learning and model extraction
Connections b/w active learning and model extraction
 
Algorithmic Fairness: A Brief Introduction
Algorithmic Fairness: A Brief IntroductionAlgorithmic Fairness: A Brief Introduction
Algorithmic Fairness: A Brief Introduction
 
ML.ppt
ML.pptML.ppt
ML.ppt
 
ML.ppt
ML.pptML.ppt
ML.ppt
 
ML.ppt
ML.pptML.ppt
ML.ppt
 

Plus de jakehofman

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2jakehofman
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1jakehofman
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networksjakehofman
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classificationjakehofman
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationjakehofman
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1jakehofman
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scalejakehofman
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in Rjakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overviewjakehofman
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systemsjakehofman
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayesjakehofman
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scalejakehofman
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Countingjakehofman
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studiesjakehofman
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Sciencejakehofman
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classificationjakehofman
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regressionjakehofman
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experimentsjakehofman
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wranglingjakehofman
 

Plus de jakehofman (20)

Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
 
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
 
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: NetworksModeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 10: Networks
 
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: ClassificationModeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Classification
 
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalizationModeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 7: Model complexity and generalization
 
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 6: Regression, Part 1
 
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at ScaleModeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 4: Counting at Scale
 
Modeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in RModeling Social Data, Lecture 3: Data manipulation in R
Modeling Social Data, Lecture 3: Data manipulation in R
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: OverviewModeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 1: Overview
 
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation SystemsModeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 8: Recommendation Systems
 
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive BayesModeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 6: Classification with Naive Bayes
 
Modeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at ScaleModeling Social Data, Lecture 3: Counting at Scale
Modeling Social Data, Lecture 3: Counting at Scale
 
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to CountingModeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 2: Introduction to Counting
 
Modeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case StudiesModeling Social Data, Lecture 1: Case Studies
Modeling Social Data, Lecture 1: Case Studies
 
NYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social ScienceNYC Data Science Meetup: Computational Social Science
NYC Data Science Meetup: Computational Social Science
 
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: ClassificationComputational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 13: Classification
 
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: RegressionComputational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 11: Regression
 
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online ExperimentsComputational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 10: Online Experiments
 
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data WranglingComputational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 09: Data Wrangling
 

Dernier

How to Analyse Profit of a Sales Order in Odoo 17
How to Analyse Profit of a Sales Order in Odoo 17How to Analyse Profit of a Sales Order in Odoo 17
How to Analyse Profit of a Sales Order in Odoo 17Celine George
 
Championnat de France de Tennis de table/
Championnat de France de Tennis de table/Championnat de France de Tennis de table/
Championnat de France de Tennis de table/siemaillard
 
Capitol Tech Univ Doctoral Presentation -May 2024
Capitol Tech Univ Doctoral Presentation -May 2024Capitol Tech Univ Doctoral Presentation -May 2024
Capitol Tech Univ Doctoral Presentation -May 2024CapitolTechU
 
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...Denish Jangid
 
Essential Safety precautions during monsoon season
Essential Safety precautions during monsoon seasonEssential Safety precautions during monsoon season
Essential Safety precautions during monsoon seasonMayur Khatri
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文中 央社
 
How to Manage Notification Preferences in the Odoo 17
How to Manage Notification Preferences in the Odoo 17How to Manage Notification Preferences in the Odoo 17
How to Manage Notification Preferences in the Odoo 17Celine George
 
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17Celine George
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfQucHHunhnh
 
Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
 Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatmentsaipooja36
 
factors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptxfactors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptxSanjay Shekar
 
Post Exam Fun(da) Intra UEM General Quiz 2024 - Prelims q&a.pdf
Post Exam Fun(da) Intra UEM General Quiz 2024 - Prelims q&a.pdfPost Exam Fun(da) Intra UEM General Quiz 2024 - Prelims q&a.pdf
Post Exam Fun(da) Intra UEM General Quiz 2024 - Prelims q&a.pdfPragya - UEM Kolkata Quiz Club
 
IATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdffIATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdff17thcssbs2
 
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...Nguyen Thanh Tu Collection
 
Morse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptxMorse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptxjmorse8
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽中 央社
 
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya - UEM Kolkata Quiz Club
 
philosophy and it's principles based on the life
philosophy and it's principles based on the lifephilosophy and it's principles based on the life
philosophy and it's principles based on the lifeNitinDeodare
 
MichaelStarkes_UncutGemsProjectSummary.pdf
MichaelStarkes_UncutGemsProjectSummary.pdfMichaelStarkes_UncutGemsProjectSummary.pdf
MichaelStarkes_UncutGemsProjectSummary.pdfmstarkes24
 

Dernier (20)

How to Analyse Profit of a Sales Order in Odoo 17
How to Analyse Profit of a Sales Order in Odoo 17How to Analyse Profit of a Sales Order in Odoo 17
How to Analyse Profit of a Sales Order in Odoo 17
 
Championnat de France de Tennis de table/
Championnat de France de Tennis de table/Championnat de France de Tennis de table/
Championnat de France de Tennis de table/
 
Capitol Tech Univ Doctoral Presentation -May 2024
Capitol Tech Univ Doctoral Presentation -May 2024Capitol Tech Univ Doctoral Presentation -May 2024
Capitol Tech Univ Doctoral Presentation -May 2024
 
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
 
Essential Safety precautions during monsoon season
Essential Safety precautions during monsoon seasonEssential Safety precautions during monsoon season
Essential Safety precautions during monsoon season
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
 
How to Manage Notification Preferences in the Odoo 17
How to Manage Notification Preferences in the Odoo 17How to Manage Notification Preferences in the Odoo 17
How to Manage Notification Preferences in the Odoo 17
 
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
 
Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
 Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
Envelope of Discrepancy in Orthodontics: Enhancing Precision in Treatment
 
Word Stress rules esl .pptx
Word Stress rules esl               .pptxWord Stress rules esl               .pptx
Word Stress rules esl .pptx
 
factors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptxfactors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptx
 
Post Exam Fun(da) Intra UEM General Quiz 2024 - Prelims q&a.pdf
Post Exam Fun(da) Intra UEM General Quiz 2024 - Prelims q&a.pdfPost Exam Fun(da) Intra UEM General Quiz 2024 - Prelims q&a.pdf
Post Exam Fun(da) Intra UEM General Quiz 2024 - Prelims q&a.pdf
 
IATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdffIATP How-to Foreign Travel May 2024.pdff
IATP How-to Foreign Travel May 2024.pdff
 
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
 
Morse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptxMorse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptx
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
 
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General QuizPragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
Pragya Champions Chalice 2024 Prelims & Finals Q/A set, General Quiz
 
philosophy and it's principles based on the life
philosophy and it's principles based on the lifephilosophy and it's principles based on the life
philosophy and it's principles based on the life
 
MichaelStarkes_UncutGemsProjectSummary.pdf
MichaelStarkes_UncutGemsProjectSummary.pdfMichaelStarkes_UncutGemsProjectSummary.pdf
MichaelStarkes_UncutGemsProjectSummary.pdf
 

Data-driven modeling: Lecture 01

  • 1. Data-driven modeling APAM E4990 Jake Hofman Columbia University January 23, 2012 Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 1 / 19
  • 2. Data-dependent products • Effective/practical systems that learn from experience impact our daily lives, e.g.: • Recommendation systems • Spam detection • Optical character recognition • Face recognition • Fraud detection • Machine translation • ... Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 2 / 19
  • 3. Learning by example Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 3 / 19
  • 4. Learning by example • How did you solve this problem? • Can you make this process explicit (e.g. write code to do so)? Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 3 / 19
  • 5. Learning by example • We learn quickly from few, relatively unstructured examples ... but we don’t understand how we accomplish this • We’d like to develop algorithms that enable machines to learn by example from large data sets Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 4 / 19
  • 6. Got data? • Web service APIs expose lots of data Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 5 / 19
  • 7. Got data? • Many free, public data sets available online Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 6 / 19
  • 8. Black-boxified? Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 7 / 19
  • 9. Black-boxified? Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 7 / 19
  • 10. Roadmap? Step 1: Have data Step 2: ??? Step 3: Profit Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 8 / 19
  • 11. Roadmap, take two 1 Get data Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
  • 12. Roadmap, take two 1 Get data 2 Visualize/perform sanity checks 3 Clean/filter observations 4 Choose features to represent data Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
  • 13. Roadmap, take two 1 Get data 2 Visualize/perform sanity checks 3 Clean/filter observations 4 Choose features to represent data 5 Specify model 6 Specify loss function Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
  • 14. Roadmap, take two 1 Get data 2 Visualize/perform sanity checks 3 Clean/filter observations 4 Choose features to represent data 5 Specify model 6 Specify loss function 7 Develop algorithm to minimize loss Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
  • 15. Roadmap, take two 1 Get data 2 Visualize/perform sanity checks 3 Clean/filter observations 4 Choose features to represent data 5 Specify model 6 Specify loss function 7 Develop algorithm to minimize loss 8 Choose performance measure 9 “Train” to minimize loss 10 “Test” to evaluate generalization Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 9 / 19
  • 16. Topics • Supervised • Unsupervised • k-nearest neighbors • K-means • Naive Bayes • Mixture models • Linear regression • Principal components • Logistic regression analysis • Support vector machines • Topic models • Collaborative filtering • Matrix factorization • Data representation: feature space, selection, normalization • Model assessment: complexity control, cross-validation, ROC curve, Bayesian Occam’s razor • Large-scale learning Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 10 / 19
  • 17. Everything old is new again1 • Many fields ... • Statistics • Pattern recognition • Data mining • Machine learning • ... similar goals • Extract and recognize patterns in data • Interpret or explain observations • Test validity of hypotheses • Efficiently search the space of hypotheses • Design efficient algorithms enabling machines to learn from data 1 http://cbcl.mit.edu/publications/theses/thesis-rifkin.pdf Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 11 / 19
  • 18. Statistics vs. machine learning2 Glossary Machine learning Statistics network, graphs model weights parameters learning fitting generalization test set performance supervised learning regression/classification unsupervised learning density estimation, clustering large grant = $1,000,000 large grant= $50,000 nice place to have a meeting: nice place to have a meeting: Snowbird, Utah, French Alps Las Vegas in August 1 2 http: //anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/ Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 12 / 19
  • 19. Philosophy • We would like models that: • Provide predictive and explanatory power • Are complex enough to describe observed phenomena • Are simple enough to generalize to future observations Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 13 / 19
  • 20. Example: Netflix Prize Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 14 / 19
  • 21. Example: Netflix Prize Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 15 / 19
  • 22. Shipping = Feature Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 16 / 19
  • 23. References Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 17 / 19
  • 24. Disclaimer You may be bored if you already know how to ... • Acquire data from APIs • Clean/explore/visualize data • Classify and cluster various types of data (e.g., images, text) • Code in Python, R, SciPy/NumPy, etc. • Scale solutions to large data sets (e.g. Hadoop, SGD) • Script with unix tools on the command line, e.g. $ sed -e ’s/<[^>]*>//g’ < page.html > page.txt Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 18 / 19
  • 25. Themes Data jeopardy Regardless of scale, it’s difficult to find the right questions to ask of the data Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
  • 26. Themes Data hacking Cleaning and normalizing data is a substantial amount of the work (and likely impacts results) Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
  • 27. Themes Data hacking The ability to iterate quickly, asking and answering many questions, is crucial Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
  • 28. Themes Data hacking Hacks happen: sed/awk/grep are useful, and scale Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
  • 29. Themes “Data science” Simple methods (e.g., linear models) work surprisingly well, especially with lots of data Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
  • 30. Themes “Data science” It’s easy to cover your tracks—things are often much more complicated than they appear Jake Hofman (Columbia University) Data-driven modeling January 23, 2012 19 / 19
  • 31.
  • 32. Predicting consumer activity with Web search with Sharad Goel, S´bastien Lahaie, David Pennock, Duncan Watts e "Right Round" c 10 20 Rank 30 Billboard 40 Search Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Week Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 6 / 71
  • 33. Search predictions Motivation "Right Round" c 10 Does collective search activity 20 Rank provide useful predictive signal about real-world outcomes? 30 Billboard 40 Search Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Week Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 7 / 71
  • 34. Search predictions Motivation Past work mainly focuses on predicting the present1 and ignores baseline models trained on publicly available data 8 Actual 7 Search Flu Level (Percent) 6 Autoregressive 5 4 3 2 1 2004 2005 2006 2007 2008 2009 2010 Date 1 Varian, 2009 Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 8 / 71
  • 35. Search predictions Motivation We predict future sales for movies, video games, and music "Transformers 2" "Tom Clancy's HAWX" "Right Round" a b c 10 Search Volume Search Volume 20 Rank 30 Billboard 40 Search −30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30 Mar−09 Apr−09 May−09 Jun−09 Jul−09 Aug−09 Time to Release (Days) Time to Release (Days) Week Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 9 / 71
  • 36. Search predictions Search models For movies and video games, predict opening weekend box office and first month sales, respectively: log(revenue) = β0 + β1 log(search) + For music, predict following week’s Billboard Hot 100 rank: billboardt+1 = β0 + β1 searcht + β2 searcht−1 + Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 10 / 71
  • 37. Search predictions Search volume Jake Hofman (Yahoo! Research) Learning from Online Activity November 30, 2011 11 / 71