SlideShare une entreprise Scribd logo
1  sur  17
Some Key Questions
about you Data

                  Brian Mac Namee
Brendan Tierney
            Damian Gordon
The Data
   If the data is the key consideration in your research
    (although not all projects will necessarily be
    concerned with large datasets) it is important to
    consider several questions for those projects that
    do.
Overview
   How suitable is the data?
   What is the type of the data?
   Where will you get it from?
   What size is the dataset?
   What format is it in?
   How much cleaning is required?
   What is the quality of the data?
   How do you deal with missing data?
   How will you evaluate your analysis?
   etc.
Suitability: Dataset
   Determining the suitability of the data is a vital
    consideration, it is not sufficient to simply locate a
    dataset that is thematically linked to your research
    question, it must be appropriate to explore the
    questions that you want to ask.
   For example, just because you want to do Credit
    Card Fraud detection and you have a dataset that
    contains Credit Card transactions or was used in
    another Credit Card Fraud project, does not mean
    that it will be suitable for your project.
Suitability: Labelling
   Is the data already labelled?

   This is very important for supervised learning
    problems.
   To take the credit card fraud example again, you
    can probably get as many credit card transactions
    as you like but you probably won't be able to get
    them marked up as fraudulent and non-fraudulent.
Suitability: Labelling
   The same thing goes for a lot of text analytics
    problems - can you get people to label thousands of
    documents as being interesting or non-interesting to
    them so that you can train a predictive model?
   The availability of labelled data is a key
    consideration for any supervised learning problem.
   The areas of semi-supervised learning and active
    learning try to address this problem and have some
    very interesting open research questions.
Suitability: Labelling
   Two important considerations:

       The Curse of Dimensionality – When the dimensionality
        increases, the volume of the space increases so fast that
        the available data becomes sparse. In order to obtain a
        statistically sound result, the amount of data you need
        often grows exponentially with the dimensionality.

       The No Free Lunch Theorem - Classifier performance
        depends greatly on the characteristics of the data to be
        classified. There is no single classifier that works best on
        all given problems.
Suitability: Labelling
   Also remember for labelling, you might be aiming
    for one of three goals:

       Binary classifications – classifying each data item to one
        of two categories.

       Multiclass classifications - classifying each data item to
        more than two categories.

       Multi-label classifications - classifying each data item to
        multiple target labels.
Types of Data
   Federated data
   High dimensional data
   Descriptive data
   Longitudinal data
   Streaming data
   Web (scraped) data
   Numeric vs. categorical vs. text data
   etc.
Locating Datasets
   http://researchmethodsdataanalysis.blogsp

   e.g.
   http://www.kdnuggets.com/datasets/
   http://www.google.com/publicdata/directory
   http://opendata.ie/
   http://lib.stat.cmu.edu/datasets/
Size of the Dataset
   What is a reasonable size of a dataset?

   Obviously it vary a lot from problem to problem, but
    in general we would recommend at least 10
    features (columns) in the dataset, and we’d like to
    see thousands of instances.
Format of the Data
   TXT (Text file)
   MIME (Multipurpose Internet Mail Extensions)
   XML (Extensible Markup Language)
   CSV (Comma-Separated Values)
   ACSII (American Standard Code for Information
    Interchange)
   etc.
Cleaning of Data
   Parsing
   Correcting
   Standardizing
   Matching
   Consolidating
Quality of the Data
   Frequency counts
   Descriptive statistics (mean, standard deviation,
    median)
   Normality (skewness, kurtosis, frequency
    histograms, normal probability plots)
   Associations (correlations, scatter plots)
Missing Data?
   Imputation
   Partial imputation
   Partial deletion
   Full analysis

   Also consider database nullology
Evaluating the Analysis
   How confident are you in the outcomes of your
    analysis?

   Area under the Curve
   Misclassification Error
   Confusion Matrix
   N-fold Cross Validation
   Test predictions using the real-world
The Data
   Other questions?

Contenu connexe

Tendances

Datamining
DataminingDatamining
Datamining
sumit621
 
BigData Analytics_1.7
BigData Analytics_1.7BigData Analytics_1.7
BigData Analytics_1.7
Rohit Mittal
 

Tendances (19)

2 Data-mining process
2   Data-mining process2   Data-mining process
2 Data-mining process
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Unit 2
Unit 2Unit 2
Unit 2
 
Data analytics
Data analyticsData analytics
Data analytics
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Datamining
DataminingDatamining
Datamining
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
 
Data Science in Action
Data Science in ActionData Science in Action
Data Science in Action
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data mining
Data miningData mining
Data mining
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 
Research trends in data warehousing and data mining
Research trends in data warehousing and data miningResearch trends in data warehousing and data mining
Research trends in data warehousing and data mining
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & Applications
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data mining
Data miningData mining
Data mining
 
BigData Analytics_1.7
BigData Analytics_1.7BigData Analytics_1.7
BigData Analytics_1.7
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill SetData Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill Set
 

En vedette

Introduction to Interviewing
Introduction to InterviewingIntroduction to Interviewing
Introduction to Interviewing
Damian T. Gordon
 
Doing a Literature Review - Part 3
Doing a Literature Review - Part 3Doing a Literature Review - Part 3
Doing a Literature Review - Part 3
Damian T. Gordon
 
Doing a Literature Review - Part 4
Doing a Literature Review - Part 4Doing a Literature Review - Part 4
Doing a Literature Review - Part 4
Damian T. Gordon
 
Introduction to Statistics - Part 2
Introduction to Statistics - Part 2Introduction to Statistics - Part 2
Introduction to Statistics - Part 2
Damian T. Gordon
 
Doing a Literature Review - Part 1
Doing a Literature Review - Part 1Doing a Literature Review - Part 1
Doing a Literature Review - Part 1
Damian T. Gordon
 
Sri lanka tracer study and impact assessment synthesis
Sri lanka   tracer study and impact assessment synthesisSri lanka   tracer study and impact assessment synthesis
Sri lanka tracer study and impact assessment synthesis
imecommunity
 
Lao pdr tracer study and impact assessment synthesis
Lao pdr   tracer study and impact assessment synthesisLao pdr   tracer study and impact assessment synthesis
Lao pdr tracer study and impact assessment synthesis
imecommunity
 
Eziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESIS
Eziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESISEziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESIS
Eziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESIS
PAHUPDATE
 
02 indonesia tracer study and impact assessment synthesis
02 indonesia   tracer study and impact assessment synthesis02 indonesia   tracer study and impact assessment synthesis
02 indonesia tracer study and impact assessment synthesis
imecommunity
 

En vedette (20)

Analysis of Interviews
Analysis of InterviewsAnalysis of Interviews
Analysis of Interviews
 
Interviews and Surveys
Interviews and SurveysInterviews and Surveys
Interviews and Surveys
 
Introduction to Interviewing
Introduction to InterviewingIntroduction to Interviewing
Introduction to Interviewing
 
Doing a Literature Review - Part 3
Doing a Literature Review - Part 3Doing a Literature Review - Part 3
Doing a Literature Review - Part 3
 
Doing a Literature Review - Part 4
Doing a Literature Review - Part 4Doing a Literature Review - Part 4
Doing a Literature Review - Part 4
 
Introduction to Statistics - Part 2
Introduction to Statistics - Part 2Introduction to Statistics - Part 2
Introduction to Statistics - Part 2
 
Doing a Literature Review - Part 1
Doing a Literature Review - Part 1Doing a Literature Review - Part 1
Doing a Literature Review - Part 1
 
HEALTHCARE RESEARCH METHODS: Experimental Studies and Qualitative Studies
HEALTHCARE RESEARCH METHODS: Experimental Studies and Qualitative StudiesHEALTHCARE RESEARCH METHODS: Experimental Studies and Qualitative Studies
HEALTHCARE RESEARCH METHODS: Experimental Studies and Qualitative Studies
 
Experimental study of precast portal frame
Experimental study of precast portal frameExperimental study of precast portal frame
Experimental study of precast portal frame
 
Introduction To The Research Method
Introduction To The Research MethodIntroduction To The Research Method
Introduction To The Research Method
 
Qualitative Research Methods by Paulino Silva - ECSM2015
Qualitative Research Methods by Paulino Silva - ECSM2015Qualitative Research Methods by Paulino Silva - ECSM2015
Qualitative Research Methods by Paulino Silva - ECSM2015
 
CCAO Presentation
CCAO PresentationCCAO Presentation
CCAO Presentation
 
Sri lanka tracer study and impact assessment synthesis
Sri lanka   tracer study and impact assessment synthesisSri lanka   tracer study and impact assessment synthesis
Sri lanka tracer study and impact assessment synthesis
 
Lao pdr tracer study and impact assessment synthesis
Lao pdr   tracer study and impact assessment synthesisLao pdr   tracer study and impact assessment synthesis
Lao pdr tracer study and impact assessment synthesis
 
[Japanese] Style validator-html5etcstudy20151125
[Japanese] Style validator-html5etcstudy20151125[Japanese] Style validator-html5etcstudy20151125
[Japanese] Style validator-html5etcstudy20151125
 
Introduction to HTML
Introduction to HTMLIntroduction to HTML
Introduction to HTML
 
Steel sm
Steel smSteel sm
Steel sm
 
Eziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESIS
Eziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESISEziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESIS
Eziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESIS
 
02 indonesia tracer study and impact assessment synthesis
02 indonesia   tracer study and impact assessment synthesis02 indonesia   tracer study and impact assessment synthesis
02 indonesia tracer study and impact assessment synthesis
 
Plat 05
Plat 05Plat 05
Plat 05
 

Similaire à Some Questions About Your Data

Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
sumit621
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
ranjit banshpal
 

Similaire à Some Questions About Your Data (20)

introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
365 Data Science
365 Data Science365 Data Science
365 Data Science
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
Analytics for actuaries cia
Analytics for actuaries ciaAnalytics for actuaries cia
Analytics for actuaries cia
 
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
 
Data mining
Data miningData mining
Data mining
 
Part1
Part1Part1
Part1
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
Introduction of Data Science and Data Analytics
Introduction of Data Science and Data AnalyticsIntroduction of Data Science and Data Analytics
Introduction of Data Science and Data Analytics
 
Data mining
Data miningData mining
Data mining
 
Technical Documentation 101 for Data Engineers.pdf
Technical Documentation 101 for Data Engineers.pdfTechnical Documentation 101 for Data Engineers.pdf
Technical Documentation 101 for Data Engineers.pdf
 
data mining
data miningdata mining
data mining
 
Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and Issues
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
BDA 2012 Big data why the big fuss?
BDA 2012 Big data why the big fuss?BDA 2012 Big data why the big fuss?
BDA 2012 Big data why the big fuss?
 
Data Mining
Data MiningData Mining
Data Mining
 
Talk
TalkTalk
Talk
 

Plus de Damian T. Gordon

Plus de Damian T. Gordon (20)

Universal Design for Learning, Co-Designing with Students.
Universal Design for Learning, Co-Designing with Students.Universal Design for Learning, Co-Designing with Students.
Universal Design for Learning, Co-Designing with Students.
 
Introduction to Microservices
Introduction to MicroservicesIntroduction to Microservices
Introduction to Microservices
 
REST and RESTful Services
REST and RESTful ServicesREST and RESTful Services
REST and RESTful Services
 
Serverless Computing
Serverless ComputingServerless Computing
Serverless Computing
 
Cloud Identity Management
Cloud Identity ManagementCloud Identity Management
Cloud Identity Management
 
Containers and Docker
Containers and DockerContainers and Docker
Containers and Docker
 
Introduction to Cloud Computing
Introduction to Cloud ComputingIntroduction to Cloud Computing
Introduction to Cloud Computing
 
Introduction to ChatGPT
Introduction to ChatGPTIntroduction to ChatGPT
Introduction to ChatGPT
 
How to Argue Logically
How to Argue LogicallyHow to Argue Logically
How to Argue Logically
 
Evaluating Teaching: SECTIONS
Evaluating Teaching: SECTIONSEvaluating Teaching: SECTIONS
Evaluating Teaching: SECTIONS
 
Evaluating Teaching: MERLOT
Evaluating Teaching: MERLOTEvaluating Teaching: MERLOT
Evaluating Teaching: MERLOT
 
Evaluating Teaching: Anstey and Watson Rubric
Evaluating Teaching: Anstey and Watson RubricEvaluating Teaching: Anstey and Watson Rubric
Evaluating Teaching: Anstey and Watson Rubric
 
Evaluating Teaching: LORI
Evaluating Teaching: LORIEvaluating Teaching: LORI
Evaluating Teaching: LORI
 
Designing Teaching: Pause Procedure
Designing Teaching: Pause ProcedureDesigning Teaching: Pause Procedure
Designing Teaching: Pause Procedure
 
Designing Teaching: ADDIE
Designing Teaching: ADDIEDesigning Teaching: ADDIE
Designing Teaching: ADDIE
 
Designing Teaching: ASSURE
Designing Teaching: ASSUREDesigning Teaching: ASSURE
Designing Teaching: ASSURE
 
Designing Teaching: Laurilliard's Learning Types
Designing Teaching: Laurilliard's Learning TypesDesigning Teaching: Laurilliard's Learning Types
Designing Teaching: Laurilliard's Learning Types
 
Designing Teaching: Gagne's Nine Events of Instruction
Designing Teaching: Gagne's Nine Events of InstructionDesigning Teaching: Gagne's Nine Events of Instruction
Designing Teaching: Gagne's Nine Events of Instruction
 
Designing Teaching: Elaboration Theory
Designing Teaching: Elaboration TheoryDesigning Teaching: Elaboration Theory
Designing Teaching: Elaboration Theory
 
Universally Designed Learning Spaces: Some Considerations
Universally Designed Learning Spaces: Some ConsiderationsUniversally Designed Learning Spaces: Some Considerations
Universally Designed Learning Spaces: Some Considerations
 

Dernier

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Dernier (20)

REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Tatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsTatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf arts
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Basic Intentional Injuries Health Education
Basic Intentional Injuries Health EducationBasic Intentional Injuries Health Education
Basic Intentional Injuries Health Education
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 

Some Questions About Your Data

  • 1. Some Key Questions about you Data Brian Mac Namee Brendan Tierney Damian Gordon
  • 2. The Data  If the data is the key consideration in your research (although not all projects will necessarily be concerned with large datasets) it is important to consider several questions for those projects that do.
  • 3. Overview  How suitable is the data?  What is the type of the data?  Where will you get it from?  What size is the dataset?  What format is it in?  How much cleaning is required?  What is the quality of the data?  How do you deal with missing data?  How will you evaluate your analysis?  etc.
  • 4. Suitability: Dataset  Determining the suitability of the data is a vital consideration, it is not sufficient to simply locate a dataset that is thematically linked to your research question, it must be appropriate to explore the questions that you want to ask.  For example, just because you want to do Credit Card Fraud detection and you have a dataset that contains Credit Card transactions or was used in another Credit Card Fraud project, does not mean that it will be suitable for your project.
  • 5. Suitability: Labelling  Is the data already labelled?  This is very important for supervised learning problems.  To take the credit card fraud example again, you can probably get as many credit card transactions as you like but you probably won't be able to get them marked up as fraudulent and non-fraudulent.
  • 6. Suitability: Labelling  The same thing goes for a lot of text analytics problems - can you get people to label thousands of documents as being interesting or non-interesting to them so that you can train a predictive model?  The availability of labelled data is a key consideration for any supervised learning problem.  The areas of semi-supervised learning and active learning try to address this problem and have some very interesting open research questions.
  • 7. Suitability: Labelling  Two important considerations:  The Curse of Dimensionality – When the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse. In order to obtain a statistically sound result, the amount of data you need often grows exponentially with the dimensionality.  The No Free Lunch Theorem - Classifier performance depends greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems.
  • 8. Suitability: Labelling  Also remember for labelling, you might be aiming for one of three goals:  Binary classifications – classifying each data item to one of two categories.  Multiclass classifications - classifying each data item to more than two categories.  Multi-label classifications - classifying each data item to multiple target labels.
  • 9. Types of Data  Federated data  High dimensional data  Descriptive data  Longitudinal data  Streaming data  Web (scraped) data  Numeric vs. categorical vs. text data  etc.
  • 10. Locating Datasets  http://researchmethodsdataanalysis.blogsp  e.g.  http://www.kdnuggets.com/datasets/  http://www.google.com/publicdata/directory  http://opendata.ie/  http://lib.stat.cmu.edu/datasets/
  • 11. Size of the Dataset  What is a reasonable size of a dataset?  Obviously it vary a lot from problem to problem, but in general we would recommend at least 10 features (columns) in the dataset, and we’d like to see thousands of instances.
  • 12. Format of the Data  TXT (Text file)  MIME (Multipurpose Internet Mail Extensions)  XML (Extensible Markup Language)  CSV (Comma-Separated Values)  ACSII (American Standard Code for Information Interchange)  etc.
  • 13. Cleaning of Data  Parsing  Correcting  Standardizing  Matching  Consolidating
  • 14. Quality of the Data  Frequency counts  Descriptive statistics (mean, standard deviation, median)  Normality (skewness, kurtosis, frequency histograms, normal probability plots)  Associations (correlations, scatter plots)
  • 15. Missing Data?  Imputation  Partial imputation  Partial deletion  Full analysis  Also consider database nullology
  • 16. Evaluating the Analysis  How confident are you in the outcomes of your analysis?  Area under the Curve  Misclassification Error  Confusion Matrix  N-fold Cross Validation  Test predictions using the real-world
  • 17. The Data  Other questions?