SlideShare a Scribd company logo
Some Key Questions
about you Data

                  Brian Mac Namee
Brendan Tierney
            Damian Gordon
The Data
   If the data is the key consideration in your research
    (although not all projects will necessarily be
    concerned with large datasets) it is important to
    consider several questions for those projects that
    do.
Overview
   How suitable is the data?
   What is the type of the data?
   Where will you get it from?
   What size is the dataset?
   What format is it in?
   How much cleaning is required?
   What is the quality of the data?
   How do you deal with missing data?
   How will you evaluate your analysis?
   etc.
Suitability: Dataset
   Determining the suitability of the data is a vital
    consideration, it is not sufficient to simply locate a
    dataset that is thematically linked to your research
    question, it must be appropriate to explore the
    questions that you want to ask.
   For example, just because you want to do Credit
    Card Fraud detection and you have a dataset that
    contains Credit Card transactions or was used in
    another Credit Card Fraud project, does not mean
    that it will be suitable for your project.
Suitability: Labelling
   Is the data already labelled?

   This is very important for supervised learning
    problems.
   To take the credit card fraud example again, you
    can probably get as many credit card transactions
    as you like but you probably won't be able to get
    them marked up as fraudulent and non-fraudulent.
Suitability: Labelling
   The same thing goes for a lot of text analytics
    problems - can you get people to label thousands of
    documents as being interesting or non-interesting to
    them so that you can train a predictive model?
   The availability of labelled data is a key
    consideration for any supervised learning problem.
   The areas of semi-supervised learning and active
    learning try to address this problem and have some
    very interesting open research questions.
Suitability: Labelling
   Two important considerations:

       The Curse of Dimensionality – When the dimensionality
        increases, the volume of the space increases so fast that
        the available data becomes sparse. In order to obtain a
        statistically sound result, the amount of data you need
        often grows exponentially with the dimensionality.

       The No Free Lunch Theorem - Classifier performance
        depends greatly on the characteristics of the data to be
        classified. There is no single classifier that works best on
        all given problems.
Suitability: Labelling
   Also remember for labelling, you might be aiming
    for one of three goals:

       Binary classifications – classifying each data item to one
        of two categories.

       Multiclass classifications - classifying each data item to
        more than two categories.

       Multi-label classifications - classifying each data item to
        multiple target labels.
Types of Data
   Federated data
   High dimensional data
   Descriptive data
   Longitudinal data
   Streaming data
   Web (scraped) data
   Numeric vs. categorical vs. text data
   etc.
Locating Datasets
   http://researchmethodsdataanalysis.blogsp

   e.g.
   http://www.kdnuggets.com/datasets/
   http://www.google.com/publicdata/directory
   http://opendata.ie/
   http://lib.stat.cmu.edu/datasets/
Size of the Dataset
   What is a reasonable size of a dataset?

   Obviously it vary a lot from problem to problem, but
    in general we would recommend at least 10
    features (columns) in the dataset, and we’d like to
    see thousands of instances.
Format of the Data
   TXT (Text file)
   MIME (Multipurpose Internet Mail Extensions)
   XML (Extensible Markup Language)
   CSV (Comma-Separated Values)
   ACSII (American Standard Code for Information
    Interchange)
   etc.
Cleaning of Data
   Parsing
   Correcting
   Standardizing
   Matching
   Consolidating
Quality of the Data
   Frequency counts
   Descriptive statistics (mean, standard deviation,
    median)
   Normality (skewness, kurtosis, frequency
    histograms, normal probability plots)
   Associations (correlations, scatter plots)
Missing Data?
   Imputation
   Partial imputation
   Partial deletion
   Full analysis

   Also consider database nullology
Evaluating the Analysis
   How confident are you in the outcomes of your
    analysis?

   Area under the Curve
   Misclassification Error
   Confusion Matrix
   N-fold Cross Validation
   Test predictions using the real-world
The Data
   Other questions?

More Related Content

What's hot

Datamining
DataminingDatamining
Datamining
sumit621
 
BigData Analytics_1.7
BigData Analytics_1.7BigData Analytics_1.7
BigData Analytics_1.7
Rohit Mittal
 

What's hot (19)

2 Data-mining process
2   Data-mining process2   Data-mining process
2 Data-mining process
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Unit 2
Unit 2Unit 2
Unit 2
 
Data analytics
Data analyticsData analytics
Data analytics
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Datamining
DataminingDatamining
Datamining
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
 
Data Science in Action
Data Science in ActionData Science in Action
Data Science in Action
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data mining
Data miningData mining
Data mining
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 
Research trends in data warehousing and data mining
Research trends in data warehousing and data miningResearch trends in data warehousing and data mining
Research trends in data warehousing and data mining
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & Applications
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data mining
Data miningData mining
Data mining
 
BigData Analytics_1.7
BigData Analytics_1.7BigData Analytics_1.7
BigData Analytics_1.7
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill SetData Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill Set
 

Viewers also liked

Introduction to Interviewing
Introduction to InterviewingIntroduction to Interviewing
Introduction to Interviewing
Damian T. Gordon
 
Doing a Literature Review - Part 3
Doing a Literature Review - Part 3Doing a Literature Review - Part 3
Doing a Literature Review - Part 3
Damian T. Gordon
 
Doing a Literature Review - Part 4
Doing a Literature Review - Part 4Doing a Literature Review - Part 4
Doing a Literature Review - Part 4
Damian T. Gordon
 
Introduction to Statistics - Part 2
Introduction to Statistics - Part 2Introduction to Statistics - Part 2
Introduction to Statistics - Part 2
Damian T. Gordon
 
Doing a Literature Review - Part 1
Doing a Literature Review - Part 1Doing a Literature Review - Part 1
Doing a Literature Review - Part 1
Damian T. Gordon
 
Sri lanka tracer study and impact assessment synthesis
Sri lanka   tracer study and impact assessment synthesisSri lanka   tracer study and impact assessment synthesis
Sri lanka tracer study and impact assessment synthesis
imecommunity
 
Lao pdr tracer study and impact assessment synthesis
Lao pdr   tracer study and impact assessment synthesisLao pdr   tracer study and impact assessment synthesis
Lao pdr tracer study and impact assessment synthesis
imecommunity
 
Eziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESIS
Eziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESISEziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESIS
Eziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESIS
PAHUPDATE
 
02 indonesia tracer study and impact assessment synthesis
02 indonesia   tracer study and impact assessment synthesis02 indonesia   tracer study and impact assessment synthesis
02 indonesia tracer study and impact assessment synthesis
imecommunity
 

Viewers also liked (20)

Analysis of Interviews
Analysis of InterviewsAnalysis of Interviews
Analysis of Interviews
 
Interviews and Surveys
Interviews and SurveysInterviews and Surveys
Interviews and Surveys
 
Introduction to Interviewing
Introduction to InterviewingIntroduction to Interviewing
Introduction to Interviewing
 
Doing a Literature Review - Part 3
Doing a Literature Review - Part 3Doing a Literature Review - Part 3
Doing a Literature Review - Part 3
 
Doing a Literature Review - Part 4
Doing a Literature Review - Part 4Doing a Literature Review - Part 4
Doing a Literature Review - Part 4
 
Introduction to Statistics - Part 2
Introduction to Statistics - Part 2Introduction to Statistics - Part 2
Introduction to Statistics - Part 2
 
Doing a Literature Review - Part 1
Doing a Literature Review - Part 1Doing a Literature Review - Part 1
Doing a Literature Review - Part 1
 
HEALTHCARE RESEARCH METHODS: Experimental Studies and Qualitative Studies
HEALTHCARE RESEARCH METHODS: Experimental Studies and Qualitative StudiesHEALTHCARE RESEARCH METHODS: Experimental Studies and Qualitative Studies
HEALTHCARE RESEARCH METHODS: Experimental Studies and Qualitative Studies
 
Experimental study of precast portal frame
Experimental study of precast portal frameExperimental study of precast portal frame
Experimental study of precast portal frame
 
Introduction To The Research Method
Introduction To The Research MethodIntroduction To The Research Method
Introduction To The Research Method
 
Qualitative Research Methods by Paulino Silva - ECSM2015
Qualitative Research Methods by Paulino Silva - ECSM2015Qualitative Research Methods by Paulino Silva - ECSM2015
Qualitative Research Methods by Paulino Silva - ECSM2015
 
CCAO Presentation
CCAO PresentationCCAO Presentation
CCAO Presentation
 
Sri lanka tracer study and impact assessment synthesis
Sri lanka   tracer study and impact assessment synthesisSri lanka   tracer study and impact assessment synthesis
Sri lanka tracer study and impact assessment synthesis
 
Lao pdr tracer study and impact assessment synthesis
Lao pdr   tracer study and impact assessment synthesisLao pdr   tracer study and impact assessment synthesis
Lao pdr tracer study and impact assessment synthesis
 
[Japanese] Style validator-html5etcstudy20151125
[Japanese] Style validator-html5etcstudy20151125[Japanese] Style validator-html5etcstudy20151125
[Japanese] Style validator-html5etcstudy20151125
 
Introduction to HTML
Introduction to HTMLIntroduction to HTML
Introduction to HTML
 
Steel sm
Steel smSteel sm
Steel sm
 
Eziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESIS
Eziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESISEziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESIS
Eziopatogenesi Ipertensione Polmonare Arteriosa-PAH ETIOPATHOGENESIS
 
02 indonesia tracer study and impact assessment synthesis
02 indonesia   tracer study and impact assessment synthesis02 indonesia   tracer study and impact assessment synthesis
02 indonesia tracer study and impact assessment synthesis
 
Plat 05
Plat 05Plat 05
Plat 05
 

Similar to Some Questions About Your Data

Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
sumit621
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
ranjit banshpal
 

Similar to Some Questions About Your Data (20)

introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
365 Data Science
365 Data Science365 Data Science
365 Data Science
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
Analytics for actuaries cia
Analytics for actuaries ciaAnalytics for actuaries cia
Analytics for actuaries cia
 
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
 
Data mining
Data miningData mining
Data mining
 
Part1
Part1Part1
Part1
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
Introduction of Data Science and Data Analytics
Introduction of Data Science and Data AnalyticsIntroduction of Data Science and Data Analytics
Introduction of Data Science and Data Analytics
 
Data mining
Data miningData mining
Data mining
 
Technical Documentation 101 for Data Engineers.pdf
Technical Documentation 101 for Data Engineers.pdfTechnical Documentation 101 for Data Engineers.pdf
Technical Documentation 101 for Data Engineers.pdf
 
data mining
data miningdata mining
data mining
 
Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)Nikita rajbhoj(a 50)
Nikita rajbhoj(a 50)
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and Issues
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
BDA 2012 Big data why the big fuss?
BDA 2012 Big data why the big fuss?BDA 2012 Big data why the big fuss?
BDA 2012 Big data why the big fuss?
 
Data Mining
Data MiningData Mining
Data Mining
 
Talk
TalkTalk
Talk
 

More from Damian T. Gordon

More from Damian T. Gordon (20)

Universal Design for Learning, Co-Designing with Students.
Universal Design for Learning, Co-Designing with Students.Universal Design for Learning, Co-Designing with Students.
Universal Design for Learning, Co-Designing with Students.
 
Introduction to Microservices
Introduction to MicroservicesIntroduction to Microservices
Introduction to Microservices
 
REST and RESTful Services
REST and RESTful ServicesREST and RESTful Services
REST and RESTful Services
 
Serverless Computing
Serverless ComputingServerless Computing
Serverless Computing
 
Cloud Identity Management
Cloud Identity ManagementCloud Identity Management
Cloud Identity Management
 
Containers and Docker
Containers and DockerContainers and Docker
Containers and Docker
 
Introduction to Cloud Computing
Introduction to Cloud ComputingIntroduction to Cloud Computing
Introduction to Cloud Computing
 
Introduction to ChatGPT
Introduction to ChatGPTIntroduction to ChatGPT
Introduction to ChatGPT
 
How to Argue Logically
How to Argue LogicallyHow to Argue Logically
How to Argue Logically
 
Evaluating Teaching: SECTIONS
Evaluating Teaching: SECTIONSEvaluating Teaching: SECTIONS
Evaluating Teaching: SECTIONS
 
Evaluating Teaching: MERLOT
Evaluating Teaching: MERLOTEvaluating Teaching: MERLOT
Evaluating Teaching: MERLOT
 
Evaluating Teaching: Anstey and Watson Rubric
Evaluating Teaching: Anstey and Watson RubricEvaluating Teaching: Anstey and Watson Rubric
Evaluating Teaching: Anstey and Watson Rubric
 
Evaluating Teaching: LORI
Evaluating Teaching: LORIEvaluating Teaching: LORI
Evaluating Teaching: LORI
 
Designing Teaching: Pause Procedure
Designing Teaching: Pause ProcedureDesigning Teaching: Pause Procedure
Designing Teaching: Pause Procedure
 
Designing Teaching: ADDIE
Designing Teaching: ADDIEDesigning Teaching: ADDIE
Designing Teaching: ADDIE
 
Designing Teaching: ASSURE
Designing Teaching: ASSUREDesigning Teaching: ASSURE
Designing Teaching: ASSURE
 
Designing Teaching: Laurilliard's Learning Types
Designing Teaching: Laurilliard's Learning TypesDesigning Teaching: Laurilliard's Learning Types
Designing Teaching: Laurilliard's Learning Types
 
Designing Teaching: Gagne's Nine Events of Instruction
Designing Teaching: Gagne's Nine Events of InstructionDesigning Teaching: Gagne's Nine Events of Instruction
Designing Teaching: Gagne's Nine Events of Instruction
 
Designing Teaching: Elaboration Theory
Designing Teaching: Elaboration TheoryDesigning Teaching: Elaboration Theory
Designing Teaching: Elaboration Theory
 
Universally Designed Learning Spaces: Some Considerations
Universally Designed Learning Spaces: Some ConsiderationsUniversally Designed Learning Spaces: Some Considerations
Universally Designed Learning Spaces: Some Considerations
 

Recently uploaded

Recently uploaded (20)

How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxMatatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringBasic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
 
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
Mattingly "AI & Prompt Design: Limitations and Solutions with LLMs"
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.pptBasic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
 
size separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceuticssize separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceutics
 
The Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational ResourcesThe Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational Resources
 
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
 
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...Basic Civil Engineering Notes of Chapter-6,  Topic- Ecosystem, Biodiversity G...
Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...
 
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxslides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 
How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17
 
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
 
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptxJose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
 

Some Questions About Your Data

  • 1. Some Key Questions about you Data Brian Mac Namee Brendan Tierney Damian Gordon
  • 2. The Data  If the data is the key consideration in your research (although not all projects will necessarily be concerned with large datasets) it is important to consider several questions for those projects that do.
  • 3. Overview  How suitable is the data?  What is the type of the data?  Where will you get it from?  What size is the dataset?  What format is it in?  How much cleaning is required?  What is the quality of the data?  How do you deal with missing data?  How will you evaluate your analysis?  etc.
  • 4. Suitability: Dataset  Determining the suitability of the data is a vital consideration, it is not sufficient to simply locate a dataset that is thematically linked to your research question, it must be appropriate to explore the questions that you want to ask.  For example, just because you want to do Credit Card Fraud detection and you have a dataset that contains Credit Card transactions or was used in another Credit Card Fraud project, does not mean that it will be suitable for your project.
  • 5. Suitability: Labelling  Is the data already labelled?  This is very important for supervised learning problems.  To take the credit card fraud example again, you can probably get as many credit card transactions as you like but you probably won't be able to get them marked up as fraudulent and non-fraudulent.
  • 6. Suitability: Labelling  The same thing goes for a lot of text analytics problems - can you get people to label thousands of documents as being interesting or non-interesting to them so that you can train a predictive model?  The availability of labelled data is a key consideration for any supervised learning problem.  The areas of semi-supervised learning and active learning try to address this problem and have some very interesting open research questions.
  • 7. Suitability: Labelling  Two important considerations:  The Curse of Dimensionality – When the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse. In order to obtain a statistically sound result, the amount of data you need often grows exponentially with the dimensionality.  The No Free Lunch Theorem - Classifier performance depends greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems.
  • 8. Suitability: Labelling  Also remember for labelling, you might be aiming for one of three goals:  Binary classifications – classifying each data item to one of two categories.  Multiclass classifications - classifying each data item to more than two categories.  Multi-label classifications - classifying each data item to multiple target labels.
  • 9. Types of Data  Federated data  High dimensional data  Descriptive data  Longitudinal data  Streaming data  Web (scraped) data  Numeric vs. categorical vs. text data  etc.
  • 10. Locating Datasets  http://researchmethodsdataanalysis.blogsp  e.g.  http://www.kdnuggets.com/datasets/  http://www.google.com/publicdata/directory  http://opendata.ie/  http://lib.stat.cmu.edu/datasets/
  • 11. Size of the Dataset  What is a reasonable size of a dataset?  Obviously it vary a lot from problem to problem, but in general we would recommend at least 10 features (columns) in the dataset, and we’d like to see thousands of instances.
  • 12. Format of the Data  TXT (Text file)  MIME (Multipurpose Internet Mail Extensions)  XML (Extensible Markup Language)  CSV (Comma-Separated Values)  ACSII (American Standard Code for Information Interchange)  etc.
  • 13. Cleaning of Data  Parsing  Correcting  Standardizing  Matching  Consolidating
  • 14. Quality of the Data  Frequency counts  Descriptive statistics (mean, standard deviation, median)  Normality (skewness, kurtosis, frequency histograms, normal probability plots)  Associations (correlations, scatter plots)
  • 15. Missing Data?  Imputation  Partial imputation  Partial deletion  Full analysis  Also consider database nullology
  • 16. Evaluating the Analysis  How confident are you in the outcomes of your analysis?  Area under the Curve  Misclassification Error  Confusion Matrix  N-fold Cross Validation  Test predictions using the real-world
  • 17. The Data  Other questions?