Some Questions About Your Data

Some Key Questions
about you Data

Brian Mac Namee
Brendan Tierney
Damian Gordon

The Data
 If the data is the key consideration in your research
(although not all projects will necessarily be
concerned with large datasets) it is important to
consider several questions for those projects that
do.

Overview
 How suitable is the data?
 What is the type of the data?
 Where will you get it from?
 What size is the dataset?
 What format is it in?
 How much cleaning is required?
 What is the quality of the data?
 How do you deal with missing data?
 How will you evaluate your analysis?
 etc.

Suitability: Dataset
 Determining the suitability of the data is a vital
consideration, it is not sufficient to simply locate a
dataset that is thematically linked to your research
question, it must be appropriate to explore the
questions that you want to ask.
 For example, just because you want to do Credit
Card Fraud detection and you have a dataset that
contains Credit Card transactions or was used in
another Credit Card Fraud project, does not mean
that it will be suitable for your project.

Suitability: Labelling
 Is the data already labelled?

 This is very important for supervised learning
problems.
 To take the credit card fraud example again, you
can probably get as many credit card transactions
as you like but you probably won't be able to get
them marked up as fraudulent and non-fraudulent.

 The same thing goes for a lot of text analytics
problems - can you get people to label thousands of
documents as being interesting or non-interesting to
them so that you can train a predictive model?
 The availability of labelled data is a key
consideration for any supervised learning problem.
 The areas of semi-supervised learning and active
learning try to address this problem and have some
very interesting open research questions.

 Two important considerations:

 The Curse of Dimensionality – When the dimensionality
increases, the volume of the space increases so fast that
the available data becomes sparse. In order to obtain a
statistically sound result, the amount of data you need
often grows exponentially with the dimensionality.

 The No Free Lunch Theorem - Classifier performance
depends greatly on the characteristics of the data to be
classified. There is no single classifier that works best on
all given problems.

 Also remember for labelling, you might be aiming
for one of three goals:

 Binary classifications – classifying each data item to one
of two categories.

 Multiclass classifications - classifying each data item to
more than two categories.

 Multi-label classifications - classifying each data item to
multiple target labels.

Types of Data
 Federated data
 High dimensional data
 Descriptive data
 Longitudinal data
 Streaming data
 Web (scraped) data
 Numeric vs. categorical vs. text data
 etc.

Locating Datasets
 http://researchmethodsdataanalysis.blogsp

 e.g.
 http://www.kdnuggets.com/datasets/
 http://www.google.com/publicdata/directory
 http://opendata.ie/
 http://lib.stat.cmu.edu/datasets/

Size of the Dataset
 What is a reasonable size of a dataset?

 Obviously it vary a lot from problem to problem, but
in general we would recommend at least 10
features (columns) in the dataset, and we’d like to
see thousands of instances.

Format of the Data
 TXT (Text file)
 MIME (Multipurpose Internet Mail Extensions)
 XML (Extensible Markup Language)
 CSV (Comma-Separated Values)
 ACSII (American Standard Code for Information
Interchange)
 etc.

Cleaning of Data
 Parsing
 Correcting
 Standardizing
 Matching
 Consolidating

Quality of the Data
 Frequency counts
 Descriptive statistics (mean, standard deviation,
median)
 Normality (skewness, kurtosis, frequency
histograms, normal probability plots)
 Associations (correlations, scatter plots)

Missing Data?
 Imputation
 Partial imputation
 Partial deletion
 Full analysis

 Also consider database nullology

Evaluating the Analysis
 How confident are you in the outcomes of your
analysis?

 Area under the Curve
 Misclassification Error
 Confusion Matrix
 N-fold Cross Validation
 Test predictions using the real-world

The Data
 Other questions?

Some Questions About Your Data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (20)

Similaire à Some Questions About Your Data

Similaire à Some Questions About Your Data (20)

Plus de Damian T. Gordon

Plus de Damian T. Gordon (20)

Dernier

Dernier (20)

Some Questions About Your Data