2. The Data
If the data is the key consideration in your research
(although not all projects will necessarily be
concerned with large datasets) it is important to
consider several questions for those projects that
do.
3. Overview
How suitable is the data?
What is the type of the data?
Where will you get it from?
What size is the dataset?
What format is it in?
How much cleaning is required?
What is the quality of the data?
How do you deal with missing data?
How will you evaluate your analysis?
etc.
4. Suitability: Dataset
Determining the suitability of the data is a vital
consideration, it is not sufficient to simply locate a
dataset that is thematically linked to your research
question, it must be appropriate to explore the
questions that you want to ask.
For example, just because you want to do Credit
Card Fraud detection and you have a dataset that
contains Credit Card transactions or was used in
another Credit Card Fraud project, does not mean
that it will be suitable for your project.
5. Suitability: Labelling
Is the data already labelled?
This is very important for supervised learning
problems.
To take the credit card fraud example again, you
can probably get as many credit card transactions
as you like but you probably won't be able to get
them marked up as fraudulent and non-fraudulent.
6. Suitability: Labelling
The same thing goes for a lot of text analytics
problems - can you get people to label thousands of
documents as being interesting or non-interesting to
them so that you can train a predictive model?
The availability of labelled data is a key
consideration for any supervised learning problem.
The areas of semi-supervised learning and active
learning try to address this problem and have some
very interesting open research questions.
7. Suitability: Labelling
Two important considerations:
The Curse of Dimensionality – When the dimensionality
increases, the volume of the space increases so fast that
the available data becomes sparse. In order to obtain a
statistically sound result, the amount of data you need
often grows exponentially with the dimensionality.
The No Free Lunch Theorem - Classifier performance
depends greatly on the characteristics of the data to be
classified. There is no single classifier that works best on
all given problems.
8. Suitability: Labelling
Also remember for labelling, you might be aiming
for one of three goals:
Binary classifications – classifying each data item to one
of two categories.
Multiclass classifications - classifying each data item to
more than two categories.
Multi-label classifications - classifying each data item to
multiple target labels.
9. Types of Data
Federated data
High dimensional data
Descriptive data
Longitudinal data
Streaming data
Web (scraped) data
Numeric vs. categorical vs. text data
etc.
11. Size of the Dataset
What is a reasonable size of a dataset?
Obviously it vary a lot from problem to problem, but
in general we would recommend at least 10
features (columns) in the dataset, and we’d like to
see thousands of instances.
12. Format of the Data
TXT (Text file)
MIME (Multipurpose Internet Mail Extensions)
XML (Extensible Markup Language)
CSV (Comma-Separated Values)
ACSII (American Standard Code for Information
Interchange)
etc.
13. Cleaning of Data
Parsing
Correcting
Standardizing
Matching
Consolidating
14. Quality of the Data
Frequency counts
Descriptive statistics (mean, standard deviation,
median)
Normality (skewness, kurtosis, frequency
histograms, normal probability plots)
Associations (correlations, scatter plots)
15. Missing Data?
Imputation
Partial imputation
Partial deletion
Full analysis
Also consider database nullology
16. Evaluating the Analysis
How confident are you in the outcomes of your
analysis?
Area under the Curve
Misclassification Error
Confusion Matrix
N-fold Cross Validation
Test predictions using the real-world