Z Score,T Score, Percential Rank and Box Plot Graph
Lecture 2: Preliminaries (Understanding and Preprocessing data)
1. Machine Learning for Language Technology 2015
Preliminaries
Understanding and Preprocessing Data
Marina Santini
santinim@stp.lingfil.uu.se
Department of Linguistics and Philology
Uppsala University, Uppsala, Sweden
Autumn 2015
Lecture 2: Preliminaries 1
2. Acknowledgements
• Weka Slides (teaching material*), Wikipedia,
MathIsFun and other websites.
* http://www.cs.waikato.ac.nz/ml/weka/book.html
Lecture 2: Preliminaries 2
3. Outline
– Raw Data and Feature Representation:
• Concepts, instances, attributes
– Digression 1: Pills of Statistics
• Sampling, mean, variance, standard deviation,
normalization, standardization, etc.
– Digression2: Data Visualization
• how to read a histogram, scatter plot, etc.
Lecture 2: Preliminaries 3
5. What is data?
• Data is a collection of facts, such as numbers,
words, measurements, observations or even
just descriptions of things.
• Data can be qualitative or quantitative.
– Qualitative data is descriptive information (it
describes something)
– Quantitative data is numeric information
(numbers).
Lecture 2: Preliminaries 5
6. Singular or Plural?
• The singular form of data is "datum”.
– Ex: "that datum is very high”
• The plural form of ”datum” is ”data”.
• ”data” is plural when it indicates many individual datum
– Ex: "the data are available”
• But ”data” can also refer to collection of facts. In this case it
is uncountable and takes the singular verb
– Ex: "the data is available”
http://www.theguardian.com/news/datablog/2010/jul/16/data-plural-singular
Lecture 2: Preliminaries 6
7. Qualitative Data
• Categorial values
– Nominal (ex: eye colour)
– Ordinal (ex: street numbers)
Lecture 2: Preliminaries 7
8. Quantitative Data
• Quantitative data can also be discrete or
continous.
• Discrete data is counted, Continuous data is
measured
– Discrete data can only take certain values (like
whole numbers)
– Continuous data can take any value (within a
range)
Lecture 2: Preliminaries 8
9. Lecture 2: Preliminaries
Concepts, Instances, and Attributes
Components of the input:
Concepts: kinds of things that can be learned
Instances: the individual, independent examples of
a concept
Attributes: measuring aspects of an instance
9
10. The importance of feature selection
and representation
Lecture 2: Preliminaries 10
Binary data is a special type of categorical
data. Binary data takes only two values.
12. Lecture 2: Preliminaries
Missing Data/Values
Types: unknown, unrecorded, irrelevant, etc.
Reasons:
collation of different datasets
measurement not possible
etc.
Missing data may have significance in itself (e.g.
missing test in a medical examination)
Most ML schemes assume that missing data have no
special significance. So… be careful and make your
own decisions.
12
13. Lecture 2: Preliminaries
Inaccurate values
Typographical errors in nominal attributes values need
to be checked for consistency
Typographical and measurement errors in numeric
attributes outliers need to be identified
13
14. Noise
• Noise is any unwanted anomaly in the data.
• In ML the presence of noise may cause
difficulties in learning the classes and produce
unreliable classifiers.
• Noise can be caused by:
– imprecisions in recording input attributes
– errors in labelling
– etc.
Lecture 2: Preliminaries 14
15. Lecture 2: Preliminaries
Getting to know the data
Simple visualization tools are very useful
Nominal attributes: histograms
Numeric attributes: graphs
Too much data to inspect? Take a sample!
15
17. Weka Software Package
http://www.cs.waikato.ac.nz/ml/weka/
Weka (Waikato Environment for Knowledge Analysis) is
developed at University of Waikato in New Zealand.
A collection of state-of-the-art machine learning
algorithms and data preprocessing tools.
It is open source. It is written in Java.
Contains implementations of learning algorithms that you
can apply to your datasets.
Lecture 2: Preliminaries 17
18. Weka input data formats
• General formats:
• Weka:
– ARFFAttribute-Relation File format.
– It is an ASCII file that describes a list of instances
sharing a set of attributes.
Lecture 2: Preliminaries 18
20. Lecture 2: Preliminaries
Sparse data
In some applications most attribute values in a
dataset are zero
E.g.: word counts in a text categorization problem
ARFF supports sparse data
This also works for nominal attributes (where the
first value corresponds to “zero”)
0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A”
0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B”
{1 26, 6 63, 10 “class A”}
{3 42, 10 “class B”}
20
22. Population and Sample
• Population and Sample
– Population: The whole group of ”things” we want to study
• Ex: All students born between 1980 and 2000
– Sample: A selection taken from a larger group (the "population") so that you
can examine it to find out something about the larger group.
• Ex: 100 randomly chosen students students born between 1980 and 2000
In other words:
the ’population' is the entire pool from which a statistical sample is drawn.
The information obtained from the sample allows statisticians to develop
hypotheses about the larger population.
Researchers gather information from a sample because of the difficulty of
studying the entire population.
Lecture 2: Preliminaries 22
23. Sampling
• Sampling is a science in itself and there are
different methods to sample a population
– Ex: random sampling, stratified sampling, multi-
stage sampling, quota sampling, etc.
• The main concern: the sample should be
representative of the population.
Lecture 2: Preliminaries 23
25. Normal Distribution
• A normal distribution is an arrangement of a
data set in which most values cluster in the
middle of the range and the rest taper off
symmetrically toward either extreme.
Lecture 2: Preliminaries 25
31. Mean
• The mean is the average of the numbers: a calculated
"central" value of a set of numbers. To calculate: Just
add up all the numbers, then divide by how many
numbers there are.
Ex: what is the mean of 2, 7 and 9?
• Add the numbers: 2 + 7 + 9 = 18
• Divide by how many numbers (i.e. we added 3
numbers): 18 ÷ 3 = 6
• The Mean is 6
Lecture 2: Preliminaries 31
32. Median
• The Median is the middle number (in a sorted
list of numbers). To find the Median, place
the numbers you are given in value order and
find the middle number. (If there are two
middle numbers, you average them.)
• Find the Median of {13, 23, 11, 16, 15, 10, 26}.
• Put them in order: {10, 11, 13, 15, 16, 23, 26}
• The middle number is 15, so the median is 15.
Lecture 2: Preliminaries 32
33. Mode
• The Mode is the number which appears most
often in a set of numbers.
• In {6, 3, 9, 6, 6, 5, 9, 3} the Mode is 6 (it occurs
most often).
Lecture 2: Preliminaries 33
35. The mean of a frequency table
• In a frequency table, the mean is calculated
by:
– multiply the score and the frequency, add up all
the numbers and divide by sum of the frequencies
Lecture 2: Preliminaries 35
36. Mean: Formula
• The x with the bar on top means ”mean of x”
• Σ (sigma) means ”sum up”
• Σ fx means ”sum up all the frequencies times the
matching scores”
• Σ f means ”sum up all the frequencies”
Lecture 2: Preliminaries 36
37. Quiz: The mean of a frequency table
• Calculate the mean of the following frequency table using
the mean formula:
Answers (only one is correct)
• 2.05
• 5.2
• 3.7
Lecture 2: Preliminaries 37
39. Measures of Dispersion
• Dispersion is a general term for different
statistics that describe how values are
distributed around the centre
Lecture 2: Preliminaries 39
40. Measures of Dispersion
• range
• quartiles
• interquartile range
• percentiles
• mean deviation
• variance
• standard deviation
• etc.
Lecture 2: Preliminaries 40
41. Range
• The range is the difference between the
lowest and highest values.
– Example: In {4, 6, 9, 3, 7} the lowest value is 3,
and the highest is 9. So the range is 9-3 = 6.
Lecture 2: Preliminaries 41
42. Quartiles
• Quartiles are the values that divide a list of numbers into
quarters.
– First put the list of numbers in order
– Then cut the list into four equal parts
– The Quartiles are at the "cuts”
• Example: 1, 3, 3, 4, 5, 6, 6, 7, 8, 8 (The numbers must be in
order)
• Cut the list into quarters. The result is:
• Quartile 1 (Q1) = 4
• Quartile 2 (Q2), which is also the Median = 5
• Quartile 3 (Q3) = 8
Lecture 2: Preliminaries 42
43. Interquartile Range
• The "Interquartile Range" is from Q1 to Q3.
• To calculate it just subtract Quartile 1 from
Quartile 3:
Lecture 2: Preliminaries 43
44. Percentiles
• Percentile is the value below which a
percentage of data falls (The data needs to be
in order)
• Example: You are the 4th tallest person in a group of 20;
80% of people are shorter than you: That means you are
at the 80th percentile.
• That is, if your height is 1.85m then "1.85m" is the 80th
percentile height in that group.
Lecture 2: Preliminaries 44
45. Mean Deviation
• It is the mean of the distances of each value
from their mean.
• Three steps:
– 1. Find the mean of all values
– 2. Find the distance of each value from that mean
(subtract the mean from each value, ignore minus
signs, and take the absolute value)
– 3. Then find the mean of those distances
Lecture 2: Preliminaries 45
46. Variance: σ2
• The Variance is the average of the squared
differences from the mean.
• To calculate the variance follow these steps:
– Work out the mean.
– Then for each number: subtract the Mean and
square the result (the squared difference).
– Then work out the average of those squared
differences.
Lecture 2: Preliminaries 46
47. Example: Compute the Variance
For the following dataset find the variance: {600,
470, 170, 430, 300}.
Mean = 600+470+170+430+300/5 = 394
For each number subtract the mean:
600-394=206; 470-394=76, 170-394=224, 430-394=36; 300-394=-94
Take each difference, square it, and then avarage the
results. The variance is 21,704.
Lecture 2: Preliminaries 47
48. Standard Deviation: σ
• The Standard Deviation is one of the most
reliable measure of how spread out numbers
are.
• The formula is easy: it is the square root of
the variance.
Lecture 2: Preliminaries 48
49. Standard Deviation Formula
(population)
• μ = the mean
• xi = the individual value of a
dataset
• (xi - μ)2 = for each value subtract
the mean and square the result
• N = the total number of values in
the dataset
• i=1 = start at this value (here the
first number of the dataset)
• Σ = add up all the values
• 1/N = divide by total number of
values in the dataset
• √ = take the square root of all the
calculation
49
51. Standard Deviation is the most reliable
measure of dispersion
• Depending of the situation, not all measures of
dispersion are equally reliable.
• For ex, the range can sometimes be misleading when
there are extremely high or low values.
– Example: In {8, 11, 5, 9, 7, 6, 3616}: the lowest value is 5,
and the highest is 3616. So the range is 3616-5 = 3611.
• However: The single value of 3616 makes the range
large, but most values are around 10.
• So we may be better off using other measures such as
Standard Deviation = 1262.65
Lecture 2: Preliminaries 51
53. Standard Deviation vs Variance
• A useful property of the standard deviation is that, unlike
the variance, it is expressed in the same units as the data.
• In other words: the StandDev is expressed in the same units
as the mean is, whereas the variance is expressed in square
units. So standard deviation is more intuitive…
• Note that a normal distribution with mean=10 and
standDev = 3 is exactly the same thing as a normal
distribution with mean=10 and variance = 9.
• Watch out and be clear of what you are using!
Lecture 2: Preliminaries 53
54. Quiz: Standard Deviation
68% of the frequency values of the word “and” in a
corpus of email (assume emails have equal length) are
between 51 and 64. Assuming this data is normally
distributed, what are the mean and standard
deviation?
1. Mean = 57; S.D. = 6.5
2. Mean = 57.5 ; S.D. = 6.5
3. Mean = 57.5; S.D. = 13
Lecture 2: Preliminaries 54
55. These notions will be resumed later...
• … when dealing with statistical inference and
other statistical methods.
• Standard Deviation Calculator:
http://www.mathsisfun.com/data/standard-
deviation-calculator.html
Lecture 2: Preliminaries 55
57. Normalization
• To normalize data means to fit the data within
unity, so all the data will take on a value
between 0 and 1. Many formulas are
available:
• Ex:
Lecture 2: Preliminaries 57
58. Standardization
• Standardization coverts all variables to a
common scale and reflects how many
standard deviations from the mean that the
data point falls
• The number of standard deviations from the
mean is also called the "Standard Score",
"sigma" or "z-score".
Lecture 2: Preliminaries 58
59. How to standardize
• z is the "z-score" (Standard Score)
• x is the value to be standardised
• μ is the mean
• σ is the standard deviation
Lecture 2: Preliminaries 59
60. Why Standardize?
• It can help us make decisions about our data.
Lecture 2: Preliminaries 60
63. Outline
• Bar chart
• Histogram
• Pie chart
• Line chart
• Scatter plot
• Dot plot
• Box plot
Lecture 2: Preliminaries 63
64. Axes and Coordinates
• The left-right (horizontal) direction is commonly called X or abscissa
The up-down (vertical) direction is commonly called Y or ordinate
• The coordinates are always written in a certain order: the horizontal
distance first, then the vertical distance.
Lecture 2: Preliminaries 64
Repetition: Read careful this web page:
https://www.mathsisfun.com/data/cartesian-coordinates.html
65. Bar Chart
• A Bar Chart (also called Bar Graph) is a
graphical display of data using bars of
different heights.
• Bar charts are used to graph categorical data.
Example:
Lecture 2: Preliminaries 65
66. Histogram
• With continuous data, histograms are used.
• Histograms are similar to bar charts, but a histogram
groups numbers into ranges.
Lecture 2: Preliminaries 66
67. Pie Chart
• It is a special chart that uses "slices" to show
relative sizes of data.
• Pie charts have been criticized.
Lecture 2: Preliminaries 67
68. Line Chart
• Line chart is a graph that shows information
that is connected in some way (such as change
over time).
Lecture 2: Preliminaries 68
69. Scatter plot
• A scatter plot has points that show the
relationship between two sets of data.
• Example: each dot shows one person's weight
versus their height.
Lecture 2: Preliminaries 69
70. Line of best fit
• Draw a "Line of Best Fit" (also called a "Trend
Line") on the scatter plot to predict values that
might not on the plot
Lecture 2: Preliminaries 70
71. Correlations
• Scatter plots are useful to detect correlations
between the sets of data.
– Correlation is Positive when the values increase together
– Correlation is Negative when one value decreases as the other increases
More on scatter plots: https://www.mathsisfun.com/data/scatter-xy-plots.html
Lecture 2: Preliminaries 71
72. Quiz: Scatter Plot
• The correlation seen in the graph at the right
would be best described as:
1. high positive correlation
2. low positive correlation
3. high negative correlation
4. low negative correlation
Lecture 2: Preliminaries 72
73. Dot Plot
• A dot plot is a graphical display of data using dots.
• It is an alternative to the bar chart, in which dots are
used to depict the quantitative values (e.g. counts)
associated with categorical variables.
Lecture 2: Preliminaries 73
74. Box Plot
• Box plots are useful to highlight outliers,
median and the interquartile range.
• aka box-and-whisker plots
Lecture 2: Preliminaries 74