1. STATISTICAL HYDROLOGY
MAL1303/MKAG1273
Summarizing Data
Dr. Shamsuddin Shahid
Associate Professor
Department of Hydraulics and Hydrology
Faculty of Civil Engineering
Room No.: M46-332;
Phone: 07-5531624; Mobile: 0182051586
Email: sshahid@utm.my
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
2. Office Location: C09-Room No. 219
11/23/2015 Shamsuddin Shahid, FKA, UTM11/23/2015
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
3. What is Statistics
Statistics is the science of Learning from Data.
Statistics is the science of collecting data and analyzing
(modeling) data for the purpose of decision making and
scientific discovery when the available information is both
limited and variable.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
4. Descriptive Statistics and Inferential Statistics
Descriptive statistics refers to statistical techniques used to
summarize and describe a data set. Measures of central tendency,
such as mean and median, and dispersion, such as range and
standard deviation, are the main descriptive statistics. Displays of
data, such as histograms and box-plots, are also considered
techniques of descriptive statistics.
Inferential statistics, or statistical induction, means the use of
statistics to make inferences concerning some unknown aspect.
Some examples of inferential statistics methods include hypothesis
testing, linear regression, and prediction.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
5. Statistical Hydrology
In statistical hydrology, we generally use four-step process in learning
from Data:
(1) Defining the problem
(2) Collecting the data
(3) Summarizing the data
(4) Analyzing data, interpreting the analyses, and communicating results.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
6. Defining the Problem
Defining the problem or the question to be addressed
• What is condition of water quality of the river?
• Is there a relationship between nitrogen fertilizer use and
groundwater quality?
• What effect of rainfall on river discharge in a particular river
basin?
• What is the main responsible factor of water quality
deterioration in a river?
• Is there any change in rainfall in last decade?
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
7. Collecting the Data
Proposing a study or experiment to collect meaningful data to answer
the problem.
• What variables have to measure?
• How many data/samples have to collect?
• How the samples should be selected?
• Etc..
The most crucial element in data collection is the manner in
which the sample is selected.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
8. Population
The population is the total set of measurements which could
hypothetically be taken from the entity being studied. It is the set of all
measurements of interest. For example, trace element composition of all
stream sediments in a area, groundwater depth as any point of a
catchment, etc.
The Statistical Sample
A sample is any subset of measurements selected from the population.
There may be confusion about sample in water resources and sample in
statistical. Water sample usually means one bottle of water. In statistics,
sample is a data that is actually available for analysis.
Population and Samples
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
9. Population and Samples
Sample is a subset of population.
Sample should be collected in such as way that it represent the whole
population
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
10. Sampling Size
• Generalization of sample size is not possible to made.
• It depends on the degree of intricacy of the problem being
addressed.
• Many times, sample size is possible to know in advance.
• Many cases, hydrologists collect reasonable number of
sample and analyze to see how accurately it represent the
population. According, they go for next set of sample
collection.
• However, many cases samples can not be collected according
to need.
• If the sample size is not adequate, different methods (non-
parametric) methods are used to analyze the data.
• Whatever it may be, sample size should never be less than 6.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
11. Sample Technique
In hydrology, one of the main decision about sampling
population is where to sample. Sampling techniques can be
classified as below:
1. Random Sampling
2. Stratified Sampling
3. Uniform Sampling
4. Regular Sampling
5. Clustered Sampling
6. Traverse Sampling
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
12. A random sample of n measurements
selected from a population containing
N measurements (N > n). A sample of
n measurements selected from a
population is said to be a random
sample if every different sample of size
n from the population has an equal
probability of being selected.
Sample data selected in a nonrandom
fashion are frequently distorted by a
selection bias. A selection bias exists
whenever there is a systematic
tendency to overrepresent or
underrepresent some part of the
population.
Random Sampling
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
13. The whole population is divided
into a number of groups or strata
according to a property and each
group is sampled differently. This
type of sampling is called Stratified
Sampling.
Let, we want to study
transmissivity of groundwater in a
catchment. Transmissivity of
groundwater heavy depends on
local geology. Therefore, if we
divided the catchment according
to geological units to study the
impact of geology on
transmissivity.
Stratified Sampling
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
14. Systematic Sampling
Uniform Sampling: Planned by
randomization within grid squares.
Regular or gridded: Planned on
rectangular or triangular grid.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
15. 1. Clustered: It is focused on
patchy distribution.
2. Traverse: Often forced by access
and exposure constraints or
logistics.
Systematic Sampling
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
16. Sample Technique
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
17. Data Quality
Garbage in – garbage out.
Hydrologist must be confident about data quality before
processing of data.
Selection of data analysis technique depends on quality of data.
Use of any measurement device must be accompanied by
awareness of precision and accuracy.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
18. Data Quality
Precision - A measurement is precise if repeated measurements of the
same entity are similar.
Accuracy – A measurement is accurate if it is close to the true value. In
water resources, the true value is usually unknown, although there are
standard that can be used for calibrating analytical equipment.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
19. Data Type
There are a variety of categories of data which may be encountered in
Hydrology. It is important to know the data type before selecting the
appropriate data analysis technique.
Ratio scale data
Ordinary measurements such as amount of rainfall, depth of
groundwater level, etc. This is the best quality and most versatile data
type.
Interval scale data
Interval scale data differ from ratio scale data in that the zero point is
not a fundamental termination of the scale. The classical example of
interval scale data is temperature measured in Centigrade.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
20. Data Type
Ordinal Scale Data
This category is of considerable lower quality that the Ratio or
Interval Scale Data. Only purpose of the scale is to place observations
in relative order. Consequently, it is not valid to apply addition or
subtraction, as well as division, to ordinal scale data. Non-parametric
methods are used to analysis the Ordinal Scale Data
Nominal or Categorical Data
Information is sometimes prescribed in the form of names. Such as
flood sometimes recorded as normal flood, severe flood, very severe
flood, etc. Sometimes, drought is recoded according to it’s
occurrence, such as drought occur in 1974, 1981, 1993, 1990, 2008, etc.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
21. Data Type
Directional Data
Data that is expressed in angle. Example of directional data: direction
of cyclone, direction of surface runoff, etc. Directional data require
special methods of analysis as the numerical values cycle around
through 360 degree.
Closed Data
There are lower and upper limits of this type of data. These are data
in the form of percentage, parts per million (ppm), etc. Such data
require cautious treatment, especially in bivariate and multivariate
methods, because variables are fundamentally interdependent.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
22. Continuous and Discrete Types of Variable
Discrete Variable
When observations on a quantitative random variable can assume only a
countable number of values, the variable is called a discrete random
variable. It can have on only a countable number of values.
Example: Number of days in a year with rainfall > 20 mm, number of
flood in a year, etc.
Continuous Variable
When observations on a quantitative random variable can assume any one
of the uncountable number of values, the variable is called a continuous
random variable. A continuous variable can take on any value over some
range.
Example: Daily rainfall, River Discharge, Groundwater Depth, etc.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
23. Summarization of Data
To analyze any collection of water resources data appropriately,
the first consideration must be the characteristics of the data
themselves. Therefore, first we need to know the common
characteristics of water resources data we want to study.
Summarization of data means making a summary of the data in
the forms which convey their important characteristics.
As for example, if we want to analyze rainfall data of a station,
first we need to know:
1.What is the average rainfall of that station?
2.How the rainfall vary in the station?
3.How the rainfall distributed over the year?
4.How often heavy rainfall occur?
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
24. Summarization of Data
Summarization of data means making a summary of the data
in the forms which convey their important characteristics such
as:
• A measure of the center of the data,
• A measure of spread or variability,
• A measure of the symmetry of the data distribution,
• Percentile of the data (estimates of extremes)
• Etc…
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
25. The Mean and Median are the two most commonly-
used measures use to describe center of the data.
Beside that Mode, Geometric Mean and Trimmed
Mean are also sometimes used to summarize some water
resources data.
We need to understand:
• What are the properties of these measures?
• When should one be employed over the other?
Measurement of the center of the data
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
26. The Mean - classical measurement of data
Mean of a set of measurements is defined to be the sum of the
measurements divided by the total number of measurements.
The sample mean is denoted by the symbol, X (x-bar) and
The population mean is denoted by (mu’).
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
27. Computation of Mean - Example
Discharge in a river in every consecutive three days in the month of
January is measured as below. What is mean value of river discharge in
January?
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
28. Measurement of the center of the data: Mean
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
29. Influence of an observation on the overall mean
is the distance between the observation and the
mean excluding the observation.
Therefore, influence of observation 4.0 on
overall mean is (4.0 – 3.3) = 0.7
The Mean
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
30. Influence of observation on the Mean
Mean Discharge = 3.4 cumec
• All observations do
not have same
influence on mean.
• Observations closer to
mean has less
influence on mean
compared to the
observations away
from the mean.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
31. Mean discharge in the month of January is 6.1 cumec
If we remove the extreme value 34.8 (outlier) from the
observation, the mean discharge = 3.2 cumec
Influence of observation on the Mean
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
32. The Mean: Summary
1. It is the arithmetic average of the measurements in a data
set.
2. There is only one mean for a data set.
3. Its value is influenced by extreme measurements; trimming
can help to reduce the degree of influence.
4. Means of subsets can be combined to determine the mean
of the complete data set.
5. It is applicable to quantitative data only.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
33. The Median
The median of a set of measurements is defined to be the middle value
when the measurements are arranged from lowest to highest.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
34. The Median
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
35. The Median
No change in Median
The median is only
minimally affected by
the magnitude of a single
observation. Therefore,
Median is often known
as resistant measure of
central value.
This resistance to the
effect of a change in
value or presence of
outlying observations is
often a desirable
property in water
resources data analysis.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
36. 1. It is the central value; 50% of the measurements lie
above it and 50% fall below it.
2. There is only one median for a data set.
3. It is less influenced by extreme measurements.
4. Medians of subsets cannot be combined to determine
the median of the complete data set.
5. For grouped data, its value is rather stable even when
the data are organized into different categories.
6. It is applicable to quantitative data only.
The Median: Summary
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
37. When a summary value is desired that is not strongly
influenced by a few extreme observations, the median is
preferable to the mean.
One such example is the chemical concentration in water one
might expect to find over many streams in a given region. Using
the median, one sample with unusually high concentration has
no greater effect on the estimate than one with low
concentration. The mean concentration may be pulled towards
the outlier, and be higher than concentrations found in most of
the samples.
Preference of Mean or Median
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
38. The mode is the most frequently observed value.
It is far more applicable for
• Discrete Data
• Grouped Data
• Data which are recorded only as falling into a finite
number of categories
It is very easy to obtain
It is less applicable for continuous data
When applied on continuous data, it gives a poor measure
of ceter
Its value often depends on the arbitrary grouping of those
data
Measurement of the center of the data: Mode
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
39. Let consider, in a location rainfall greater than 100 mm in a day considered
as a heavy rainfall day. Data show the number of heavy rainfall days in
different years over the time period 2000-2010. How frequently heavy
rainfall occur that location?
Measurement of the center of the data: Mode
Mean = 3.1 days
Median = 3 days
Mode = 4 days
It is more meaningful if
you say that heavy
rainfall usually happens
4 times in a year.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
40. Measurement of the center of the data: Mode
Mode is not influenced by extreme measurements.
Heavy rainfall days in different years over the time period 2000-2010
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
41. Measurement of the center of the data: Mode
Heavy rainfall days in different years over the time period 2000-2010
There can be more than one mode for a data set. This type of data
set is called multi-modal data.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
42. The Mode: Summary
1. It is the most frequent or probable measurement in the data
set.
2. There can be more than one mode for a data set.
3. It is not influenced by extreme measurements.
4. Modes of subsets cannot be combined to determine the
mode of the complete data set.
5. For grouped data its value can change depending on the
categories used.
6. It is applicable for both qualitative and quantitative data.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
43. Geometric Mean
The geometric mean, is a type of mean, which indicates the central
tendency
• Geometric mean applies only to positive numbers.
• It is less influenced by extreme values
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
44. Trimmed Mean
Trimmed mean drops the highest and lowest extreme values and
averages the rest. For example, a 5% trimmed mean drops the highest 5%
and the lowest 5% of the measurements and averages the rest. Similarly,
a 25% trimmed mean drops the highest and the lowest 25% of the
measurements and averages the rest.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
45. Trimmed Mean
14 samples are collected to monitor the Nitrate concentration in
groundwater as below:
20% trimmed mean of the data:
Lower limit of window = 14*o.20 = 2.8 3
Lower limit of window = 14*o.80 = 11.2 11
Trimmed mean = 2.8
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
46. Measurement of Spread
1. Range
2. pth percentile
3. Interquartile Range
4. Deviation
5. Variance
6. Standard deviation
7. Variation of coefficient
The degree of variation of data values can be diagnostic in water
resources. For example, variation in annual rainfall tells us how
rainfall is reliable in an area.
It is also important to obtain a measure of the variability of values in
a population in order to assess the number of observations we need
to draw useful conclusions and to assess the reliability of our
estimates of parameters.
Measures used to show the data variability are:
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
47. Measurement of Spread: Range
The range is defined as the difference between the largest and the
smallest measurements of the set.
Range is the simplest but least useful measure of data variation.
For grouped data, because we do not
know the individual measurements,
the range is taken to be the
difference between the upper limit
of the last interval and the lower
limit of the first interval.
Range = 4.8 – 2.5
= 2.3
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
48. Measurement of Spread: p-th percentile
The p-th percentile of a set of n measurements (arranged in order of
magnitude) is that value that has at most p% of the measurements
below it and at most (100 - p)% above it.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
49. Measurement of Spread: Interquartile Range
The interquartile range (IQR) of a set of measurements is defined to be
the difference between the upper and lower quartiles; that is,
IQR = 75th percentile - 25th percentile
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
50. Measurement of Spread: Deviation
Deviation defines how a particular measurement deviate from the mean
of the set of measurements.
The deviations of the measurements are computed by using the formula
=
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
51. Measurement of Spread: Variance
A more easily interpreted function of the deviations involves the
sum of the squared deviations of the measurements from their
mean. This measure is called the variance.
The variance of a set of n measurements X1, X2, … Xn with mean is
the sum of the squared deviations divided by n – 1:
We have special symbols to denote the sample and population
variances. The symbol s2 represents the sample variance, and the
corresponding population variance is denoted by the symbol 2.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
52. Measurement of Spread: Standard Deviation
The standard deviation of a set of measurements is defined to be the
square root of the variance.
One reason for defining the standard deviation is that it yields a
measure of variability having the same units of measurement as the
original data, whereas the units for variance are the square of the
measurement units.
s denotes the sample standard deviation and denotes the
corresponding population standard deviation.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
53. Measurement of Spread: Group Data
We need a simple modification of the formula to calculate the
sample variance of grouped data are available.
If Xi and fi denote the midpoint and frequency, respectively, for
the i-th class interval, then variance of group data:
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
54. • Means and variances are ways to describe a distribution of
data.
• But all the details of the data can not be understood with only
mean and variance.
• A histogram is one of the most effective ways to depict a
frequency distribution
• To know details of the data distribution we need to draw
frequency histogram and the relative frequency histogram
• Frequency is the number of times a variable takes on a
particular group
• Note that any variable has a frequency distribution
Distribution of Data
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
55. Total annual rainfall (in millimeter) in a station situated in the sub-
tropical region is given below.
940 2118 1684 1415 1111 1303 1526 1667
1182 1301 1995 1692 2293 1903 870 1678
1647 1236 1180 1515 1650 1614 1384 1845
1315 1879 2326 1539 2110 2038 1976 1117
1710 2066 1295 1814 1265 1419 1751 1582
1446 1468 1810 1471 2102 1895 2474 1645
1769 2119
Distribution of Data
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
56. 940 2118 1684 1415 1111 1303 1526 1667
1182 1301 1995 1692 2293 1903 870 1678
1647 1236 1180 1515 1650 1614 1384 1845
1315 1879 2326 1539 2110 2038 1976 1117
1710 2066 1295 1814 1265 1419 1751 1582
1446 1468 1810 1471 2102 1895 2474 1645
1769 2119
Distribution of Data
Minimum Rainfall = 870 mm
Maximum Rainfall = 2474 mm
Range = 1604
Group Interval = Range / (number of group -1)
= 1604 / (9 – 1)
200
Groups: 801 – 1000; 1001-1200; 1201 – 1400; 1401 – 1600;
1601 – 1800; 1801 – 2000; 2001 – 2200; 2201 – 2400;
2401 - 2600
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
57. Distribution of Data: Histogram
• A histogram is used to graphically summarize the distribution of a data set
• A histogram divides the range of values in a data set into intervals
• Over each interval is placed a bar whose height represents the frequency of
data values in the interval.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
58. Distribution of Data
Number of group is 17
Because of the arbitrariness in the choice of number of intervals, starting
value, and length of intervals, histograms can be made to take on different
shapes for the same set of data, especially for small data sets. Histograms
are most useful for describing data sets when the number of data points is
fairly large, say 50 or more.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
59. Distribution of Data
Number of
group is 5
When the number of data points is relatively small and the number of intervals
is large, the histogram fluctuates too much—that is, responds to a very few
data Values. This results in a graph that is not a realistic depiction of the
histogram for the whole population. When the number of class intervals is too
small, most of the patterns or trends in the data are not displayed.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
60. • Frequencies can be absolute (when the frequency
provided is the actual count of the occurrences) or relative
(when they are normalized by dividing the absolute
frequency by the total number of observations [0, 1])
• Relative frequencies are particularly useful if you want to
compare distributions drawn from two different sources
(i.e. while the numbers of observations of each source may
be different)
Absolute and Relative Frequency Distribution
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
61. Distribution of Data
We can compare two different samples (or
populations) by examining their relative
frequency histograms even if the samples
(populations) are of different sizes, because
we use proportions rather than frequencies
in a relative frequency histogram.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
62. Distribution of Data
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
63. Distribution of Data
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
64. Distribution of Data
This rule applies to data sets
with roughly a “mound-
shaped’’ histogram—that is, a
histogram that has a single
peak, is symmetrical, and
tapers off gradually in the
tails.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
65. Distribution of Data
M0 – Sample Mean
Md – Sample Median
µ - Population mean
TM – Trimmed mean
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
66. Coefficient of variation measures the
variability in the values in a population
relative to the magnitude of the
population mean. In a process or
population with mean and standard
deviation , the coefficient of variation
is defined as,
Coefficient of Variation
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
67. Coefficient of Variation shows how much the hydrological variables vary.
For example, if we calculate Coefficient of Variation over annual rainfall
time series, it weill indicate the reliability of rainfall. If the value is very
high, it means it varies very wide from year to year and less reliable.
Coefficient of Variation
In both cases annual average rainfall is 580 mm. However, rainfall in second
area is more reliable compared to first one.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
68. Measures of Skewness and Kurtosis
• A fundamental task in many statistical analyses is to characterize
the location and variability of a data set (Measures of central
tendency vs. measures of dispersion)
• Both measures tell us nothing about the shape of the distribution
• A further characterization of the data includes skewness and
kurtosis
• The histogram is an effective graphical technique for showing both
the skewness and kurtosis of a data set
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
69. The term skewness refers to the lack of symmetry.
Skewness measures the degree of asymmetry exhibited by
the data
If skewness equals zero, the histogram is symmetric about
the mean
3
1
3
)(
ns
xx
skewness
n
i
i
Skewness
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
70. Consider the two distributions in the figure just below. Within
each graph, the bars on the right side of the distribution taper
differently than the bars on the left side. These tapering sides
are called tails, and they provide a visual means for determining
which of the two kinds of skewness a distribution has:
1. negative skew: The left tail is longer; the mass of the
distribution is concentrated on the right of the figure. The
distribution is said to be left-skewed, left-tailed, or skewed to the
left.
2. positive skew: The right tail is longer; the mass of the
distribution is concentrated on the left of the figure. The
distribution is said to be right-skewed, right-tailed, or skewed to
the right.
Skewness
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
71. Skewness
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
72. Skewness in a data series may be observed not only
graphically but by simple inspection of the values.
For instance, consider the numeric sequence (49, 50,
51), whose values are evenly distributed around a
central value of (50). We can transform this sequence
into a negatively skewed distribution by adding a value
far below the mean, as in e.g. (40, 49, 50, 51).
Similarly, we can make the sequence positively
skewed by adding a value far above the mean, as in
e.g. (49, 50, 51, 60).
Skewness
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
73. Skewness
Positive skewness
There are more observations below the mean than above it When the
mean is greater than the median
Negative skewness
There are a small number of low observations and a large number of high
ones When the median is greater than the mean
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
74. Skewness of a distribution cannot be determined simply by
inspection.
If Mean > Median, the skew is positive.
If Mean < Median, the skew is negative.
If Mean = Median, the skew is zero.
Skewness
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
75. • One measure of absolute skewness is difference between
mean and mode. A measure of such would not be true
meaningful because it depends of the units of
measurement.
• The simplest measure of skewness is the Pearson’s
coefficient of skewness:
Skewness
deviationStandard
Mode-Mean
skewnessoftcoefficiensPearson'
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
76. • Skewness coefficient varies between -3.o to +3.0.
• There is no acceptable range of skewness to measure the
distribution of data.
• Some people says that rule of thumb is -1 to +1 being
acceptable (-2 to +2 is often used too) for normal
distribution.
Skewness
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
77. Kurtosis measures how peaked the histogram is
The kurtosis of a normal distribution is 0 (zero)
Kurtosis characterizes the relative peakedness or flatness of a
distribution compared to the normal distribution
3
)(
4
4
ns
xx
kurtosis
n
i
i
Kurtosis
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
78. • Platykurtic– When the kurtosis < 0, the frequencies throughout the
curve are closer to be equal (i.e., the curve is more flat and wide).
Thus, negative kurtosis indicates a relatively flat distribution
• Leptokurtic– When the kurtosis > 0, there are high frequencies in
only a small part of the curve (i.e, the curve is more peaked). Thus,
positive kurtosis indicates a relatively peaked distribution
• Kurtosis is based on the size of a distribution's tails. Negative
kurtosis (platykurtic) – distributions with short tails. Positive
kurtosis (leptokurtic) – distributions with relatively long tails
leptokurticplatykurtic
Kurtosis
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
79. Coefficient of Kurtosis is the most important measure of kurtosis
which is based on the second and fourth moments :
Kurtosis
2
2
4
2
N
xxf
2
2
)(
N
xxf
4
4
)(
Where,
Second Momentum
Fourth Momentum
• If 2 -3 > 0, the distribution is leptokurtic.
• If , If 2 -3 < 0 the distribution is platykurtic.
• If , 2 -3 = 0 the distribution is mesokurtic (normal).
A kurtosis value of
+/-1 is considered
very good for most
uses, but +/-2 is
also usually
acceptable.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
80. Chebyshev’s theorem
According to Chebyshev’s theorem,
At least of the measurements will fall within
[Mean – (k-1)*SD] to [Mean + (k-1)*SD], where K = 2
Empirical rule
Give a set of n measurements possessing a mound-shaped histogram,
then
the interval X s contains approximately 68% of the measurements
the interval X 2s contains approximately 95% of the measurements
the interval X 3s contains approximately 99.7% of the measurements.
Chebyshev’s Rule
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
81. Empirical rule
Give a set of n measurements possessing a mound-shaped histogram, then
the interval X s contains approximately 68% of the measurements
the interval X 2s contains approximately 95% of the measurements
the interval X 3s contains approximately 99.7% of the measurements.
Chebyshev’s Rule
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
82. Outlier
An outlier is an observation that lies an abnormal distance from other
values in a random sample from a population.
Outlier or an outlying observation, is one that appears to deviate
markedly from other members of the sample in which it occurs.
Outliers can have many anomalous causes:
• A physical apparatus for taking measurements may have suffered a
transient malfunction.
• There may have been an error in data transmission or transcription.
• Outliers arise due to changes in system behaviour, fraudulent
behaviour, human error, instrument error
• simply through natural deviations in populations.
• A sample may have been contaminated with elements from outside
the population being examined.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
83. Identification of Outliers
There is no rigid mathematical definition of what constitutes an outlier.
Determining whether or not an observation is an outlier is ultimately a
subjective exercise.
Type 1 - Determine the outliers with no prior knowledge of the data. This is
essentially a learning approach. The approach processes the data as a
static distribution, pinpoints the most remote points, and flags them as
potential outliers.
Type 2 – Using model-based methods which assume that the data are from
a normal distribution, and identify observations which are deemed
"unlikely" based on mean and standard deviation.
• Chauvenet's criterion
• Grubbs' test
• Dixon's Q test
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
84. Chauvenet's criterion
• A value is measured experimentally in several trials as 9, 10, 10, 10,
11, and 50.
• The mean is 16.7 and the standard deviation 16.34.
• Value 50 differs from 16.7 by 33.3, slightly more than two standard
deviations.
• The probability of taking data more than two standard deviations from
the mean is roughly 0.05.
• Six measurements were taken, so the statistic value (data size
multiplied by the probability) is 0.05×6 = 0.3.
• Because 0.3 < 0.5, according to Chauvenet's criterion, the measured
value of 50 should be discarded (leaving a new mean of 10, with
standard deviation 0.7).
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
85. Grubbs' test detects one outlier at a time.
Gcalculated > Gtable then reject the questionable point.
Grubbs' test
Example: 9, 10, 10, 10, 11, and 50
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
86. Grubbs' test
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
87. To apply a Q test, arrange the data in order of increasing values
and calculate Q as defined:
Where gap is the absolute difference between the outlier in
question and the closest number to it.
If Qcalculated > Qtable then reject the questionable point.
Dixon's Q test, or simply the Q test
Example: 9, 10, 10, 10, 11, and 50
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
88. Dixon's Q test
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
89. Common Characteristics of Water Resources Data
1. A lower bound of zero. No negative values are possible.
2. Presence of 'outliers‘ regularly occur, specially outliers on the high
side are more common in water resources.
3. Non-normal distribution of data
4. Positive skewness is common.
5. Data reported only as below or above some threshold (censored
data). Examples include concentrations below one or more
detection limits, annual flood above a level, etc.
6. Seasonal patterns. Values tend to be higher or lower in certain
seasons of the year.
7. Positive autocorrelation. Consecutive observations tend to be
strongly correlated with each other. High values tend to follow
high values and low values tend to follow low values.
8. Dependence on other uncontrolled variables. Water discharge
from a well highly depends on hydraulic conductivity, sediment
grain size, or some other variable.
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)
90. STATISTICAL HYDROLOGY
MAL1303/MKAG1273
Summarizing Data
Dr. Shamsuddin Shahid
Associate Professor
Department of Hydraulics and Hydrology
Faculty of Civil Engineering
Room No.: M46-332;
Phone: 07-5531624; Mobile: 0182051586
Email: sshahid@utm.my
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)