Unit-1
Statistics
Definition1 :-
Statistics is the study of the collection, analysis, interpretation, presentation, and
organization of data.
In applying statistics to scientific, industrial, or societal problem, it is necessary to
begin with a population or process to be studied. Populations can be diverse
topics such as "all persons living in a country" or "every atom composing a
crystal". Itdeals with all aspects of data including the planning of data collection
in terms of the design of surveys and experiments.
Definition2 :-
Statistics is a science of facts and figures and nothing beyond that. It's a
measurement of data and expression of the same in the numerical manner.
Uses of statistics:
1. Itis highly quantitative than qualitative
2. Statistical method deals with two fundamental principles
3. Statistical unit
4. Statistical data mustbe manipulated
5. Presentation of statistical data with the help of line-diagram
1. It is highly quantitative than qualitative:
Social statistics which presentthe data of an area mustbe numerous in nature. By
which we can measurethe tendency of a project.
In a little period, it also understand by everyone, when listen the percentage. So it
is easy to record and easy to understand.
2. Statistical method deals with two fundamentalprinciples:
Fundamental regularity based on mathematical probability
Itsays aboutcapacity of the researcher
Fundamental regularity based on mathematical probability:
Itstates that every social phenomena is influenced by large number by
variables, which are co-related and inter related and statistics ls to study
this co-relation. Thereforethe theory of probability, linear programs and
shadow prices are used to find-out the reality.
Itsays aboutcapacity of the researcher:
For substantiation of findings and conclusions, statisticaljargon are necessary and
it savethe researcher/scholar fromdanger and challenges. Itis the data, facts and
figures which say the capacity of the researcher. The skills and the resources
which is used by the researcher mustbe applied in its research finding.
3. Statistical Units:
Statistical unit has four characteristics as:
Appropriateness
Clarity
Measurability
Comparability
4. Statistical data must be manipulated:
The statistical data mustbe manipulated, divided and totaled to formulate some
conclusions.
5. Presentation of statistical data with the help of line-diagram:
Presentation of statistical data with the help of line-diagram, graphs, charts,
histogram, frequency, distribution, pie-diagrams etc.
Limitations of statistics:
Statistics is indispensable to almost all sciences - social, physical and natural. It is very often
used in most of the spheres of human activity. In spite of the wide scope of the subject it has
certain limitations. Some important limitations of statistics are the following:
1. Statistics does not study qualitative phenomena:
Statistics deals with facts and figures. So the quality aspect of a variable or the subjective
phenomenon falls out of the scope of statistics. For example, qualities like beauty, honesty,
intelligence etc. cannot be numerically expressed. So these characteristics cannot be examined
statistically. This limits the scope of the subject.
2. Statistical laws are not exact:
Statistical laws are not exact as incase of natural sciences. These laws are true only on average.
They hold good under certain conditions. They cannot be universally applied. So statistics has
less practical utility.
3. Statistics does not study individuals:
Statistics deals with aggregate of facts. Single or isolated figures are not statistics. This is
considered to be a major handicap of statistics.
4. Statistics can be misused:
Statistics is mostly a tool of analysis. Statistical techniques are used to analyze and interpret the
collected information in an enquiry. As it is, statistics does not prove or disprove anything. It is
just a means to an end. Statements supported by statistics are more appealing and are commonly
believed. For this, statistics is often misused. Statistical methods rightly used are beneficial but if
misused these become harmful. Statistical methods used by less expert hands will lead to
inaccurate results. Here the fault does not lie with the subject of statistics but with the person
who makes wrong use of it.
Frequency Distribution
Frequency:- Frequency is how often something occurs.
Example:Samplayed footballon
Saturday Morning,SaturdayAfternoon,ThursdayAfternoon
The frequencywas 2 on Saturday, 1 onThursday and 3 for the whole week.
FrequencyDistribution
By countingfrequencieswe canmake a FrequencyDistributiontable.
Example:Goals
Sam put the numbers in order, then added up:
howoften1 occurs (2 times),
howoften2 occurs (5 times),
etc,
and wrote themdownas a Frequency
Distributiontable.
Sam's teamhas scoredthe followingnumbers
of goalsin recentgames:
2, 3, 1, 2, 1, 3, 2, 3, 4, 5, 4, 2, 2,3
From the table we can see interesting things such as
getting2 goalshappensmostoften
onlyonce didtheyget5 goals
Frequency Distribution:- values and their frequency (how often each value occurs).
Example:Newspapers
These are the numbersof newspaperssoldata local shopoverthe last10 days:
22, 20, 18, 23, 20, 25, 22, 20, 18, 20
Let uscount howmany of each numberthere is:
Papers Sold Frequency
18 2
19 0
20 4
21 0
22 2
23 1
24 0
25 1
It isalso possible to groupthe values.Here theyare groupedin5s:
Papers Sold Frequency
15-19 2
20-24 7
25-29 1
Frequency Curve
A smooth curve which corresponds to the limiting case of a histogram computed for a frequency
distribution of a continuous distribution as the number of data points becomes very large is
called frequency curve.
Measures of Central Tendency
Introduction
A measureof central tendency is a single value that attempts to describe a set of
data by identifying the central position within that set of data. Measures of
central tendency are sometimes called measures of central location. They are also
classed as summary statistics. Themean (often called the average) is most likely
the measureof central tendency that you are mostfamiliar with, but there are
others, such as the median and the mode.
The mean, median and mode are all valid measures of central tendency.
Mean(Arithmetic)
The mean (or average) is the most popular and well known measureof central
tendency. Itcan be used with both discrete and continuous data, although its use
is mostoften with continuous data. The mean is equal to the sumof all the values
in the data set divided by the number of values in the data set. So, if we have n
values in a data set and they have values 𝑥1, 𝑥2, 𝑥3,…, 𝑥 𝑛 the samplemean,
usually denoted by (pronounced x bar), is:
𝑥̅ =
(𝑥1 + 𝑥2 + 𝑥3 + ⋯+ 𝑥 𝑛)
𝑛
This formula is usually written in a slightly different manner using the Greek
capitol letter, ∑ , pronounced "sigma", which means "sumof...":
𝑥̅ =
∑𝑥
𝑛
When not to use the mean
The mean has one main disadvantage: it is particularly susceptibleto the
influence of outliers. These are values that are unusualcompared to the restof
the data set by being especially small or large in numerical value. For example,
consider the wages of staff at a factory below:
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
The mean salary for these ten staff is $30.7k. However, inspecting theraw data
suggests that this mean value might not be the best way to accurately reflect the
typical salary of a worker, as mostworkers havesalaries in the $12k to 18k range.
The mean is being skewed by the two large salaries. Therefore, in this situation,
we would like to have a better measureof central tendency.
Median
The median is the middle scorefor a set of data that has been arranged in order
of magnitude.
If the number of events are even then the averageof two middle are taken.
The median is better for describing the typical value.
Example:-
In order to calculate the median, supposewehave the data below:
65 55 89 56 35 14 56 55 87 45 92
We firstneed to rearrangethat data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89 92
Our median mark is the middle mark - in this case, 56 (highlighted in bold).
Mode
The mode is the mostfrequent scorein our data set.
What will happen to the measures of central tendency if we add the same amount
to all data values, or multiply each data value by the same amount?
Data Mean Mode Median
Original Data
Set:
6, 7, 8, 10, 12, 14, 14, 15, 16, 20 12.2 14 13
Add 3 to each
data value
9, 10, 11, 13, 15, 17, 17, 18, 19, 23 15.2 17 16
Multiply 2
times each
data value
12, 14, 16, 20, 24, 28, 28, 30, 32, 40 24.4 28 26
When added: Since all values are shifted the same amount, the measures of central tendency
all shifted by the same amount. If you add 3 to each data value, you will add 3 to the mean,
mode and median.
When multiplied: Since all values are affected by the same multiplicative values, the measures
of central tendency will feel the same affect. If you multiply each data value by 2, you will
multiply the mean, mode and median by 2.
Example :-1
Find the mean, median and mode forthe following data: 5, 15, 10, 15, 5, 10, 10, 20,
25, 15.
Answer:-
(You will need to organize the data.)
5, 5, 10, 10, 10, 15, 15, 15, 20, 25
Mean:
𝑆𝑢𝑚 𝑜𝑓 𝑑𝑎𝑡𝑎
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎
=
130
10
= 13
Median: 5, 5, 10, 10,10,15,15, 15, 20, 25
Listing the data in order is the easiest way to find the median.
The numbers 10 and 15 both fall in the middle.
Average these two numbers to get the median.
10+15
2
= 12.5
Mode: Two numbers appear most often: 10 and 15.
There are three 10's and three 15's.
In this example there are two answers for the mode.
Example :- 2
For what value of x will 8 and x have the same mean (average) as 27 and 5?
Answer:-
First, find the mean of 27 and 5:
27 + 5
2
= 16
Now, find the x value, knowing that the
average of x and 8 must be 16:
𝑥 + 8
2
= 16
⟹32 = x + 8 cross multiply
⇒ 𝑥 = 32 − 8 = 24
Example :- 3
On his first5 biology tests, Bob received the following scores: 72, 86, 92, 63, and
77. What test scoremust Bob earn on his sixth test so that his average(mean
score) for all six tests will be 80? Show how you arrived at your answer.
Answer:-
Possible solution:
Set up an equation to representthe situation. Remember to use all 6 test
scores:
72+86+92+63+77+x
6
= 80
cross multiply and solve: (80)(6) = 390 + 𝑥
⇒ 480 = 390 + 𝑥
⇒ 𝑥 = 480− 390 = 90
Example:- 4
The mean (average) weightof three dogs is 38 pounds. One of the dogs, Sparky,
weighs 46 pounds. The other two dogs, Eddie and Sandy, havethe same
weight. Find Eddie's weight.
Answer:-
Let x = Eddie's weigh ( they weigh the same, so they are both represented by "x".)
Let x = Sandy's weight
Average: sumof the data divided by the number of data.
x + x + 46 = 38 cross multiply and solve
3(dogs)
Solution:-
Cost Number of items in the group Cumulative frequency
10-20 4 4
20-30 5 9
30-40 3 12
40-50 6 18
50-60 3 21
Here N=21 ⇒
𝑁
2
= 10.5
The median class is 30-40.
FromFormula,
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝐿 + (
𝑁
2
− 𝑐𝑓
𝑓
) × 𝑖
L=30, 𝑖 = 10, 𝑐𝑓 = 9
𝑀𝑒𝑑𝑖𝑎𝑛 = 30 +
(10.5−9)
12
× 10 = 30 + 1.25 = 31.25
Question:- Find the Mode of the following distribution:
Class Interval 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Frequency 5 9 8 12 28 20 12 11
Solution:-
MaximumFrequency=28,Modal class=40-50
From Formula,
𝑀𝑜𝑑𝑒 = 𝑎 +
𝐶(𝑓𝑖 − 𝑓𝑖−1)
2𝑓𝑖 − 𝑓𝑖−1 − 𝑓𝑖+1
𝑎 = 40, 𝐶 = 10, 𝑓𝑖 = 28, 𝑓𝑖−1 = 12, 𝑓𝑖+1 = 20
Mode=40+
10(28−12)
(2×28)−12−20
= 40 + 6.666 = 46.666
FAQs - Measures of Central Tendency
What is the best measure of central tendency?
There can often be a "best" measureof central tendency with regards to the data
you are analyzing, but there is no one "best" measureof central tendency. This is
because whether you use the median, mean or mode will depend on the type of
data you have(see our Types of Variable guide), such as nominal or continuous
data; whether your data has outliers and/or is skewed; and whatyou aretrying to
show fromyour data. Further considerations of when to useeach measure of
central tendency is found in our guide on the previous page.
In a strongly skeweddistribution, what is the best indicator of central tendency?
Itis usually inappropriateto usethe mean in such situations whereyour data is
skewed. You would normally choosethe median or mode, with the median
usually preferred. This is discussed on the previous pageunder the subtitle,
"When not to usethe mean".
Does all data have a median, mode and mean?
Yes and no. All continuous data has a median, mode and mean. However, strictly
speaking, ordinaldata has a median and mode only, and nominal data has only a
mode. However, a consensus has notbeen reached among statisticians about
whether the mean can be used with ordinal data, and you can often see a mean
reported for Likert data in research.
When is the mean the best measure of central tendency?
The mean is usually the best measureof central tendency to use when your data
distribution is continuous and symmetrical, such as when your data is normally
distributed. However, it all depends on what you are trying to show fromyour
data.
When is the mode the best measure of central tendency?
The mode is the least used of the measures of central tendency and can only be
used when dealing with nominal data. For this reason, the mode will be the best
measureof central tendency (as it is the only one appropriateto use) when
dealing with nominal data. The mean and/or median are usually preferred when
dealing with all other types of data, but this does not mean it is never used with
these data types.
When is the median the best measure of central tendency?
The median is usually preferred to other measures of central tendency when your
data set is skewed (i.e., forms a skewed distribution) or you are dealing with
ordinal data. However, themode can also be appropriate in these situations, but
is not as commonly used as the median.
What is the most appropriate measure of central tendency whenthe datahas
outliers?
The median is usually preferred in these situations because the value of the mean
can be distorted by the outliers. However, it will depend on how influential the
outliers are. If they do not significantly distortthe mean, using the mean as the
measureof central tendency will usually be preferred.
In a normally distributeddataset, whichis greatest:mode, medianor mean?
If the data set is perfectly normal, the mean, median and mean are equal to each
other (i.e., the same value).
For any data set, whichmeasures of central tendency have only one value?
The median and mean can only have one value for a given data set. The mode can
have more than one value
MERITS AND DEMERITS OF MEAN, MEDIAN AND MODE
MEAN
The arithmetic mean (or simply "mean") of a sample is the sumof the sampled
values divided by the number of items in the sample.
MERITS OF ARITHEMETIC MEAN
1. ARITHEMETICMEANRIGIDLYDEFINED BYALGEBRICFORMULA
2. It is easy to calculate and simple to understand
3. ITBASED ONALL OBSERVATIONS AND ITCANBE REGARDED AS
REPRESENTATIVEOF THE GIVENDATA
4. It is capable of being treated mathematically and hence it is widely used in
statistical analysis.
5. Arithmetic mean can be computed even if the detailed distribution is not
known but someof the observation and number of the observation are
known.
6. It is least affected by the fluctuation of sampling
DEMERITS OF ARITHMETIC MEAN
1. Itcan neither be determined by inspection or by graphicallocation
2. Arithmetic mean cannot be computed for qualitative data like data on
intelligence honesty and smoking habit etc
3. It is too much affected by extreme observations and hence it is not
adequately representdata consisting of some extreme point
4. Arithmetic mean cannot be computed when class intervals have open ends
MEDIAN
The median is that value of the series which divides the group into two equal
parts, one part comprising all values greater than the median value and the other
part comprising all the values smaller than the median value.
MERITS OF MEDIAN
(1) Simplicity:- Itis very simple measureof the central tendency of the series. I the
case of simple statistical series, justa glance at the data is enough to locate the
median value.
(2) Free fromthe effect of extreme values: - Unlike arithmetic mean, median
value is not destroyed by the extreme values of the series.
(3) Certainty: - Certainty is another merits is the median. Median values are
always a certain specific value in the series.
(4) Real value: - Median value is real value and is a better representativevalue of
the series compared to arithmetic mean average, the value of which may not exist
in the series at all.
(5) Graphic presentation: - Besides algebraic approach, the median value can be
estimated also through the graphic presentation of data.
(6) Possibleeven when data is incomplete: - Median can be estimated even in the
case of certain incomplete series. Itis enough if one knows the number of items
and the middle item of the series.
DEMERITS OF MEDIAN
Following are the various demerits of median:
(1) Lack of representative character: - Median fails to be a representative
measurein caseof such series the different values of which are wide apart from
each other. Also, median is of limited representative character as it is not based
on all the items in the series.
(2) Unrealistic:- When the median is located somewherebetween the two middle
values, it remains only an approximate measure, not a precisevalue.
(3) Lack of algebraic treatment: - Arithmetic mean is capable of further algebraic
treatment, but median is not. For example, multiplying the median with the
number of items in the series will not give us the sumtotal of the values of the
series.
However, median is quite a simple method finding an average of a series. Itis
quite a commonly used measure in the caseof such series which are related to
qualitative observation as and health of the student.
MODE
The value of the variable which occurs mostfrequently in a distribution is called
the mode.
MERITS OF M0DE
Following are the various merits of mode:
(1) Simple and popular: - Mode is very simple measure of central tendency.
Sometimes, justat the series is enough to locate the model value. Because of its
simplicity, it s a very popular measure of the central tendency.
(2) Less effect of marginal values: - Compared top mean, mode is less affected by
marginal values in the series. Mode is determined only by the value with highest
frequencies.
(3) Graphic presentation:- Mode can be located graphically, with the help of
histogram.
(4) Best representative: - Mode is that value which occurs mostfrequently in the
series. Accordingly, modeis the best representativevalue of the series.
(5) No need of knowing all the items or frequencies: - The calculation of mode
does not requireknowledge of all the items and frequencies of a distribution. In
simple series, it is enough if one knows theitems with highest frequencies in the
distribution.
DEMERITS OF M0DE
Following are the various demerits of mode:
(1) Uncertain and vague: - Mode is an uncertain and vaguemeasure of the central
tendency.
(2) Not capable of algebraic treatment: - Unlike mean, mode is not capable of
further algebraic treatment.
(3) Difficult: - With frequencies of all items are identical, it is difficult to identify
the modal value.
(4) Complex procedureof grouping:- Calculation of mode involves cumbersome
procedureof grouping the data. If the extent of grouping changes there will be a
change in the model value.
(5) Ignores extrememarginal frequencies:- Itignores extreme marginal
frequencies. To that extent model value is not a representative value of all the
items in a series. Besides, one can question the representative character of the
model value as its calculation does not involve all items of the series.
Dispersion
In statistics, dispersion (also called variability, scatter, or spread) denotes how
stretched or squeezed is a distribution (theoretical or that underlying a statistical
sample). Common examples of measures of statistical dispersion are the variance,
standard deviation and interquartile range.
Dispersion is contrasted with location or central tendency, and together they are
the mostused properties of distributions.
Measures of dispersion
The set of constants which would in a concise way explain the “variability”, or
“scatter” in a data is called “Measuresof dispersion or variability”.
The average for two groups of the same number of measurements may be equal,
but one group may be more variable then the others.
e.g. set of five values 5,6,7,8,9 has themean as 7; while other set of five values
1,6,4,10,14 also has the samemean 7. The second set has more variability then
the first.
Usually four measures of dispersion or variability are defined.
Range:-
The Range is the difference between the two extreme values.
In frequency distribution, 𝑅 = (𝐿𝑎𝑟𝑔𝑒𝑠𝑡 𝑥 𝑣𝑎𝑙𝑢𝑒) – (𝑆𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝑥 𝑣𝑎𝑙𝑢𝑒)
Example: In {4, 6, 9, 3, 7} the lowestvalue is 3, and the highest is 9.
So the range is 9-3 = 6.
Quartile deviation:-
Median bisects the distribution. If the distribution divided into four parts,
quartiles are obtained. FirstQuartile is𝑄1 and third Quartile is 𝑄3.
𝑄1 = 𝑙 +
(
𝑁
4
− 𝑓𝑄1
)
𝑓
× 𝐶 𝑄3 = 𝑙 +
(
3𝑁
4
− 𝑓𝑄3
)
𝑓
× 𝐶
Where 𝑙 = lower limit of the Quartile class
𝐶 = common factor
Quartile Deviation is defined as 𝑄. 𝐷. =
1
2
( 𝑄3 − 𝑄1)
AverageDeviation:-If averagechosen A, then averagedeviation about A is
averagedeviation.
𝐴. 𝐷. ( 𝐴) =
1
3
∑| 𝑥𝑖 − 𝐴| 𝑓𝑜𝑟 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒 𝑑𝑎𝑡𝑎
=
1
3
∑𝑓𝑖| 𝑥𝑖 − 𝐴| 𝑓𝑜𝑟 𝑎 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛
Standard deviation:-
Standard deviation(𝜎) = √
1
𝑛
∑(𝑥𝑖 − 𝑥̅)2 𝑓𝑜𝑟 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒 𝑑𝑎𝑡𝑎 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛
= √
1
𝑁
∑𝑓𝑖 (𝑥𝑖 − 𝑥̅)2 𝑓𝑜𝑟 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛
Square of standard deviation, 𝜎2
is defined as Variance (𝑉).
𝑉 = 𝜎2
=
1
𝑁
∑𝑓𝑖 (𝑥𝑖 − 𝑥̅)2
Coefficient of variation
In probability theory and statistics, the coefficient of variation (CV) is a
standardized measureof dispersion of a probability distribution or frequency
distribution. Itis defined as the ratio of the standard deviation 𝜎to the mean 𝜇 . It
is also known as unitizedrisk or the variationcoefficient. Theabsolutevalue of
the CV is sometimes known as relative standard deviation (RSD), which is
expressed as a percentage.
Definition
The coefficient of variation (CV) is defined as the ratio of the standard deviation 𝜎
to the mean 𝜇 :
𝐶𝑣 =
𝜎
𝜇
Itshows the extent of variability in relation to mean of the population.
Example:-Theowner of a restaurantis interested in how much people spend at
the restaurant. Heexamines 10 randomly selected receipts for parties of four and
write down the following data: 44, 50, 38, 96, 42, 47, 40,39,46, 50
Find mean, standard deviation and variance.
Solution:-
Mean is calculated by adding and dividing by 10.
Mean = 𝑥̅ = 49.2
Following table is used to find standard deviation
P 𝒙 − 𝟒𝟗. 𝟐 ( 𝒙 − 𝟒𝟗. 𝟐) 𝟐
44 -5.2 27.04
50 0.8 0.64
38 11.2 125.44
96 46.8 2190.24
42 -7.2 51.84
47 -2.2 4.84
40 -9.2 84.64
39 -10.2 104.04
46 -3.2 10.24
50 0.8 0.64
Total 2600.4
Standard Deviation= 𝜎