low cost antibiotic cement nail for infected non union.pptx
Human resources section2b-textbook_on_public_health_and_community_medicine
1. Role of Statistics in Public Health and Community
38 Introduction to Biostatistics Medicine
Statistics finds an extensive use in Public Health and
Seema R. Patrikar Community Medicine. Statistical methods are foundations for
public health administrators to understand what is happening
The origin of statistics roots from the Greek word ‘Statis’ which to the population under their care at community level as well as
means state. In the early days the administration of the state individual level. If reliable information regarding the disease is
required the collection of information regarding the population available, the public health administrator is in a position to:
for the purpose of war. Around 2000 years ago, in India, we ●● Assess community needs
had this system of collecting administrative statistics. In the ●● Understand socio-economic determinants of health
Mauryan regime the system of registration of vital events ●● Plan experiment in health research
of births and deaths existed. Ain-i-Akbari is a collection of ●● Analyse their results
information gathered on various surveys conducted during the ●● Study diagnosis and prognosis of the disease for taking
reign of Emperor Akbar. effective action
The birth of statistics occurred in mid-17th century. A ●● Scientifically test the efficacy of new medicines and
commoner, named John Graunt, began reviewing a weekly methods of treatment.
church publication issued by the local parish clerk that listed Statistics in public health is critical for calling attention to
the number of births, christenings, and deaths in each parish. problems, identifying risk factors, and suggesting solutions,
These so called Bills of Mortality also listed the causes of death. and ultimately for taking credit for our successes. The most
Graunt who was a shopkeeper organized this data, which was important application of statistics in sociology is in the field
published as Natural and Political Observations made upon of demography.
the Bills of Mortality. The seventeenth century contribution of Statistics helps in developing sound methods of collecting data
theory of probability laid the foundation of modern statistical so as to draw valid inferences regarding the hypothesis. It helps
methods. us present the data in numerical form after simplifying the
Today, statistics has become increasingly important with complex data by way of classification, tabulation and graphical
passing time. Statistical methods are fruitfully applied to presentation. Statistics can be used for comparison as well as
any problem of decision making where the past information to study the relationship between two or more factors. The use
is available or can be made available. It helps to weigh the of such relationship further helps to predict one factor from the
evidences and draw conclusions. Statistics finds its application other. Statistics helps the researcher come to valid conclusions
in almost all the fields of science. We hardly find any science in answering their research questions.
that does not make use of statistics. Despite wide importance of the subject it is looked upon with
Definition of Statistics suspicion. “Lies, damned lies, and statistics” is part of a phrase
attributed to Benjamin Disraeli and popularized in the United
Different authors have defined statistics differently. The best
States by Mark Twain: “There are three kinds of lies: lies,
definition of statistics is given by Croxton and Cowden according
damned lies, and statistics.” The semi- ironic statement refers
to whom statistics may be defined as the science, which deals
to the persuasive power of numbers, and describes how even
with collection, presentation, analysis and interpretation of
accurate statistics can be used to bolster inaccurate arguments.
numerical data.
It is human psychology that when facts are supported by
Definition of Biostatistics figures, they are easily believed. If wrong figures are used
Biostatistics may be defined as application of statistical methods they are bound to give wrong conclusions and hence when
to medical, biological and public health related problems. It is statistical theories are applied the figures that are used are
the scientific treatment given to the medical data derived from free of all types of biases and have been properly collected and
group of individuals or patients. scientifically analysed.
Role of Statistics in Clinical Medicine Broad Categories of Statistics
The main theory of statistics lies in the term variability. No two Statistics can broadly be split into two categories Descriptive
individuals are same. For example, blood pressure of person Statistics and Inferential Statistics. Descriptive statistics
may vary from time to time as well as from person to person. deals with the meaningful presentation of data such that its
We can also have instrumental variability as well as observers characteristics can be effectively observed. It encompasses the
variability. Methods of statistical inference provide largely tabular, graphical or pictorial display of data, condensation
objective means for drawing conclusions from the data about of large data into tables, preparation of summary measures
the issue under study. Medical science is full of uncertainties to give a concise description of complex information and also
and statistics deals with uncertainties. Statistical methods to exhibit pattern that may be found in data sets. Inferential
try to quantify the uncertainties present in medical science. It statistics however refers to decisions. Medical research doesn’t
helps the researcher to arrive at a scientific judgment about stop at just describing the characteristic of disease or situation.
a hypothesis. It has been argued that decision making is an It tries to determine whether characteristics of a situation are
integral part of a physician’s work. Frequently, decision making unusual or if they have happened by chance. Because of this
is probability based. desire to generalize, the first step is to statistically analyse the
• 218 •
2. data. Study Exercises
In order to begin our analysis as to why statistics is necessary Short Notes : (1) Differentiate between descriptive and
we must begin by addressing the nature of science and inferential statistics (2) Describe briefly various scales of
experimentation. The characteristic method used by researcher measurement.
when he/she starts his/her experiment is to study a relatively
MCQs & Exercises
small collection of subjects, as complete population based
studies are time consuming, laborious, costly and resource 1. An 85 year old man is rushed to the emergency department
intensive. The researcher draws a subset of the population by ambulance during an episode of chest pain. The
called as “sample” and studies this sample in depth. But the preliminary assessment of the condition of the man is
conclusions drawn after analyzing the sample is not restricted performed by a nurse, who reports that the patients pain
to the sample but is extrapolated to the population i.e. people in seems to be ‘severe’. The characterization of pain as
general. Thus Statistics is the mathematical method by which ‘severe’ is (a) Dichotomous (b) Nominal (c) Quantitative
the uncertainty inherent in the scientific method is rigorously (d) Qualitative
quantified. 2. If we ask the patient attending OPD to evaluate his pain
on a scale of 0 (no pain) to 5 (the worst pain), then this
Summary commonly applied scale is a (a) Dichotomous (b) Ratio
In recent times, use of Statistics as a tool to describe various scale (c) Continuous (d) Nominal
phenomena is increasing in biological sciences and health related 3. For each of the following variable indicate whether it is
fields so much so that irrespective of the sphere of investigation, quantitative or qualitative and specify the measurement
a research worker has to plan his/her experiments in such a scale for each variable : (a) Blood Pressure (mmHg)
manner that the kind of conclusions which he/she intends to (b) Cholesterol (mmol/l) (c) Diabetes (Yes/No) (d) Body
draw should become technically valid. Statistics comes to this Mass Index (Kg/m2) (e) Age (years) (f) Sex (female/
aid at the stages of planning of experiment, collection of data, male) (g) Employment (paid work/retired/housewife) (h)
analysis and interpretation of measures computed during the Smoking Status (smokers/non-smokers, ex-smokers) (i)
analysis. Biostatistics is defined as application of statistical Exercise (hours per week) (j) Drink alcohol (units per week)
methods to medical, biological and public health related (k) Level of pain (mild/moderate/severe)
problems. Statistics is broadly categorized into descriptive Answers : (1) d; (2) b; (3) (a) Quantitative continuous;
statistics and inferential statistics. Descriptive statistics (b) Quantitative continuous; (c) Qualitative dichotomous ;
describes the data in meaningful tables or graphs so that the (d) Quantitative continuous; (e) Quantitative continuous;
hidden pattern is brought out. Condensing the complex data (f) Qualitative dichotomous; (g) Qualitative nominal ;
into simple format and describing it with summary measures is (h) Qualitative nominal; (i) Quantitative discrete; (j) Quantitative
part of the descriptive statistics. Inferential statistics on other discrete; (k) Qualitative ordinal.
hand, deals with drawing inferences and taking decision by
studying a subset or sample from the population.
The first step in handling the data, after it has been collected
Descriptive Statistics: Displaying
39 the Data
is to ‘reduce’ and summarise it, so that it can become
understandable; then only meaningful conclusions can be
drawn from it. Data can be displayed in either tabular form or
Seema R. Patrikar graphical form. Tables are used to categorize and summarize
data while graphs are used to provide an overall visual
representation. To develop Graphs and diagrams, we need to
The observations made on the subjects one after the other is
first of all, condense the data in a table.
called raw data. Raw data are often little more than jumble of
numbers and hence very difficult to handle. Data is collected Understanding as to how the Data have been
by researcher so that they can give solutions to the research Recorded
question that they started with. Raw data becomes useful only Before we start summarizing or further analyzing the data, we
when they are arranged and organized in a manner that we should be very clear as on which ‘scale’ it has been recorded
can extract information from the data and communicate it to (i.e. qualitative or quantitative; and, whether continuous,
others. In other words data should be processed and subjected discrete, ordinal, polychotomous or dichotomous). The details
to further analysis. This is possible through data depiction, have already been covered earlier in the chapter on variables
data summarization and data transformation. and scales of measurement (section on epidemiology) and the
• 219 •
3. readers should quickly revise that chapter before proceeding. Child Sex Age Malnutrition
Ordered Data (months) Status
When the data are organized in order of magnitude from 17 f 2 Normal
the smallest value to the largest value it is called as ordered
18 m 11 Normal
array. For example consider the ages of 11 subjects undergoing
tobacco cessation programme (in years) 16, 27, 34, 41, 38, 53, 19 m 12 Normal
65, 52, 20, 26, 68. When we arrange these ages in increasing 20 m 11 Malnourished
order of magnitude we get ordered array as follows: 16, 20,
21 m 10 Normal
26, 27, 34, 38, 41, 52, 53, 65, 68. After observing the ordered
array we can quickly determine that the youngest person is of 22 f 9 Normal
16 years and oldest of 68 years. Also we can easily state that 23 f 5 Normal
almost 55% of the subjects are below 40 years of age, and that
24 f 6 Normal
the midway person is aged 38 years.
25 m 4 Normal
Grouped Data - Frequency Table
Besides arranging the data in ordered array, grouping of data 26 f 7 Normal
is yet another useful way of summarizing them. We classify the 27 f 11 Normal
data in appropriate groups which are called “classes”. The basic 28 f 12 Normal
purpose behind classification or grouping is to help comparison
and also to accommodate a large number of observations into 29 m 10 Malnourished
a few classes only, by condensation so that similarities and 30 m 4 Normal
dissimilarities can be easily brought out. It also highlights
31 m 6 Normal
important features and pinpoints the most significant ones at
glance. 32 m 8 Normal
Table 1 shows a set of raw data obtained from a cross-sectional 33 m 12 Malnourished
survey of a random sample of 100 children under one year of 34 m 1 Malnourished
age for malnutrition status. Information regarding age and
35 m 1 Normal
sex of the child was also collected. We will use this data to
illustrate the construction of various tables. If we show the 36 f 3 Normal
distribution of children as per age then it is called as simple 37 m 5 Normal
table as only one variable is considered.
38 f 6 Normal
Table - 1 : Raw data on malnutrition status (malnourished 39 f 8 Normal
and normal) for 100 children below one year of age 40 f 9 Normal
Child Sex Age Malnutrition 41 f 10 Malnourished
(months) Status
42 m 1 Normal
1 f 6 Normal
43 f 12 Malnourished
2 m 4 Malnourished
44 f 2 Malnourished
3 m 2 Malnourished
45 m 1 Normal
4 m 5 Normal
46 m 6 Normal
5 m 3 Normal
47 m 4 Normal
6 f 1 Normal
48 f 9 Normal
7 m 5 Normal
49 f 4 Normal
8 f 8 Normal
50 m 9 Normal
9 f 7 Normal
51 m 7 Normal
10 f 9 Normal
52 m 6 Normal
11 f 10 Normal
53 m 4 Normal
12 f 2 Normal
54 f 2 Normal
13 m 4 Malnourished
55 m 5 Normal
14 f 6 Normal
56 m 3 Normal
15 m 8 Normal
57 f 1 Normal
16 f 1 Malnourished
58 m 5 Normal
• 220 •
4. Child Sex Age Malnutrition Steps in Making a Summary Table for the Data
(months) Status To group a set of observations we select a set of contiguous,
60 m 7 Malnourished non overlapping intervals such that each value in the set of
observations can be placed in one and only one of the intervals.
61 m 9 Normal These intervals are usually referred to as class intervals. For
62 m 10 Normal example the above data can be grouped into different age
groups of 1-4, 5-8 and 9-12. These are called class intervals.
63 f 2 Normal
The class interval 1-4 includes the values 1, 2, 3 and 4. The
64 f 4 Normal smallest value 1 is called its lower class limit whereas the
65 f 6 Normal highest value 4 is called its upper class limit. The middle value
of 1-4 i.e. 2.5 is called the midpoint or class mark. The number
66 f 8 Normal
of subjects falling in the class interval 1-4 is called its class
67 m 1 Normal frequency. Such presentation of data in class intervals along
68 m 2 Normal with frequency is called frequency distribution. When both the
limits are included in the range of values of the interval, the
69 m 11 Normal
class interval are known as inclusive type of class intervals (e.g.
70 f 12 Normal 1-4, 5-8, 9-12, etc.) whereas when lower boundary is included
71 m 11 Normal but upper limit is excluded from the range of values, such
class intervals are known as exclusive type of class intervals
72 m 10 Malnourished (e.g. 1-5, 5-9, 9-12 etc.) This type of class intervals is suitable
73 f 9 Normal for continuous variable. Tables can be formed for qualitative
74 f 5 Normal variables also.
75 f 6 Normal Table - 2 and 3 display tabulation for quantitative as well as
qualitative variable.
76 m 4 Normal
77 m 7 Normal Table - 2 : Age distribution of the 100 children
78 m 11 Normal Age group (months) Number of children
79 f 12 Normal 1-4 36
80 f 10 Normal 5-8 33
81 m 4 Normal 9-12 31
82 m 6 Malnourished Total 100
83 m 8 Normal
Table - 3 : Distribution of malnourishment in 100
84 m 12 Normal
children
85 m 1 Normal
Malnourishment Status Number of children
86 m 1 Normal
Malnourished 17
87 m 3 Normal
Normal 83
88 f 5 Normal
Total 100
89 m 6 Normal
Such type of tabulation which takes only one variable for
90 f 8 Normal
classification is called one way table. When two variables
91 f 9 Normal are involved the table is referred to as cross tabulation or
92 f 10 Normal two way table. For example Table - 4 displays age and sex
distribution of the children and Table - 5 displays distribution
93 f 1 Normal of malnourishment status and sex of children.
94 m 12 Normal
95 m 2 Normal Table - 4 : Age and sex distribution of 100 children
96 f 1 Normal Age group (months) Female Male Total
97 m 6 Normal 1-4 14 22 36
98 f 4 Malnourished 5-8 15 18 33
99 f 9 Malnourished 9-12 16 15 31
100 m 4 Normal Total 45 55 100
• 221 •
5. percentages in bracket may be written on the top of each bar.
Table - 5 : Malnourishment status and sex distribution
When we draw bar charts with only one variable or a single
of children
group it is called as simple bar chart and when two variables
Malnourishment Status Female Male Total or two groups are considered it is called as multiple bar chart.
Malnourished 6 11 17 In multiple bar chart the two bars representing two variables
are drawn adjacent to each other and equal width of the bars
Normal 39 44 83
is maintained. Third type of bar chart is the component bar
Total 45 55 100 chart wherein we have two qualitative variables which are
further segregated into different categories or components. In
How to Decide on the Number of Class Intervals? this the total height of the bar corresponding to one variable
When data are to be grouped it is required to decide upon the is further sub-divided into different components or categories
number of class intervals to be made. Too few class intervals of the other variable. For example consider the following data
would result in losing the information. On the other hand too (Table-6) which shows the findings of a hypothetical research
many class intervals would not bring out the hidden pattern. work intended to describe the pattern of blood groups among
The thumb rule is that we should not have less than 5 class patients of essential hypertension.
intervals and no more than 15 class intervals. To be specific,
experts have suggested a formula for approximate number of Table - 6 : Distribution of blood group of patients of
class intervals (k) as follows: essential hypertension
K= 1 + 3.332 log10N rounded to the nearest integer, where N is Number of
the number of values or observations under consideration. Blood Group patients Percentage
For example if N=25 we have, K= 1 + 3.332 log1025 i.e. (frequency)
approximately 5 class intervals. A 232 42.81
Having decided the number of class intervals the next step is B 201 37.05
to decide the width of the class interval. The width of the class
interval is taken as : AB 76 14.02
O 33 6.09
Maximum observed value - Minimum observed value
(= Range) Total 542 100.00
Width = Number of class interval (k)
A simple bar chart in respect of the above data on blood groups
The class limits should be preferably rounded figures and the among patients of essential hypertension is represented as in
class intervals should be non-overlapping and must include Fig. - 1.
range of the observed data. As far as possible the percentages Similarly a multiple bar chart of the data represented in Table
and totals should be calculated column wise. - 5 of the distribution of the malnourishment status among
Graphical Presentation of Data males and females is shown in Fig. - 2.
A tabular presentation discussed above shows distribution of The same information can also be depicted in the form of
subjects in various groups or classes. This tabular representation component bar chart as in Fig. - 3.
of the frequency distribution is useful for further analysis and
conclusion. But it is difficult for a layman to understand complex Fig. - 1 : Distribution of blood groups of patients with
distribution of data in tabular form. Graphical presentation of essential hypertension
data is better understood and appreciated by humans. Graphical 250
representation brings out the hidden pattern and trends of the
complex data sets. 200
Thus the reason for displaying data graphically is two fold:
Frequency
150
1) Investigators can have a better look at the information
collected and the distribution of data and, 100
2) To communicate this information to others quickly
We shall discuss in detail some of the commonly used graphical 50
presentations.
0
Bar Charts : Bar charts are used for qualitative type of variable
A B AB O
in which the variable studied is plotted in the form of bar
Blood Groups
along the X-axis (horizontal) and the height of the bar is equal
to the percentage or frequencies which are plotted along the
Y-axis (vertical). The width of the bars is kept constant for all
the categories and the space between the bars also remains
constant throughout. The number of subjects along with
• 222 •
6. points by a straight line then it is called as frequency polygon
Fig. - 2 : Multiple Bar Chart showing the distribution of
Conventionally, we consider one imaginary value immediately
malnourishment status in males and females
preceding the first value and one succeeding the last value and
50
44 plot them with frequency = 0. An example is given in Table - 7
45
39 and Fig. - 5.
40
35
Fig. - 4a : Distribution of patients according to blood
30
group
Number
25
20
15
10 6
11
O
5
0
6%
Malnourished Normal AB
Females Males
14 %
Fig. - 3 : Component Bar Chart showing the distribution
of malnourishment status in males and females A : 43 %
60
50
40 B : 37 %
Number
30
20
10
0 Fig. - 4b
Female Male
Malnourished Normal 42.81
Blood group A = X 360 = 154 degrees
100
Pie Chart : Another interesting method of displaying categorical 37.08
(qualitative) data is a pie diagram also called as circular Blood group B = X 360 = 134 degrees
100
diagram. A pie diagram is essentially a circle in which the
Blood group AB = 14.02 X 360 = 50 degrees
angle at the center is equal to its proportion multiplied by 360
100
(or, more easily, its percentage multiplied by 360 and divided
by 100). A pie diagram is best when the total categories are Blood group O = 6.09 X 360 = 22 degrees
between 2 to 6. If there are more than 6 categories, try and 100
reduce them by “clubbing”, otherwise the diagram becomes too
overcrowded.
A pie diagram in respect of the data on blood groups among Table - 7: Distribution of subjects as per age groups
patients of essential hypertension is presented below after Number of
Age Midpoints
calculating the angles for the individual categories as in subjects
Fig. - 4 a, b. 20-25 22.5 2
Frequency Curve and Polygon : To construct a frequency curve
25-30 27.5 3
and frequency polygon we plot the variable along the X-axis
and the frequencies along the Y-axis. Observed values of the 30-35 32.5 6
variable or the midpoints of the class intervals are plotted along 35-40 37.5 14
with the corresponding frequency of that class interval. Then
40-45 42.5 7
we construct a smooth freehand curve passing through these
points. Such a curve is known as frequency curve. If instead of 45-50 47.5 5
joining the midpoints by smooth curve, we join the consecutive
• 223 •
7. Fig. - 5 : Distribution of subjects in different age groups Fig. - 8
16
Rough estimate of
14 the centre or middle
observation i.e. median
Number of subjects
12
value (27.5)
10
8
6
4 Spread of the data
2
0
Fig. - 9
15 20 25 30 35 40 45 50 55
Age groups
Stem-and-leaf plots : This presentation is used for quantitative
type of data. To construct a stem-and-leaf plot, we divide each
value into a stem component and leaf component. The digits
in the tens-place becomes stem component and the digits in
units-place becomes leaf components. It is of much utility in
quickly assessing whether the data is following a “normal”
distribution or not, by seeing whether the stem and leaf is
showing a bell shape or not. For example consider a sample of
10 values of age in years : 21, 42, 05, 11, 30, 50, 28, 27, 24,
52. Here, 21 has a stem component of 2 and leaf component
of 1. Similarly the second value 42 has a stem component of 4
and leaf component of 2 and so on. The stem values are listed
in numerical order (ascending or descending) to form a vertical
axis. A vertical line is drawn to outline a stem. If the stem
value already exists then the leaf is placed on the right side of For the given example we notice the mound (heap) in the
vertical line (Fig. - 6). middle of the distribution. There are no outliers.
The value of each of the leaf is plotted in its appropriate location Histogram : The stem-and-leaf is a good way to explore
on the other side of vertical line as in Fig. - 7. distributions. A more traditional approach is to use histogram.
To describe the central location, spread and shape of the stem A histogram is used for quantitative continuous type of data
plot we rotate the stem plot by 90 degrees just to explain it where, on the X-axis, we plot the quantitative exclusive type
more clearly as in Fig. - 8. of class intervals and on the Y-axis we plot the frequencies.
Roughly we can say that the spread of data is from 5 to 52 The difference between bar charts and histogram is that since
and the median value is between 27 and 28. Regarding the histogram is the best representation for quantitative data
shape of the distribution though it will be difficult to make measured on continuous scale, there are no gaps between the
firm statements about shape when n is small, we can always bars. Consider an example of the data on serum cholesterol of
determine (Fig. - 9) : 10 subjects (Table - 8 & Fig. - 10)
●● Whether data are more or less symmetrical or are extremely
skewed Table - 8 : Distribution of the subjects
●● Whether there is a central cluster or mound Serum
●● Whether there are any outliers cholesterol No of subjects Percentage
(mg/dl)
Fig. - 6 Fig. - 7
175 – 200 3 30
0 0 5
200 – 225 3 30
1 1 1
225 – 250 2 20
2 2 1 4 7 8
250 – 275 1 10
3 3 0
275 – 300 1 10
4 4 2
Total 10 100%
5 5 0 2
• 224 •
8. diagram, the rate of disease are plotted along the vertical (y)
Fig. - 10 : Distribution of subjects according to Serum
axis. However, in localised outbreaks, with a well demarcated
Cholesterol Levels
population that has been at risk (as sudden outbreaks of food
3.5 poisoning) the actual numbers can be plotted on Y-axis, during
3.0 quick investigations. The unit of time, as applicable to the
disease in question, is plotted along the “X”-axis (horizontal).
% of subjects
2.5 This unit of time would be hours-time in food poisoning, days
2.0 (i.e, as per dates of the month) for cholera, weeks for typhoid,
malaria or Hepatitis-A, months for Hepatitis-B and in years (or
1.5
even decades) for IHD or Lung Cancer.
1.0 Scatter Diagram : A scatter diagram gives a quick visual
0.5 display of the association between two variables, both of which
are measured on numerical continuous or numerical discrete
0 scale. An example of scatter plot between age (in months) and
175-200 200-225 225-250 250-275 275-300 body weight (in kg) of infants is given in Fig. - 12.
Serum Cholesterol Levels (mg/dl)
Fig. - 12 : Scatter Diagram of the association between Age
and Body Weight of infants
Box-and-Whisker plot : A box-and-whisker plot reveals
maximum of the information to the audience. A box-and- 12
whisker plot can be useful for handling many data values. They
Body Weight (Kgs.)
10
allow people to explore data and to draw informal conclusions
when two or more variables are present. It shows only certain 8
statistics rather than all the data. Five-number summary 6
is another name for the visual representations of the box-
and-whisker plot. The five-number summary consists of the 4
median, the quartiles (lower quartile and upper quartile), and 2
the smallest and greatest values in the distribution. Thus a
0
box-and-whisker plot displays the center, the spread, and the
overall range of distribution (Fig. - 11) 0 2 4 6 8 10 12 14
Age in months
Fig. - 11
The scatter diagram in the above figure shows instant finding
that weight and age are associated - as age increases, weight
Largest Value increases. Be careful to record the dependent variable along the
vertical (Y) axis and the independent variable along the
Upper Quartile (Q3) horizontal (X) axis. In this example weight is dependent on
age (as age increases weight is likely to increase) but age
is not dependent on weight (if weight increases, age will
not necessarily increase). Thus, weight is the dependent
variable, and has been plotted on Y axis while age is the
independent variable, plotted along X axis.
Median Quartile (Q2) Summary
Raw information, which is just jumble of numbers, collected by
the researcher needs to be presented and displayed in a manner
that it makes sense and can be further processed. Data presented
Lower Quartile (Q1) in an eye-catching way can highlight particular figures and
situations, draw attention to specific information, highlight
Smallest value
hidden pattern and important information and simplify complex
information. Raw information can be presented either in table
i.e. tabular presentation or in graphs and charts i.e. graphical
presentation. A table consists of rows and columns. The data
is condensed in homogenous groups called class intervals and
Line chart: Line chart is used for quantitative data. It is the number of individuals falling in each class interval called
an excellent method of displaying the changes that occur frequency is displayed. A table is incomplete without a title.
in disease frequency over time. It thus helps in assessing Clear title describing completely the data in concise form is
“temporal trends” and helps displaying data on epidemics or written. Graphical presentation is used when data needs to be
localised outbreaks in the form of epidemic curve. In a line displayed in charts and graphs. A chart or diagram should have
• 225 •
9. a clear title describing the data depicted. The X-axis and the
Exports (crores Imports (crores
Y-axis should be properly defined along with the scale. Legend Year
of rupees) of rupees)
in case of more than one variable or group is necessary. An
optional footnote giving the source of information may be 1960-61 610.3 624.65
present. Appropriate graphical presentation should be depicted 1961-62 955.39 742.78
depending on whether data is quantitative or qualitative.
1962-63 660.65 578.36
While dealing with quantitative data histograms, line chart,
polygon, stem and leaf and box and whisker plots should be 1963-64 585.25 527.98
used whereas bar charts, pictograms and pie charts should be
used when dealing with qualitative data. 9. Of the 140 children, 20 lived in owner occupied houses,
70 lived in council houses and 50 lived in private rented
Study Exercises accommodation. Type of accommodation is a categorical
Long Question : Discuss the art of effective presentation in variable. Appropriate graphical presentation will be
the field of health, in respect of data and information; so as to (a) Line chart (b) Simple Bar chart (c) Histogram
convince the makers of decision. (d) Frequency Polygon
10. A study was conducted to assess the awareness of phimosis
Short Notes: (1) Discuss the need for graphical presentation of
in young infants and children up to 5 years of age. The
data (2) Differentiate between inclusive and exclusive type of
awareness level with respect to the family income is as
class intervals (3) Box and Whisker Plot (4) Scatter diagram
tabulated below. Which graphical presentation is best to
MCQs describe the following data?
1. Which of the following is used for representing qualitative
data (a) Histogram (b) Polygon (c) Pie chart (d) Line chart <2000 2000 – 5000 5000 – 8000 >8000
2. The scatter plot is used to display (a) Causality Aware 50 62 77 70
(b) Correlation (c) Power (d) Type II error
3. Five summary plot consists of Quartiles and (a) Median (b) Unaware 50 28 23 30
Mode (c) Mean (d) Range
(a) Stem & Leaf (b) Pie Chart (c) Multiple Bar Chart
4. The appropriate method of displaying the changes that
(d) Component Bar Chart
occur in disease frequency over time (a) Line chart (b) Bar
11. Following is the frequency distribution of the serum levels
chart (c) Histogram (d) Stem and leaf.
of total cholesterol reported in a sample of 71 subjects.
5. Box and whisker plot is also known as (a) Magical box
Which graphical presentation is best to describe the
(b) Four summary plot (c) Five summary plot (d) None of
following data?
the above
6. The type of diagram useful to detect linear relationship Serum cholesterol level Frequency
between two variables is (a) Histogram (b) Line Chart
(c) Scatter Plot (d) Bar Chart <130 2
7. The following table shows the age distribution of cases of a 130-150 7
certain disease reported during a year in a particular state.
150-170 18
Which graphical presentation is appropriate to describe
this data? (a) Pie chart (b) Line chart (c) Histogram 170-190 20
(d) Pictogram 190-210 15
210-230 7
Age Number of cases
>230 2
5-14 5
15-24 10 (a) Stem & Leaf (b) Pie (c) Histogram
25-34 120 (d) Component Bar Chart
12. Information from the Sports Committee member on
35-44 22 representation in different games at the state level by
45-54 13 gender is as given below. Which graphical presentation is
55-64 5 best to describe the following data
8. Which graphical presentation is best to describe the Different Games Females Males
following data? (a) Multiple bar chart (b) Pie chart Long Jump 4 6
(c) Histogram (d) Box plot
High Jump 2 4
Shot Put 9 11
Running 15 10
Swimming 5 4
• 226 •
10. (a) Box plot (b) Histogram (c) Multiple Bar Chart (d) Pie Statistical Exercise
chart 1. Following is the population data in a locality, present the
13. Which graphical presentation is best to describe the data in tabular form as well as using appropriate graphs.
following data
S. No. Age S. No. Age S. No. Age
Grade of malnutrition Frequency
1 11 11 8 21 16
Normal 60
2 15 12 12 22 17
Grade I 30
3 6 13 22 23 19
Grade II 7
4 17 14 24 24 8
Grade III 2
5 18 15 16 25 9
Grade IV 1
6 7 16 19 26 10
(a) Box Plot (b) Component Bar Chart (c) Histogram (d) Pie 7 25 17 20 27 24
chart 8 32 18 9 28 31
Answers : (1) c; (2) b; (3) a; (4) a; (5) c; (6) c; (7) c; (8) a; (9) b;
(10) d; (11) c; (12) c; (13) d. 9 12 19 21 29 32
10 34 20 31 30 37
summing all the observations and then dividing by number of
Summarising the Data: Measures x
40 of Central Tendency and Variability
observations. It is generally denoted by . It is calculated as
follows.
Sum of the values of all observations
Mean (x) =
Seema R. Patrikar Total number of observations,
that is, the total number of
The huge raw information gathered by the researcher is subjects (denoted by "n")
organized and condensed in a table or graphical display. Mathematically,
Compiling and presenting the data in tabular or graphical form
Σxi
will not give complete information of the data collected. We
x = i n
need to “summarise” the entire data in one figure, looking at
which we can get overall idea of the data. Thus, the data set It is the simplest of the centrality measure but is influenced by
should be meaningfully described using summary measures. extreme values and hence at times may give fallacious results.
Summary measures provide description of data in terms of It depends on all values of the data set but is affected by the
concentration of data and variability existing in data. Having fluctuations of sampling.
described our data set we use these summary figures to draw Example : The serum cholesterol level (mg/dl) of 10 subjects
certain conclusions about the reference population from which were found to be as follows: 192 242 203 212 175 284 256
the sample data has been drawn. Thus data is described by two 218 182 228
summary measures namely, measure of central tendency and We observe that the above data set is of quantitative type.
measure of variability. Before we discuss in detail, the various
measures we should understand the distribution of the data To calculate mean the first step is to sum all the values. Thus,
Σxi
set. i = 192 + 242 + 203 + ……..+ 228 = 2192
The second step is to divide this sum by total number of
Measures of Central Tendency observation (n), which are 10 in our example. Thus,
This gives the centrality measure of the data set i.e. where the
Σxi
observations are concentrated. There are numerous measures
x = in = 2192/10 = 219.2
of central tendency. These are : Mean; Median; Mode; Geometric
Mean; Harmonic Mean.
Thus the average value of Serum cholesterol among the 10
Mean (Arithmetic Mean) or Average subjects studied = 219.5 mg/dl. This summary value of mean
This is most appropriate measure for data following normal describes our entire data in one value.
distribution but not for skewed distributions. It is calculated by
• 227 •
11. Calculation of mean from grouped data : For calculating the observations are less, median can be calculated by just
the mean from a “grouped data” we should first find out the inspection. Unlike mean, median can be calculated if the extreme
midpoint (class mark) of each class interval which we denote observation is missing. It is less affected by fluctuations of
by x. (Mid point is calculated by adding the upper limit and sampling than mean.
the lower limit of the respective class intervals and dividing by Mode
2). The next step is to multiply the midpoints by the frequency
of that class interval. Summing all these multiplications and Mode is the most common value that repeats itself in the
then dividing by total sample size yields us the mean value for data set. Though mode is easy to calculate, at times it may
grouped data. be impossible to calculate mode if we do not have any value
repeating itself in the data set. At other end it may so happen
Consider the following example on 10 subjects on serum that we come across two or more values repeating themselves
cholesterol level (mg/dl), put in class interval (Table - 1). same number of times. In such cases the distribution are said
to bimodal or multimodal.
Table - 1
Geometric Mean
Serum cholesterol Midpoint No. of x*f
Geometric mean is defined as the nth root of the product of
level (mg/dl) (x) subjects (f)
observations.
(a) (b) (c ) (bxc) Mathematically,
175-199 187 3 561 n
x1 x2 x3......... xn
* * *
200-224 212 3 636 Geometric Mean =
225-249 237 2 474 Thus if there are 3 observations in the data set, the first step
would be to calculate the product of all the three observations.
250-274 262 1 262 The second step would be to take cube root of this product.
275-299 287 1 287 Similarly the geometric mean of 4 values would be the 4th root
Total 10 = ∑f 2220 = ∑f of the product of the four observations.
x The merits of geometric mean are that it is based on all the
observations. It is also not much affected by the fluctuations of
The mean, then is calculated as sampling. The disadvantage is that it is not easy to calculate
and finds limited use in medical research.
Harmonic Mean
Median Harmonic mean of a set of values is the reciprocal of the arithmetic
When the data is skewed, another measure of central tendency mean of the reciprocals of the values. Mathematically,
called median is used. Median is a locative measure which is n
Harmonic mean =
the middlemost observation after all the values are arranged 1 1 1
in ascending or descending order. In other words median is x1 + x2 +.... xn
that -value which divides the entire data set into 2 equal parts, Thus if there are four values in the data set as 2, 4, 6 and 8,
when the data set is ordered in an ascending (or descending) the harmonic mean is
fashion. In case when there is odd number of observations we
4
have a single most middle value which is the median value. In = 3.84
1 1 1 1
case when even number of observations is present there are 2 +4 +6 +8
two middle values and the median is calculated by taking the
mean of these two middle observations. Thus, Though harmonic mean is based on all the values, it is not
easy to understand and calculate. Like geometric mean this
{
n+1 ; when n is odd also finds limited use in medical research.
Median =
2
mean of n th &
2 ( n + 1) th obs
2
; when n is even Relationship between the Three Measures of Mean,
Median and Mode
1. For symmetric curve
Let us work on our example of serum cholesterol considered in
Mean = Median = Mode
calculation of mean for ungrouped data. In the first step, we
2. For symmetric curve
will order the data set in an ascending order as follows :
Mean – Mode ≈ 3 (Mean – Median)
175, 182, 192, 203, 212, 218, 228, 242, 256, 284 3. For positively skewed curve
Since n is 10 (even) we have two middle most observations as Mean > Median > Mode
212 and 218 (i.e. the 5th and 6th value) 4. For negatively skewed curve
212 + 218 Mean < Median < Mode
Therefore, median = --------------- = 215 Choice of Central Tendency
2 We observe that each central tendency discussed above have
Like mean, median is also very easy to calculate. In fact if some merits and demerits. No one average is good for all types
• 228 •
12. of research. The choice should depend on the type of information Quartiles divide the total number of observations into 4 equal
collected and the research question the investigator is trying parts of 25% each. Thus there are three quartiles (Q1, Q2 and
to answer. If the collected data is of quantitative nature and Q3) which divide the total observations in four equal parts.
symmetric or approximately symmetric data, generally the The second quartile Q2 is equivalent to the middle value i.e.
measure used is arithmetic mean. But if the values in the series median. The interquartile range gives the middle 50% values of
are such that only one or two observations are very big or very the data set. Though interquartile range is easy to calculate it
small compared to other observations, arithmetic mean gives suffers from the same defects as that of range.
fallacious conclusions. In such cases (skewed data) median Mean Deviation
or mode would give better results. In social and psychological
studies which deals with scored observations or data which Mean deviation is the mean of the difference from a constant
are not capable of direct quantitative measurements like socio- ‘A which can be taken as mean, median, mode or any constant
’
economic status, intelligence or pain score etc., median or mode observation from the data. The formula for mean deviation is
is better measure than mean. However, ‘mode’ is generally not given as follows:
used since it is not amenable to statistical analysis.
Measures of Relative Position (Quantiles) Mean deviation =
Quantiles are the values that divide a set numerical data arranged
where A may be mean, median, mode or a constant; xi is the
in increasing order into equal number of parts. Quartiles divide
value of individual observations; n is the total number of
the numerical data arranged in increasing order into four equal
observations; and, ∑ = is a sign indicating “sum of”. The main
parts of 25% each. Thus there are 3 quartiles Q1, Q2 and Q3
drawback of this measure is that it ignores the algebraic signs
respectively. Deciles are values which divide the arranged data
and hence to overcome this drawback we have another measure
into ten equal parts of 10% each. Thus we have 9 deciles which
of variability called as Variance.
divide the data in ten equal parts. Percentiles are the values
that divide the arranged data into hundred equal parts of 1% Standard Deviation
each. Thus there are 99 percentiles. The 50th percentile, 5th Variance is the average of the squared deviations of each of the
decile and 2nd quartile are equal to median. individual value from the mean ( x ). It is mathematically given
as follows:
Measures of Variability
Knowledge of central tendency alone is not sufficient for
complete understanding of distribution. For example if we have Variance =
three series having the same mean, then it alone does not throw
light on the composition of the data, hence to supplement it Most often we use the square root of the variance called
we need a measure which will tell us regarding the spread of Standard Deviation to describe the data as it is devoid of any
the data. In contrast to measures of central tendency which errors. Variance squares the units and hence standard deviation
describes the center of the data set, measures of variability by taking square root brings the measure back in the same
describes the variability or spreadness of the observation from units as original and hence is best measure of variability. It is
the center of the data. Various measures of dispersion are as given as follows:
follows.
●● Range Standard Deviation (SD)=
●● Interquartile range
●● Mean deviation
●● Standard deviation The larger the standard deviation the larger is the spread of the
●● Coefficient of variation distribution.
Range Note: When n is less than 30, the denominator in variance and
One of the simplest measures of variability is range. Range standard deviation formula changes to (n-1).
is the difference between the two extremes i.e. the difference Let us demonstrate its calculations using our hypothetical data
between the maximum and minimum observation. set on serum cholesterol (Table - 2).
Range = maximum observation - minimum observation
One of the drawbacks of range is that it uses only extreme ; since n<30
Standard Deviation (SD)=
observations and ignores the rest. This variability measure - 1
is easy to calculate but it is affected by the fluctuations of
sampling. It gives rough idea of the dispersion of the data.
(739.84 + 519.84 + ... + 77.44)
Interquartile Range =
10 - 1
As in the case of range difference in extreme observations is
found, similarly interquartile range is calculated by taking 10543.6
difference in the values of the two extreme quartiles. Thus SD = = 34.227
9
Interquartile range = Q3 - Q1
• 229 •
13. Table - 2
Coefficient of Variation (CV)=
Sr. Serum
No cholesterol ( = 219.2 ) If the coefficient of variation is greater for one data set it
1 192 192-219.2 = -27.2 (-27.2) =739.84
2 suggests that the data set is more variable than the other data
set.
2 242 242-219.2 = 22.8 (22.8)2 = 519.84
Thus, any information that is collected by the researcher needs
3 203 -16.2 262.44 to be described by measures of central tendency and measures
4 212 -7.2 51.84 of variability. Both the measures together describe the data.
5 175 -44.2 1953.64 Measures of central tendency alone will not give any idea
about the data set without measure of variability. Descriptive
6 284 64.8 4199.04 Statistics is critical because it often suggests possible hypothesis
7 256 36.8 1354.24 for future investigation.
8 218 -1.2 1.44 Summary
9 182 -37.2 1383.84 Raw information is organized and condensed by using tabular
10 228 8.8 77.44 and graphical presentations, but compiling and presenting
the data in tabular or graphical form will not give complete
Total 2192 10543.6
information of the data collected. We need to “summarise” the
entire data in one figure, looking at which we can get overall
Calculation of Standard deviation in a grouped data : For
idea of the data. Thus, the data set should be meaningfully
grouped data the calculation of standard deviation slightly
described using summary measures. Summary measures
changes. It is given by following formula.
provide description of data in terms of concentration of data
and variability existing in data. Having described our data set
; replace n by n-1 if observations we use these summary figures to draw certain conclusions
are less than 30 about the reference population from which the sample data
=n has been drawn. Thus data is described by two summary
measures namely, measures of central tendency and measures
where fi is the frequency (i.e. number of subjects in that group) of variability. Measures of central tendency describe the
and is the overall mean. Suppose the data on serum cholesterol centrality of the data set. In other words central tendency tells
was grouped, as we had demonstrated earlier in this chapter us where the data is concentrated. If the researcher is dealing
for calculation of the mean for grouped data. We had calculated with quantitative data, mean is the best centrality measure
the mean as 222. Now in the same table, make more columns whereas in qualitative data median and mode describes the
as in Table - 3. data appropriately. Measures of variability give the spreadness
or the dispersion of the data. In other words it describes the
Thus, scatter of the individual observations from the central value.
10250 The simplest of the variability measure is range which is
33.74 difference between the two extreme observations. Various
9
-1 = n - 1 measures of dispersion are mean deviation, variance and
standard deviation. Standard deviation is the most commonly
Coefficient of Variation used variability measure to describe quantitative data and
is devoid of any errors. When commenting on the variability
Besides the measures of variability discussed above, we have while dealing with two or more groups or techniques, special
one more important measure called the coefficient of variation measure of variability called coefficient of variation is used.
which compares the variability in two data sets. It measures the The group in which coefficient of variation is more is said to be
variability relative to the mean and is calculated as follows:
Table - 3
Serum cholesterol fi*
Midpoint (x) No. of subjects (f)
level (mg/dl)
175-199 187 3 (187-222)= -35 (-35)2=1225 3*1225=3675
200-224 212 3 (212-222)= -10 100 300
225-249 237 2 15 225 450
250-274 262 1 40 1600 1600
275-299 287 1 65 4225 4225
Total 10 = ∑f 7375 10250
• 230 •
14. more variable than the other. Both measures of central tendency 9. 10 babies are born in a hospital on same day. All weigh
and measures of variability together describe the data set and 2.8 Kg each; What would be the standard deviation
often suggest possible hypothesis for future investigation. (a) 0.28 (b) 1 (c) 2.8 (d) 0
10. To compare the variability in two populations we use this
Study Exercises measure (a) Range (b) Coefficient of Variation (c) Median
Short Notes : (1) Measures of central tendency (2) Measures (d) Standard deviation
of Variation Answers : (1) a; (2) a; (3) d; (4) c; (5) a; (6) b; (7) c; (8) a;
MCQs (9) d; (10) b.
1. Which of the Statistical average takes into account all the Statistical Exercises
numbers equally? (a) Mean (b) Median (c) Mode (d) None 1. A researcher wanted to know the weights in Kg of children
of the above of second standard collected the following information on
2. Which of the following is a measure of Spread (a) Variance, 15 students: 10, 20, 11, 12, 12, 13, 11, 14, 13, 13, 15, 11,
(b) Mean (c) p value (d) Mode 16, 17, 18. What type of data is it? Calculate mean, median
3. Which of the following is a measure of location and mode from the above data. Calculate mean deviation
(a) Variance (b) Mode (c) p value (d) Median and standard deviation. (Answer : Mean = 13.7, Median
4. Which among the following is not a measure of variability: = 13, Mode = 11&13, Mean deviation = 2.34, Standard
(a) Standard deviation (b) Range (c) Median (d) Coefficient deviation = 2.9)
of Variation 2. If the height (cm) of the same students is 95, 110, 98,
5. For a positively skewed curve which measure of central 100, 102, 102, 99, 103,104, 103,106, 99, 108,108,109.
tendency is largest (a) Mean (b) Mode (c) Median (d) All What type of data is it? What is the scale of measurement?
are equal Calculate mean, median and mode from the above data.
6. Most common value that repeats itself in the data set is (a) Calculate mean deviation and standard deviation. Between
Mean (b) Mode (c) Median (d) All of the above. height and weigh which is more variable and why? (Answer
7. Variance is square of (a) p value (b) Mean deviation : Mean =103.1, Median = 103, Mode = 99,102,103
(c) Standard deviation (d) Coefficient of variation. &108, Mean deviation = 3.55, Standard deviation = 4.4,
8. Percentiles divides the data into _____ equal parts (a) 100 Coefficient of variation of weight = 21.17, Coefficient of
(b) 50 (c) 10 (d) 25 variation of height = 4.27 hence weight is more variable)
Introducing Inferential Statistics : Fig. - 1
41 Gaussian Distribution and Central
Limit Theorem
Seema R. Patrikar
The Gauassian Distribution or Normal
Curve
If we draw a smooth curve passing through the mid points of
the bars of histogram and if the curve is bell shaped curve then
the data is said to be roughly following a normal distribution.
Many different types of data distributions are encountered in
medicine. The Gaussian or “normal” distribution is among the
most important. Its importance stems from the fact that the
characteristics of this theoretical distribution underline many
aspects of both descriptive and inferential statistics (Fig. - 1).
• 231 •
15. Gaussian distribution is one of the important distributions Fig. 3 shows the area enclosed by 1, 2 and 3 SD from mean.
in statistics. Most of the data relating to social and physical
sciences conform to the distribution for sufficiently large Fig. - 3
observations by virtue of central limit theorem.
Normal distribution was first discovered by mathematician
De-Moivre. Karl Gauss and Pierre-Simon Laplace used this 68 %
distribution to describe error of measurement. Normal
distribution is also called as ‘Gaussian distribution’. 95 %
A normal curve is determined entirely by the mean and the
99.7 %
standard deviation. Hence it is possible to have various normal
curves with different standard deviations but same mean (Fig. Mean-3 SD Mean-2SD Mean-1SD Mean+1SD Mean+2SD Mean+3SD
- 2a) and various normal curves with different means but same
standard deviation (Fig. - 2b). If these criteria are not met, then the distribution is not a
The normal curve possesses many important properties and Gaussian or normal distribution.
is of extreme importance in the theory of errors. The normal Standard Normal Variate (SNV)
distribution is defined by following characteristics: As already specified, a normal frequency curve can be described
●● It is a bell shaped symmetric (about the mean) curve. completely with the mean and standard deviation values.
●● The curve on either side of the mean is mirror image of the Even the same set of data would provide different value for
other side. the mean and SD, depending on the choice of measurement.
●● The mean, median and mode coincide. For example, the same persons height can be expressed as 66
inches or 167.6 cms. An infant’s birth weight can be recorded
Fig. - 2a: Normal curves with same mean but different as 2500 gms or 5.5 pounds. Because the units of measurement
standard deviations differ, so do the numbers, although the true height and weight
are the same. To eliminate the effect produced by the choice
of units of measurement the data can be put in the unit free
form or the data can be normalized. The first step to transform
the original variable to normalized variable is to calculate the
mean and SD. The normalized values are then calculated by
subtracting mean from individual values and dividing by SD.
These normalized values are also called the z values.
x-
z =
σ
μ (where x is the individual observation, µ = mean and σ=
standard deviation)
Fig. 2b : Normal curves with same standard deviation but The distribution of z always follows normal distribution, with
different means mean of 0 and standard deviation of 1. The z values are often
called the ‘Standard Normal Variate’.
Central Limit Theorem (CLT)
The CLT is responsible for the following remarkable result:
The distribution of an average tends to be Normal, even when
the distribution from which the average is computed is non-
Normal.
●● Highest frequency (frequency means the number of Furthermore, this normal distribution will have the same mean
observations for a particular value or in a particular class as the parent distribution, AND, variance equal to the variance
interval) is in the middle around the mean and lowest at of the parent distribution divided by the sample size (σ/n).
both the extremes and frequency is decreasing smoothly The central limit theorem states that given a distribution with
on either side of the mean. a mean μ and variance σ², the sampling distribution of the
●● The total area under the curve is equal to 1 or 100%. mean approaches a normal distribution with a mean (μ) and
●● The most important relationship in the normal curve is the a variance σ²/N as N, the sample size, increases. The amazing
area relationship. and counter-intuitive thing about the central limit theorem
The proportional area enclosed between mean and multiples of is that no matter what the shape of the original distribution,
SD is constant. the sampling distribution of the mean approaches a normal
Mean ± 1 SD -------> 68% of the total area distribution. Furthermore, for most distributions, a normal
distribution is approached very quickly as N increases. Thus,
Mean ± 2 SD -------> 95% of the total area
the Central Limit theorem is the foundation for many statistical
Mean ± 3 SD -------> 99% of the total area procedures.
• 232 •
16. To understand the concept of central limit theorem in detail let Fig. - 4e : Distribution of x when n=8
us consider the following example (Fig. - 4a - g).
Repeatedly taking
Fig. 4a : Non Normal distribution of x eight from the
parent distribution,
and computing the 5
averages, produces
The uniform the distribution curve
distribution on the 5 as shown on right.
right is obviously non-
Normal. Call that the 0
parent distribution 0 0.5 1
0 Fig. - 4f : Distribution of x when n=16
0 0.5 1
Fig. - 4b : Distribution of x when n=2 Repeatedly taking
To compute an average, sixteen from the
parent distribution, 5
x, two samples are
drawn, at random, from and computing the
the parent distribution averages, produces
and averaged. Then the distribution curve
another sample of two is as shown on right.
5 0
drawn and another value 0 0.5 1
of x computed. This
process is repeated, over
Fig. - 4g : Distribution of x when n=32
and over, and averages
of two are computed. 0
The distribution of 0 0.5 1
averages of two is shown Repeatedly taking
on the right. thirty-two from the 5
parent distribution, and
computing the averages,
Fig. - 4c : Distribution of x when n=3 produces the probability
density on the left.
Repeatedly taking 0
0 0.5 1
three from the
parent distribution, 5
Thus we notice that when the sample size approaches a couple
and computing the dozen, the distribution of the average is very nearly Normal,
averages, produces the even though the parent distribution looks anything but
distribution curve as Normal.
shown on the right.
0 Summary
0 0.5 1
Normal distribution was first discovered by mathematician
De-Moivre. Karl Gauss and Pierre-Simon Laplace used this
Fig. - 4d : Distribution of x when n=4 distribution to describe error of measurement. Normal
distribution is also called as ‘Gaussian distribution’. When the
midpoints of the histograms are joined by smooth curve and
Repeatedly taking if the curve resembles a bell shaped curve, the data is said to
four from the parent 5 be approximately normal. The normal distribution is defined
distribution, and by certain characteristics. It is a bell shaped symmetric (about
computing the averages, the mean) curve. The curve on either side of the mean is mirror
produces the probability image of the other side. The mean, median and mode coincide.
density on the left. Highest frequency is in the middle around the mean and lowest
0 at both the extremes and frequency is decreasing smoothly on
0 0.5 1 either side of the mean. The total area under the curve is equal
to 1 or 100%. The most important relationship in the normal
curve is the area relationship. The proportional area enclosed
• 233 •