SlideShare une entreprise Scribd logo
1  sur  96
Télécharger pour lire hors ligne
Role of Statistics in Public Health and Community
    38         Introduction to Biostatistics                          Medicine
                                                                      Statistics finds an extensive use in Public Health and
                                              Seema R. Patrikar       Community Medicine. Statistical methods are foundations for
                                                                      public health administrators to understand what is happening
The origin of statistics roots from the Greek word ‘Statis’ which     to the population under their care at community level as well as
means state. In the early days the administration of the state        individual level. If reliable information regarding the disease is
required the collection of information regarding the population       available, the public health administrator is in a position to:
for the purpose of war. Around 2000 years ago, in India, we           ●● Assess community needs
had this system of collecting administrative statistics. In the       ●● Understand socio-economic determinants of health
Mauryan regime the system of registration of vital events             ●● Plan experiment in health research
of births and deaths existed. Ain-i-Akbari is a collection of         ●● Analyse their results
information gathered on various surveys conducted during the          ●● Study diagnosis and prognosis of the disease for taking
reign of Emperor Akbar.                                                    effective action
The birth of statistics occurred in mid-17th century. A               ●● Scientifically test the efficacy of new medicines and
commoner, named John Graunt, began reviewing a weekly                      methods of treatment.
church publication issued by the local parish clerk that listed       Statistics in public health is critical for calling attention to
the number of births, christenings, and deaths in each parish.        problems, identifying risk factors, and suggesting solutions,
These so called Bills of Mortality also listed the causes of death.   and ultimately for taking credit for our successes. The most
Graunt who was a shopkeeper organized this data, which was            important application of statistics in sociology is in the field
published as Natural and Political Observations made upon             of demography.
the Bills of Mortality. The seventeenth century contribution of       Statistics helps in developing sound methods of collecting data
theory of probability laid the foundation of modern statistical       so as to draw valid inferences regarding the hypothesis. It helps
methods.                                                              us present the data in numerical form after simplifying the
Today, statistics has become increasingly important with              complex data by way of classification, tabulation and graphical
passing time. Statistical methods are fruitfully applied to           presentation. Statistics can be used for comparison as well as
any problem of decision making where the past information             to study the relationship between two or more factors. The use
is available or can be made available. It helps to weigh the          of such relationship further helps to predict one factor from the
evidences and draw conclusions. Statistics finds its application      other. Statistics helps the researcher come to valid conclusions
in almost all the fields of science. We hardly find any science       in answering their research questions.
that does not make use of statistics.                                 Despite wide importance of the subject it is looked upon with
Definition of Statistics                                              suspicion. “Lies, damned lies, and statistics” is part of a phrase
                                                                      attributed to Benjamin Disraeli and popularized in the United
Different authors have defined statistics differently. The best
                                                                      States by Mark Twain: “There are three kinds of lies: lies,
definition of statistics is given by Croxton and Cowden according
                                                                      damned lies, and statistics.” The semi- ironic statement refers
to whom statistics may be defined as the science, which deals
                                                                      to the persuasive power of numbers, and describes how even
with collection, presentation, analysis and interpretation of
                                                                      accurate statistics can be used to bolster inaccurate arguments.
numerical data.
                                                                      It is human psychology that when facts are supported by
Definition of Biostatistics                                           figures, they are easily believed. If wrong figures are used
Biostatistics may be defined as application of statistical methods    they are bound to give wrong conclusions and hence when
to medical, biological and public health related problems. It is      statistical theories are applied the figures that are used are
the scientific treatment given to the medical data derived from       free of all types of biases and have been properly collected and
group of individuals or patients.                                     scientifically analysed.
Role of Statistics in Clinical Medicine                               Broad Categories of Statistics
The main theory of statistics lies in the term variability. No two    Statistics can broadly be split into two categories Descriptive
individuals are same. For example, blood pressure of person           Statistics and Inferential Statistics. Descriptive statistics
may vary from time to time as well as from person to person.          deals with the meaningful presentation of data such that its
We can also have instrumental variability as well as observers        characteristics can be effectively observed. It encompasses the
variability. Methods of statistical inference provide largely         tabular, graphical or pictorial display of data, condensation
objective means for drawing conclusions from the data about           of large data into tables, preparation of summary measures
the issue under study. Medical science is full of uncertainties       to give a concise description of complex information and also
and statistics deals with uncertainties. Statistical methods          to exhibit pattern that may be found in data sets. Inferential
try to quantify the uncertainties present in medical science. It      statistics however refers to decisions. Medical research doesn’t
helps the researcher to arrive at a scientific judgment about         stop at just describing the characteristic of disease or situation.
a hypothesis. It has been argued that decision making is an           It tries to determine whether characteristics of a situation are
integral part of a physician’s work. Frequently, decision making      unusual or if they have happened by chance. Because of this
is probability based.                                                 desire to generalize, the first step is to statistically analyse the


                                                                • 218 •
data.                                                                 Study Exercises
In order to begin our analysis as to why statistics is necessary      Short Notes : (1) Differentiate between descriptive and
we must begin by addressing the nature of science and                 inferential statistics (2) Describe briefly various scales of
experimentation. The characteristic method used by researcher         measurement.
when he/she starts his/her experiment is to study a relatively
                                                                      MCQs & Exercises
small collection of subjects, as complete population based
studies are time consuming, laborious, costly and resource            1.	 An 85 year old man is rushed to the emergency department
intensive. The researcher draws a subset of the population                 by ambulance during an episode of chest pain. The
called as “sample” and studies this sample in depth. But the               preliminary assessment of the condition of the man is
conclusions drawn after analyzing the sample is not restricted             performed by a nurse, who reports that the patients pain
to the sample but is extrapolated to the population i.e. people in         seems to be ‘severe’. The characterization of pain as
general. Thus Statistics is the mathematical method by which               ‘severe’ is (a) Dichotomous (b) Nominal (c) Quantitative 
the uncertainty inherent in the scientific method is rigorously            (d) Qualitative
quantified.                                                           2.	 If we ask the patient attending OPD to evaluate his pain
                                                                           on a scale of 0 (no pain) to 5 (the worst pain), then this
Summary                                                                    commonly applied scale is a (a) Dichotomous (b) Ratio
In recent times, use of Statistics as a tool to describe various           scale (c) Continuous (d) Nominal
phenomena is increasing in biological sciences and health related     3.	 For each of the following variable indicate whether it is
fields so much so that irrespective of the sphere of investigation,        quantitative or qualitative and specify the measurement
a research worker has to plan his/her experiments in such a                scale for each variable : (a) Blood Pressure (mmHg)
manner that the kind of conclusions which he/she intends to                (b) Cholesterol (mmol/l) (c) Diabetes (Yes/No) (d) Body
draw should become technically valid. Statistics comes to this             Mass Index (Kg/m2) (e) Age (years) (f) Sex (female/
aid at the stages of planning of experiment, collection of data,           male) (g) Employment (paid work/retired/housewife) (h)
analysis and interpretation of measures computed during the                Smoking Status (smokers/non-smokers, ex-smokers) (i)
analysis. Biostatistics is defined as application of statistical           Exercise (hours per week) (j) Drink alcohol (units per week)
methods to medical, biological and public health related                   (k) Level of pain (mild/moderate/severe)
problems. Statistics is broadly categorized into descriptive          Answers : (1) d; (2) b; (3) (a) Quantitative continuous;
statistics and inferential statistics. Descriptive statistics         (b) Quantitative continuous; (c) Qualitative dichotomous ;
describes the data in meaningful tables or graphs so that the         (d) Quantitative continuous; (e) Quantitative continuous;
hidden pattern is brought out. Condensing the complex data            (f) Qualitative dichotomous; (g) Qualitative nominal ;
into simple format and describing it with summary measures is         (h) Qualitative nominal; (i) Quantitative discrete; (j) Quantitative
part of the descriptive statistics. Inferential statistics on other   discrete; (k) Qualitative ordinal.
hand, deals with drawing inferences and taking decision by
studying a subset or sample from the population.




                                                                      The first step in handling the data, after it has been collected
             Descriptive Statistics: Displaying
   39        the Data
                                                                      is to ‘reduce’ and summarise it, so that it can become
                                                                      understandable; then only meaningful conclusions can be
                                                                      drawn from it. Data can be displayed in either tabular form or
                                              Seema R. Patrikar       graphical form. Tables are used to categorize and summarize
                                                                      data while graphs are used to provide an overall visual
                                                                      representation. To develop Graphs and diagrams, we need to
The observations made on the subjects one after the other is
                                                                      first of all, condense the data in a table.
called raw data. Raw data are often little more than jumble of
numbers and hence very difficult to handle. Data is collected         Understanding as to how the Data have been
by researcher so that they can give solutions to the research         Recorded
question that they started with. Raw data becomes useful only         Before we start summarizing or further analyzing the data, we
when they are arranged and organized in a manner that we              should be very clear as on which ‘scale’ it has been recorded
can extract information from the data and communicate it to           (i.e. qualitative or quantitative; and, whether continuous,
others. In other words data should be processed and subjected         discrete, ordinal, polychotomous or dichotomous). The details
to further analysis. This is possible through data depiction,         have already been covered earlier in the chapter on variables
data summarization and data transformation.                           and scales of measurement (section on epidemiology) and the


                                                                • 219 •
readers should quickly revise that chapter before proceeding.          Child   Sex     Age       Malnutrition
Ordered Data                                                                         (months)      Status
When the data are organized in order of magnitude from                  17      f       2       Normal
the smallest value to the largest value it is called as ordered
                                                                        18     m       11       Normal
array. For example consider the ages of 11 subjects undergoing
tobacco cessation programme (in years) 16, 27, 34, 41, 38, 53,          19     m       12       Normal
65, 52, 20, 26, 68. When we arrange these ages in increasing            20     m       11       Malnourished
order of magnitude we get ordered array as follows: 16, 20,
                                                                        21     m       10       Normal
26, 27, 34, 38, 41, 52, 53, 65, 68. After observing the ordered
array we can quickly determine that the youngest person is of           22      f       9       Normal
16 years and oldest of 68 years. Also we can easily state that          23      f       5       Normal
almost 55% of the subjects are below 40 years of age, and that
                                                                        24      f       6       Normal
the midway person is aged 38 years.
                                                                        25     m        4       Normal
Grouped Data - Frequency Table
Besides arranging the data in ordered array, grouping of data           26      f       7       Normal
is yet another useful way of summarizing them. We classify the          27      f      11       Normal
data in appropriate groups which are called “classes”. The basic        28      f      12       Normal
purpose behind classification or grouping is to help comparison
and also to accommodate a large number of observations into             29     m       10       Malnourished
a few classes only, by condensation so that similarities and            30     m        4       Normal
dissimilarities can be easily brought out. It also highlights
                                                                        31     m        6       Normal
important features and pinpoints the most significant ones at
glance.                                                                 32     m        8       Normal
Table 1 shows a set of raw data obtained from a cross-sectional         33     m       12       Malnourished
survey of a random sample of 100 children under one year of             34     m        1       Malnourished
age for malnutrition status. Information regarding age and
                                                                        35     m        1       Normal
sex of the child was also collected. We will use this data to
illustrate the construction of various tables. If we show the           36      f       3       Normal
distribution of children as per age then it is called as simple         37     m        5       Normal
table as only one variable is considered.
                                                                        38      f       6       Normal
 Table - 1 : Raw data on malnutrition status (malnourished              39      f       8       Normal
 and normal) for 100 children below one year of age                     40      f       9       Normal
    Child          Sex           Age          Malnutrition              41      f      10       Malnourished
                               (months)         Status
                                                                        42     m        1       Normal
      1              f             6        Normal
                                                                        43      f      12       Malnourished
      2             m              4        Malnourished
                                                                        44      f       2       Malnourished
      3             m              2        Malnourished
                                                                        45     m        1       Normal
      4             m              5        Normal
                                                                        46     m        6       Normal
      5             m              3        Normal
                                                                        47     m        4       Normal
      6              f             1        Normal
                                                                        48      f       9       Normal
      7             m              5        Normal
                                                                        49      f       4       Normal
      8              f             8        Normal
                                                                        50     m        9       Normal
      9              f             7        Normal
                                                                        51     m        7       Normal
      10             f             9        Normal
                                                                        52     m        6       Normal
      11             f            10        Normal
                                                                        53     m        4       Normal
      12             f             2        Normal
                                                                        54      f       2       Normal
      13            m              4        Malnourished
                                                                        55     m        5       Normal
      14             f             6        Normal
                                                                        56     m        3       Normal
      15            m              8        Normal
                                                                        57      f       1       Normal
      16             f             1        Malnourished
                                                                        58     m        5       Normal


                                                             • 220 •
Child   Sex     Age       Malnutrition       Steps in Making a Summary Table for the Data
              (months)      Status           To group a set of observations we select a set of contiguous,
 60     m        7       Malnourished        non overlapping intervals such that each value in the set of
                                             observations can be placed in one and only one of the intervals.
 61     m        9       Normal              These intervals are usually referred to as class intervals. For
 62     m       10       Normal              example the above data can be grouped into different age
                                             groups of 1-4, 5-8 and 9-12. These are called class intervals.
 63      f       2       Normal
                                             The class interval 1-4 includes the values 1, 2, 3 and 4. The
 64      f       4       Normal              smallest value 1 is called its lower class limit whereas the
 65      f       6       Normal              highest value 4 is called its upper class limit. The middle value
                                             of 1-4 i.e. 2.5 is called the midpoint or class mark. The number
 66      f       8       Normal
                                             of subjects falling in the class interval 1-4 is called its class
 67     m        1       Normal              frequency. Such presentation of data in class intervals along
 68     m        2       Normal              with frequency is called frequency distribution. When both the
                                             limits are included in the range of values of the interval, the
 69     m       11       Normal
                                             class interval are known as inclusive type of class intervals (e.g.
 70      f      12       Normal              1-4, 5-8, 9-12, etc.) whereas when lower boundary is included
 71     m       11       Normal              but upper limit is excluded from the range of values, such
                                             class intervals are known as exclusive type of class intervals
 72     m       10       Malnourished        (e.g. 1-5, 5-9, 9-12 etc.) This type of class intervals is suitable
 73      f       9       Normal              for continuous variable. Tables can be formed for qualitative
 74      f       5       Normal              variables also.

 75      f       6       Normal              Table - 2 and 3 display tabulation for quantitative as well as
                                             qualitative variable.
 76     m        4       Normal
 77     m        7       Normal               Table - 2 : Age distribution of the 100 children
 78     m       11       Normal                     Age group (months)              Number of children
 79      f      12       Normal                             1-4                                36
 80      f      10       Normal                             5-8                                33
 81     m        4       Normal                            9-12                                31
 82     m        6       Malnourished                      Total                             100
 83     m        8       Normal
                                              Table - 3 : Distribution of malnourishment in 100
 84     m       12       Normal
                                              children
 85     m        1       Normal
                                                   Malnourishment Status            Number of children
 86     m        1       Normal
                                              Malnourished                                     17
 87     m        3       Normal
                                              Normal                                           83
 88      f       5       Normal
                                              Total                                            100
 89     m        6       Normal
                                             Such type of tabulation which takes only one variable for
 90      f       8       Normal
                                             classification is called one way table. When two variables
 91      f       9       Normal              are involved the table is referred to as cross tabulation or
 92      f      10       Normal              two way table. For example Table - 4 displays age and sex
                                             distribution of the children and Table - 5 displays distribution
 93      f       1       Normal              of malnourishment status and sex of children.
 94     m       12       Normal
 95     m        2       Normal               Table - 4 : Age and sex distribution of 100 children
 96      f       1       Normal                Age group (months)         Female         Male         Total
 97     m        6       Normal               1-4                           14            22            36
 98      f       4       Malnourished         5-8                           15            18            33
 99      f       9       Malnourished         9-12                          16            15            31
100     m        4       Normal               Total                         45            55           100




                                         • 221 •
percentages in bracket may be written on the top of each bar.
 Table - 5 : Malnourishment status and sex distribution
                                                                     When we draw bar charts with only one variable or a single
 of children
                                                                     group it is called as simple bar chart and when two variables
 Malnourishment Status         Female       Male         Total       or two groups are considered it is called as multiple bar chart.
 Malnourished                     6           11          17         In multiple bar chart the two bars representing two variables
                                                                     are drawn adjacent to each other and equal width of the bars
 Normal                          39           44          83
                                                                     is maintained. Third type of bar chart is the component bar
 Total                           45           55         100         chart wherein we have two qualitative variables which are
                                                                     further segregated into different categories or components. In
How to Decide on the Number of Class Intervals?                      this the total height of the bar corresponding to one variable
When data are to be grouped it is required to decide upon the        is further sub-divided into different components or categories
number of class intervals to be made. Too few class intervals        of the other variable. For example consider the following data
would result in losing the information. On the other hand too        (Table-6) which shows the findings of a hypothetical research
many class intervals would not bring out the hidden pattern.         work intended to describe the pattern of blood groups among
The thumb rule is that we should not have less than 5 class          patients of essential hypertension.
intervals and no more than 15 class intervals. To be specific,
experts have suggested a formula for approximate number of             Table - 6 : Distribution of blood group of patients of
class intervals (k) as follows:                                        essential hypertension
K= 1 + 3.332 log10N rounded to the nearest integer, where N is                                       Number of
the number of values or observations under consideration.                     Blood Group             patients         Percentage
For example if N=25 we have, K= 1 + 3.332 log1025 i.e.                                              (frequency)
approximately 5 class intervals.                                                        A              232               42.81
Having decided the number of class intervals the next step is                           B              201               37.05
to decide the width of the class interval. The width of the class
interval is taken as :                                                                  AB              76               14.02
                                                                                        O               33                6.09
        Maximum observed value - Minimum observed value
                          (= Range)                                                    Total           542              100.00
Width =           Number of class interval (k)
                                                                     A simple bar chart in respect of the above data on blood groups
The class limits should be preferably rounded figures and the        among patients of essential hypertension is represented as in
class intervals should be non-overlapping and must include           Fig. - 1.
range of the observed data. As far as possible the percentages       Similarly a multiple bar chart of the data represented in Table
and totals should be calculated column wise.                         - 5 of the distribution of the malnourishment status among
Graphical Presentation of Data                                       males and females is shown in Fig. - 2.
A tabular presentation discussed above shows distribution of         The same information can also be depicted in the form of
subjects in various groups or classes. This tabular representation   component bar chart as in Fig. - 3.
of the frequency distribution is useful for further analysis and
conclusion. But it is difficult for a layman to understand complex     Fig. - 1 : Distribution of blood groups of patients with
distribution of data in tabular form. Graphical presentation of        essential hypertension
data is better understood and appreciated by humans. Graphical               250
representation brings out the hidden pattern and trends of the
complex data sets.                                                           200
Thus the reason for displaying data graphically is two fold:
                                                                      Frequency




                                                                             150
1)	 Investigators can have a better look at the information
    collected and the distribution of data and,                              100
2)	 To communicate this information to others quickly
We shall discuss in detail some of the commonly used graphical                    50
presentations.
                                                                                   0
Bar Charts : Bar charts are used for qualitative type of variable
                                                                                               A       B          AB         O
in which the variable studied is plotted in the form of bar
                                                                                                   Blood Groups
along the X-axis (horizontal) and the height of the bar is equal
to the percentage or frequencies which are plotted along the
Y-axis (vertical). The width of the bars is kept constant for all
the categories and the space between the bars also remains
constant throughout. The number of subjects along with




                                                                 • 222 •
points by a straight line then it is called as frequency polygon
 Fig. - 2 : Multiple Bar Chart showing the distribution of
                                                                     Conventionally, we consider one imaginary value immediately
 malnourishment status in males and females
                                                                     preceding the first value and one succeeding the last value and
         50
                                                       44            plot them with frequency = 0. An example is given in Table - 7
         45
                                             39                      and Fig. - 5.
         40
         35
                                                                      Fig. - 4a : Distribution of patients according to blood
         30
                                                                      group
Number




         25
         20
         15
         10            6
                                11
                                                                                              O
          5
          0
                                                                                              6%
                     Malnourished             Normal                                 AB
                  Females       Males
                                                                                    14 %
 Fig. - 3 : Component Bar Chart showing the distribution
 of malnourishment status in males and females                                                             A : 43 %
         60

         50

         40                                                                          B : 37 %
Number




         30

         20

         10

              0                                                       Fig. - 4b
                            Female                Male
                    Malnourished              Normal                                        42.81
                                                                          Blood group A =            X 360 = 154 degrees
                                                                                             100
Pie Chart : Another interesting method of displaying categorical                            37.08
(qualitative) data is a pie diagram also called as circular               Blood group B =            X 360 = 134 degrees
                                                                                             100
diagram. A pie diagram is essentially a circle in which the
                                                                         Blood group AB =   14.02    X 360 = 50 degrees
angle at the center is equal to its proportion multiplied by 360
                                                                                             100
(or, more easily, its percentage multiplied by 360 and divided
by 100). A pie diagram is best when the total categories  are             Blood group O =    6.09    X 360 = 22 degrees
between  2 to 6. If there are more than 6 categories, try and                                100
reduce them by “clubbing”, otherwise the diagram becomes too
overcrowded.
A pie diagram in respect of the data on blood groups among            Table - 7: Distribution of subjects as per age groups
patients  of  essential  hypertension   is presented below after                                                     Number of
                                                                              Age              Midpoints
calculating the angles  for the individual categories as in                                                           subjects
Fig. - 4 a, b.                                                               20-25                  22.5                  2
Frequency Curve and Polygon : To construct a frequency curve
                                                                             25-30                  27.5                  3
and frequency polygon we plot the variable along the X-axis
and the frequencies along the Y-axis. Observed values of the                 30-35                  32.5                  6
variable or the midpoints of the class intervals are plotted along           35-40                  37.5                  14
with the corresponding frequency of that class interval. Then
                                                                             40-45                  42.5                  7
we construct a smooth freehand curve passing through these
points. Such a curve is known as frequency curve. If instead of              45-50                  47.5                  5
joining the midpoints by smooth curve, we join the consecutive




                                                               • 223 •
Fig. - 5 : Distribution of subjects in different age groups                  Fig. - 8
                16
                                                                                                                    Rough estimate of
                14                                                                                                  the centre or middle
                                                                                                                    observation i.e. median
Number of subjects




                12
                                                                                                                    value (27.5)
                10
                     8
                     6
                     4                                                                                              Spread of the data
                     2
                     0
                                                                               Fig. - 9
                         15   20   25   30       35     40     45   50   55
                                             Age groups


Stem-and-leaf plots : This presentation is used for quantitative
type of data. To construct a stem-and-leaf plot, we divide each
value into a stem component and leaf component. The digits
in the tens-place becomes stem component and the digits in
units-place becomes leaf components. It is of much utility in
quickly assessing whether the data is following a “normal”
distribution or not, by seeing whether the stem and leaf is
showing a bell shape or not. For example consider a sample of
10 values of age in years : 21, 42, 05, 11, 30, 50, 28, 27, 24,
52. Here, 21 has a stem component of 2 and leaf component
of 1. Similarly the second value 42 has a stem component of 4
and leaf component of 2 and so on. The stem values are listed
in numerical order (ascending or descending) to form a vertical
axis. A vertical line is drawn to outline a stem. If the stem
value already exists then the leaf is placed on the right side of              For the given example we notice the mound (heap) in the
vertical line (Fig. - 6).                                                      middle of the distribution. There are no outliers.
The value of each of the leaf is plotted in its appropriate location          Histogram : The stem-and-leaf is a good way to explore
on the other side of vertical line as in Fig. - 7.                            distributions. A more traditional approach is to use histogram.
To describe the central location, spread and shape of the stem                A histogram is used for quantitative continuous type of data
plot we rotate the stem plot by 90 degrees just to explain it                 where, on the X-axis, we plot the quantitative exclusive type
more clearly as in Fig. - 8.                                                  of class intervals and on the Y-axis we plot the frequencies.
Roughly we can say that the spread of data is from 5 to 52                    The difference between bar charts and histogram is that since
and the median value is between 27 and 28. Regarding the                      histogram is the best representation for quantitative data
shape of the distribution though it will be difficult to make                 measured on continuous scale, there are no gaps between the
firm statements about shape when n is small, we can always                    bars. Consider an example of the data on serum cholesterol of
determine (Fig. - 9) :                                                        10 subjects (Table - 8 & Fig. - 10)
●● Whether data are more or less symmetrical or are extremely
    skewed                                                                     Table - 8 : Distribution of the subjects
●● Whether there is a central cluster or mound                                       Serum
●● Whether there are any outliers                                                  cholesterol        No of subjects        Percentage
                                                                                     (mg/dl)
  Fig. - 6                                      Fig. - 7
                                                                               175 – 200                    3                    30
  0                                              0   5
                                                                               200 – 225                    3                    30
  1                                              1   1
                                                                               225 – 250                    2                    20
  2                                              2   1 4 7 8
                                                                               250 – 275                    1                    10
  3                                              3   0
                                                                               275 – 300                    1                    10
  4                                              4   2
                                                                               Total                        10                 100%
  5                                              5   0 2




                                                                         • 224 •
diagram, the rate of disease are plotted along the vertical (y)
  Fig. - 10 : Distribution of subjects according to Serum
                                                                     axis. However, in localised outbreaks, with a well demarcated
  Cholesterol Levels
                                                                     population that has been at risk (as sudden outbreaks of food
                3.5                                                  poisoning) the actual numbers can be plotted on Y-axis, during
                3.0                                                  quick investigations. The unit of time, as applicable to the
                                                                     disease in question, is plotted along the “X”-axis (horizontal).
% of subjects




                2.5                                                  This unit of time would be hours-time in food poisoning, days
                2.0                                                  (i.e, as per dates of the month) for cholera, weeks for typhoid,
                                                                     malaria or Hepatitis-A, months for Hepatitis-B and in years (or
                1.5
                                                                     even decades) for IHD or Lung Cancer.
                1.0                                                  Scatter Diagram : A scatter diagram gives a quick visual
                0.5                                                  display of the association between two variables, both of which
                                                                     are measured on numerical continuous or numerical discrete
                0                                                    scale. An example of scatter plot between age (in months) and
                      175-200   200-225 225-250 250-275 275-300      body weight (in kg) of infants is given in Fig. - 12.
                        Serum Cholesterol Levels (mg/dl)
                                                                        Fig. - 12 : Scatter Diagram of the association between Age
                                                                        and Body Weight of infants
Box-and-Whisker plot : A box-and-whisker plot reveals
maximum of the information to the audience. A box-and-                                     12
whisker plot can be useful for handling many data values. They

                                                                      Body Weight (Kgs.)
                                                                                           10
allow people to explore data and to draw informal conclusions
when two or more variables are present. It shows only certain                               8
statistics rather than all the data. Five-number summary                                    6
is another name for the visual representations of the box-
and-whisker plot. The five-number summary consists of the                                   4
median, the quartiles (lower quartile and upper quartile), and                              2
the smallest and greatest values in the distribution. Thus a
                                                                                            0
box-and-whisker plot displays the center, the spread, and the
overall range of distribution (Fig. - 11)                                                       0   2   4      6     8      10   12   14
                                                                                                            Age in months
  Fig. - 11
                                                                     The scatter diagram in the above figure shows instant finding
                                                                     that weight and age are associated - as age increases, weight
                                           Largest Value             increases. Be careful to record the dependent variable along the
                                                                     vertical (Y) axis  and the independent variable along the
                                           Upper Quartile (Q3)       horizontal (X) axis. In this example weight is dependent on
                                                                     age (as age increases weight is likely to increase) but age
                                                                     is not dependent on weight (if weight increases, age will
                                                                     not  necessarily increase). Thus, weight is the dependent
                                                                     variable, and has been plotted on Y  axis while age is the
                                                                     independent variable, plotted along X axis.
                                           Median Quartile (Q2)      Summary
                                                                     Raw information, which is just jumble of numbers, collected by
                                                                     the researcher needs to be presented and displayed in a manner
                                                                     that it makes sense and can be further processed. Data presented
                                           Lower Quartile (Q1)       in an eye-catching way can highlight particular figures and
                                                                     situations, draw attention to specific information, highlight
                                           Smallest value
                                                                     hidden pattern and important information and simplify complex
                                                                     information. Raw information can be presented either in table
                                                                     i.e. tabular presentation or in graphs and charts i.e. graphical
                                                                     presentation. A table consists of rows and columns. The data
                                                                     is condensed in homogenous groups called class intervals and
Line chart: Line chart is used for quantitative data. It is          the number of individuals falling in each class interval called
an excellent method of displaying the changes that occur             frequency is displayed. A table is incomplete without a title.
in disease frequency over time. It  thus helps in assessing          Clear title describing completely the data in concise form is
“temporal trends” and helps displaying data  on epidemics or         written. Graphical presentation is used when data needs to be
localised outbreaks in the  form of epidemic  curve. In a line       displayed in charts and graphs. A chart or diagram should have



                                                                 • 225 •
a clear title describing the data depicted. The X-axis and the
                                                                                      Exports (crores          Imports (crores
Y-axis should be properly defined along with the scale. Legend           Year
                                                                                        of rupees)               of rupees)
in case of more than one variable or group is necessary. An
optional footnote giving the source of information may be               1960-61              610.3                 624.65
present. Appropriate graphical presentation should be depicted          1961-62              955.39                742.78
depending on whether data is quantitative or qualitative.
                                                                        1962-63              660.65                578.36
While dealing with quantitative data histograms, line chart,
polygon, stem and leaf and box and whisker plots should be              1963-64              585.25                527.98
used whereas bar charts, pictograms and pie charts should be
used when dealing with qualitative data.                            9.	 Of the 140 children, 20 lived in owner occupied houses,
                                                                         70 lived in council houses and 50 lived in private rented
Study Exercises                                                          accommodation. Type of accommodation is a categorical
Long Question : Discuss the art of effective presentation in             variable. Appropriate graphical presentation will be
the field of health, in respect of data and information; so as to        (a) Line chart (b) Simple Bar chart (c) Histogram
convince the makers of decision.                                         (d) Frequency Polygon
                                                                    10.	 A study was conducted to assess the awareness of phimosis
Short Notes: (1) Discuss the need for graphical presentation of
                                                                         in young infants and children up to 5 years of age. The
data (2) Differentiate between inclusive and exclusive type of
                                                                         awareness level with respect to the family income is as
class intervals (3) Box and Whisker Plot (4) Scatter diagram
                                                                         tabulated below. Which graphical presentation is best to
MCQs                                                                     describe the following data?
1.	 Which of the following is used for representing qualitative
    data (a) Histogram (b) Polygon (c) Pie chart (d) Line chart                     <2000      2000 – 5000 5000 – 8000       >8000
2.	 The scatter plot is used to display (a) Causality                   Aware         50              62         77            70
    (b) Correlation (c) Power (d) Type II error
3.	 Five summary plot consists of Quartiles and (a) Median (b)          Unaware       50              28         23            30
    Mode (c) Mean (d) Range
                                                                    	    (a) Stem & Leaf (b) Pie Chart (c) Multiple Bar Chart
4.	 The appropriate method of displaying the changes that
                                                                         (d) Component Bar Chart
    occur in disease frequency over time (a) Line chart (b) Bar
                                                                    11.	 Following is the frequency distribution of the serum levels
    chart (c) Histogram (d) Stem and leaf.
                                                                         of total cholesterol reported in a sample of 71 subjects.
5.	 Box and whisker plot is also known as (a) Magical box
                                                                         Which graphical presentation is best to describe the
    (b) Four summary plot (c) Five summary plot (d) None of
                                                                         following data?
    the above
6.	 The type of diagram useful to detect linear relationship              Serum cholesterol level              Frequency
    between two variables is (a) Histogram (b) Line Chart
    (c) Scatter Plot (d) Bar Chart                                                  <130                           2
7.	 The following table shows the age distribution of cases of a                   130-150                         7
    certain disease reported during a year in a particular state.
                                                                                   150-170                        18
    Which graphical presentation is appropriate to describe
    this data? (a) Pie chart (b) Line chart (c) Histogram                          170-190                        20
    (d) Pictogram                                                                  190-210                        15
                                                                                   210-230                         7
              Age                      Number of cases
                                                                                    >230                           2
             5-14                               5
             15-24                             10                   	    (a)    Stem     &     Leaf   (b)    Pie (c)   Histogram
             25-34                            120                        (d) Component Bar Chart
                                                                    12.	 Information from the Sports Committee member on
             35-44                             22                        representation in different games at the state level by
             45-54                             13                        gender is as given below. Which graphical presentation is
             55-64                              5                        best to describe the following data

8.	 Which graphical presentation is best to describe the                Different Games              Females           Males
    following data? (a) Multiple bar chart (b) Pie chart                Long Jump                      4                6
    (c) Histogram (d) Box plot
                                                                        High Jump                      2                4
                                                                        Shot Put                       9                11
                                                                        Running                        15               10
                                                                        Swimming                       5                 4




                                                              • 226 •
(a) Box plot (b) Histogram (c) Multiple Bar Chart (d) Pie             Statistical Exercise
     chart                                                                 1.	 Following is the population data in a locality, present the
13.	 Which graphical presentation is best to describe the                      data in tabular form as well as using appropriate graphs.
     following data
                                                                             S. No.      Age      S. No.      Age        S. No.     Age
       Grade of malnutrition                    Frequency
                                                                               1          11        11          8         21         16
    Normal                                          60
                                                                               2          15        12         12         22         17
    Grade I                                         30
                                                                               3          6         13         22         23         19
    Grade II                                         7
                                                                               4          17        14         24         24         8
    Grade III                                        2
                                                                               5          18        15         16         25         9
    Grade IV                                         1
                                                                               6          7         16         19         26         10
	   (a) Box Plot (b) Component Bar Chart (c) Histogram (d) Pie                 7          25        17         20         27         24
    chart                                                                      8          32        18          9         28         31
Answers : (1) c; (2) b; (3) a; (4) a; (5) c; (6) c; (7) c; (8) a; (9) b;
(10) d; (11) c; (12) c; (13) d.                                                9          12        19         21         29         32
                                                                               10         34        20         31         30         37




                                                                           summing all the observations and then dividing by number of
                Summarising the Data: Measures                                                                       x
      40        of Central Tendency and Variability
                                                                           observations. It is generally denoted by . It is calculated as
                                                                           follows.
                                                                                              Sum of the values of all observations
                                                                                  Mean (x) =
                                                 Seema R. Patrikar                                Total number of observations,
                                                                                                   that is, the total number of
The huge raw information gathered by the researcher is                                              subjects (denoted by "n")
organized and condensed in a table or graphical display.                   Mathematically,
Compiling and presenting the data in tabular or graphical form
                                                                                                               Σxi
will not give complete information of the data collected. We
                                                                                                      x   = i  n
need to “summarise” the entire data in one figure, looking at
which we can get overall idea of the data. Thus, the data set              It is the simplest of the centrality measure but is influenced by
should be meaningfully described using summary measures.                   extreme values and hence at times may give fallacious results.
Summary measures provide description of data in terms of                   It depends on all values of the data set but is affected by the
concentration of data and variability existing in data. Having             fluctuations of sampling.
described our data set we use these summary figures to draw                Example : The serum cholesterol level (mg/dl) of 10 subjects
certain conclusions about the reference population from which              were found to be as follows: 192 242 203 212 175 284 256
the sample data has been drawn. Thus data is described by two              218 182 228
summary measures namely, measure of central tendency and                   We observe that the above data set is of quantitative type.
measure of variability. Before we discuss in detail, the various
measures we should understand the distribution of the data                 To calculate mean the first step is to sum all the values. Thus,
                                                                                   Σxi
set.                                                                               i = 192 + 242 + 203 + ……..+ 228 = 2192
                                                                           The second step is to divide this sum by total number of
Measures of Central Tendency                                               observation (n), which are 10 in our example. Thus,
This gives the centrality measure of the data set i.e. where the
                                                                                                  Σxi
observations are concentrated. There are numerous measures
                                                                                            x = in = 2192/10 = 219.2
of central tendency. These are : Mean; Median; Mode; Geometric
Mean; Harmonic Mean.
                                                                           Thus the average value of Serum cholesterol among the 10
Mean (Arithmetic Mean) or Average                                          subjects studied = 219.5 mg/dl. This summary value of mean
This is most appropriate measure for data following normal                 describes our entire data in one value.
distribution but not for skewed distributions. It is calculated by


                                                                     • 227 •
Calculation of mean from grouped data : For calculating                     the observations are less, median can be calculated by just
the mean from a “grouped data” we should first find out the                 inspection. Unlike mean, median can be calculated if the extreme
midpoint (class mark) of each class interval which we denote                observation is missing. It is less affected by fluctuations of
by x. (Mid point is calculated by adding the upper limit and                sampling than mean.
the lower limit of the respective class intervals and dividing by           Mode
2). The next step is to multiply the midpoints by the frequency
of that class interval. Summing all these multiplications and               Mode is the most common value that repeats itself in the
then dividing by total sample size yields us the mean value for             data set. Though mode is easy to calculate, at times it may
grouped data.                                                               be impossible to calculate mode if we do not have any value
                                                                            repeating itself in the data set. At other end it may so happen
Consider the following example on 10 subjects on serum                      that we come across two or more values repeating themselves
cholesterol level (mg/dl), put in class interval (Table - 1).               same number of times. In such cases the distribution are said
                                                                            to bimodal or multimodal.
 Table - 1
                                                                            Geometric Mean
 Serum cholesterol         Midpoint         No. of             x*f
                                                                            Geometric mean is defined as the nth root of the product of
   level (mg/dl)             (x)          subjects (f)
                                                                            observations.
           (a)                (b)             (c )            (bxc)         Mathematically,
 175-199                      187               3              561                                            n
                                                                                                                  x1 x2 x3......... xn
                                                                                                                    *   *         *


 200-224                      212               3              636                         Geometric Mean =

 225-249                      237               2              474          Thus if there are 3 observations in the data set, the first step
                                                                            would be to calculate the product of all the three observations.
 250-274                      262               1              262          The second step would be to take cube root of this product.
 275-299                      287               1              287          Similarly the geometric mean of 4 values would be the 4th root
 Total                                    10 = ∑f         2220 = ∑f         of the product of the four observations.
                                                          x                 The merits of geometric mean are that it is based on all the
                                                                            observations. It is also not much affected by the fluctuations of
The mean, then is calculated as                                             sampling. The disadvantage is that it is not easy to calculate
                                                                            and finds limited use in medical research.
                                                                            Harmonic Mean
Median                                                                      Harmonic mean of a set of values is the reciprocal of the arithmetic
When the data is skewed, another measure of central tendency                mean of the reciprocals of the values. Mathematically,
called median is used. Median is a locative measure which is                                                            n
                                                                                            Harmonic mean =
the middlemost observation after all the values are arranged                                                     1 1          1
in ascending or descending order. In other words median  is                                                     x1 + x2 +.... xn
that -value which divides the entire data set into 2 equal parts,           Thus if there are four values in the data set as 2, 4, 6 and 8,
when the data set is ordered in an ascending (or descending)                the harmonic mean is
fashion. In case when there is odd number of observations we
                                                                                                       4
have a single most middle value which is the median value. In                                                = 3.84
                                                                                                  1 1 1    1
case when even number of observations is present there are                                        2 +4 +6 +8
two middle values and the median is calculated by taking the
mean of these two middle observations. Thus,                                Though harmonic mean is based on all the values, it is not
                                                                            easy to understand and calculate. Like geometric mean this

           {
                 n+1                                  ; when n is odd       also finds limited use in medical research.
Median =
                   2
                 mean of   n th &
                           2        ( n + 1) th obs
                                      2
                                                      ; when n is even      Relationship between the Three Measures of Mean,
                                                                            Median and Mode
                                                                            1.	   For symmetric curve
Let us work on our example of serum cholesterol considered in
                                                                            	     Mean = Median = Mode
calculation of mean for ungrouped data. In the first step, we
                                                                            2.	   For symmetric curve
will order the data set in an ascending order as follows :
                                                                            	     Mean – Mode ≈ 3 (Mean – Median)
175, 182, 192, 203, 212, 218, 228, 242, 256, 284                            3.	   For positively skewed curve
Since n is 10 (even) we have two middle most observations as                	     Mean > Median > Mode
212 and 218 (i.e. the 5th and 6th value)                                    4.	   For negatively skewed curve
                     212 + 218                                              	     Mean < Median < Mode
Therefore, median = --------------- = 215                                   Choice of Central Tendency
                           2                                                We observe that each central tendency discussed above have
Like mean, median is also very easy to calculate. In fact if                some merits and demerits. No one average is good for all types


                                                                        • 228 •
of research. The choice should depend on the type of information     Quartiles divide the total number of observations into 4 equal
collected and the research question the investigator is trying       parts of 25% each. Thus there are three quartiles (Q1, Q2 and
to answer. If the collected data is of quantitative nature and       Q3) which divide the total observations in four equal parts.
symmetric or approximately symmetric data, generally the             The second quartile Q2 is equivalent to the middle value i.e.
measure used is arithmetic mean. But if the values in the series     median. The interquartile range gives the middle 50% values of
are such that only one or two observations are very big or very      the data set. Though interquartile range is easy to calculate it
small compared to other observations, arithmetic mean gives          suffers from the same defects as that of range.
fallacious conclusions. In such cases (skewed data) median           Mean Deviation
or mode would give better results. In social and psychological
studies which deals with scored observations or data which           Mean deviation is the mean of the difference from a constant
are not capable of direct quantitative measurements like socio-      ‘A which can be taken as mean, median, mode or any constant
                                                                       ’
economic status, intelligence or pain score etc., median or mode     observation from the data. The formula for mean deviation is
is better measure than mean. However, ‘mode’ is generally not        given as follows:
used since it is not amenable to statistical analysis.
Measures of Relative Position (Quantiles)                                            Mean deviation =
Quantiles are the values that divide a set numerical data arranged
                                                                     where A may be mean, median, mode or a constant; xi is the
in increasing order into equal number of parts. Quartiles divide
                                                                     value of individual observations; n is the total number of
the numerical data arranged in increasing order into four equal
                                                                     observations; and, ∑ = is a sign indicating “sum of”. The main
parts of 25% each. Thus there are 3 quartiles Q1, Q2 and Q3
                                                                     drawback of this measure is that it ignores the algebraic signs
respectively. Deciles are values which divide the arranged data
                                                                     and hence to overcome this drawback we have another measure
into ten equal parts of 10% each. Thus we have 9 deciles which
                                                                     of variability called as Variance.
divide the data in ten equal parts. Percentiles are the values
that divide the arranged data into hundred equal parts of 1%         Standard Deviation
each. Thus there are 99 percentiles. The 50th percentile, 5th        Variance is the average of the squared deviations of each of the
decile and 2nd quartile are equal to median.                         individual value from the mean ( x ). It is mathematically given
                                                                     as follows:
Measures of Variability
Knowledge of central tendency alone is not sufficient for
complete understanding of distribution. For example if we have                          Variance =
three series having the same mean, then it alone does not throw
light on the composition of the data, hence to supplement it         Most often we use the square root of the variance called
we need a measure which will tell us regarding the spread of         Standard Deviation to describe the data as it is devoid of any
the data. In contrast to measures of central tendency which          errors. Variance squares the units and hence standard deviation
describes the center of the data set, measures of variability        by taking square root brings the measure back in the same
describes the variability or spreadness of the observation from      units as original and hence is best measure of variability. It is
the center of the data. Various measures of dispersion are as        given as follows:
follows.
●● Range                                                                      Standard Deviation (SD)=
●● Interquartile range
●● Mean deviation
●● Standard deviation                                                The larger the standard deviation the larger is the spread of the
●● Coefficient of variation                                          distribution.
Range                                                                Note: When n is less than 30, the denominator in variance and
One of the simplest measures of variability is range. Range          standard deviation formula changes to (n-1).
is the difference between the two extremes i.e. the difference       Let us demonstrate its calculations using our hypothetical data
between the maximum and minimum observation.                         set on serum cholesterol (Table - 2).
   Range = maximum observation - minimum observation
One of the drawbacks of range is that it uses only extreme                                                          ; since n<30
                                                                     Standard Deviation (SD)=
observations and ignores the rest. This variability measure                                                   - 1
is easy to calculate but it is affected by the fluctuations of
sampling. It gives rough idea of the dispersion of the data.
                                                                                         (739.84 + 519.84 + ... + 77.44)
Interquartile Range                                                                 =
                                                                                                     10 - 1
As in the case of range difference in extreme observations is
found, similarly interquartile range is calculated by taking                                         10543.6
difference in the values of the two extreme quartiles.                               Thus SD =               = 34.227
                                                                                                        9
                 Interquartile range = Q3 - Q1



                                                               • 229 •
Table - 2
                                                                                   Coefficient of Variation (CV)=
  Sr.      Serum
  No     cholesterol         (   = 219.2 )                                 If the coefficient of variation is greater for one data set it
 1       192           192-219.2 = -27.2       (-27.2) =739.84
                                                       2                   suggests that the data set is more variable than the other data
                                                                           set.
 2       242           242-219.2 = 22.8        (22.8)2 = 519.84
                                                                           Thus, any information that is collected by the researcher needs
 3       203           -16.2                   262.44                      to be described by measures of central tendency and measures
 4       212           -7.2                    51.84                       of variability. Both the measures together describe the data.
 5       175           -44.2                   1953.64                     Measures of central tendency alone will not give any idea
                                                                           about the data set without measure of variability. Descriptive
 6       284           64.8                    4199.04                     Statistics is critical because it often suggests possible hypothesis
 7       256           36.8                    1354.24                     for future investigation.
 8       218           -1.2                    1.44                        Summary
 9       182           -37.2                   1383.84                     Raw information is organized and condensed by using tabular
 10      228           8.8                     77.44                       and graphical presentations, but compiling and presenting
                                                                           the data in tabular or graphical form will not give complete
 Total 2192                                    10543.6
                                                                           information of the data collected. We need to “summarise” the
                                                                           entire data in one figure, looking at which we can get overall
Calculation of Standard deviation in a grouped data : For
                                                                           idea of the data. Thus, the data set should be meaningfully
grouped data the calculation of standard deviation slightly
                                                                           described using summary measures. Summary measures
changes. It is given by following formula.
                                                                           provide description of data in terms of concentration of data
                                                                           and variability existing in data. Having described our data set
                                   ; replace n by n-1 if observations      we use these summary figures to draw certain conclusions
                                   are less than 30                        about the reference population from which the sample data
                        =n                                                 has been drawn. Thus data is described by two summary
                                                                           measures namely, measures of central tendency and measures
where fi is the frequency (i.e. number of subjects in that group)          of variability. Measures of central tendency describe the
and is the overall mean. Suppose the data on serum cholesterol             centrality of the data set. In other words central tendency tells
was grouped, as we had demonstrated earlier in this chapter                us where the data is concentrated. If the researcher is dealing
for calculation of the mean for grouped data. We had calculated            with quantitative data, mean is the best centrality measure
the mean as 222. Now in the same table, make more columns                  whereas in qualitative data median and mode describes the
as in Table - 3.                                                           data appropriately. Measures of variability give the spreadness
                                                                           or the dispersion of the data. In other words it describes the
Thus,                                                                      scatter of the individual observations from the central value.
                                              10250                        The simplest of the variability measure is range which is
                                                               33.74       difference between the two extreme observations. Various
                                                9
                        -1 = n - 1                                         measures of dispersion are mean deviation, variance and
                                                                           standard deviation. Standard deviation is the most commonly
Coefficient of Variation                                                   used variability measure to describe quantitative data and
                                                                           is devoid of any errors. When commenting on the variability
Besides the measures of variability discussed above, we have               while dealing with two or more groups or techniques, special
one more important measure called the coefficient of variation             measure of variability called coefficient of variation is used.
which compares the variability in two data sets. It measures the           The group in which coefficient of variation is more is said to be
variability relative to the mean and is calculated as follows:


 Table - 3
 Serum cholesterol                                                                                                          fi*
                          Midpoint (x)          No. of subjects (f)
   level (mg/dl)
        175-199                  187                       3                 (187-222)= -35          (-35)2=1225           3*1225=3675
        200-224                  212                       3                 (212-222)= -10               100                     300
        225-249                  237                       2                       15                     225                     450
        250-274                  262                       1                       40                    1600                     1600
        275-299                  287                       1                       65                    4225                     4225
         Total                                         10 = ∑f                                           7375                     10250


                                                                       • 230 •
more variable than the other. Both measures of central tendency      9.	 10 babies are born in a hospital on same day. All weigh
and measures of variability together describe the data set and            2.8 Kg each; What would be the standard deviation
often suggest possible hypothesis for future investigation.               (a) 0.28 (b) 1 (c) 2.8 (d) 0
                                                                     10.	 To compare the variability in two populations we use this
Study Exercises                                                           measure (a) Range (b) Coefficient of Variation (c) Median
Short Notes : (1) Measures of central tendency (2) Measures               (d) Standard deviation
of Variation                                                         Answers : (1) a; (2) a; (3) d; (4) c; (5) a; (6) b; (7) c; (8) a;
MCQs                                                                 (9) d; (10) b.
1.	 Which of the Statistical average takes into account all the      Statistical Exercises
    numbers equally? (a) Mean (b) Median (c) Mode (d) None           1.	 A researcher wanted to know the weights in Kg of children
    of the above                                                         of second standard collected the following information on
2.	 Which of the following is a measure of Spread (a) Variance,          15 students: 10, 20, 11, 12, 12, 13, 11, 14, 13, 13, 15, 11,
    (b) Mean (c) p value (d) Mode                                        16, 17, 18. What type of data is it? Calculate mean, median
3.	 Which of the following is a measure of location                      and mode from the above data. Calculate mean deviation
    (a) Variance (b) Mode (c) p value (d) Median                         and standard deviation. (Answer : Mean = 13.7, Median
4.	 Which among the following is not a measure of variability:           = 13, Mode = 11&13, Mean deviation = 2.34, Standard
    (a) Standard deviation (b) Range (c) Median (d) Coefficient          deviation = 2.9)
    of Variation                                                     2.	 If the height (cm) of the same students is 95, 110, 98,
5.	 For a positively skewed curve which measure of central               100, 102, 102, 99, 103,104, 103,106, 99, 108,108,109.
    tendency is largest (a) Mean (b) Mode (c) Median (d) All             What type of data is it? What is the scale of measurement?
    are equal                                                            Calculate mean, median and mode from the above data.
6.	 Most common value that repeats itself in the data set is (a)         Calculate mean deviation and standard deviation. Between
    Mean (b) Mode (c) Median (d) All of the above.	                      height and weigh which is more variable and why? (Answer
7.	 Variance is square of (a) p value (b) Mean deviation                 : Mean =103.1, Median = 103, Mode = 99,102,103
    (c) Standard deviation (d) Coefficient of variation.                 &108, Mean deviation = 3.55, Standard deviation = 4.4,
8.	 Percentiles divides the data into _____ equal parts (a) 100          Coefficient of variation of weight = 21.17, Coefficient of
    (b) 50 (c) 10 (d) 25                                                 variation of height = 4.27 hence weight is more variable)




            Introducing Inferential Statistics :                      Fig. - 1

   41       Gaussian Distribution and Central
            Limit Theorem

                                             Seema R. Patrikar


The Gauassian              Distribution         or     Normal
Curve
If we draw a smooth curve passing through the mid points of
the bars of histogram and if the curve is bell shaped curve then
the data is said to be roughly following a normal distribution.
Many different types of data distributions are encountered in
medicine. The Gaussian or “normal” distribution is among the
most important. Its importance stems from the fact that the
characteristics of this theoretical distribution underline many
aspects of both descriptive and inferential statistics (Fig. - 1).


                                                               • 231 •
Gaussian distribution is one of the important distributions        Fig. 3 shows the area enclosed by 1, 2 and 3 SD from mean.
in statistics. Most of the data relating to social and physical
sciences conform to the distribution for sufficiently large         Fig. - 3
observations by virtue of central limit theorem.
Normal distribution was first discovered by mathematician
De-Moivre. Karl Gauss and Pierre-Simon Laplace used this                                          68 %
distribution to describe error of measurement. Normal
distribution is also called as ‘Gaussian distribution’.                                           95 %
A normal curve is determined entirely by the mean and the
                                                                                                  99.7 %
standard deviation. Hence it is possible to have various normal
curves with different standard deviations but same mean (Fig.        Mean-3 SD   Mean-2SD   Mean-1SD Mean+1SD   Mean+2SD Mean+3SD
- 2a) and various normal curves with different means but same
standard deviation (Fig. - 2b).                                    If these criteria are not met, then the distribution is not a
The normal curve possesses many important properties and           Gaussian or normal distribution.
is of extreme importance in the theory of errors. The normal       Standard Normal Variate (SNV)
distribution is defined by following characteristics:              As already specified, a normal frequency curve can be described
●● It is a bell shaped symmetric (about the mean) curve.           completely with the mean and standard deviation values.
●● The curve on either side of the mean is mirror image of the     Even the same set of data would provide different value for
     other side.                                                   the mean and SD, depending on the choice of measurement.
●● The mean, median and mode coincide.                             For example, the same persons height can be expressed as 66
                                                                   inches or 167.6 cms. An infant’s birth weight can be recorded
 Fig. - 2a: Normal curves with same mean but different             as 2500 gms or 5.5 pounds. Because the units of measurement
 standard deviations                                               differ, so do the numbers, although the true height and weight
                                                                   are the same. To eliminate the effect produced by the choice
                                                                   of units of measurement the data can be put in the unit free
                                                                   form or the data can be normalized. The first step to transform
                                                                   the original variable to normalized variable is to calculate the
                                                                   mean and SD. The normalized values are then calculated by
                                                                   subtracting mean from individual values and dividing by SD.
                                                                   These normalized values are also called the z values.
                                                                                                      x-
                                                                                              z   =
                                                                                                       σ
                             μ                                     (where x is the individual observation, µ = mean and σ=
                                                                   standard deviation)
 Fig. 2b : Normal curves with same standard deviation but          The distribution of z always follows normal distribution, with
 different means                                                   mean of 0 and standard deviation of 1. The z values are often
                                                                   called the ‘Standard Normal Variate’.

                                                                   Central Limit Theorem (CLT)
                                                                   The CLT is responsible for the following remarkable result:
                                                                   The distribution of an average tends to be Normal, even when
                                                                   the distribution from which the average is computed is non-
                                                                   Normal.
●●  Highest frequency (frequency means the number of               Furthermore, this normal distribution will have the same mean
    observations for a particular value or in a particular class   as the parent distribution, AND, variance equal to the variance
    interval) is in the middle around the mean and lowest at       of the parent distribution divided by the sample size (σ/n).
    both the extremes and frequency is decreasing smoothly         The central limit theorem states that given a distribution with
    on either side of the mean.                                    a mean μ and variance σ², the sampling distribution of the
●● The total area under the curve is equal to 1 or 100%.           mean approaches a normal distribution with a mean (μ) and
●● The most important relationship in the normal curve is the      a variance σ²/N as N, the sample size, increases. The amazing
    area relationship.                                             and counter-intuitive thing about the central limit theorem
The proportional area enclosed between mean and multiples of       is that no matter what the shape of the original distribution,
SD is constant.                                                    the sampling distribution of the mean approaches a normal
		     Mean ± 1 SD -------> 68% of the total area                  distribution. Furthermore, for most distributions, a normal
                                                                   distribution is approached very quickly as N increases. Thus,
		     Mean ± 2 SD -------> 95% of the total area
                                                                   the Central Limit theorem is the foundation for many statistical
		     Mean ± 3 SD -------> 99% of the total area                  procedures.


                                                             • 232 •
To understand the concept of central limit theorem in detail let    Fig. - 4e : Distribution of x when n=8
us consider the following example (Fig. - 4a - g).
                                                                    Repeatedly taking
 Fig. 4a : Non Normal distribution of x                             eight from the
                                                                    parent distribution,
                                                                    and computing the           5
                                                                    averages, produces
 The uniform                                                        the distribution curve
 distribution on the         5                                      as shown on right.
 right is obviously non-
 Normal.  Call that the                                                                         0
 parent distribution                                                                                0           0.5             1

                             0                                      Fig. - 4f : Distribution of x when n=16
                                 0         0.5             1


 Fig. - 4b : Distribution of x when n=2                             Repeatedly taking
 To compute an average,                                             sixteen from the
                                                                    parent distribution,       5
 x, two samples are
 drawn, at random, from                                             and computing the
 the parent distribution                                            averages, produces
 and averaged. Then                                                 the distribution curve
 another sample of two is                                           as shown on right.
                                 5                                                             0
 drawn and another value                                                                            0          0.5             1
 of x computed.  This
 process is repeated, over
                                                                    Fig. - 4g : Distribution of x when n=32
 and over, and averages
 of two are computed.            0
 The    distribution    of           0       0.5               1
 averages of two is shown                                           Repeatedly taking
 on the right.                                                      thirty-two from the         5
                                                                    parent distribution, and
                                                                    computing the averages,
 Fig. - 4c : Distribution of x when n=3                             produces the probability
                                                                    density on the left.
 Repeatedly taking                                                                              0
                                                                                                    0           0.5             1
 three from the
 parent distribution,            5
                                                                   Thus we notice that when the sample size approaches a couple
 and computing the                                                 dozen, the distribution of the average is very nearly Normal,
 averages, produces the                                            even though the parent distribution looks anything but
 distribution curve as                                             Normal.
 shown on the right.
                                 0                                 Summary
                                     0       0.5               1
                                                                   Normal distribution was first discovered by mathematician
                                                                   De-Moivre. Karl Gauss and Pierre-Simon Laplace used this
 Fig. - 4d : Distribution of x when n=4                            distribution to describe error of measurement. Normal
                                                                   distribution is also called as ‘Gaussian distribution’. When the
                                                                   midpoints of the histograms are joined by smooth curve and
 Repeatedly taking                                                 if the curve resembles a bell shaped curve, the data is said to
 four from the parent            5                                 be approximately normal. The normal distribution is defined
 distribution, and                                                 by certain characteristics. It is a bell shaped symmetric (about
 computing the averages,                                           the mean) curve. The curve on either side of the mean is mirror
 produces the probability                                          image of the other side. The mean, median and mode coincide.
 density on the left.                                              Highest frequency is in the middle around the mean and lowest
                                 0                                 at both the extremes and frequency is decreasing smoothly on
                                     0       0.5               1   either side of the mean. The total area under the curve is equal
                                                                   to 1 or 100%. The most important relationship in the normal
                                                                   curve is the area relationship. The proportional area enclosed



                                                               • 233 •
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine
Human resources section2b-textbook_on_public_health_and_community_medicine

Contenu connexe

Similaire à Human resources section2b-textbook_on_public_health_and_community_medicine

1 Introduction to Biostatistics last.pptx
1 Introduction to Biostatistics last.pptx1 Introduction to Biostatistics last.pptx
1 Introduction to Biostatistics last.pptxdebabatolosa
 
1. Introdution to Biostatistics.ppt
1. Introdution to Biostatistics.ppt1. Introdution to Biostatistics.ppt
1. Introdution to Biostatistics.pptFatima117039
 
MAKING SENSE OFSTATISTICSWhat statistics tell you an.docx
MAKING SENSE OFSTATISTICSWhat statistics tell you an.docxMAKING SENSE OFSTATISTICSWhat statistics tell you an.docx
MAKING SENSE OFSTATISTICSWhat statistics tell you an.docxsmile790243
 
Ch 1 Introduction..doc
Ch 1 Introduction..docCh 1 Introduction..doc
Ch 1 Introduction..docAbedurRahman5
 
An Assignment On Advanced Biostatistics
An Assignment On Advanced BiostatisticsAn Assignment On Advanced Biostatistics
An Assignment On Advanced BiostatisticsAmy Roman
 
BIOSTATISTICS AND GENITICS
BIOSTATISTICS AND GENITICSBIOSTATISTICS AND GENITICS
BIOSTATISTICS AND GENITICSriancopper
 
Role of Biostatistician and Biostatistical Programming in Epidemiological Stu...
Role of Biostatistician and Biostatistical Programming in Epidemiological Stu...Role of Biostatistician and Biostatistical Programming in Epidemiological Stu...
Role of Biostatistician and Biostatistical Programming in Epidemiological Stu...PEPGRA Healthcare
 
Basics of Research Types of Data Classification
Basics of Research Types of Data ClassificationBasics of Research Types of Data Classification
Basics of Research Types of Data ClassificationHarshit Pandey
 
Effective strategies to monitor clinical risks using biostatistics - Pubrica.pdf
Effective strategies to monitor clinical risks using biostatistics - Pubrica.pdfEffective strategies to monitor clinical risks using biostatistics - Pubrica.pdf
Effective strategies to monitor clinical risks using biostatistics - Pubrica.pdfPubrica
 
48  january 2  vol 27 no 18  2013  © NURSING STANDARD RC.docx
48  january 2  vol 27 no 18  2013  © NURSING STANDARD  RC.docx48  january 2  vol 27 no 18  2013  © NURSING STANDARD  RC.docx
48  january 2  vol 27 no 18  2013  © NURSING STANDARD RC.docxblondellchancy
 
introductoin to Biostatistics ( 1st and 2nd lec ).ppt
introductoin to Biostatistics ( 1st and 2nd lec ).pptintroductoin to Biostatistics ( 1st and 2nd lec ).ppt
introductoin to Biostatistics ( 1st and 2nd lec ).pptKvkExambranch
 
introductoin to Biostatistics ( 1st and 2nd lec ).ppt
introductoin to Biostatistics ( 1st and 2nd lec ).pptintroductoin to Biostatistics ( 1st and 2nd lec ).ppt
introductoin to Biostatistics ( 1st and 2nd lec ).pptPriyankaSharma89719
 
Data Visuallization for Decision Making - Intel White Paper
Data Visuallization for Decision Making - Intel White PaperData Visuallization for Decision Making - Intel White Paper
Data Visuallization for Decision Making - Intel White PaperNicholas Tenhue
 
Statistics in Psychology - an introduction
Statistics in Psychology  - an introduction                 Statistics in Psychology  - an introduction
Statistics in Psychology - an introduction Suresh Kumar Murugesan
 
1 biostat chepter one.pdf
1 biostat chepter one.pdf1 biostat chepter one.pdf
1 biostat chepter one.pdfMohammedKasim29
 
Estimating the Statistical Significance of Classifiers used in the Predictio...
Estimating the Statistical Significance of Classifiers used in the  Predictio...Estimating the Statistical Significance of Classifiers used in the  Predictio...
Estimating the Statistical Significance of Classifiers used in the Predictio...IOSR Journals
 

Similaire à Human resources section2b-textbook_on_public_health_and_community_medicine (20)

Statistics Exericse 29
Statistics Exericse 29Statistics Exericse 29
Statistics Exericse 29
 
1 Introduction to Biostatistics last.pptx
1 Introduction to Biostatistics last.pptx1 Introduction to Biostatistics last.pptx
1 Introduction to Biostatistics last.pptx
 
1. Introdution to Biostatistics.ppt
1. Introdution to Biostatistics.ppt1. Introdution to Biostatistics.ppt
1. Introdution to Biostatistics.ppt
 
MAKING SENSE OFSTATISTICSWhat statistics tell you an.docx
MAKING SENSE OFSTATISTICSWhat statistics tell you an.docxMAKING SENSE OFSTATISTICSWhat statistics tell you an.docx
MAKING SENSE OFSTATISTICSWhat statistics tell you an.docx
 
Ch 1 Introduction..doc
Ch 1 Introduction..docCh 1 Introduction..doc
Ch 1 Introduction..doc
 
An Assignment On Advanced Biostatistics
An Assignment On Advanced BiostatisticsAn Assignment On Advanced Biostatistics
An Assignment On Advanced Biostatistics
 
BIOSTATISTICS AND GENITICS
BIOSTATISTICS AND GENITICSBIOSTATISTICS AND GENITICS
BIOSTATISTICS AND GENITICS
 
Role of Biostatistician and Biostatistical Programming in Epidemiological Stu...
Role of Biostatistician and Biostatistical Programming in Epidemiological Stu...Role of Biostatistician and Biostatistical Programming in Epidemiological Stu...
Role of Biostatistician and Biostatistical Programming in Epidemiological Stu...
 
Basics of Research Types of Data Classification
Basics of Research Types of Data ClassificationBasics of Research Types of Data Classification
Basics of Research Types of Data Classification
 
Effective strategies to monitor clinical risks using biostatistics - Pubrica.pdf
Effective strategies to monitor clinical risks using biostatistics - Pubrica.pdfEffective strategies to monitor clinical risks using biostatistics - Pubrica.pdf
Effective strategies to monitor clinical risks using biostatistics - Pubrica.pdf
 
48  january 2  vol 27 no 18  2013  © NURSING STANDARD RC.docx
48  january 2  vol 27 no 18  2013  © NURSING STANDARD  RC.docx48  january 2  vol 27 no 18  2013  © NURSING STANDARD  RC.docx
48  january 2  vol 27 no 18  2013  © NURSING STANDARD RC.docx
 
Biostatistics Concept & Definition
Biostatistics Concept & DefinitionBiostatistics Concept & Definition
Biostatistics Concept & Definition
 
introductoin to Biostatistics ( 1st and 2nd lec ).ppt
introductoin to Biostatistics ( 1st and 2nd lec ).pptintroductoin to Biostatistics ( 1st and 2nd lec ).ppt
introductoin to Biostatistics ( 1st and 2nd lec ).ppt
 
introductoin to Biostatistics ( 1st and 2nd lec ).ppt
introductoin to Biostatistics ( 1st and 2nd lec ).pptintroductoin to Biostatistics ( 1st and 2nd lec ).ppt
introductoin to Biostatistics ( 1st and 2nd lec ).ppt
 
Data Visuallization for Decision Making - Intel White Paper
Data Visuallization for Decision Making - Intel White PaperData Visuallization for Decision Making - Intel White Paper
Data Visuallization for Decision Making - Intel White Paper
 
Status of Statistics
Status of StatisticsStatus of Statistics
Status of Statistics
 
Biostatics ppt
Biostatics pptBiostatics ppt
Biostatics ppt
 
Statistics in Psychology - an introduction
Statistics in Psychology  - an introduction                 Statistics in Psychology  - an introduction
Statistics in Psychology - an introduction
 
1 biostat chepter one.pdf
1 biostat chepter one.pdf1 biostat chepter one.pdf
1 biostat chepter one.pdf
 
Estimating the Statistical Significance of Classifiers used in the Predictio...
Estimating the Statistical Significance of Classifiers used in the  Predictio...Estimating the Statistical Significance of Classifiers used in the  Predictio...
Estimating the Statistical Significance of Classifiers used in the Predictio...
 

Plus de Prabir Chatterjee (20)

Health System
Health SystemHealth System
Health System
 
Safe water modified
Safe water  modifiedSafe water  modified
Safe water modified
 
Cycles in community health
Cycles in community healthCycles in community health
Cycles in community health
 
Medical Ethics Vivekananda Arogya Ddham
Medical Ethics Vivekananda Arogya DdhamMedical Ethics Vivekananda Arogya Ddham
Medical Ethics Vivekananda Arogya Ddham
 
Acute chest syndrome (sickle cell)
Acute chest syndrome (sickle cell)Acute chest syndrome (sickle cell)
Acute chest syndrome (sickle cell)
 
Comprehensive Primary Health Care
Comprehensive Primary Health CareComprehensive Primary Health Care
Comprehensive Primary Health Care
 
Abhiskaar
AbhiskaarAbhiskaar
Abhiskaar
 
Cup that runneth over
Cup that runneth overCup that runneth over
Cup that runneth over
 
Janani suraksha ii jcm
Janani suraksha ii jcmJanani suraksha ii jcm
Janani suraksha ii jcm
 
Maa ki suraksha jcm
Maa ki suraksha jcmMaa ki suraksha jcm
Maa ki suraksha jcm
 
Diabetic diet cmc
Diabetic diet cmcDiabetic diet cmc
Diabetic diet cmc
 
Ntui 18 12-2012
Ntui 18 12-2012Ntui 18 12-2012
Ntui 18 12-2012
 
Doctor and family!
Doctor and family!Doctor and family!
Doctor and family!
 
Nssk adibasi udj jetha murmu
Nssk adibasi udj jetha murmuNssk adibasi udj jetha murmu
Nssk adibasi udj jetha murmu
 
29 th mela pratapur googowak' lagre enec competition
29 th mela pratapur   googowak' lagre enec competition29 th mela pratapur   googowak' lagre enec competition
29 th mela pratapur googowak' lagre enec competition
 
Nonadanga slum
Nonadanga slumNonadanga slum
Nonadanga slum
 
Telephone 5 months dead
Telephone 5 months deadTelephone 5 months dead
Telephone 5 months dead
 
Malaria and kala azar ranchi 2012
Malaria and kala azar ranchi 2012Malaria and kala azar ranchi 2012
Malaria and kala azar ranchi 2012
 
Nest of corruption
Nest of corruptionNest of corruption
Nest of corruption
 
Hleg bengali
Hleg bengaliHleg bengali
Hleg bengali
 

Dernier

Presentation on Parasympathetic Nervous System
Presentation on Parasympathetic Nervous SystemPresentation on Parasympathetic Nervous System
Presentation on Parasympathetic Nervous SystemPrerana Jadhav
 
Hematology and Immunology - Leukocytes Functions
Hematology and Immunology - Leukocytes FunctionsHematology and Immunology - Leukocytes Functions
Hematology and Immunology - Leukocytes FunctionsMedicoseAcademics
 
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...Wessex Health Partners
 
Biomechanics- Shoulder Joint!!!!!!!!!!!!
Biomechanics- Shoulder Joint!!!!!!!!!!!!Biomechanics- Shoulder Joint!!!!!!!!!!!!
Biomechanics- Shoulder Joint!!!!!!!!!!!!ibtesaam huma
 
Presentation on General Anesthetics pdf.
Presentation on General Anesthetics pdf.Presentation on General Anesthetics pdf.
Presentation on General Anesthetics pdf.Prerana Jadhav
 
COVID-19 (NOVEL CORONA VIRUS DISEASE PANDEMIC ).pptx
COVID-19  (NOVEL CORONA  VIRUS DISEASE PANDEMIC ).pptxCOVID-19  (NOVEL CORONA  VIRUS DISEASE PANDEMIC ).pptx
COVID-19 (NOVEL CORONA VIRUS DISEASE PANDEMIC ).pptxBibekananda shah
 
Nutrition of OCD for my Nutritional Neuroscience Class
Nutrition of OCD for my Nutritional Neuroscience ClassNutrition of OCD for my Nutritional Neuroscience Class
Nutrition of OCD for my Nutritional Neuroscience Classmanuelazg2001
 
Measurement of Radiation and Dosimetric Procedure.pptx
Measurement of Radiation and Dosimetric Procedure.pptxMeasurement of Radiation and Dosimetric Procedure.pptx
Measurement of Radiation and Dosimetric Procedure.pptxDr. Dheeraj Kumar
 
Tans femoral Amputee : Prosthetics Knee Joints.pptx
Tans femoral Amputee : Prosthetics Knee Joints.pptxTans femoral Amputee : Prosthetics Knee Joints.pptx
Tans femoral Amputee : Prosthetics Knee Joints.pptxKezaiah S
 
Culture and Health Disorders Social change.pptx
Culture and Health Disorders Social change.pptxCulture and Health Disorders Social change.pptx
Culture and Health Disorders Social change.pptxDr. Dheeraj Kumar
 
ANTI-DIABETICS DRUGS - PTEROCARPUS AND GYMNEMA
ANTI-DIABETICS DRUGS - PTEROCARPUS AND GYMNEMAANTI-DIABETICS DRUGS - PTEROCARPUS AND GYMNEMA
ANTI-DIABETICS DRUGS - PTEROCARPUS AND GYMNEMADivya Kanojiya
 
PULMONARY EDEMA AND ITS MANAGEMENT.pdf
PULMONARY EDEMA AND  ITS  MANAGEMENT.pdfPULMONARY EDEMA AND  ITS  MANAGEMENT.pdf
PULMONARY EDEMA AND ITS MANAGEMENT.pdfDolisha Warbi
 
PNEUMOTHORAX AND ITS MANAGEMENTS.pdf
PNEUMOTHORAX   AND  ITS  MANAGEMENTS.pdfPNEUMOTHORAX   AND  ITS  MANAGEMENTS.pdf
PNEUMOTHORAX AND ITS MANAGEMENTS.pdfDolisha Warbi
 
Big Data Analysis Suggests COVID Vaccination Increases Excess Mortality Of ...
Big Data Analysis Suggests COVID  Vaccination Increases Excess Mortality Of  ...Big Data Analysis Suggests COVID  Vaccination Increases Excess Mortality Of  ...
Big Data Analysis Suggests COVID Vaccination Increases Excess Mortality Of ...sdateam0
 
Lippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdf
Lippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdfLippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdf
Lippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdfSreeja Cherukuru
 
MedDRA-A-Comprehensive-Guide-to-Standardized-Medical-Terminology.pdf
MedDRA-A-Comprehensive-Guide-to-Standardized-Medical-Terminology.pdfMedDRA-A-Comprehensive-Guide-to-Standardized-Medical-Terminology.pdf
MedDRA-A-Comprehensive-Guide-to-Standardized-Medical-Terminology.pdfSasikiranMarri
 
epilepsy and status epilepticus for undergraduate.pptx
epilepsy and status epilepticus  for undergraduate.pptxepilepsy and status epilepticus  for undergraduate.pptx
epilepsy and status epilepticus for undergraduate.pptxMohamed Rizk Khodair
 
CEHPALOSPORINS.pptx By Harshvardhan Dev Bhoomi Uttarakhand University
CEHPALOSPORINS.pptx By Harshvardhan Dev Bhoomi Uttarakhand UniversityCEHPALOSPORINS.pptx By Harshvardhan Dev Bhoomi Uttarakhand University
CEHPALOSPORINS.pptx By Harshvardhan Dev Bhoomi Uttarakhand UniversityHarshChauhan475104
 
SGK HÓA SINH NĂNG LƯỢNG SINH HỌC 2006.pdf
SGK HÓA SINH NĂNG LƯỢNG SINH HỌC 2006.pdfSGK HÓA SINH NĂNG LƯỢNG SINH HỌC 2006.pdf
SGK HÓA SINH NĂNG LƯỢNG SINH HỌC 2006.pdfHongBiThi1
 
low cost antibiotic cement nail for infected non union.pptx
low cost antibiotic cement nail for infected non union.pptxlow cost antibiotic cement nail for infected non union.pptx
low cost antibiotic cement nail for infected non union.pptxdrashraf369
 

Dernier (20)

Presentation on Parasympathetic Nervous System
Presentation on Parasympathetic Nervous SystemPresentation on Parasympathetic Nervous System
Presentation on Parasympathetic Nervous System
 
Hematology and Immunology - Leukocytes Functions
Hematology and Immunology - Leukocytes FunctionsHematology and Immunology - Leukocytes Functions
Hematology and Immunology - Leukocytes Functions
 
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
 
Biomechanics- Shoulder Joint!!!!!!!!!!!!
Biomechanics- Shoulder Joint!!!!!!!!!!!!Biomechanics- Shoulder Joint!!!!!!!!!!!!
Biomechanics- Shoulder Joint!!!!!!!!!!!!
 
Presentation on General Anesthetics pdf.
Presentation on General Anesthetics pdf.Presentation on General Anesthetics pdf.
Presentation on General Anesthetics pdf.
 
COVID-19 (NOVEL CORONA VIRUS DISEASE PANDEMIC ).pptx
COVID-19  (NOVEL CORONA  VIRUS DISEASE PANDEMIC ).pptxCOVID-19  (NOVEL CORONA  VIRUS DISEASE PANDEMIC ).pptx
COVID-19 (NOVEL CORONA VIRUS DISEASE PANDEMIC ).pptx
 
Nutrition of OCD for my Nutritional Neuroscience Class
Nutrition of OCD for my Nutritional Neuroscience ClassNutrition of OCD for my Nutritional Neuroscience Class
Nutrition of OCD for my Nutritional Neuroscience Class
 
Measurement of Radiation and Dosimetric Procedure.pptx
Measurement of Radiation and Dosimetric Procedure.pptxMeasurement of Radiation and Dosimetric Procedure.pptx
Measurement of Radiation and Dosimetric Procedure.pptx
 
Tans femoral Amputee : Prosthetics Knee Joints.pptx
Tans femoral Amputee : Prosthetics Knee Joints.pptxTans femoral Amputee : Prosthetics Knee Joints.pptx
Tans femoral Amputee : Prosthetics Knee Joints.pptx
 
Culture and Health Disorders Social change.pptx
Culture and Health Disorders Social change.pptxCulture and Health Disorders Social change.pptx
Culture and Health Disorders Social change.pptx
 
ANTI-DIABETICS DRUGS - PTEROCARPUS AND GYMNEMA
ANTI-DIABETICS DRUGS - PTEROCARPUS AND GYMNEMAANTI-DIABETICS DRUGS - PTEROCARPUS AND GYMNEMA
ANTI-DIABETICS DRUGS - PTEROCARPUS AND GYMNEMA
 
PULMONARY EDEMA AND ITS MANAGEMENT.pdf
PULMONARY EDEMA AND  ITS  MANAGEMENT.pdfPULMONARY EDEMA AND  ITS  MANAGEMENT.pdf
PULMONARY EDEMA AND ITS MANAGEMENT.pdf
 
PNEUMOTHORAX AND ITS MANAGEMENTS.pdf
PNEUMOTHORAX   AND  ITS  MANAGEMENTS.pdfPNEUMOTHORAX   AND  ITS  MANAGEMENTS.pdf
PNEUMOTHORAX AND ITS MANAGEMENTS.pdf
 
Big Data Analysis Suggests COVID Vaccination Increases Excess Mortality Of ...
Big Data Analysis Suggests COVID  Vaccination Increases Excess Mortality Of  ...Big Data Analysis Suggests COVID  Vaccination Increases Excess Mortality Of  ...
Big Data Analysis Suggests COVID Vaccination Increases Excess Mortality Of ...
 
Lippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdf
Lippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdfLippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdf
Lippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdf
 
MedDRA-A-Comprehensive-Guide-to-Standardized-Medical-Terminology.pdf
MedDRA-A-Comprehensive-Guide-to-Standardized-Medical-Terminology.pdfMedDRA-A-Comprehensive-Guide-to-Standardized-Medical-Terminology.pdf
MedDRA-A-Comprehensive-Guide-to-Standardized-Medical-Terminology.pdf
 
epilepsy and status epilepticus for undergraduate.pptx
epilepsy and status epilepticus  for undergraduate.pptxepilepsy and status epilepticus  for undergraduate.pptx
epilepsy and status epilepticus for undergraduate.pptx
 
CEHPALOSPORINS.pptx By Harshvardhan Dev Bhoomi Uttarakhand University
CEHPALOSPORINS.pptx By Harshvardhan Dev Bhoomi Uttarakhand UniversityCEHPALOSPORINS.pptx By Harshvardhan Dev Bhoomi Uttarakhand University
CEHPALOSPORINS.pptx By Harshvardhan Dev Bhoomi Uttarakhand University
 
SGK HÓA SINH NĂNG LƯỢNG SINH HỌC 2006.pdf
SGK HÓA SINH NĂNG LƯỢNG SINH HỌC 2006.pdfSGK HÓA SINH NĂNG LƯỢNG SINH HỌC 2006.pdf
SGK HÓA SINH NĂNG LƯỢNG SINH HỌC 2006.pdf
 
low cost antibiotic cement nail for infected non union.pptx
low cost antibiotic cement nail for infected non union.pptxlow cost antibiotic cement nail for infected non union.pptx
low cost antibiotic cement nail for infected non union.pptx
 

Human resources section2b-textbook_on_public_health_and_community_medicine

  • 1. Role of Statistics in Public Health and Community 38 Introduction to Biostatistics Medicine Statistics finds an extensive use in Public Health and Seema R. Patrikar Community Medicine. Statistical methods are foundations for public health administrators to understand what is happening The origin of statistics roots from the Greek word ‘Statis’ which to the population under their care at community level as well as means state. In the early days the administration of the state individual level. If reliable information regarding the disease is required the collection of information regarding the population available, the public health administrator is in a position to: for the purpose of war. Around 2000 years ago, in India, we ●● Assess community needs had this system of collecting administrative statistics. In the ●● Understand socio-economic determinants of health Mauryan regime the system of registration of vital events ●● Plan experiment in health research of births and deaths existed. Ain-i-Akbari is a collection of ●● Analyse their results information gathered on various surveys conducted during the ●● Study diagnosis and prognosis of the disease for taking reign of Emperor Akbar. effective action The birth of statistics occurred in mid-17th century. A ●● Scientifically test the efficacy of new medicines and commoner, named John Graunt, began reviewing a weekly methods of treatment. church publication issued by the local parish clerk that listed Statistics in public health is critical for calling attention to the number of births, christenings, and deaths in each parish. problems, identifying risk factors, and suggesting solutions, These so called Bills of Mortality also listed the causes of death. and ultimately for taking credit for our successes. The most Graunt who was a shopkeeper organized this data, which was important application of statistics in sociology is in the field published as Natural and Political Observations made upon of demography. the Bills of Mortality. The seventeenth century contribution of Statistics helps in developing sound methods of collecting data theory of probability laid the foundation of modern statistical so as to draw valid inferences regarding the hypothesis. It helps methods. us present the data in numerical form after simplifying the Today, statistics has become increasingly important with complex data by way of classification, tabulation and graphical passing time. Statistical methods are fruitfully applied to presentation. Statistics can be used for comparison as well as any problem of decision making where the past information to study the relationship between two or more factors. The use is available or can be made available. It helps to weigh the of such relationship further helps to predict one factor from the evidences and draw conclusions. Statistics finds its application other. Statistics helps the researcher come to valid conclusions in almost all the fields of science. We hardly find any science in answering their research questions. that does not make use of statistics. Despite wide importance of the subject it is looked upon with Definition of Statistics suspicion. “Lies, damned lies, and statistics” is part of a phrase attributed to Benjamin Disraeli and popularized in the United Different authors have defined statistics differently. The best States by Mark Twain: “There are three kinds of lies: lies, definition of statistics is given by Croxton and Cowden according damned lies, and statistics.” The semi- ironic statement refers to whom statistics may be defined as the science, which deals to the persuasive power of numbers, and describes how even with collection, presentation, analysis and interpretation of accurate statistics can be used to bolster inaccurate arguments. numerical data. It is human psychology that when facts are supported by Definition of Biostatistics figures, they are easily believed. If wrong figures are used Biostatistics may be defined as application of statistical methods they are bound to give wrong conclusions and hence when to medical, biological and public health related problems. It is statistical theories are applied the figures that are used are the scientific treatment given to the medical data derived from free of all types of biases and have been properly collected and group of individuals or patients. scientifically analysed. Role of Statistics in Clinical Medicine Broad Categories of Statistics The main theory of statistics lies in the term variability. No two Statistics can broadly be split into two categories Descriptive individuals are same. For example, blood pressure of person Statistics and Inferential Statistics. Descriptive statistics may vary from time to time as well as from person to person. deals with the meaningful presentation of data such that its We can also have instrumental variability as well as observers characteristics can be effectively observed. It encompasses the variability. Methods of statistical inference provide largely tabular, graphical or pictorial display of data, condensation objective means for drawing conclusions from the data about of large data into tables, preparation of summary measures the issue under study. Medical science is full of uncertainties to give a concise description of complex information and also and statistics deals with uncertainties. Statistical methods to exhibit pattern that may be found in data sets. Inferential try to quantify the uncertainties present in medical science. It statistics however refers to decisions. Medical research doesn’t helps the researcher to arrive at a scientific judgment about stop at just describing the characteristic of disease or situation. a hypothesis. It has been argued that decision making is an It tries to determine whether characteristics of a situation are integral part of a physician’s work. Frequently, decision making unusual or if they have happened by chance. Because of this is probability based. desire to generalize, the first step is to statistically analyse the • 218 •
  • 2. data. Study Exercises In order to begin our analysis as to why statistics is necessary Short Notes : (1) Differentiate between descriptive and we must begin by addressing the nature of science and inferential statistics (2) Describe briefly various scales of experimentation. The characteristic method used by researcher measurement. when he/she starts his/her experiment is to study a relatively MCQs & Exercises small collection of subjects, as complete population based studies are time consuming, laborious, costly and resource 1. An 85 year old man is rushed to the emergency department intensive. The researcher draws a subset of the population by ambulance during an episode of chest pain. The called as “sample” and studies this sample in depth. But the preliminary assessment of the condition of the man is conclusions drawn after analyzing the sample is not restricted performed by a nurse, who reports that the patients pain to the sample but is extrapolated to the population i.e. people in seems to be ‘severe’. The characterization of pain as general. Thus Statistics is the mathematical method by which ‘severe’ is (a) Dichotomous (b) Nominal (c) Quantitative  the uncertainty inherent in the scientific method is rigorously (d) Qualitative quantified. 2. If we ask the patient attending OPD to evaluate his pain on a scale of 0 (no pain) to 5 (the worst pain), then this Summary commonly applied scale is a (a) Dichotomous (b) Ratio In recent times, use of Statistics as a tool to describe various scale (c) Continuous (d) Nominal phenomena is increasing in biological sciences and health related 3. For each of the following variable indicate whether it is fields so much so that irrespective of the sphere of investigation, quantitative or qualitative and specify the measurement a research worker has to plan his/her experiments in such a scale for each variable : (a) Blood Pressure (mmHg) manner that the kind of conclusions which he/she intends to (b) Cholesterol (mmol/l) (c) Diabetes (Yes/No) (d) Body draw should become technically valid. Statistics comes to this Mass Index (Kg/m2) (e) Age (years) (f) Sex (female/ aid at the stages of planning of experiment, collection of data, male) (g) Employment (paid work/retired/housewife) (h) analysis and interpretation of measures computed during the Smoking Status (smokers/non-smokers, ex-smokers) (i) analysis. Biostatistics is defined as application of statistical Exercise (hours per week) (j) Drink alcohol (units per week) methods to medical, biological and public health related (k) Level of pain (mild/moderate/severe) problems. Statistics is broadly categorized into descriptive Answers : (1) d; (2) b; (3) (a) Quantitative continuous; statistics and inferential statistics. Descriptive statistics (b) Quantitative continuous; (c) Qualitative dichotomous ; describes the data in meaningful tables or graphs so that the (d) Quantitative continuous; (e) Quantitative continuous; hidden pattern is brought out. Condensing the complex data (f) Qualitative dichotomous; (g) Qualitative nominal ; into simple format and describing it with summary measures is (h) Qualitative nominal; (i) Quantitative discrete; (j) Quantitative part of the descriptive statistics. Inferential statistics on other discrete; (k) Qualitative ordinal. hand, deals with drawing inferences and taking decision by studying a subset or sample from the population. The first step in handling the data, after it has been collected Descriptive Statistics: Displaying 39 the Data is to ‘reduce’ and summarise it, so that it can become understandable; then only meaningful conclusions can be drawn from it. Data can be displayed in either tabular form or Seema R. Patrikar graphical form. Tables are used to categorize and summarize data while graphs are used to provide an overall visual representation. To develop Graphs and diagrams, we need to The observations made on the subjects one after the other is first of all, condense the data in a table. called raw data. Raw data are often little more than jumble of numbers and hence very difficult to handle. Data is collected Understanding as to how the Data have been by researcher so that they can give solutions to the research Recorded question that they started with. Raw data becomes useful only Before we start summarizing or further analyzing the data, we when they are arranged and organized in a manner that we should be very clear as on which ‘scale’ it has been recorded can extract information from the data and communicate it to (i.e. qualitative or quantitative; and, whether continuous, others. In other words data should be processed and subjected discrete, ordinal, polychotomous or dichotomous). The details to further analysis. This is possible through data depiction, have already been covered earlier in the chapter on variables data summarization and data transformation. and scales of measurement (section on epidemiology) and the • 219 •
  • 3. readers should quickly revise that chapter before proceeding. Child Sex Age Malnutrition Ordered Data (months) Status When the data are organized in order of magnitude from 17 f 2 Normal the smallest value to the largest value it is called as ordered 18 m 11 Normal array. For example consider the ages of 11 subjects undergoing tobacco cessation programme (in years) 16, 27, 34, 41, 38, 53, 19 m 12 Normal 65, 52, 20, 26, 68. When we arrange these ages in increasing 20 m 11 Malnourished order of magnitude we get ordered array as follows: 16, 20, 21 m 10 Normal 26, 27, 34, 38, 41, 52, 53, 65, 68. After observing the ordered array we can quickly determine that the youngest person is of 22 f 9 Normal 16 years and oldest of 68 years. Also we can easily state that 23 f 5 Normal almost 55% of the subjects are below 40 years of age, and that 24 f 6 Normal the midway person is aged 38 years. 25 m 4 Normal Grouped Data - Frequency Table Besides arranging the data in ordered array, grouping of data 26 f 7 Normal is yet another useful way of summarizing them. We classify the 27 f 11 Normal data in appropriate groups which are called “classes”. The basic 28 f 12 Normal purpose behind classification or grouping is to help comparison and also to accommodate a large number of observations into 29 m 10 Malnourished a few classes only, by condensation so that similarities and 30 m 4 Normal dissimilarities can be easily brought out. It also highlights 31 m 6 Normal important features and pinpoints the most significant ones at glance. 32 m 8 Normal Table 1 shows a set of raw data obtained from a cross-sectional 33 m 12 Malnourished survey of a random sample of 100 children under one year of 34 m 1 Malnourished age for malnutrition status. Information regarding age and 35 m 1 Normal sex of the child was also collected. We will use this data to illustrate the construction of various tables. If we show the 36 f 3 Normal distribution of children as per age then it is called as simple 37 m 5 Normal table as only one variable is considered. 38 f 6 Normal Table - 1 : Raw data on malnutrition status (malnourished 39 f 8 Normal and normal) for 100 children below one year of age 40 f 9 Normal Child Sex Age Malnutrition 41 f 10 Malnourished (months) Status 42 m 1 Normal 1 f 6 Normal 43 f 12 Malnourished 2 m 4 Malnourished 44 f 2 Malnourished 3 m 2 Malnourished 45 m 1 Normal 4 m 5 Normal 46 m 6 Normal 5 m 3 Normal 47 m 4 Normal 6 f 1 Normal 48 f 9 Normal 7 m 5 Normal 49 f 4 Normal 8 f 8 Normal 50 m 9 Normal 9 f 7 Normal 51 m 7 Normal 10 f 9 Normal 52 m 6 Normal 11 f 10 Normal 53 m 4 Normal 12 f 2 Normal 54 f 2 Normal 13 m 4 Malnourished 55 m 5 Normal 14 f 6 Normal 56 m 3 Normal 15 m 8 Normal 57 f 1 Normal 16 f 1 Malnourished 58 m 5 Normal • 220 •
  • 4. Child Sex Age Malnutrition Steps in Making a Summary Table for the Data (months) Status To group a set of observations we select a set of contiguous, 60 m 7 Malnourished non overlapping intervals such that each value in the set of observations can be placed in one and only one of the intervals. 61 m 9 Normal These intervals are usually referred to as class intervals. For 62 m 10 Normal example the above data can be grouped into different age groups of 1-4, 5-8 and 9-12. These are called class intervals. 63 f 2 Normal The class interval 1-4 includes the values 1, 2, 3 and 4. The 64 f 4 Normal smallest value 1 is called its lower class limit whereas the 65 f 6 Normal highest value 4 is called its upper class limit. The middle value of 1-4 i.e. 2.5 is called the midpoint or class mark. The number 66 f 8 Normal of subjects falling in the class interval 1-4 is called its class 67 m 1 Normal frequency. Such presentation of data in class intervals along 68 m 2 Normal with frequency is called frequency distribution. When both the limits are included in the range of values of the interval, the 69 m 11 Normal class interval are known as inclusive type of class intervals (e.g. 70 f 12 Normal 1-4, 5-8, 9-12, etc.) whereas when lower boundary is included 71 m 11 Normal but upper limit is excluded from the range of values, such class intervals are known as exclusive type of class intervals 72 m 10 Malnourished (e.g. 1-5, 5-9, 9-12 etc.) This type of class intervals is suitable 73 f 9 Normal for continuous variable. Tables can be formed for qualitative 74 f 5 Normal variables also. 75 f 6 Normal Table - 2 and 3 display tabulation for quantitative as well as qualitative variable. 76 m 4 Normal 77 m 7 Normal Table - 2 : Age distribution of the 100 children 78 m 11 Normal Age group (months) Number of children 79 f 12 Normal 1-4 36 80 f 10 Normal 5-8 33 81 m 4 Normal 9-12 31 82 m 6 Malnourished Total 100 83 m 8 Normal Table - 3 : Distribution of malnourishment in 100 84 m 12 Normal children 85 m 1 Normal Malnourishment Status Number of children 86 m 1 Normal Malnourished 17 87 m 3 Normal Normal 83 88 f 5 Normal Total 100 89 m 6 Normal Such type of tabulation which takes only one variable for 90 f 8 Normal classification is called one way table. When two variables 91 f 9 Normal are involved the table is referred to as cross tabulation or 92 f 10 Normal two way table. For example Table - 4 displays age and sex distribution of the children and Table - 5 displays distribution 93 f 1 Normal of malnourishment status and sex of children. 94 m 12 Normal 95 m 2 Normal Table - 4 : Age and sex distribution of 100 children 96 f 1 Normal Age group (months) Female Male Total 97 m 6 Normal 1-4 14 22 36 98 f 4 Malnourished 5-8 15 18 33 99 f 9 Malnourished 9-12 16 15 31 100 m 4 Normal Total 45 55 100 • 221 •
  • 5. percentages in bracket may be written on the top of each bar. Table - 5 : Malnourishment status and sex distribution When we draw bar charts with only one variable or a single of children group it is called as simple bar chart and when two variables Malnourishment Status Female Male Total or two groups are considered it is called as multiple bar chart. Malnourished 6 11 17 In multiple bar chart the two bars representing two variables are drawn adjacent to each other and equal width of the bars Normal 39 44 83 is maintained. Third type of bar chart is the component bar Total 45 55 100 chart wherein we have two qualitative variables which are further segregated into different categories or components. In How to Decide on the Number of Class Intervals? this the total height of the bar corresponding to one variable When data are to be grouped it is required to decide upon the is further sub-divided into different components or categories number of class intervals to be made. Too few class intervals of the other variable. For example consider the following data would result in losing the information. On the other hand too (Table-6) which shows the findings of a hypothetical research many class intervals would not bring out the hidden pattern. work intended to describe the pattern of blood groups among The thumb rule is that we should not have less than 5 class patients of essential hypertension. intervals and no more than 15 class intervals. To be specific, experts have suggested a formula for approximate number of Table - 6 : Distribution of blood group of patients of class intervals (k) as follows: essential hypertension K= 1 + 3.332 log10N rounded to the nearest integer, where N is Number of the number of values or observations under consideration. Blood Group patients Percentage For example if N=25 we have, K= 1 + 3.332 log1025 i.e. (frequency) approximately 5 class intervals. A 232 42.81 Having decided the number of class intervals the next step is B 201 37.05 to decide the width of the class interval. The width of the class interval is taken as : AB 76 14.02 O 33 6.09 Maximum observed value - Minimum observed value (= Range) Total 542 100.00 Width = Number of class interval (k) A simple bar chart in respect of the above data on blood groups The class limits should be preferably rounded figures and the among patients of essential hypertension is represented as in class intervals should be non-overlapping and must include Fig. - 1. range of the observed data. As far as possible the percentages Similarly a multiple bar chart of the data represented in Table and totals should be calculated column wise. - 5 of the distribution of the malnourishment status among Graphical Presentation of Data males and females is shown in Fig. - 2. A tabular presentation discussed above shows distribution of The same information can also be depicted in the form of subjects in various groups or classes. This tabular representation component bar chart as in Fig. - 3. of the frequency distribution is useful for further analysis and conclusion. But it is difficult for a layman to understand complex Fig. - 1 : Distribution of blood groups of patients with distribution of data in tabular form. Graphical presentation of essential hypertension data is better understood and appreciated by humans. Graphical 250 representation brings out the hidden pattern and trends of the complex data sets. 200 Thus the reason for displaying data graphically is two fold: Frequency 150 1) Investigators can have a better look at the information collected and the distribution of data and, 100 2) To communicate this information to others quickly We shall discuss in detail some of the commonly used graphical 50 presentations. 0 Bar Charts : Bar charts are used for qualitative type of variable A B AB O in which the variable studied is plotted in the form of bar Blood Groups along the X-axis (horizontal) and the height of the bar is equal to the percentage or frequencies which are plotted along the Y-axis (vertical). The width of the bars is kept constant for all the categories and the space between the bars also remains constant throughout. The number of subjects along with • 222 •
  • 6. points by a straight line then it is called as frequency polygon Fig. - 2 : Multiple Bar Chart showing the distribution of Conventionally, we consider one imaginary value immediately malnourishment status in males and females preceding the first value and one succeeding the last value and 50 44 plot them with frequency = 0. An example is given in Table - 7 45 39 and Fig. - 5. 40 35 Fig. - 4a : Distribution of patients according to blood 30 group Number 25 20 15 10 6 11 O 5 0 6% Malnourished Normal AB Females Males 14 % Fig. - 3 : Component Bar Chart showing the distribution of malnourishment status in males and females A : 43 % 60 50 40 B : 37 % Number 30 20 10 0 Fig. - 4b Female Male Malnourished Normal 42.81 Blood group A = X 360 = 154 degrees 100 Pie Chart : Another interesting method of displaying categorical 37.08 (qualitative) data is a pie diagram also called as circular Blood group B = X 360 = 134 degrees 100 diagram. A pie diagram is essentially a circle in which the Blood group AB = 14.02 X 360 = 50 degrees angle at the center is equal to its proportion multiplied by 360 100 (or, more easily, its percentage multiplied by 360 and divided by 100). A pie diagram is best when the total categories  are Blood group O = 6.09 X 360 = 22 degrees between  2 to 6. If there are more than 6 categories, try and 100 reduce them by “clubbing”, otherwise the diagram becomes too overcrowded. A pie diagram in respect of the data on blood groups among Table - 7: Distribution of subjects as per age groups patients  of  essential  hypertension   is presented below after Number of Age Midpoints calculating the angles  for the individual categories as in subjects Fig. - 4 a, b. 20-25 22.5 2 Frequency Curve and Polygon : To construct a frequency curve 25-30 27.5 3 and frequency polygon we plot the variable along the X-axis and the frequencies along the Y-axis. Observed values of the 30-35 32.5 6 variable or the midpoints of the class intervals are plotted along 35-40 37.5 14 with the corresponding frequency of that class interval. Then 40-45 42.5 7 we construct a smooth freehand curve passing through these points. Such a curve is known as frequency curve. If instead of 45-50 47.5 5 joining the midpoints by smooth curve, we join the consecutive • 223 •
  • 7. Fig. - 5 : Distribution of subjects in different age groups Fig. - 8 16 Rough estimate of 14 the centre or middle observation i.e. median Number of subjects 12 value (27.5) 10 8 6 4 Spread of the data 2 0 Fig. - 9 15 20 25 30 35 40 45 50 55 Age groups Stem-and-leaf plots : This presentation is used for quantitative type of data. To construct a stem-and-leaf plot, we divide each value into a stem component and leaf component. The digits in the tens-place becomes stem component and the digits in units-place becomes leaf components. It is of much utility in quickly assessing whether the data is following a “normal” distribution or not, by seeing whether the stem and leaf is showing a bell shape or not. For example consider a sample of 10 values of age in years : 21, 42, 05, 11, 30, 50, 28, 27, 24, 52. Here, 21 has a stem component of 2 and leaf component of 1. Similarly the second value 42 has a stem component of 4 and leaf component of 2 and so on. The stem values are listed in numerical order (ascending or descending) to form a vertical axis. A vertical line is drawn to outline a stem. If the stem value already exists then the leaf is placed on the right side of For the given example we notice the mound (heap) in the vertical line (Fig. - 6). middle of the distribution. There are no outliers. The value of each of the leaf is plotted in its appropriate location Histogram : The stem-and-leaf is a good way to explore on the other side of vertical line as in Fig. - 7. distributions. A more traditional approach is to use histogram. To describe the central location, spread and shape of the stem A histogram is used for quantitative continuous type of data plot we rotate the stem plot by 90 degrees just to explain it where, on the X-axis, we plot the quantitative exclusive type more clearly as in Fig. - 8. of class intervals and on the Y-axis we plot the frequencies. Roughly we can say that the spread of data is from 5 to 52 The difference between bar charts and histogram is that since and the median value is between 27 and 28. Regarding the histogram is the best representation for quantitative data shape of the distribution though it will be difficult to make measured on continuous scale, there are no gaps between the firm statements about shape when n is small, we can always bars. Consider an example of the data on serum cholesterol of determine (Fig. - 9) : 10 subjects (Table - 8 & Fig. - 10) ●● Whether data are more or less symmetrical or are extremely skewed Table - 8 : Distribution of the subjects ●● Whether there is a central cluster or mound Serum ●● Whether there are any outliers cholesterol No of subjects Percentage (mg/dl) Fig. - 6 Fig. - 7 175 – 200 3 30 0 0 5 200 – 225 3 30 1 1 1 225 – 250 2 20 2 2 1 4 7 8 250 – 275 1 10 3 3 0 275 – 300 1 10 4 4 2 Total 10 100% 5 5 0 2 • 224 •
  • 8. diagram, the rate of disease are plotted along the vertical (y) Fig. - 10 : Distribution of subjects according to Serum axis. However, in localised outbreaks, with a well demarcated Cholesterol Levels population that has been at risk (as sudden outbreaks of food 3.5 poisoning) the actual numbers can be plotted on Y-axis, during 3.0 quick investigations. The unit of time, as applicable to the disease in question, is plotted along the “X”-axis (horizontal). % of subjects 2.5 This unit of time would be hours-time in food poisoning, days 2.0 (i.e, as per dates of the month) for cholera, weeks for typhoid, malaria or Hepatitis-A, months for Hepatitis-B and in years (or 1.5 even decades) for IHD or Lung Cancer. 1.0 Scatter Diagram : A scatter diagram gives a quick visual 0.5 display of the association between two variables, both of which are measured on numerical continuous or numerical discrete 0 scale. An example of scatter plot between age (in months) and 175-200 200-225 225-250 250-275 275-300 body weight (in kg) of infants is given in Fig. - 12. Serum Cholesterol Levels (mg/dl) Fig. - 12 : Scatter Diagram of the association between Age and Body Weight of infants Box-and-Whisker plot : A box-and-whisker plot reveals maximum of the information to the audience. A box-and- 12 whisker plot can be useful for handling many data values. They Body Weight (Kgs.) 10 allow people to explore data and to draw informal conclusions when two or more variables are present. It shows only certain 8 statistics rather than all the data. Five-number summary 6 is another name for the visual representations of the box- and-whisker plot. The five-number summary consists of the 4 median, the quartiles (lower quartile and upper quartile), and 2 the smallest and greatest values in the distribution. Thus a 0 box-and-whisker plot displays the center, the spread, and the overall range of distribution (Fig. - 11) 0 2 4 6 8 10 12 14 Age in months Fig. - 11 The scatter diagram in the above figure shows instant finding that weight and age are associated - as age increases, weight Largest Value increases. Be careful to record the dependent variable along the vertical (Y) axis  and the independent variable along the Upper Quartile (Q3) horizontal (X) axis. In this example weight is dependent on age (as age increases weight is likely to increase) but age is not dependent on weight (if weight increases, age will not  necessarily increase). Thus, weight is the dependent variable, and has been plotted on Y  axis while age is the independent variable, plotted along X axis. Median Quartile (Q2) Summary Raw information, which is just jumble of numbers, collected by the researcher needs to be presented and displayed in a manner that it makes sense and can be further processed. Data presented Lower Quartile (Q1) in an eye-catching way can highlight particular figures and situations, draw attention to specific information, highlight Smallest value hidden pattern and important information and simplify complex information. Raw information can be presented either in table i.e. tabular presentation or in graphs and charts i.e. graphical presentation. A table consists of rows and columns. The data is condensed in homogenous groups called class intervals and Line chart: Line chart is used for quantitative data. It is the number of individuals falling in each class interval called an excellent method of displaying the changes that occur frequency is displayed. A table is incomplete without a title. in disease frequency over time. It  thus helps in assessing Clear title describing completely the data in concise form is “temporal trends” and helps displaying data  on epidemics or written. Graphical presentation is used when data needs to be localised outbreaks in the  form of epidemic  curve. In a line displayed in charts and graphs. A chart or diagram should have • 225 •
  • 9. a clear title describing the data depicted. The X-axis and the Exports (crores Imports (crores Y-axis should be properly defined along with the scale. Legend Year of rupees) of rupees) in case of more than one variable or group is necessary. An optional footnote giving the source of information may be 1960-61 610.3 624.65 present. Appropriate graphical presentation should be depicted 1961-62 955.39 742.78 depending on whether data is quantitative or qualitative. 1962-63 660.65 578.36 While dealing with quantitative data histograms, line chart, polygon, stem and leaf and box and whisker plots should be 1963-64 585.25 527.98 used whereas bar charts, pictograms and pie charts should be used when dealing with qualitative data. 9. Of the 140 children, 20 lived in owner occupied houses, 70 lived in council houses and 50 lived in private rented Study Exercises accommodation. Type of accommodation is a categorical Long Question : Discuss the art of effective presentation in variable. Appropriate graphical presentation will be the field of health, in respect of data and information; so as to (a) Line chart (b) Simple Bar chart (c) Histogram convince the makers of decision. (d) Frequency Polygon 10. A study was conducted to assess the awareness of phimosis Short Notes: (1) Discuss the need for graphical presentation of in young infants and children up to 5 years of age. The data (2) Differentiate between inclusive and exclusive type of awareness level with respect to the family income is as class intervals (3) Box and Whisker Plot (4) Scatter diagram tabulated below. Which graphical presentation is best to MCQs describe the following data? 1. Which of the following is used for representing qualitative data (a) Histogram (b) Polygon (c) Pie chart (d) Line chart <2000 2000 – 5000 5000 – 8000 >8000 2. The scatter plot is used to display (a) Causality Aware 50 62 77 70 (b) Correlation (c) Power (d) Type II error 3. Five summary plot consists of Quartiles and (a) Median (b) Unaware 50 28 23 30 Mode (c) Mean (d) Range (a) Stem & Leaf (b) Pie Chart (c) Multiple Bar Chart 4. The appropriate method of displaying the changes that (d) Component Bar Chart occur in disease frequency over time (a) Line chart (b) Bar 11. Following is the frequency distribution of the serum levels chart (c) Histogram (d) Stem and leaf. of total cholesterol reported in a sample of 71 subjects. 5. Box and whisker plot is also known as (a) Magical box Which graphical presentation is best to describe the (b) Four summary plot (c) Five summary plot (d) None of following data? the above 6. The type of diagram useful to detect linear relationship Serum cholesterol level Frequency between two variables is (a) Histogram (b) Line Chart (c) Scatter Plot (d) Bar Chart <130 2 7. The following table shows the age distribution of cases of a 130-150 7 certain disease reported during a year in a particular state. 150-170 18 Which graphical presentation is appropriate to describe this data? (a) Pie chart (b) Line chart (c) Histogram 170-190 20 (d) Pictogram 190-210 15 210-230 7 Age Number of cases >230 2 5-14 5 15-24 10 (a) Stem & Leaf (b) Pie (c) Histogram 25-34 120 (d) Component Bar Chart 12. Information from the Sports Committee member on 35-44 22 representation in different games at the state level by 45-54 13 gender is as given below. Which graphical presentation is 55-64 5 best to describe the following data 8. Which graphical presentation is best to describe the Different Games Females Males following data? (a) Multiple bar chart (b) Pie chart Long Jump 4 6 (c) Histogram (d) Box plot High Jump 2 4 Shot Put 9 11 Running 15 10 Swimming 5 4 • 226 •
  • 10. (a) Box plot (b) Histogram (c) Multiple Bar Chart (d) Pie Statistical Exercise chart 1. Following is the population data in a locality, present the 13. Which graphical presentation is best to describe the data in tabular form as well as using appropriate graphs. following data S. No. Age S. No. Age S. No. Age Grade of malnutrition Frequency 1 11 11 8 21 16 Normal 60 2 15 12 12 22 17 Grade I 30 3 6 13 22 23 19 Grade II 7 4 17 14 24 24 8 Grade III 2 5 18 15 16 25 9 Grade IV 1 6 7 16 19 26 10 (a) Box Plot (b) Component Bar Chart (c) Histogram (d) Pie 7 25 17 20 27 24 chart 8 32 18 9 28 31 Answers : (1) c; (2) b; (3) a; (4) a; (5) c; (6) c; (7) c; (8) a; (9) b; (10) d; (11) c; (12) c; (13) d. 9 12 19 21 29 32 10 34 20 31 30 37 summing all the observations and then dividing by number of Summarising the Data: Measures x 40 of Central Tendency and Variability observations. It is generally denoted by . It is calculated as follows. Sum of the values of all observations Mean (x) = Seema R. Patrikar Total number of observations, that is, the total number of The huge raw information gathered by the researcher is subjects (denoted by "n") organized and condensed in a table or graphical display. Mathematically, Compiling and presenting the data in tabular or graphical form Σxi will not give complete information of the data collected. We x = i n need to “summarise” the entire data in one figure, looking at which we can get overall idea of the data. Thus, the data set It is the simplest of the centrality measure but is influenced by should be meaningfully described using summary measures. extreme values and hence at times may give fallacious results. Summary measures provide description of data in terms of It depends on all values of the data set but is affected by the concentration of data and variability existing in data. Having fluctuations of sampling. described our data set we use these summary figures to draw Example : The serum cholesterol level (mg/dl) of 10 subjects certain conclusions about the reference population from which were found to be as follows: 192 242 203 212 175 284 256 the sample data has been drawn. Thus data is described by two 218 182 228 summary measures namely, measure of central tendency and We observe that the above data set is of quantitative type. measure of variability. Before we discuss in detail, the various measures we should understand the distribution of the data To calculate mean the first step is to sum all the values. Thus, Σxi set. i = 192 + 242 + 203 + ……..+ 228 = 2192 The second step is to divide this sum by total number of Measures of Central Tendency observation (n), which are 10 in our example. Thus, This gives the centrality measure of the data set i.e. where the Σxi observations are concentrated. There are numerous measures x = in = 2192/10 = 219.2 of central tendency. These are : Mean; Median; Mode; Geometric Mean; Harmonic Mean. Thus the average value of Serum cholesterol among the 10 Mean (Arithmetic Mean) or Average subjects studied = 219.5 mg/dl. This summary value of mean This is most appropriate measure for data following normal describes our entire data in one value. distribution but not for skewed distributions. It is calculated by • 227 •
  • 11. Calculation of mean from grouped data : For calculating the observations are less, median can be calculated by just the mean from a “grouped data” we should first find out the inspection. Unlike mean, median can be calculated if the extreme midpoint (class mark) of each class interval which we denote observation is missing. It is less affected by fluctuations of by x. (Mid point is calculated by adding the upper limit and sampling than mean. the lower limit of the respective class intervals and dividing by Mode 2). The next step is to multiply the midpoints by the frequency of that class interval. Summing all these multiplications and Mode is the most common value that repeats itself in the then dividing by total sample size yields us the mean value for data set. Though mode is easy to calculate, at times it may grouped data. be impossible to calculate mode if we do not have any value repeating itself in the data set. At other end it may so happen Consider the following example on 10 subjects on serum that we come across two or more values repeating themselves cholesterol level (mg/dl), put in class interval (Table - 1). same number of times. In such cases the distribution are said to bimodal or multimodal. Table - 1 Geometric Mean Serum cholesterol Midpoint No. of x*f Geometric mean is defined as the nth root of the product of level (mg/dl) (x) subjects (f) observations. (a) (b) (c ) (bxc) Mathematically, 175-199 187 3 561 n x1 x2 x3......... xn * * * 200-224 212 3 636 Geometric Mean = 225-249 237 2 474 Thus if there are 3 observations in the data set, the first step would be to calculate the product of all the three observations. 250-274 262 1 262 The second step would be to take cube root of this product. 275-299 287 1 287 Similarly the geometric mean of 4 values would be the 4th root Total 10 = ∑f 2220 = ∑f of the product of the four observations. x The merits of geometric mean are that it is based on all the observations. It is also not much affected by the fluctuations of The mean, then is calculated as sampling. The disadvantage is that it is not easy to calculate and finds limited use in medical research. Harmonic Mean Median Harmonic mean of a set of values is the reciprocal of the arithmetic When the data is skewed, another measure of central tendency mean of the reciprocals of the values. Mathematically, called median is used. Median is a locative measure which is n Harmonic mean = the middlemost observation after all the values are arranged 1 1 1 in ascending or descending order. In other words median  is x1 + x2 +.... xn that -value which divides the entire data set into 2 equal parts, Thus if there are four values in the data set as 2, 4, 6 and 8, when the data set is ordered in an ascending (or descending) the harmonic mean is fashion. In case when there is odd number of observations we 4 have a single most middle value which is the median value. In = 3.84 1 1 1 1 case when even number of observations is present there are 2 +4 +6 +8 two middle values and the median is calculated by taking the mean of these two middle observations. Thus, Though harmonic mean is based on all the values, it is not easy to understand and calculate. Like geometric mean this { n+1 ; when n is odd also finds limited use in medical research. Median = 2 mean of n th & 2 ( n + 1) th obs 2 ; when n is even Relationship between the Three Measures of Mean, Median and Mode 1. For symmetric curve Let us work on our example of serum cholesterol considered in Mean = Median = Mode calculation of mean for ungrouped data. In the first step, we 2. For symmetric curve will order the data set in an ascending order as follows : Mean – Mode ≈ 3 (Mean – Median) 175, 182, 192, 203, 212, 218, 228, 242, 256, 284 3. For positively skewed curve Since n is 10 (even) we have two middle most observations as Mean > Median > Mode 212 and 218 (i.e. the 5th and 6th value) 4. For negatively skewed curve 212 + 218 Mean < Median < Mode Therefore, median = --------------- = 215 Choice of Central Tendency 2 We observe that each central tendency discussed above have Like mean, median is also very easy to calculate. In fact if some merits and demerits. No one average is good for all types • 228 •
  • 12. of research. The choice should depend on the type of information Quartiles divide the total number of observations into 4 equal collected and the research question the investigator is trying parts of 25% each. Thus there are three quartiles (Q1, Q2 and to answer. If the collected data is of quantitative nature and Q3) which divide the total observations in four equal parts. symmetric or approximately symmetric data, generally the The second quartile Q2 is equivalent to the middle value i.e. measure used is arithmetic mean. But if the values in the series median. The interquartile range gives the middle 50% values of are such that only one or two observations are very big or very the data set. Though interquartile range is easy to calculate it small compared to other observations, arithmetic mean gives suffers from the same defects as that of range. fallacious conclusions. In such cases (skewed data) median Mean Deviation or mode would give better results. In social and psychological studies which deals with scored observations or data which Mean deviation is the mean of the difference from a constant are not capable of direct quantitative measurements like socio- ‘A which can be taken as mean, median, mode or any constant ’ economic status, intelligence or pain score etc., median or mode observation from the data. The formula for mean deviation is is better measure than mean. However, ‘mode’ is generally not given as follows: used since it is not amenable to statistical analysis. Measures of Relative Position (Quantiles) Mean deviation = Quantiles are the values that divide a set numerical data arranged where A may be mean, median, mode or a constant; xi is the in increasing order into equal number of parts. Quartiles divide value of individual observations; n is the total number of the numerical data arranged in increasing order into four equal observations; and, ∑ = is a sign indicating “sum of”. The main parts of 25% each. Thus there are 3 quartiles Q1, Q2 and Q3 drawback of this measure is that it ignores the algebraic signs respectively. Deciles are values which divide the arranged data and hence to overcome this drawback we have another measure into ten equal parts of 10% each. Thus we have 9 deciles which of variability called as Variance. divide the data in ten equal parts. Percentiles are the values that divide the arranged data into hundred equal parts of 1% Standard Deviation each. Thus there are 99 percentiles. The 50th percentile, 5th Variance is the average of the squared deviations of each of the decile and 2nd quartile are equal to median. individual value from the mean ( x ). It is mathematically given as follows: Measures of Variability Knowledge of central tendency alone is not sufficient for complete understanding of distribution. For example if we have Variance = three series having the same mean, then it alone does not throw light on the composition of the data, hence to supplement it Most often we use the square root of the variance called we need a measure which will tell us regarding the spread of Standard Deviation to describe the data as it is devoid of any the data. In contrast to measures of central tendency which errors. Variance squares the units and hence standard deviation describes the center of the data set, measures of variability by taking square root brings the measure back in the same describes the variability or spreadness of the observation from units as original and hence is best measure of variability. It is the center of the data. Various measures of dispersion are as given as follows: follows. ●● Range Standard Deviation (SD)= ●● Interquartile range ●● Mean deviation ●● Standard deviation The larger the standard deviation the larger is the spread of the ●● Coefficient of variation distribution. Range Note: When n is less than 30, the denominator in variance and One of the simplest measures of variability is range. Range standard deviation formula changes to (n-1). is the difference between the two extremes i.e. the difference Let us demonstrate its calculations using our hypothetical data between the maximum and minimum observation. set on serum cholesterol (Table - 2). Range = maximum observation - minimum observation One of the drawbacks of range is that it uses only extreme ; since n<30 Standard Deviation (SD)= observations and ignores the rest. This variability measure - 1 is easy to calculate but it is affected by the fluctuations of sampling. It gives rough idea of the dispersion of the data. (739.84 + 519.84 + ... + 77.44) Interquartile Range = 10 - 1 As in the case of range difference in extreme observations is found, similarly interquartile range is calculated by taking 10543.6 difference in the values of the two extreme quartiles. Thus SD = = 34.227 9 Interquartile range = Q3 - Q1 • 229 •
  • 13. Table - 2 Coefficient of Variation (CV)= Sr. Serum No cholesterol ( = 219.2 ) If the coefficient of variation is greater for one data set it 1 192 192-219.2 = -27.2 (-27.2) =739.84 2 suggests that the data set is more variable than the other data set. 2 242 242-219.2 = 22.8 (22.8)2 = 519.84 Thus, any information that is collected by the researcher needs 3 203 -16.2 262.44 to be described by measures of central tendency and measures 4 212 -7.2 51.84 of variability. Both the measures together describe the data. 5 175 -44.2 1953.64 Measures of central tendency alone will not give any idea about the data set without measure of variability. Descriptive 6 284 64.8 4199.04 Statistics is critical because it often suggests possible hypothesis 7 256 36.8 1354.24 for future investigation. 8 218 -1.2 1.44 Summary 9 182 -37.2 1383.84 Raw information is organized and condensed by using tabular 10 228 8.8 77.44 and graphical presentations, but compiling and presenting the data in tabular or graphical form will not give complete Total 2192 10543.6 information of the data collected. We need to “summarise” the entire data in one figure, looking at which we can get overall Calculation of Standard deviation in a grouped data : For idea of the data. Thus, the data set should be meaningfully grouped data the calculation of standard deviation slightly described using summary measures. Summary measures changes. It is given by following formula. provide description of data in terms of concentration of data and variability existing in data. Having described our data set ; replace n by n-1 if observations we use these summary figures to draw certain conclusions are less than 30 about the reference population from which the sample data =n has been drawn. Thus data is described by two summary measures namely, measures of central tendency and measures where fi is the frequency (i.e. number of subjects in that group) of variability. Measures of central tendency describe the and is the overall mean. Suppose the data on serum cholesterol centrality of the data set. In other words central tendency tells was grouped, as we had demonstrated earlier in this chapter us where the data is concentrated. If the researcher is dealing for calculation of the mean for grouped data. We had calculated with quantitative data, mean is the best centrality measure the mean as 222. Now in the same table, make more columns whereas in qualitative data median and mode describes the as in Table - 3. data appropriately. Measures of variability give the spreadness or the dispersion of the data. In other words it describes the Thus, scatter of the individual observations from the central value. 10250 The simplest of the variability measure is range which is 33.74 difference between the two extreme observations. Various 9 -1 = n - 1 measures of dispersion are mean deviation, variance and standard deviation. Standard deviation is the most commonly Coefficient of Variation used variability measure to describe quantitative data and is devoid of any errors. When commenting on the variability Besides the measures of variability discussed above, we have while dealing with two or more groups or techniques, special one more important measure called the coefficient of variation measure of variability called coefficient of variation is used. which compares the variability in two data sets. It measures the The group in which coefficient of variation is more is said to be variability relative to the mean and is calculated as follows: Table - 3 Serum cholesterol fi* Midpoint (x) No. of subjects (f) level (mg/dl) 175-199 187 3 (187-222)= -35 (-35)2=1225 3*1225=3675 200-224 212 3 (212-222)= -10 100 300 225-249 237 2 15 225 450 250-274 262 1 40 1600 1600 275-299 287 1 65 4225 4225 Total 10 = ∑f 7375 10250 • 230 •
  • 14. more variable than the other. Both measures of central tendency 9. 10 babies are born in a hospital on same day. All weigh and measures of variability together describe the data set and 2.8 Kg each; What would be the standard deviation often suggest possible hypothesis for future investigation. (a) 0.28 (b) 1 (c) 2.8 (d) 0 10. To compare the variability in two populations we use this Study Exercises measure (a) Range (b) Coefficient of Variation (c) Median Short Notes : (1) Measures of central tendency (2) Measures (d) Standard deviation of Variation Answers : (1) a; (2) a; (3) d; (4) c; (5) a; (6) b; (7) c; (8) a; MCQs (9) d; (10) b. 1. Which of the Statistical average takes into account all the Statistical Exercises numbers equally? (a) Mean (b) Median (c) Mode (d) None 1. A researcher wanted to know the weights in Kg of children of the above of second standard collected the following information on 2. Which of the following is a measure of Spread (a) Variance, 15 students: 10, 20, 11, 12, 12, 13, 11, 14, 13, 13, 15, 11, (b) Mean (c) p value (d) Mode 16, 17, 18. What type of data is it? Calculate mean, median 3. Which of the following is a measure of location and mode from the above data. Calculate mean deviation (a) Variance (b) Mode (c) p value (d) Median and standard deviation. (Answer : Mean = 13.7, Median 4. Which among the following is not a measure of variability: = 13, Mode = 11&13, Mean deviation = 2.34, Standard (a) Standard deviation (b) Range (c) Median (d) Coefficient deviation = 2.9) of Variation 2. If the height (cm) of the same students is 95, 110, 98, 5. For a positively skewed curve which measure of central 100, 102, 102, 99, 103,104, 103,106, 99, 108,108,109. tendency is largest (a) Mean (b) Mode (c) Median (d) All What type of data is it? What is the scale of measurement? are equal Calculate mean, median and mode from the above data. 6. Most common value that repeats itself in the data set is (a) Calculate mean deviation and standard deviation. Between Mean (b) Mode (c) Median (d) All of the above. height and weigh which is more variable and why? (Answer 7. Variance is square of (a) p value (b) Mean deviation : Mean =103.1, Median = 103, Mode = 99,102,103 (c) Standard deviation (d) Coefficient of variation. &108, Mean deviation = 3.55, Standard deviation = 4.4, 8. Percentiles divides the data into _____ equal parts (a) 100 Coefficient of variation of weight = 21.17, Coefficient of (b) 50 (c) 10 (d) 25 variation of height = 4.27 hence weight is more variable) Introducing Inferential Statistics : Fig. - 1 41 Gaussian Distribution and Central Limit Theorem Seema R. Patrikar The Gauassian Distribution or Normal Curve If we draw a smooth curve passing through the mid points of the bars of histogram and if the curve is bell shaped curve then the data is said to be roughly following a normal distribution. Many different types of data distributions are encountered in medicine. The Gaussian or “normal” distribution is among the most important. Its importance stems from the fact that the characteristics of this theoretical distribution underline many aspects of both descriptive and inferential statistics (Fig. - 1). • 231 •
  • 15. Gaussian distribution is one of the important distributions Fig. 3 shows the area enclosed by 1, 2 and 3 SD from mean. in statistics. Most of the data relating to social and physical sciences conform to the distribution for sufficiently large Fig. - 3 observations by virtue of central limit theorem. Normal distribution was first discovered by mathematician De-Moivre. Karl Gauss and Pierre-Simon Laplace used this 68 % distribution to describe error of measurement. Normal distribution is also called as ‘Gaussian distribution’. 95 % A normal curve is determined entirely by the mean and the 99.7 % standard deviation. Hence it is possible to have various normal curves with different standard deviations but same mean (Fig. Mean-3 SD Mean-2SD Mean-1SD Mean+1SD Mean+2SD Mean+3SD - 2a) and various normal curves with different means but same standard deviation (Fig. - 2b). If these criteria are not met, then the distribution is not a The normal curve possesses many important properties and Gaussian or normal distribution. is of extreme importance in the theory of errors. The normal Standard Normal Variate (SNV) distribution is defined by following characteristics: As already specified, a normal frequency curve can be described ●● It is a bell shaped symmetric (about the mean) curve. completely with the mean and standard deviation values. ●● The curve on either side of the mean is mirror image of the Even the same set of data would provide different value for other side. the mean and SD, depending on the choice of measurement. ●● The mean, median and mode coincide. For example, the same persons height can be expressed as 66 inches or 167.6 cms. An infant’s birth weight can be recorded Fig. - 2a: Normal curves with same mean but different as 2500 gms or 5.5 pounds. Because the units of measurement standard deviations differ, so do the numbers, although the true height and weight are the same. To eliminate the effect produced by the choice of units of measurement the data can be put in the unit free form or the data can be normalized. The first step to transform the original variable to normalized variable is to calculate the mean and SD. The normalized values are then calculated by subtracting mean from individual values and dividing by SD. These normalized values are also called the z values. x- z = σ μ (where x is the individual observation, µ = mean and σ= standard deviation) Fig. 2b : Normal curves with same standard deviation but The distribution of z always follows normal distribution, with different means mean of 0 and standard deviation of 1. The z values are often called the ‘Standard Normal Variate’. Central Limit Theorem (CLT) The CLT is responsible for the following remarkable result: The distribution of an average tends to be Normal, even when the distribution from which the average is computed is non- Normal. ●● Highest frequency (frequency means the number of Furthermore, this normal distribution will have the same mean observations for a particular value or in a particular class as the parent distribution, AND, variance equal to the variance interval) is in the middle around the mean and lowest at of the parent distribution divided by the sample size (σ/n). both the extremes and frequency is decreasing smoothly The central limit theorem states that given a distribution with on either side of the mean. a mean μ and variance σ², the sampling distribution of the ●● The total area under the curve is equal to 1 or 100%. mean approaches a normal distribution with a mean (μ) and ●● The most important relationship in the normal curve is the a variance σ²/N as N, the sample size, increases. The amazing area relationship. and counter-intuitive thing about the central limit theorem The proportional area enclosed between mean and multiples of is that no matter what the shape of the original distribution, SD is constant. the sampling distribution of the mean approaches a normal Mean ± 1 SD -------> 68% of the total area distribution. Furthermore, for most distributions, a normal distribution is approached very quickly as N increases. Thus, Mean ± 2 SD -------> 95% of the total area the Central Limit theorem is the foundation for many statistical Mean ± 3 SD -------> 99% of the total area procedures. • 232 •
  • 16. To understand the concept of central limit theorem in detail let Fig. - 4e : Distribution of x when n=8 us consider the following example (Fig. - 4a - g). Repeatedly taking Fig. 4a : Non Normal distribution of x eight from the parent distribution, and computing the 5 averages, produces The uniform the distribution curve distribution on the 5 as shown on right. right is obviously non- Normal.  Call that the 0 parent distribution 0 0.5 1 0 Fig. - 4f : Distribution of x when n=16 0 0.5 1 Fig. - 4b : Distribution of x when n=2 Repeatedly taking To compute an average, sixteen from the parent distribution, 5 x, two samples are drawn, at random, from and computing the the parent distribution averages, produces and averaged. Then the distribution curve another sample of two is as shown on right. 5 0 drawn and another value 0 0.5 1 of x computed.  This process is repeated, over Fig. - 4g : Distribution of x when n=32 and over, and averages of two are computed. 0 The distribution of 0 0.5 1 averages of two is shown Repeatedly taking on the right. thirty-two from the 5 parent distribution, and computing the averages, Fig. - 4c : Distribution of x when n=3 produces the probability density on the left. Repeatedly taking 0 0 0.5 1 three from the parent distribution, 5 Thus we notice that when the sample size approaches a couple and computing the dozen, the distribution of the average is very nearly Normal, averages, produces the even though the parent distribution looks anything but distribution curve as Normal. shown on the right. 0 Summary 0 0.5 1 Normal distribution was first discovered by mathematician De-Moivre. Karl Gauss and Pierre-Simon Laplace used this Fig. - 4d : Distribution of x when n=4 distribution to describe error of measurement. Normal distribution is also called as ‘Gaussian distribution’. When the midpoints of the histograms are joined by smooth curve and Repeatedly taking if the curve resembles a bell shaped curve, the data is said to four from the parent 5 be approximately normal. The normal distribution is defined distribution, and by certain characteristics. It is a bell shaped symmetric (about computing the averages, the mean) curve. The curve on either side of the mean is mirror produces the probability image of the other side. The mean, median and mode coincide. density on the left. Highest frequency is in the middle around the mean and lowest 0 at both the extremes and frequency is decreasing smoothly on 0 0.5 1 either side of the mean. The total area under the curve is equal to 1 or 100%. The most important relationship in the normal curve is the area relationship. The proportional area enclosed • 233 •