4. Descriptive Statistics
Data analysis begins with calculation of descriptive
statistics for the research variables
These statistics summarize various aspects about the
data, giving details about the sample and providing
information about the population from which he sample
was drawn
Each variable’s type determines the nature of
descriptive statistics that one calculates and the manner
in which one reports or displays those statistics
Simply to describe what's going on in our data
5. inferential statistics
Trying to reach conclusions that extend beyond the immediate data
alone=INFER
We use inferential statistics to try to infer from the sample data what
the population might think/experience
Make judgments of the probability that an observed difference
between groups is a dependable one or one that might have
happened by chance in this study
Make inferences from our data to more general conditions
http://www.socialresearchmethods.net/kb/statinf.php
10. Variables:
DISCRETE CONTINUOUS
Only certain values (fixed and
readily Countable
Examples of discrete variables
commonly encountered in
cardiovascular research include
species, strain, racial/ethnic group,
sex, education level,
treatment group, hypertension status,
and New York Heart
Association class.
Infinite number of values
Fixed intervals between adjacent
values
They can be manipulated
mathematically, taking sums and
differences
Age, height, weight, blood
pressure, measures of cardiac
structure and function, blood
chemistries, and survival time
11. Discrete variables (categorical)
NOMINAL (UNORDERED) ORDINAL (ORDERED)
Take values such as yes/no,
Human/dog/mouse, female/male,
treatment A/B/C; a nominal
Variable that takes only 2 possible
values is called binary. One
May apply numbers as labels for
nominal categories, but there
Is no natural ordering
Take naturally ordered values such as
New York Heart Association class (I, II,
III, or IV), hypertension status (optimal,
normal, high-normal,or hypertensive),
or education level (less than high
school,high school, college, graduate
school
12. Categorical
A categorical variable (sometimes called a nominal variable) is one
that has two or more categories, but there is no intrinsic ordering to
the categories. For example, gender is a categorical variable
having two categories (male and female) and there is no intrinsic
ordering to the categories. Hair color is also a categorical variable
having a number of categories (blonde, brown, brunette, red, etc.)
and again, there is no agreed way to order these from highest to
lowest. A purely categorical variable is one that simply allows you to
assign categories but you cannot clearly order the variables. If the
variable has a clear ordering, then that variable would be an
ordinal variable, as described below.
13. Ordinal
An ordinal variable is similar to a categorical variable. The difference between
the two is that there is a clear ordering of the variables. For example, suppose
you have a variable, economic status, with three categories (low, medium and
high). In addition to being able to classify people into these three categories,
you can order the categories as low, medium and high. Now consider a
variable like educational experience (with values such as elementary school
graduate, high school graduate, some college and college graduate). These
also can be ordered as elementary school, high school, some college, and
college graduate. Even though we can order these from lowest to highest, the
spacing between the values may not be the same across the levels of the
variables. Say we assign scores 1, 2, 3 and 4 to these four levels of educational
experience and we compare the difference in education between categories
one and two with the difference in educational experience between categories
two and three, or the difference between categories three and four. The
difference between categories one and two (elementary and high school) is
probably much bigger than the difference between categories two and three
(high school and some college).
14. Ordinal
In this example, we can order the people in level of educational
experience but the size of the difference between categories is
inconsistent (because the spacing between categories one and
two is bigger than categories two and three). If these categories
were equally spaced, then the variable would be an interval
variable
15. Continuous variables
Continuous variables can have an
infinite number of different values
between two given points. As
shown above, there cannot be a
continuous scale of children within
a family. If height were being
measured though, the variables
would be continuous as there are
an unlimited number of possibilities
even if only looking at between 1
and 1.1 meters.
16. Descriptive statistics for Discrete
variables
Absolute frequencies (raw counts) for each category
Relative frequencies (proportions or percentages of the total
Number of observations)
Cumulative frequencies for successive categories of ordinal
variables
17. Collection
Formal Sampling
Recording Responses To Experimental Conditions
Observing A Process Repeatedly Over Time
18. Descriptive statistics for continuous
variables
Location statistics
MEAN
MEDIAN
MODE,
QUANTILES
Dispersion statistics[CENTRAL TENDENCY]
VARIANCE=
STANDARD DEVIATION=S=√S²
RANGE
INTERQUARTILE RANGE
Shape statistics
SKEWNESS
KURTOSIS
19. ROBUST
MEDIAN is robust :Not strongly affected by outliers or by extreme
changes to a small portion
MEAN is sensitive (not robust) to those conditions
MODE is robust to outliers, but it may be affected by data
collection operations, such as rounding or digit preference, that
alter data precision.
20. QUANTILES
Quintiles combine aspects of ordered data and cumulative
frequencies
The p-th quantile (0≤p≤1)
100p is an integer, the quantiles are called percentiles
Median, or 0.50 quantile, is the 50th percentile, the 0.99 quantile is
the 99th percentile
Three specific percentiles are widely used in descriptive statistics,
[100p is an integer multiple of 25]
Q1first quartile (25th percentile, 0.25 quantile)
Q2second quartile (50th percentile, 0.50 quantile), median
Q3third quartile (75th perce ntile,0.75 quantile)
21. INTERQUARTILE RANGE[IQR]
It is a single number
defined as IQ of RQ-3Q1
Variance and standard deviation
are affected (increased) by the presence of extreme observations,
the IQR is not; it is robust
22. SKEWNESS[skewness coefficient]
For a given data
Distribution is symmetric (skewness=0)
A more pronounced tail in 1 direction than the
other (left tail, skewness<0; right tail, skewness>0)
If skewness=0, the mean= median
Right- (left-) skewed distribution has its mean value greater
(less than) the median
23. Kurtosis
a measure of the “peakedness” of a distribution
A gaussian distribution (also called “normal”) with a bell-shaped
frequency curve has kurtosis 0
Positive kurtosis indicates a sharper peak with longer/fatter tails and
relatively more variability due to extreme deviations
Negative kurtosis coefficient indicates broader shoulders with
shorter/thinner tails
27. DOT PLOT of Continuous variable(BMI)
The dot plot is a simple
graph that is used
mainly with small data
sets to show individual
values of sample data in
1 dimension
28. Box-and-whisker plot= box plot
graph
Graph displays values of quartiles (Q1, Q2, Q3) by
a rectangular box. The ends of the box
correspond to Q1 and Q3, such that
the length of the box is the interquartile range
(IQRQ3Q1). There is a line drawn inside the box at
the median, Q2, and there is a “” symbol plotted
at the mean.Traditionally, “whiskers” (thin lines)
extend out to, at most,1.5 times the box length
from both ends of the box: they connect all
values outside the box that are not 1.5 IQR away
from the box, and they must end at an observed
value.Beyond the whiskers are outliers, identified
individually by symbols such as circles or asterisks
31. Univariate Analysis: Look one
variable at a time for 3 features
Distribution Central Tendency Dispersion
Of Frequency in %
/bar diagram/histogram
Mean
Median
Mode
Range
Standard Deviation
Variance
32. Correlation[r] is a single 1 number that shows
the degree of relationship between 2
variables
-1 to +1
34. r is also called Karl Pearson’s
coefficient of correlation