This presents an overview about relevance and significance of statistics as a valid tool in enhancing quality of research. It also touches upon some misuse and abuse of statistics.
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Statistics Role in Research
1. Relevance of Statistics in
Research
Dr S G Deshmukh
ABV-Indian Institute of Information Technology &
Management Gwalior
15 Feb 2018
FDP on Statistics and Research Methodology (15-21 Feb, 2018)
2. Opening remarks..
“all knowledge is, in the final analysis, history.
All sciences are, in the abstract,
mathematics and all methods of acquiring
knowledge are essentially statistics.”
C. R. Rao, in the preface of his famous book
“Statistics and Truth”, 1997, World Scientific
2
3. Typically, such research wants to:
Describe the structure, hierarchy and
organization of societies
Identify regularities/anomalies that are worth
explaining
Construct and test explanations for such
patterns and regularities
Address societal problems, suggest
interventions, implement changes
Theories to explain why certain things
happen. Causes and effects.
4. Research is …….
Knowledge acquisition gained
through reasoning
through intuition/gut feelings
but most importantly through the use of
appropriate methods/tools/techniques
That is where the role of statistics comes into
picture !
4
5. Research in pursuit of knowledge
Attributional:
Attributing a measurement (definition) to a particular
Concept.
Growth, Leadership, Managerial Efficiency
Relational:
Relating a phenomenon with its determinants
Explaining behavior
Classificational:
Understanding by categorizing on the basis of some
indicators
Taxonomy, Innovators Vs Followers, Leaders Vs Laggards5
6. Basic Elements of the Scientific
Method
Empiricism: Enquiry is conducted through
observation and verified/validated through
evidence
Determinism: Events occur according to
regular laws and causes. The goal of research
is to discover/unfold these
Scepticism: Our proposition is open to analysis
scrutiny and critique- That is how body of
knowledge progresses !
6
7. Typical Scientific Method
1. Choose a question to investigate
2. Identify a hypothesis related to the question
3. Make testable predictions in the hypothesis
4. Design an experiment to answer hypothesis
question
5. Collect data in experiment
6. Determine results and assess their validity
7. Determine if results support or refute your
hypothesis
7
8. Some basic features of research
process
Always involves bringing together three sets of things:
some content that is of interest
some ideas that give meaning to that content, and
some techniques or procedures by means of which those ideas and
content can be studied.
These three sets of things more formally, as three distinct, though
interrelated domains:
The Substantive domain, from which we draw contents that seem
worthy of our study and attention;
The Conceptual domain, from which we draw ideas that seem likely to
give meaning to our results; and
The Methodological domain, from which we draw techniques that seem
useful in conducting that research.
8
9. Stepping into research
Method and Methodology
Method refers to the techniques and Methodology to the
strategy
Logic as an Essence of Philosophy
Inference depends on the law of Causation
Deductive and Inductive are methods Non Exclusive
Structuralism as the holistic approach
Why Philosophy?
In Search of Knowledge, Understanding of Nature and
Meaning of Universe.
Creation of Theories OR Universality about Basic things.
In-depth knowledge of a phenomenon
9
10. Two models : AROHA & AVAROHA
A - Algorithm
A – Approach V - Variables
R – Review A - Arrangement
O - Objectives R - Results
H - Hypothesis O - Objectivity
A - Analysis H – Humanistic
A – Analytical Rigour
10
Source: Deshpande R S, Institute for Social & Economic Change, B’lore
11. How to get into a research topic?
Searching for new evidence from facts and
concluding with a new hypothesis.
It should be net addition to the existing knowledge
or at least a new interpretation of that.
It should be crystal clear in its meaning.
It should have a hypothesis which is not a
statement of existing facts.
It should be empirically analyzable.11
12. Criteria of good research
Good research is systematic- structured with
specified steps taken in specified sequence in
accordance with well-defined rules
Good research is logical: logical reasoning makes
research more meaningful in the context of decision
making
Good research is empirical: dealing with concrete
data that provides the basis for external validity to
research results.
Good research is replicable
Good research is also visible : sharing with
community, peers and the society at large
12
14. What is Statistics ?
A collection of methods for planning
experiments, obtaining data, and then
organizing, summarizing, presenting,
analyzing, interpreting, and drawing
conclusions based on the data
https://www.amazon.in/Business-Statistics-2e-Naval-Bajpai/dp/8131797007
15. 15
Statistics
The science of data to answer research
questions
Formulate a research question(s) (hypothesis)
Collect data
Analyze and summarize data
Draw conclusions to answer research question(s)
Statistical Inference
In the presence of variation
16. Why are statistics important in
research?
Communication
Credibility
Convergence
Answered in Quora by Stan Paxtan, Apr 2016
https://www.quora.com/Why-are-statistics-important-in-
research
16
17. Why do we need statistics? ..1..
Measure things
Examine relationships
Make predictions
Test hypotheses
Construct concepts and develop theories
Explore issues
18. Why do we need statistics? ..2..
Explain activities or attitudes
Describe what is happening
Present information
Make comparisons to find similarities and
differences
Draw conclusions about populations
based only on sample results.
19. Common reasons for rejection of
a paper ..
Incomplete data such as too small a sample size or missing or poor controls
Poor analysis such as using inappropriate statistical tests or a lack of
statistics altogether
Inappropriate methodology for answering your hypothesis or using old
methodology that has been surpassed by newer, more powerful methods
that provide more robust results
Weak research motive where your hypothesis is not clear or scientifically
valid, or your data does not answer the question posed
Inaccurate conclusions on assumptions that are not supported by your data
Source: Springer Nature Guidelines
https://www.springer.com/gp/authors-editors/authorandreviewertutorials/submitting-to-a-journal-and-
peer-review/what-is-open-access/10285582 19
20. Language of statistics
Variable and Constant
Discrete and Continuous
Population and Sample
Parameter and Statistic
21. Population vs Sample
Population — the whole
a collection of persons, objects, or items under
study
Census — gathering data from the entire
population
Sample — a portion of the whole
a subset of the population
21
22. 22
Parameter vs. Statistic
Parameter — descriptive measure of the
population
Usually represented by Greek letters
Statistic — descriptive measure of a sample
Usually represented by Roman letters
23. Examples
Parameter
51% of the entire population of the Gwalior is
Female
Statistic
Based on a sample from the IIITM population
is was determined that 23.2% consider
themselves as addict to internet.
24. 24
Variation
What if everyone:
Looked the same
Thought the same
Believed the same
How many people would you have to interview
to know everything about the population with
regard to looks, thoughts, and beliefs?
25. 25
Populations with variation
Everyone looks different
Everyone thinks different
Everyone believes different
Interviews or observations are required on
multiple members of the population for valid
conclusions about population characteristics.
Variation
26. 26
Variation
Variation is everywhere
Individuals
Repeated measurements on the same individual
Almost everything varies over time
Because variation is everywhere, statistical
conclusions are not certain.
Probability statement
Confidence statement
Margin of error
27. 27
Understanding Data
Individuals & Variables
Individuals – objects described by a set of data.
May be people, animals, or things
Also called subjects or units.
Variables – any characteristic of an individual.
A variable can take different values for different
individuals.
28. Statistics as a tool in research
Types of Research Questions
Descriptive (What does X look like?)
Correlational (Is there an association between X
and Y? As X increases, what does Y do?)
Experimental (Do changes in X cause changes in
Y?)
Different statistical procedures allow us to
answer the different kinds of research
questions
29. 29
Statistical concepts & tools
Data representation
Various Probability Distributions
Discrete (Binomial, Geometric, Poisson, Uniform etc.)
Continuous (Uniform, Exponential, Normal etc.)
Central Limit Theorem
Moment generating functions
Distribution of Sample Means
Point Estimates
Confidence Interval
Type I and Type II errors
Hypothesis Testing
Regression: simple/multiple
Anova, DOE
Non-parametric tests
30. 30
Common concern: Bias
Statistics- Collection of data
Sample Surveys Experimentsvs.
Population “Snapshot”
Impose treatment
on subjects/units
Observe response to
imposed treatment
Bias:
Systematically favors certain outcomes
32. 32
Central Limit Theorem
Most theory about sample means depends on
assumptions that the mean comes from a normal
distribution.
The Central Limit Theorem says that for any
population, if the sample size is large enough, the
sample means will be approximately normally
distributed with the mean equal to the population
mean and standard deviation equal to the population
standard deviation σ divided by the square root of n
(σ/√n).
33. 33
Normal distribution
Mother of all !
Standard normal variate (Z) ~ N(, 2 )
2 : Chi-Square – Square of Z
t distribution –small sample size
F Distribution ~ Ratio of 2
Approximation to Discrete : Binomial etc.
34. Recall..
Descriptive Statistics
Describes data usually through the use of graphs, charts and
pictures. Simple calculations like mean, range, mode, etc., may
also be used.
Inferential Statistics
Uses sample data to make inferences (draw
conclusions) about an entire population
35. 1. Center: A representative or average value that indicates where the
middle of the data set is located
2. Variation: A measure of the amount that the values vary among
themselves or how data is dispersed
3. Distribution: The nature or shape of the distribution of data (such
as bell-shaped, uniform, or skewed)
4. Outliers: Sample values that lie very far away from the vast
majority of other sample values
5. Time: Changing characteristics of the data over time
Recall:
Important Characteristics of Data
36. 36
Statistical significance
Significance is a statistical term that tells how sure you are that
a difference or relationship exists. To say that a significant
difference or relationship exists only tells half the story.
We might be very sure that a relationship exists, but is it a
strong, moderate, or weak relationship? After finding a
significant relationship, it is important to evaluate its strength.
Significant relationships can be strong or weak. Significant
differences can be large or small. It just depends on your
sample size.
37. Steps in a test of hypothesis
1. Define problem. :Determine H0 and HA. Select Alpha .
2. Collect data
3. Calculate xbar as an estimate of µ and s as an estimate of
σ.
4. Check assumptions:
Sample size n is reasonably large (n ≥ 30) so can use
normal distribution and estimate σ with s.
Check for outliers or strong skewness in pop. dist.
5. Calculate Standard Score
6. Compare with Tabulated value to make conclusions.
7. Make conclusions in context of the problem.
38. 38
If statistic is higher than the critical
value from the table
The finding is significant.
Reject the null hypothesis.
The probability is small that the difference or
relationship happened by chance, and p is less
than the critical alpha level (p < alpha ).
39. 39
Regression and Correlation
Regression analysis is the process of
constructing a mathematical model or function
that can be used to predict or determine one
variable by another variable.
Correlation is a measure of the degree of
relatedness of two variables.
40. 40
Simple Regression analysis
bivariate (two variables) linear regression -- the
most elementary regression model
dependent variable, the variable to be
predicted, usually called Y
independent variable, the predictor or
explanatory variable, usually called X
41. 41
Regression models
Probabilistic Regression Model
Y = 0 + 1X +
0 and 1 are population parameters
0 and 1 are estimated by sample statistics b0 and b1
42. 42
Parametric vs Nonparametric
Statistics
Parametric Statistics are statistical techniques based on assumptions about
the population from which the sample data are collected.
Assumption that data being analyzed are randomly selected from a
normally distributed population.
Requires quantitative measurement that yield interval or ratio level data.
Nonparametric Statistics are based on fewer assumptions about the population and the
parameters.
Sometimes called “distribution-free” statistics.
A variety of nonparametric statistics are available for use with nominal or ordinal
data.
RUN TEST
MANN-WHITNEY
CHI-SQUARE
KRUSKAL-WALLIS
Etc.
43. 43
Which test to use?
Goal Measurement
(from Gaussian
Population)
Rank, Score, or Measurement
(from Non- Gaussian
Population)
Describe one group Mean, SD Median, interquartile range
Compare one group to a
hypothetical value
One-sample t test Wilcoxon test
Compare two unpaired
groups
Unpaired t test Mann-Whitney test
Compare two paired
groups
Paired t test Wilcoxon test
Compare three or
more unmatched
groups
One-way ANOVA Kruskal-Wallis test
44. 44
Importance of data origin..
Good data – intelligent human effort
Bad data – laziness, lack of understanding, or a
desire to mislead
Know where the data come from
Understand statistics
Example: Did you know that 45% of statistics
are made up on the spot????
45. 45
Manipulating the facts
Data collection – sampling and measurement
biases, ignoring influential variables
Data summarization – graphically
misrepresenting data, choosing misleading
statistics
Statistical Inference – reporting invalid
conclusions and interpretations
46. 46
Manipulating data collection
Sampling biases:
One group in a population is overrepresented
compared to another.
Example: “New Longitudinal Study Finds that
Having a Working Mother Does No Significant Harm
to Children.”
The sample was not representative of average
or higher income families.
47. 47
Manipulating data production
Ignoring influential variables:
Reporting results without considering important influential
variables.
Example – Differences in pay due to gender
“As of 2016, full-time employed women earned on average
only about 76 percent as much as full-time employed men”
Does this difference show that women are discriminated
against?
Occupation has been ignored.
More men have received training for higher paying jobs.
48. Bad Samples
Small Samples
Loaded Questions
Misleading Graphs
Precise Numbers
Distorted Percentages
Partial Pictures
Deliberate Distortions
Abuses of Statistics
49. Abuses of Statistics ..1..
Bad Samples
Inappropriate methods to collect data. BIAS Example: using
yellow pages (phone book) to sample data.
Small Samples
Size of the sample could be a question mark
Loaded Questions
Survey questions can be worked to elicit a desired response
51. Issues:
Sample size
Was sample representative?
Was the survey question biased?
How was the survey conducted?
Is the graph constructed accurately?
Is their conclusion valid?
52. Their conclusion is not valid (it may still be
true). You need more information about the
sample and size of sample as well as the
survey itself.
Remarks..
56. Precise Numbers
There are 103,215,02 households in a Metro town.
This is actually an estimate and it would be best to say
there are about 1.03 Crore households.
Distorted Percentages
100% improvement doesn’t mean perfect.
Deliberate Distortions
Lies, Lies, all Lies
Abuses of Statistics ..3..
57. Abuses of Statistics
Partial Pictures
“Ninety percent of all our cars sold in Gwalior the
last 10 years are still on the road.”
Problem: What if the 90% were sold in the last 3
years?
58. Some research hypotheses
“If you know the outcome of your research, then
you are not doing research”-Einstein.
Hypothesis:””-The relationship between Emotional
Intelligence and job performance will be stronger for
individuals whose job involves greater amount of
interpersonal interaction
Hunch says true, So says the research findings. Axiomatic
hypothesis testing.(Source XXX,Vol.14,no.4,Oct.-
Dec.2010,pp.250-252).
There is no new light by such like researches.
Statistical packages such as SPSS,LISREL have
made as if you are doing in-depth research l
61. Web exercise
Demo exercise : Spurious correlations
http://www.tylervigen.com/
Interesting article on improbability !
http://www.jmp.com/landing/hand_improbability-
principle.shtml
62. 62
Checklist for
A Statistical Project ..1..
Statement of purpose/question of interest
Summary of data collection e.g. random sample, stratified sample, available
data
Identify possible sources of bias
Why do you believe sample was representative?
Summarize the data (concise, well-labeled, easy to read)
Numerical or quantitative data
Graphs: Pie diagram or histogram
Measures of central tendency (e.g. mean or median)
Measures of spread (e.g. range, SD, IQR)
A check for outliers (e.g. z scores,)
A check for normality (prob. plot, 68-95-99.7 rule) if needed by your analysis
Quantitative data
Proportion in each category
63. 63
Checklist for
A Statistical Project :2..
Statistical inference
Quantitative data -e.g. confidence intervals for mean(s), hypothesis test for
mean(s), regression, ANOVA
Qualitative data
Include a discussion of why our method is appropriate
Diagnostics
Verification of any assumptions made during statistical inference
Interpretation/Explanation of results
What does it all mean?
Use the above summaries to justify your interpretation
Suggest reasons for what you have observed
Overall conclusion, recommendations, future scope
References
64. 64
Quotable quotes !!
Every model is an approximation. It is the data that are real !
All models are wrong ; some models are useful.
Discovering the unexpected is more important than confirming the known !
Among the factors to be considered there will usually be the vital few and the
trivial many ( Juran)
There’s never been a signal without noise !
Not everything that can be counted counts and not everything that counts
can be counted (Albert Einstein)
66. Some YouTube presentations..
Sn Title Link Duration
1 Choosing which statistical test to use
- statistics help
https://www.youtube.com/wat
ch?v=rulIUAN0U3w
9.32
minutes
2 Intro to Hypothesis Testing in
Statistics - Hypothesis Testing
Statistics Problems & Examples
https://www.youtube.com/wat
ch?v=VK-rnA3-41c
23.40
minutes
3 Null and Alternate Hypothesis -
Statistical Hypothesis Testing
https://www.youtube.com/wat
ch?v=_Qlxt0HmuOo
14.51
minutes
4 What is a p-value? https://www.youtube.com/wat
ch?v=HTZ8YKgD0MI
5.43
minutes
67. TED Talks on Statistics
Sn Title Link
1 Why you should love statistics | Alan
Smith
https://www.ted.com/talks/alan_smith_
why_we_re_so_bad_at_statistics?refe
rrer=playlist-statistically_speaking
2 How juries are fooled by statistics? Peter
Donnelly
https://www.ted.com/talks/peter_donne
lly_shows_how_stats_fool_juries?refer
rer=playlist-statistically_speaking
67
68. An interesting blog:
Not awful and boring ideas for teaching
statistics
http://notawfulandboring.blogspot.in/
By Jess Hartnett, Gannon University's Department of
Psychology and Counseling.
Statistically funny
Blog by Hilda Bastian, National Institutes of Health, USA
https://statistically-funny.blogspot.in/
68
69. Useful resource
Prof J P Verma, Professor of Statistics at
LNIPE, Gwalior and author of several books on
statistics
Visit : http://jpverma.org/
And get complimentary material
(presentations ) on statistics..
69
70. Summary :
Relevance of statistics in research
Validity
Will this study help answer the research
question?(Content validity?)
Analysis
What analysis, & how should this be interpreted
and reported?(Stat. packages?)
Efficiency
Is the experiment the correct size,
making best use of resources?(Time, budget?)
71. Information literacy & statistics**
First generation of literacy is when you know
how to read and write.
Second generation of literacy is when you are
computer literate
Third generation - it is critical for all of us to be
information literate
And, one of the purposes of statistics is to bring
about information literacy in the society
**Dr K C Chakrabarty, “Uses and misuses of statistics” address by Deputy Governor of
the Reserve Bank of India, at the DST Centre for Interdisciplinary Mathematical
Sciences, Faculty of Science, BHU, 20 March 2012.
71