This document provides an introduction to statistics and data visualization. It discusses key topics including descriptive and inferential statistics, variables and types of data, sampling techniques, organizing and graphing data, measures of central tendency and variation, and random variables. Specifically, it defines statistics as collecting, organizing, summarizing, analyzing and making decisions from data. It also outlines the main differences between descriptive statistics, which describes data, and inferential statistics, which uses samples to make estimations about populations.
2. Topics
1. Introduction to Statistics
2. Sampling Techniques
3. Types of Statistics
a) Descriptive Statistics
b) Inferential Statistics
4. Variables and Types of Data
a) Qualitative
b) Quantitative
i. Discrete
ii. Continuous
5. Organizing and Graphing Data
a) Qualitative Data
b) Quantitative Data
3. Introduction to Statistics
Statistics is a branch of science dealing with collecting, organizing, summarizing, analyzing and
making decisions from data.
Statistics is divided into two main areas, which are
1. Descriptive Statistics:
It deals with methods for collecting, organizing, and describing data by using tables, graphs, and
summary measures.
For example, 66% of eligible voters voted in the 2020 presidential election
2. Inferential Statistics:
It deals with methods that use sample results, to help in estimation or make decisions about the
population.
The set of all elements (observations) of interest in a study is called a population, and the selected
numbers of elements from the population is called a sample.
For example, a poll suggests that 75% of voters will select a Candidate A. People haven’t voted yet, so
we don’t know what will happen, but we could reasonably conclude that Candidate A will win the
election.
4. Data is factual information. We collect data from a Population, the collection of all
individuals or items a researcher is interested in. Sample is a subset of the population
we get data from
5. Primary Data vs Secondary
Data
Primary Data:
This data has been compiled by the Researcher using such techniques as Surveys,
Experiments, depth interviews, observations, focus groups.
Secondary Data:
This data has been compiled or published elsewhere e.g., census data
Advantages:
It can be gathered quickly and inexpensively. It enables researchers to build on past
research.
Disadvantages:
Data may be outdated. May not be accurate. s
6. Sampling Techniques
Its known that in some cases, it’s hard to study a large population in order to make
conclusions about certain phenomena, for example if we interested in studying the
obesity in India, imagine that the researcher has a limited period of time say three
months, its ‘ impossible’ to survey all citizens to make conclusions about such
phenomenon during the determined period of time, so that sampling methodology
is the best solution in order to perform the study and get representative results
during shorter period and also it saves efforts and money.
Types of Sampling Techniques:
1. Simple Random Sampling – It is applicable when the population is slightly small.
2. Systematic Method
3. Stratified Method
4. Clustered Sampling Method
8. Stratified Sampling Method
In statistics, a subset of a population share some characteristics is called a
‘stratum’ the plural is strata. In such condition, the stratified sampling method is
used and these subsets are selected randomly.
9. Clustered Sampling Method
Usually, each cluster consists of heterogeneous elements based on
geographical bases. The advantage of this method is that it’s cheaper than other
methods. Cluster sampling is used when the population is large or when it
involves elements residing in a large geographic area.
If a researcher interested in surveying the number of students in Engineering
Colleges. Assume that there are 25 Colleges in Telangana. To do so, the
researcher can select 5 colleges and survey all students in these colleges from
each geographical regions using cluster sampling method. This type of
clustered sampling is called ‘single-stage-cluster sampling’.
Assume that the needed sample size is 3000 students. First stage the
researcher will chose 5 colleges, and then the determined sample size is 3000
students. These can be selected from the five colleges using the simple random
sampling method or using the systematic method. In this case, we have ‘two-
10. Variables and Types of Data
Basic terms that will be used frequently in statistical problems, such terms are,
• an element,
• a variable and their types,
• a measurement, and
• a dataset.
Note:
Any study is based on a problem or phenomenon such as heavy traffics,
accidents, rating scales and grades or others. The researcher should define the
variables of interest before collecting data.
11. An element (or member of a sample or population) is a specific subject or object
about which the information is collected.
A variable is a characteristic under study that takes different values for different
elements.
For example, if we collect information about income of households, then income
is a variable. These households are expected to have different incomes; also,
some of them may have the same income.
The value of a variable for an element is called an observation or
measurement.
12.
13.
14. Types of Variables
In statistics, we have two types of variables according to their elements;
1. Quantitative/ Numerical
Quantitative variable gives us numbers representing counts or measurements.
Ex.: price of a shirt
a) Discrete – values that can be counted
b) Continuous - all values between any two specific values, They often
include fractions and decimals
2. Qualitative/ Categorical
Qualitative variable (or categorical data) gives us names or labels that are not
numbers representing the observations.
a) Nominal
b) Ordinal
18. Levels of Variables based on Measurement
Scale
Classification of Variables:
1. Nominal - The nominal level of measurement classifies data into mutually
exclusive (disjoint) categories in which no order or ranking can be imposed on the
data.
2. Ordinal - The ordinal level of measurement classifies data into categories that can
be ordered, however precise differences between the ranks do not exist
21. Organizing and Graphing Qualitative Data
Data recorded in the sequence in which they are collected and before they are
processed or ranked are called Raw Data.
Suppose we collect information on the scores of 20 students from a Class. The
data values, in the order they are collected, are recorded in below Table
(Quantitative Data)
22. According to previous example, suppose we ask the same students about their
grades as A, B, C and D, after they completed and record the values in table 2.
(Qualitative Data)
Methods of organizing Qualitative Data:
1. Frequency Table
2. Relative Frequency and Percentage Distributions
3. Graphical Presentation of Qualitative Data
23. Frequency Table
Assume that a sample of 50 students from a class was selected, and those
students were asked how they feel about the degree of their satisfaction of the
program. The responses of those students are recorded below where (v) means
very high satisfaction, (s) means somewhat satisfaction and (n) means no
satisfaction.
24. Relative Frequency and Percentage Distributions
The relative frequency is a tool that shows what proportion of the total frequency
belongs to the corresponding category; therefore the relative frequency of a
category can be calculated by dividing the frequency of that category by the sum
of all frequencies
25. Graphical Presentation of Qualitative Data
There are many types of graphs that are used to display qualitative data; in this part we will study and graph
two of such graphs which they are commonly used to display the qualitative data, these graphs are the Bar
chart and the Pie chart.
A graph made of bars whose heights represent the frequencies of respective categories is called a bar
graph.
A circle divided into portions that represents the relative frequencies or percentages of a population or a
sample of different categories is called a pie chart.
26. Organizing and Graphing Quantitative Data
Frequency Distributions
The following data table represent the scores of 50 students in statistics course.
Use the this data to
a. Construct the frequency distribution table by using 12 classes
b. Construct the relative frequency distribution table
c. Find the percentages
27. Solution:
Step1: The largest score is 90, the lowest one is 37, so The range = 90–37 = 53
Step2: Divide the range by the number of classes, which is twelve 12 53 = 4.4
Step3: Round up the number in step 2 to the next whole number that is 5. Thus, the class width
is 5
28. Graphing Grouped Data:
Grouped data can be displayed in a histogram, a polygon or ogive. In this part we will
learn how to construct such graphs.
1. A histogram of grouped data in a frequency distribution table with equal class
widths is a graph in which class boundaries are marked on the horizontal axis and
the frequencies, relative frequencies, or percentages are marked on a vertical axis.
29.
30.
31. The Histogram gives us insight into the shape of the data distribution, literally how the values
are distributed across the bins. The part of the distribution that “trails off” to one or both sides is
called a tail of the distribution.
32. We can also use a histogram to identify modes. For numeric data, especially continuous
variables, we think of modes as prominent peaks.
33. Finally, we can also “smooth out” these histograms and use a smooth curve to examine the
shape of the distribution. Below are the smooth curve versions of the distributions shown in the
four histograms used to demonstrate skew and symmetry.
34. 3. Ascending Cumulative Frequency Curve (Ogive):
An ogive is a curve drawn for the ascending cumulative frequency of grouped data in
a distribution table by first joining plotting dots marked above the upper boundaries of
classes at heights equal to the ascending cumulative frequencies of respective
classes, then joining these points by smooth curve.
35. Numerical Descriptive Measures
1. Measures of Central Tendency
a) Mean
b) Median
c) Mode
2. Measures of Variation
a) Maximum and Minimum
b) Range
c) Mean Deviation
d) Standard Deviation
e) Coefficient of Variation
3. Measures of Position
a) Standard Scores
b) Quartile and Interquartile Range
36. Measures of Central Tendency
Mean:
The most commonly used measure of central tendency is called mean (or the
average).
40. Median:
The median is the value of the middle term in a data set that has been ranked in
increasing or decreasing order.
41. Median for Grouped Data:
Let be a frequency distribution table given as follow:
42. Mode:
The mode is the value that occurs most often in a data set.
1. Find the mode for the given data set:
76, 150, 95, 750, 124, 985, 87, 490
Solution:
Because each value in this data set occurs only once, this data set contains no mode.
2. Find the mode for the given data set:
77, 82, 74, 81, 79, 84, 74, 78
Solution:
Here the value 74 occurs twice, that is with the highest frequency, so 74 is the mode.
3. Find the mode for the given data set
7, 8, 9, 10, 11, 12, 13, 13, 14, 17, 17, 45
Solution:
Since each of the two values, 13 and 17 occurs twice and each of the remaining values
occurs only once. Therefore, this data set has two modes: 13 and 17.
43. Means of Variation
Range:
The range for a data set is depends on two values (the smallest and the largest
values) among all values in such data set.
Range = Largest value – Smallest value
1. Find the range for the data set:
40, 10, 20, 30, 35, 40, 50, 60
Solution: The largest value is 60, and the smallest value is 10.
Therefore Range = Largest value – Smallest value = 60 – 10 = 50
44. Variance and Standard Deviation:
A most used measure of variation is called standard deviation denoted by ( for the
population and S for the sample). The numerical value of this measure helps us how
the values of the dataset corresponding to such measure are relatively closely around
the mean.
Another measure of variation is called the variance denoted by (2 for the population
and S2 for the sample); such measure can be found by squared the standard deviation.
45.
46. Coefficient of Variation:
A coefficient of variation (CV) is one of well known measures that used to compare the
variability of two different data sets that have different units of measurement. Moreover,
one disadvantage of the standard deviation that its being a measure of absolute
variability and not of relative variability
47.
48. Measures of Position:
For a population or a sample data sets, the measures of position are a tool that used to
determine the position of a single value in relation to other values.
Types of Measures:
• Standard Scores
• Quartiles
• Percentiles
• Percentile Rank
1. Standard Scores:
49.
50. Quartiles:
Any data set can be divided into four equal parts by using a summary measure called
quartiles. There are three quartiles that used to dived the data set which is denoted by
(Qi) for i=1,2,3.
Quartiles are three summary measures that divide a ranked data set into four equal
parts. The second quartile is the same as the median, the first quartile is the middle
term among the observations that are less than the median, and the third quartile is
the value of the middle term among the observations that are greater than the median.
Interquartile Range:
A measure of variation that depends on the first and the third quartiles is celled the
interquartile range (IQR) and it is defined as follows
51.
52.
53. Box Plot:
It summarizes the data with 5 statistics plus extreme observations:
Drawing a box plot:
1.Draw the vertical axis to include all possible values in the data.
2.Draw a horizontal line at the median, at Q1, and at Q3. Use these to form a box.
3.Draw the whiskers. The whiskers’ upper limit is Q3+1.5xIQR and the lower limit is Q1-1.5xIQR.
4.The actual whiskers are then drawn at the next closet point within the limit.
5.Any points outside the whisker limits are included as individual points. These are potential outliers.
(Potential) outliers can help us… - examine skew (outliers in the negative direction suggest left skew; outliers in
the positive direction suggest right skew). - identify issues with data collection or entry, especially if the value of
the outliers doesn’t make sense.
58. Random Variables:
Def: A random variable is the “Result of a chance event” that you can measure or not .
(or)
A random variable is a variable whose value is unknown or a function that assigns values to each of an experiment's
outcomes, whose possible values are numerical outcomes of a random event.
Random variables are used in econometric or regression analysis to determine statistical relationships among one
another. A typical example of a random variable is the “outcome of a two coin toss”.
59. Outcome of a two coin toss:
X is the random Variable
P(x) is the probability of distribution
H Head
T Tail
P(x) = H H ----2
P(x) = H T-----1
p(x) = T H-----1
p(x) = T T -----0
Where X is the number of heads we get from tossing two coins,
then X could be the 0, 1, or 2.
Example of Random variable
• A person’s blood type
• Number of persons visited a Park
• Number of times a user visits LinkedIn in a
day
60. Types of Random Variables:
1. Discrete Random Variable
2. Continuous Random Variable
1. Discrete Random Variable:
The Discrete random variables take on a countable number of distinct values. Consider an experiment where a
coins is tossed three times.
If X represents the number of times that the coin comes up heads, then X is a discrete random variable that can
only have the values 0, 1, 2 or 3. No other value is possible for X.
Discrete random variables are usually counts.
Example:
• the number of children in a family.
• the Friday night attendance at a cinema
• the number of patients in a doctor's surgery.
• the number of defective light bulbs in a box of ten.
61. 2. Continuous Random Variables:
The Continuous random variables can represent “any value within a specified range( 4ft to 6 ft height)” and can
take on an infinite number of possible values.
An example of a continuous random variable would be an experiment that involves measuring the average height
of a random group of 25 people.
Drawing on the latter, if X represents the random variable for the average height of a random group of 25 people,
you will find that the resulting outcome is a continuous figure since height may be 5’ or 5’-4” or 5’-8” etc.
Clearly, there is an infinite number of possible values for height.
63. Sampling Distribution
A sampling distribution can be defined as a probability distribution using statistics by first choosing a particular
population and then making use of random samples which are drawn from the population.
A lot of researchers, academicians, market strategists, etc. go ahead of sampling distribution instead of choosing the
entire population. This makes the data set easy and also manageable.
Example 1:
To make it easier, suppose a marketer wants to do an analysis of the number of youth riding a bicycle between two
regions within the age limit 13-18.
For this purpose, he will not take into account the entire population present in the two regions between 13-18 years
of age, which is practically not possible, and even if done, it too time-consuming, and the data set is not manageable.
Instead, the marketer will take a sample set of 200 each from each region and get the distribution done.
The average count of the usage of the bicycle here is termed as the sample mean. Each sample chosen has its own
mean generated, and the distribution done for the average mean obtained is defined as the sample distribution. The
deviation obtained is termed as the standard error.
64. Example 2:
Assuming that a researcher is conducting a study on the weights of the inhabitants of a particular town and he has
five observations or samples, i.e., 70kg, 75kg, 85kg, 80kg, and 65kg. The town is generally considered to be having a
normal distribution and maintains a standard deviation of 5kg in the aspect of weight measures. Thus the mean can
be calculated as (70+75+85+80+65)/5 = 75 kg.
Also, we assume that the population size is huge; thus, to go to the second step, we will divide the number of
observations or samples by 1, i.e., 1/5 = 0.20. Now we need to take the square root of 0.20, which comes to 0.45. The
square root is then multiplied by the standard deviation, i.e., 0.45*5 = 2.25kg. Thus standard error obtained is 2.25kg,
and the mean obtained was 75kg. These two factors can be used to describe the distribution.
Types of Sampling Distribution
1. Sampling Distribution of Mean
This can be defined as the probabilistic spread of all the means of samples chosen on a random basis of a fixed size
from a particular population. When samples have opted from a normal population, the spread of the mean obtained
will also be normal to the mean and the standard deviation. If the population is not normal to still, the distribution of
the means will tend to become closer to the normal distribution provided that the sample size is quite large.
2. Sampling Distribution of Proportion
This is primarily associated with the statistics involved in attributes. Here the role of binomial distribution comes into
play. Generally, it responds to the laws of the binomial distribution, but as the sample size increases, it usually
becomes normal distribution again.
66. 3. Student’s T-Distribution
This type of distribution is used when the standard deviation of the population is unknown to the researcher or
when the size of the sample is very small. This type of distribution is very symmetrical and fulfils the condition of
standard normal variate. As the sample size increases, even T distribution tends to become very close to normal
distribution.
4. F Distribution
When the greater variance is mandatorily present in the numerator, the F distribution finds its usage as the degree
of freedom changes the critical values of F changes too, which is applicable for both large and small variances. This
can be calculated from the tables available.
The comparison is made from the measured value of F belonging to the sample set and the value, which is
calculated from the table if the earlier one is equal to or larger than the table value, the null hypothesis of the
study gets rejected.
5. Chi-Square Formula Distribution
This type of distribution is used when the data set involves dealing with values that include adding up the squares.
The set of squared quantities belonging to the variance of samples is added, and thus a distribution spread is made,
which we call as chi-square distribution.
68. Inferential Statistics
An Inferential Statistics is one of the two main branches of statistics. It drawing “Inferences” from Data.
It is making a “logical judgment”
Example: 80% of all vehicles in USA are 4 wheelers
An inferential statistics use a “random sample of data” from a population to describe and make inferences
about the population.
Inferential statistics use Statistical models (mathematical model) for compare your sample data to other
samples (or to previous research).
Example:
you might stand in front of a mall and ask a sample of 100 people. if they like shopping and after you could
make a bar chart for result based on answer or no answers (that would be Descriptive Statistics).
Where as, you could use your research (and Inferential Statistics) to reason that around 75-80% of the
population (all shoppers in all malls) like shopping.
69. There are two main areas of Inferential Statistics
1. Estimating parameters:
This means taking a statistic from your sample data (for example the sample mean) and using it to say
something about a population parameter (i.e. the population mean).
(for example ¼ kg rice for 2 people, the how much is required for 1000 peoples)
2.Hypothesis tests:
A hypothesis is an assumption, or an idea or an imagine or a guess. But Hypothesis testing is the process
of determining whether the given hypothesis result is true or not.
Example of Hypothesis,
if a new cancer drug is effective.
or
if breakfast helps for children to perform better in schools.
70. Hypothesis
1. Statement of Hypothesis
If you want to test a relationship between two or more things, you need to write hypotheses before you start your
experiment or data collection.
2. Null Hypothesis
A null hypothesis (H0) refers a hypothesis that states that there is no relationship between two population parameters.
In inferential statistics, the null hypothesis is a default hypothesis that quantity to be measured as a zero.
i.e. (H0=0)
3. Level of Significance ()
The level of significance is the probability of Rejecting the null hypothesis.
For example, a significance level of 0.05 indicates a 5% risk of concluding that a difference exists.
4. Level of Confidence (1- )
The level of confidence refers to the percentage of all possible samples that can be expected to include the true
population parameter.
71. 5. Rejection region
The rejection region is the ‘Area’ where the null hypothesis is rejected.
6. Non rejection region
The non-rejection region is the ‘Area’ where the null hypothesis is not rejected.
72. 7. One-tail and Two-tail Test
The one-tailed test is used to reject the null hypothesis sample
mean should be either larger or smaller than the population
mean. This test is also referred to as a directional test or
directional hypothesis. The test is run to prove a claim either true
or false.
If a test shows the mean sample being both larger and smaller
than the population, it is a two-tailed test.
73. Types of Inferential Statistics Tests:
1. Linear Regression Analysis
2. Analysis of Variance
3. Analysis of Co-variance
4. Statistical Significance (T-Test)
5. Correlation Analysis
1. Linear Regression is a linear model, that
assumes a linear relationship between the
Independent variable (x) and the dependent
variable (y).
Here, y can be calculated from a linear
combination of the Independent variable (x).
74. 2. Analysis of Variance (ANOVA):
This is another statistical method which is extremely popular in data science. It is used to test and analyse the
differences between two or more means from the data set (with errors and with out errors).
3. Analysis of Co-variance (ANCOVA):
It is an extension of the previous (ANOVA) models and it includes the combination of ANOVA and regression.
4. Statistical Significance (T-Test):
It is used to determine the significant difference between the means of two groups, which may be related in
certain features.
5. Correlation Analysis:
It is used to evaluate the strength of relationship between two variables.
The Regression slop is either negative or positive. It is depending upon the variables.
A negative correlation means that the value of one variable decreases while the value of the other increases and
positive correlation means that the value both variables decrease or increase simultaneously.
75.
76. Differences between Descriptive and Inferential Statistics:
Descriptive Statistics Inferential Statistics
It describes the characteristics or properties of the target
data.
Make inferences from the sample and generalize them
to the population
Organize, analyse and present the data in a meaningful
way
Compare, Tests and predicts the data for future
outcomes.
The analyzed results are in the form of graphs, charts,
plots etc
The analyzed results are the probability scores (40% or
80% or 95.3% etc)
Describes the data which is already known Tries to make conclusions about the population beyond
the data available
Tools:
Measures of central tendency andsmeasures of spread
Tools:
Hypothesis tests, Analysis of variance and etc.
77. Dr. Rambabu Palaka
Professor
School of Engineering
Malla Reddy University, Hyderabad
Mobile: +91-9652665840
Email: drrambabu@mallareddyuniversity.ac.in
Reference:
INTRODUCTION TO STATISTICS MADE EASY
by Prof. Dr. Hamid Al-Oklah Dr. Said Titi Mr. Tareq Alodat