Data analysis
• Data analysis is defined as the technique that analyse
the data to enhance the productivity and the business
growth by involving process like cleansing,
transforming, inspecting and modelling data to
perform market analysis, to gather the hidden insight
of the data, to improve business study and for the
generation of the report based upon the available data
using the data analysis tools such as Tableau, Power BI,
R and Python, Apache Spark, etc.
• It refers to the technique to analyze data to enhance
productivity and grow business. It is the process of
inspecting, cleansing, transforming, and modeling the
data.
Why we Need Data
Analysis?
We need Data Analysis basically for the reasons
mentioned below:
• Gather hidden insights.
• To generate reports based on the available data.
• Perform market analysis.
• Improvement of business Strategy.
Decision Science is the collection of quantitative techniques
used to inform decision-making at the individual and
population levels.
It includes decision analysis, risk analysis, cost-benefit and
cost-effectiveness analysis, constrained optimization,
simulation modeling, and behavioral decision theory, as well
as parts of operations research, microeconomics, statistical
inference, management control, cognitive and social
psychology, and computer science
Decision Science
Data collection
Processing & Modeling
Analysis & Insight
Data
Intelligence
Information
DATA ANALYSIS DATA CONVERSION
Data analytics
• Data analytics is the collection, transformation, and
organization of data in order to draw conclusions, make
predictions, and drive informed decision making.
• Data analytics is often confused with data analysis.
While these are related terms, they aren’t exactly the
same. In fact, data analysis is a subcategory of data
analytics that deals specifically with extracting meaning
from data. Data analytics, as a whole, includes
processes beyond analysis, including data
science (using data to theorize and forecast) and data
engineering (building data systems).
So why Data Analytics?
With Data Analytics businesses can
understand hidden patterns and meanings
within the behavior of the customer.
For businesses,
1. Informed Decision Making.
2. More Effective Marketing
3. More Efficient Operations
4. Cutting Costs.
What is Sampling?
Sampling is a method that allows us to get information about the
population based on the statistics from a subset of the population
(sample), without having to investigate every individual.
Why do we need Sampling?
Sampling is done to draw conclusions about populations
from samples, and it enables us to determine a
population’s characteristics by directly observing only a
portion (or sample) of the population.
• Selecting a sample requires less time than selecting every item in
a population
• Sample selection is a cost-efficient method
• Analysis of the sample is less cumbersome and more practical
than an analysis of the entire population
Population vs sample
• The population is the entire
group that you want to draw
conclusions about.
• The sample is the specific group
of individuals that you will
collect data from.
The population can be defined in
terms of geographical location, age,
income, and many other
characteristics.
Learn how to determine sample size
Stage 1: Consider your sample size variables
1. Population size
2. Margin of error (confidence interval)
3. Confidence level
4. Standard deviation
Stage 2: Calculate sample size
5. Find your Z-score
6. Use the sample size formula
Different Types of Sampling Techniques
• Probability Sampling: In probability sampling, every element of the population has an equal
chance of being selected. Probability sampling gives us the best chance to create a sample
that is truly representative of the population
• Non-Probability Sampling: In non-probability sampling, all elements do not have an equal
chance of being selected. Consequently, there is a significant risk of ending up with a non-
representative sample which does not produce generalizable results
Types of Probability Sampling
1. Simple Random Sampling
This is a type of sampling technique you must have come across at some point.
Here, every individual is chosen entirely by chance and each member of the
population has an equal chance of being selected.
Simple random sampling reduces selection bias.
One big advantage of this technique is
that it is the most direct method of
probability sampling. But it comes with a
caveat – it may not select enough
individuals with our characteristics of
interest.
Monte Carlo methods use repeated
random sampling for the estimation of
unknown parameters
2.Systematic Sampling
In this type of sampling, the first individual is selected randomly and others are
selected using a fixed ‘sampling interval’. Let’s take a simple example to
understand this.
Say our population size is x and we have to select a sample size of n. Then, the
next individual that we will select would be x/nth intervals away from the first
individual. We can select the rest in the same way.
Systematic sampling is more convenient than
simple random sampling. However, it might
also lead to bias if there is an underlying
pattern in which we are selecting items from
the population (though the chances of that
happening are quite rare).
3.Stratified Sampling
In this type of sampling, we divide the population into subgroups
(called strata) based on different traits like gender, category, etc.
And then we select the sample(s) from these subgroups:
We use this type of sampling
when we want
representation from all the
subgroups of the
population. However,
stratified sampling requires
proper knowledge of the
characteristics of the
population.
4.Cluster Sampling
In a clustered sample, we use the subgroups of the population as the
sampling unit rather than individuals. The population is divided into
subgroups, known as clusters, and a whole cluster is randomly selected
to be included in the study:
In the above example, we have
divided our population into 5
clusters. Each cluster consists of 4
individuals and we have taken the
4th cluster in our sample. We can
include more clusters as per our
sample size.
This type of sampling is used
when we focus on a specific
region or area.
Types of Non-Probability Sampling
1.Convenience Sampling
This is perhaps the easiest method of sampling because individuals are selected
based on their availability and willingness to take part.
Here, let’s say individuals numbered 4, 7, 12, 15 and 20 want to be part of our
sample, and hence, we will include them in the sample.
Convenience sampling is prone to
significant bias, because the
sample may not be the
representation of the specific
characteristics such as religion or,
say the gender, of the population.
2.Quota Sampling
In this type of sampling, we choose items based on predetermined
characteristics of the population. Consider that we have to select
individuals having a number in multiples of four for our sample:
Therefore, the individuals
numbered 4, 8, 12, 16, and 20 are
already reserved for our sample.
In quota sampling, the chosen
sample might not be the best
representation of the
characteristics of the population
that weren’t considered
3.Judgment Sampling
It is also known as selective sampling. It depends on the
judgment of the experts when choosing whom to ask to
participate.
Suppose, our experts believe that
people numbered 1, 7, 10, 15,
and 19 should be considered for
our sample as they may help us
to infer the population in a better
way. As you can imagine, quota
sampling is also prone to bias by
the experts and may not
necessarily be representative.
4.Snowball Sampling
I quite like this sampling technique. Existing people are asked to
nominate further people known to them so that the sample increases
in size like a rolling snowball. This method of sampling is effective when
a sampling frame is difficult to identify.
Here, we had randomly chosen person 1 for
our sample, and then he/she recommended
person 6, and person 6 recommended person
11, and so on.
1->6->11->14->19
There is a significant risk of selection bias in
snowball sampling, as the referenced
individuals will share common traits with the
person who recommends them.
Statistics simply means numerical data,
and is field of math that generally deals
with collection of data, tabulation, and
interpretation of numerical data.
Statistics
1. Descriptive Statistics :
Descriptive statistics uses data that provides a description of the
population either through numerical calculation or graph or table. It
provides a graphical summary of data. It is simply used for
summarizing objects, etc. There are two categories in this as
following below.
(a). Measure of central tendency –
Measure of central tendency is also known as summary
statistics that is used to represents the center point or
a particular value of a data set or sample set.
In statistics, there are three common measures of
central tendency as shown below:
(i) Mean :
It is measure of average of all value in a sample set.
For example,
(ii) Median :
It is measure of central value of a sample set.
In these, data set is ordered from lowest to
highest value and then finds exact middle.
For example,
(iii) Mode :
It is value most frequently arrived in sample
set. The value repeated most of time in central
set is actually mode.
For example,
(b). Measure of Variability –
Measure of Variability is also known as measure of dispersion and used to
describe variability in a sample or population. In statistics, there are three
common measures of variability as shown below:
(i) Range :
It is given measure of how to spread apart values in sample set or data set.
Range = Maximum value - Minimum value
(ii) Variance :
It simply describes how much a random variable defers from expected value and
it is also computed as square of deviation.
S2= ∑n
i=1 [(xi - ͞
x)2 ÷ n]
In these formula, n represent total data points, ͞x represent mean of data points
and xi represent individual data points.
(iii) Dispersion :
It is measure of dispersion of set of data from its mean.
σ= √ (1÷n) ∑n
i=1 (xi - μ)2
2. Inferential Statistics :
• Inferential Statistics makes inference and prediction about population based on a sample
of data taken from population. It generalizes a large dataset and applies probabilities to
draw a conclusion.
• It is simply used for explaining meaning of descriptive stats.
• It is simply used to analyze, interpret result, and draw conclusion.
• Inferential Statistics is mainly related to and associated with hypothesis testing whose
main target is to reject null hypothesis.
• Hypothesis testing is a type of inferential procedure that takes help of sample data to
evaluate and assess credibility of a hypothesis about a population.
• Inferential statistics are generally used to determine how strong relationship is within
sample. But it is very difficult to obtain a population list and draw a random sample.
Inferential statistics can be done with help of various steps as given below:
• Obtain and start with a theory.
• Generate a research hypothesis.
• Operationalize or use variables
• Identify or find out population to which we can apply study material.
• Generate or form a null hypothesis for these population.
• Collect and gather a sample of children from population and simply run study.
• Then, perform all tests of statistical to clarify if obtained characteristics of sample are sufficiently
different from what would be expected under null hypothesis so that we can be able to find and
reject null hypothesis.
Types of inferential statistics –
Various types of inferential statistics are used widely
nowadays and are very easy to interpret. These are
given below:
• One sample test of difference/One sample
hypothesis test
• Confidence Interval
• Contingency Tables and Chi-Square Statistic
• T-test or Anova
• Pearson Correlation
• Bi-variate Regression
• Multi-variate Regression
Prescriptive analytics is a process that analyzes data
and provides instant recommendations on how to
optimize business practices to suit multiple predicted
outcomes.
In essence, prescriptive analytics takes the “what we
know” (data), comprehensively understands that data to
predict what could happen, and suggests the best steps
forward based on informed simulations.
Predictive analytics: Predictive analytics applies
mathematical models to the current data to inform
(predict) future behavior. It is the “what could happen."
Types of Variables in Statistics
1. Quantitative Variables: Sometimes referred to as “numeric” variables,
these are variables that represent a measurable quantity. Examples
include:
• Number of students in a class
• Number of square feet in a house
• Population size of a city
• Age of an individual
• Height of an individual
2. Qualitative Variables: Sometimes referred to as “categorical” variables, these
are variables that take on names or labels and can fit into categories. Examples
include:
• Eye color (e.g. “blue”, “green”, “brown”)
• Gender (e.g. “male”, “female”)
• Breed of dog (e.g. “lab”, “bulldog”, “poodle”)
• Level of education (e.g. “high school”, “Associate’s degree”, “Bachelor’s
degree”)
• Marital status (e.g. “married”, “single”, “divorced”)
•
Nominal Scale
A nominal scale is the 1st level of measurement scale in which the numbers serve as “tags” or
“labels” to classify or identify the objects. A nominal scale usually deals with the non-numeric
variables or the numbers that do not have any value.
Characteristics of Nominal Scale
• A nominal scale variable is classified into two or more categories. In this measurement
mechanism, the answer should fall into either of the classes.
• It is qualitative. The numbers are used here to identify the objects.
• The numbers don’t define the object characteristics. The only permissible aspect of
numbers in the nominal scale is “counting.”
Example:
An example of a nominal scale measurement is given below:
What is your gender?
M- Male
F- Female
Here, the variables are used as tags, and the answer to this question should be either M or F.
Ordinal Scale
The ordinal scale is the 2nd level of measurement that reports the ordering and ranking of data
without establishing the degree of variation between them. Ordinal represents the “order.”
Ordinal data is known as qualitative data or categorical data. It can be grouped, named and
also ranked.
Characteristics of the Ordinal Scale
• The ordinal scale shows the relative ranking of the variables
• It identifies and describes the magnitude of a variable
• Along with the information provided by the nominal scale, ordinal scales give the rankings
of those variables
• The interval properties are not known
• The surveyors can quickly analyse the degree of agreement concerning the identified order
of variables
Example:
Ranking of school students – 1st, 2nd, 3rd, etc.
Ratings in restaurants
Evaluating the frequency of occurrences
• Very often
• Often
Assessing the degree of agreement
• Totally agree
• Agree
• Totally disagree
Interval Scale
The interval scale is the 3rd level of measurement scale. It is defined as a quantitative
measurement scale in which the difference between the two variables is meaningful.
In other words, the variables are measured in an exact manner, not as in a relative way
in which the presence of zero is arbitrary.
Characteristics of Interval Scale:
• The interval scale is quantitative as it can quantify the difference between the values
• It allows calculating the mean and median of the variables
• To understand the difference between the variables, you can subtract the values between
the variables
• The interval scale is the preferred scale in Statistics as it helps to assign any numerical values
to arbitrary assessment such as feelings, calendar types, etc.
Example:
• Likert Scale
• Net Promoter Score (NPS)
• Bipolar Matrix Table
Ratio Scale
The ratio scale is the 4th level of measurement scale, which is quantitative. It is a type of
variable measurement scale. It allows researchers to compare the differences or intervals. The
ratio scale has a unique feature. It possesses the character of the origin or zero points.
Characteristics of Ratio Scale:
• Ratio scale has a feature of absolute zero
• It doesn’t have negative numbers, because of its zero-point feature
• It affords unique opportunities for statistical analysis. The variables can be orderly added,
subtracted, multiplied, divided. Mean, median, and mode can be calculated using the ratio
scale.
• Ratio scale has unique and useful properties. One such feature is that it allows unit
conversions like kilogram – calories, gram – calories, etc.
Example:
An example of a ratio scale is:
What is your weight in Kgs?
Less than 55 kgs
55 – 75 kgs
76 – 85 kgs
86 – 95 kgs
More than 95 kgs