What do you mean by Statistics?
• There are actually many definitions of statistics; however it can be classified in two senses:
a) Statistics in plural sense; and
b) Statistics in singular sense
• As a plural noun, statistics is used for denoting numerical and quantitative information. Thus, in plural sense,
it means the same thing as data. For ex. Statistics of scores of cricket match, price statistics, export-import
statistics, etc.
• In a singular sense, Statistics can be defined as a branch of science which deals with scientific methods of
collection, organization, presentation analysis, and interpretation of data obtained by conducting a survey or
an experimental study.
Some Basic Concepts
a) POPULATION: The term is used to denote a well-defined set, group or aggregate of observations relating to
a phenomenon under statistical investigation. Again, population can be classified into 2 broad classes:
a. Finite e.g., A bag of wheat lying in Warehouse, No. of students in PMLSD B-school, etc.
b. Infinite e.g., Blood cell count in a certain region of the brain.
b) TARGET POPULATION: They are subsets of population. For. E.g. Wheat bags lying on the Eastern corner
of warehouse, No. of Girls in IMS B-school, etc.
c) SAMPLE FRAME: The frame from which the actual sample is drawn i.e., if a study is concerned to India
but we have narrowed it to NCR region only, it becomes our sample frame. In other words, the sampling
frame in a study is the target population only.
d) SAMPLE AND SAMPLING: A Sample is a fraction or subset of population drawn through valid statistical
procedure so that it can be regarded as representative of the entire population. The valid statistical procedure
of drawing a sample from the population is called Sampling.
Some Basic Concepts..contd.
a) SAMPLING UNIT: The members representing the sample. For e.g., Girls in B-School in this case.
b) PARAMETER: Parameter is a descriptive measure of some characteristics of the population. For Ex. Height
of students in a class.
c) FUNDAMENTALS OF MEASUREMENT
a. Construct: A wider term used to cover the broad concept and the underlying variables. For e.g. Firm
performance
b. Variable: The characteristic on which individuals or objects differ among themselves is called a variable.
For e.g. to measure firm performance we generally use variables like RoA, ROCE, Sales growth, etc.
Data, Frequency
• Collection of meaningful observations is called data.
• E.g., Stock prices, height of students, twitter texts, etc.
• Frequency of a variable can be defined as number of times an
observation occurs in a series of observations.
• For ex, we consider the series 5,2,5,3,4,2,3,5,2,4,2. Here, 2 occur 4 times 3
and 4 occur twice each and 5 occur thrice. Hence, the frequencies of 2,
3,4and 5 are respectively 4,2,2,3.
Types of Data
• According to statistics
• Quantitative
• Qualitative
• According to decision making
• Inferential data
• According to time
• Time series data
• Cross-sectional data
• Longitudinal/ panel data
• Balanced panel
• Unbalanced panel
• According to linearity
• Linear data
• Non-linear data
• According to parameters
• Parametric
• Non-parametric data
• According to features
• Nominal
• Ordinal
• Interval
• Ratio or continuous data
• According to interpretability
• Structured
• Unstructured
• Semi-structured
• Highly unstructured data
• According to classification
• Binary classified data
• Grouped data
• According to ML /DL algorithms
• Supervised learning data
• Unsupervised learning data
• Reinforcement learning data
Measures of Central Tendency
• Mean
• Median
• Mode
• Trimmed Mean
• Rolling Mean ≈ Moving Average
• Outliers
• Outlier detection
Detecting Outliers
• Outliers are extreme datapoints which fall outside the normal minimum or maximum limits. To detect outlier
there are quite a few approaches.
a) Box Plots
b) Q-Q plot
c) P-P plot
d) Statistical measures
Violin Plot
• Sometimes Boxplots can be
misleading as there is an
absence of distribution in the
plot.
• Hence, we suggest Violin Plot.
• Now in real time we can see
whether the data is lying
outside the whisker and the
distribution and hence, the
potential outliers can be
identified. Lower Fence
Upper Fence
Q3
Q1
Me
Distribution of the
data at 95% C.I.
Violin Plot
• Sometimes Boxplots can be
misleading as there is an
absence of distribution in the
plot.
• Hence, we suggest Violin Plot.
• Now in real time we can see
whether the data is lying
outside the whisker and the
distribution and hence, the
potential outliers can be
identified. F
A
C
E
D
B
G
H
I