Data Analysis in Research: Descriptive Statistics & Normality

Data Analysis in Research:
Descriptive Statistics & Normality
Ikbal Ahmed
PhD Researcher
Binary University, Malaysia

Data Analysis: What is Data?
In general, data is any set of characters that is gathered and translated for some
purpose, usually analysis. If data is not put into context, it doesn't do anything to a
human or computer. There are multiple types of data. Some of the more common
types of data include the following.
• Single character
• Boolean (true or false)
• Text (string)
• Number (integer or floating-point)
• Picture
• Sound
• Video
Research data can take many forms. It might be:
• documents, spreadsheets
• laboratory notebooks, field notebooks, diaries
• questionnaires, transcripts, codebooks
• audiotapes, videotapes
• photographs, films
• test responses
• slides, artefacts, specimens, samples
• collections of digital outputs
• database contents (video, audio, text, images)
• models, algorithms, scripts
• contents of an application (input, output,
logfiles for analysis software, simulation
software, schemas)
• methodologies and workflows
 Research data comes in many different formats and is gathered using a wide
variety of methodologies.
 Research data are collected and used in scholarship across all academic
disciplines and, while it can consist of numbers in a spreadsheet, it also takes
many different formats, including videos, images, artifacts, and diaries.
Whether a psychologist collecting survey data to better understand human
behavior, an artist using data to generate images and sounds, or an
anthropologist using audio files to document observations about different
cultures, scholarly research across all academic fields is increasingly data-
driven.
 Can be read by researchers to be used to support the research and prove
hypothesis
 Data becomes information after analysis
 Data types:
i. Primary
ii. Secondary

Sources of research data
Research data can be generated for different purposes and through different processes.
• Observational data is captured in real-time, and is usually irreplaceable, for
example sensor data, survey data, sample data, and neuro-images.
• Experimental data is captured from lab equipment. It is often reproducible, but this
can be expensive. Examples of experimental data are gene sequences,
chromatograms, and toroid magnetic field data.
• Simulation data is generated from test models where model and metadata are more
important than output data. For example, climate models and economic models.
• Derived or compiled data has been transformed from pre-existing data points. It is
reproducible if lost, but this would be expensive. Examples are data mining,
compiled databases, and 3D models.
• Reference or canonical data is a static or organic conglomeration or collection of
smaller (peer-reviewed) datasets, most probably published and curated. For example,
gene sequence databanks, chemical structures, or spatial data portals.

Primary Data Vs Secondary Data

Quantitative Data
Quantitative data seems to be the easiest to explain. It answers key questions such as “how
many, “how much” and “how often”. Quantitative data can be expressed as a number or can be
quantified. Simply put, it can be measured by numerical variables.
Examples of quantitative data:
• Scores on tests and exams e.g. 85, 67, 90 and etc.
• The weight of a person or a subject.
• Your shoe size.
• The temperature in a room.
There are two general types of quantitative data:
1. Discrete data
2. Continuous data.

Quantitative Data (Continue)
1. Discrete data: Discrete data is a count that involves only integers.
Examples of discrete data:
• The number of students in a class.
• The number of workers in a company.
• The number of home runs in a baseball game.
• The number of test questions you answered correctly
2. Continuous data: Continuous data is information that could be meaningfully divided into finer
levels. It can be measured on a scale or continuum and can have almost any numeric value.
Examples of continuous data:
• The amount of time required to complete a project.
• The height of children.
• The square footage of a two-bedroom house.
• The speed of cars.

Qualitative data
Qualitative data can’t be expressed as a number and can’t be measured. Qualitative data consist
of words, pictures, and symbols, not numbers. Qualitative data is also called categorical
data because the information can be sorted by category, not by number. Qualitative data can
answer questions such as “how this has happened” or and “why this has happened”.
Examples of qualitative data:
• Colors e.g. the color of the sea
• Your favorite holiday destination such as Hawaii, New Zealand and etc.
• Names as John, Patricia,…..
• Ethnicity such as American Indian, Asian, etc.
There are two general types of qualitative data:
1. Nominal data
2. Ordinal data

Qualitative Data (Continue)
1. Nominal data: Nominal data is used just for labeling variables, without any type of quantitative value. The
name ‘nominal’ comes from the Latin word “nomen” which means ‘name’.
Examples of Nominal Data:
• Gender (Women, Men)
• Hair color (Blonde, Brown, Brunette, Red, etc.)
• Marital status (Married, Single, Widowed)
• Ethnicity (Hispanic, Asian)
2. Ordinal data: Ordinal data shows where a number is in order. Ordinal data is data which is placed into some
kind of order by their position on a scale. Ordinal data may indicate superiority. However, you cannot do
arithmetic with ordinal numbers because they only show sequence.
Examples of Ordinal Data:
• The first, second and third person in a competition.
• Letter grades: A, B, C, and etc.
• When a company asks a customer to rate the sales experience on a scale of 1-10.
• Economic status: low, medium and high.

What is data analysis?
• According to LeCompte and Schensul, research data analysis is a process used by researchers for
reducing data to a story and interpreting it to derive insights. The data analysis process helps in
reducing a large chunk of data into smaller fragments, which makes sense.
• Marshall and Rossman, on the other hand, describe data analysis as a messy, ambiguous, and time-
consuming, but a creative and fascinating process through which a mass of collected data is being
brought to order, structure and meaning.
• Types of Data Analysis: Techniques and Methods
There are several types of data analysis techniques that exist based on business and technology. The
major types of data analysis are:
• Descriptive Analysis
• Diagnostic Analysis
• Predictive Analysis
• Prescriptive Analysis

Data Analysis Tools
Data Analysis Software:
1) Statistical Package for Social Science (SPSS)
2) STATA
3) Smart PLS
4) EViews
5) Microsoft Excel
Data Analysis Programming Languages:

Qualitative Data Analysis
The data obtained through this method consists of words, pictures, symbols and observations. This
type of analysis refers to the procedures and processes that are utilized for the analysis of data to
provide some level of understanding, explanation or interpretation. Unlike the quantitative
analysis, no statistical approaches are used to collect and analyze this data. There are a variety of
approaches to collecting this type of data and interpreting it. Some of the most commonly used
methods are:
1. Content Analysis: It is used to analyze verbal or behavioral data. This data can consist of
documents or communication artifacts like texts in various formats, pictures or audios/videos.
2. Narrative Analysis: This one is the most commonly used as it involves analyzing data that
comes from a variety of sources including field notes, surveys, diaries, interviews and other
written forms. It involves reformulating the stories given by people based on their experiences
and in different contexts.
3. Grounded Theory: This method involves the development of causal explanations of a single
phenomenon from the study of one or more cases. If further cases are studied, then the
explanations are altered until the researchers arrive at a statement that fits all of the cases.

Quantitative Data Analysis
The two most commonly used quantitative data analysis methods are-
1. Descriptive statistics
2. Inferential statistics.
1) Descriptive Analysis: analyses complete data or a sample of summarized numerical
data. It shows mean and deviation for continuous data whereas percentage and
frequency for categorical data.
2) Inferential Analysis: analyses sample from complete data. In this type of Analysis,
you can find different conclusions from the same data by selecting different
samples.

Descriptive Statistics
Typically descriptive statistics (also known as descriptive analysis) is the first
level of analysis. This helps researchers find absolute numbers to summarize
individual variables and find patterns. A few commonly used descriptive
statistics are:
• Mean: numerical average of a set of values.
• Median: midpoint of a set of numerical values.
• Mode: most common value among a set of values.
• Percentage: used to express how a value or group of respondents within
the data relates to a larger group of respondents.
• Frequency: the number of times a value is found.
• Range: the highest and lowest value in a set of values.

Inferential Analysis
These complex analyses show the relationships between multiple variables
to generalize results and make predictions. A few examples are...
• Correlation: describes the relationship between 2 variables
• Regression: shows or predicts the relationship between 2 variables
• Analysis of variance: tests the extent to which 2+ groups differ

Descriptive statistics
Descriptive statistics can be useful for two purposes:
1) to provide basic information about variables in a dataset and
2) to highlight potential relationships between variables.
The three most common descriptive statistics can be displayed graphically or
pictorially and are measures of:
• Graphical/Pictorial Methods
• Measures of Central Tendency
• Measures of Dispersion
• Measures of Association

Graphical/Pictorial Methods
There are several graphical and pictorial methods that enhance researchers' understanding of individual variables and the
relationships between variables. Graphical and pictorial methods provide a visual representation of the data. Some of
these methods include:
• Histograms
• Scatter plots
• Bar charts
• Pie charts
• Line Graph
• Histograms
• Each value of a variable is displayed along the bottom of a histogram, and a bar is drawn for each value
• The height of the bar corresponds to the frequency with which that value occurs
• Bar Charts
• The heights of bars in a bar chart represent the frequencies (or relative frequencies) in each group.
Note that in a bar chart, there is a gap between each bar. This is unlike a histogram, where there are no gaps between the
bars, reflecting the continuous nature of the underlying variable.

Graphical/Pictorial Methods (Continue)
• Scatter plots
• Display the relationship between two quantitative or
numeric variables by plotting one variable against the
value of another variable
• For example, one axis of a scatter plot could represent
height and the other could represent weight. Each
person in the data would receive one data point on the
scatter plot that corresponds to his or her height and
weight
• Pie Chart
• In a pie chart, a circle is divided into slices, such that
each slice represents a different category and the size of
the slice is proportional to the relative frequency of that
category. Conventionally, the categories in a pie chart
are ordered clockwise from the largest slice to the
smallest, starting at the 12 o’clock position.

A line graph, also known as a line chart, is a type of chart used to
visualize the value of something over time. For example, a finance
department may plot the change in the amount of cash the
company has on hand over time.
The line graph consists of a horizontal x-axis and a vertical y-axis.
Most line graphs only deal with positive number values, so these
axes typically intersect near the bottom of the y-axis and the left
end of the x-axis. The point at which the axes intersect is always
(0, 0). Each axis is labeled with a data type. For example, the x-
axis could be days, weeks, quarters, or years, while the y-axis
shows revenue in dollars.
The x-axis is also called the independent axis because its values
do not depend on anything. For example, time is always placed on
the x-axis since it continues to move forward regardless of
anything else. The y-axis is also called the dependent axis because
its values depend on those of the x-axis: at this time, the company
had this much money. The result is that the line of the graph
always progresses in a horizontal fashion and each x value only
has one y value (the company cannot have two amounts of money
at the same time).
Graphical/Pictorial Methods (Continue)
Line graph

Measures of central tendency are the most basic and, often, the most informative description of a
population's characteristics. There are three measures of central tendency:
• Mean -- the sum of a variable's values divided by the total number of values
• Median -- the middle value of a variable
• Mode -- the value that occurs most often
Example:
The incomes of five randomly selected people in the United States are
$10,000, $10,000, $45,000, $60,000, and $1,000,000.
Mean Income =
10,000 + 10,000 + 45,000 + 60,000 + 1,000,000
5
= $225,000
Median Income = $45,000
Modal Income = $10,000
The mean is the most commonly used measure of central tendency. Medians are generally used when a few
values are extremely different from the rest of the values (this is called a skewed distribution).
For example, the median income is often the best measure of the average income because, while most
individuals earn between $0 and $200,000, a handful of individuals earn millions.
Measures of Central Tendency

Measures of dispersion provide information about
the spread of a variable's values. There are four key
measures of dispersion:
• Range
• Variance
• Standard Deviation
• Skew
Range is simply the difference between the smallest
and largest values in the data. The interquartile range
is the difference between the values at the
75th percentile and the 25th percentile of the data.
Variance is the most commonly used measure of
dispersion. It is calculated by taking the average of
the squared differences between each value and the
mean.
Measures of Dispersion

Standard deviation, another commonly used statistic,
is the square root of the variance.
Skewness The term ‘skewness’ is used to mean the
absence of symmetry from the mean of the dataset. It is
characteristic of the deviation from the mean, to be
greater on one side than the other, i.e. attribute of the
distribution having one tail heavier than the other.
Skewness is used to indicate the shape of the
distribution of data.
In a skewed distribution, the curve is extended to either
left or right side. So, when the plot is extended towards
the right side more, it denotes positive skewness,
wherein mode < median < mean. On the other hand,
when the plot is stretched more towards the left
direction, then it is called as negative skewness and so,
mean < median < mode.
Measures of Dispersion (Continue)

Example:
The incomes of five randomly selected people in the United States are $10,000,
$10,000, $45,000, $60,000, and $1,000,000:
Range = 1,000,000 - 10,000
= 990,000
Variance = [(10,000 − 225,000)
2
+ (10,000 − 225,000)
2
+ (45,000 − 225,000)
2
+ (60,000 − 225,000)
2
+ (1,00,000 − 225,000)
2
]
4
= 150,540,000,000
Standard Deviation = 150,540,000,000
= 387,995
Skew = Income is positively skewed
Measures of Dispersion (Continue)

Measures of association indicate whether two variables are related. Two measures are commonly used:
• Chi-square
• Correlation
Chi-Square
• As a measure of association between variables, chi-square tests are used on nominal data (i.e., data that are put into
classes: e.g., gender [male, female] and type of job [unskilled, semi-skilled, skilled]) to determine whether they are
associated.
• A chi-square is called significant if there is an association between two variables, and non significant if there is not an
association
To test for associations, a chi-square is calculated in the following way: In the sample dataset, respondents were asked
their gender and whether or not they were a cigarette smoker. There were three answer choices: Nonsmoker, Past
smoker, and Current smoker. Suppose we want to test for an association between smoking behavior (nonsmoker, current
smoker, or past smoker) and gender (male or female) using a Chi-Square Test of Independence.
Measures of Association

Correlation
• A correlation coefficient is used to measure the strength of the relationship between numeric variables (e.g., weight
and height)
• The most common correlation coefficient is Pearson's r, which can range from -1 to +1.
• If the coefficient is between 0 and 1, as one variable increases, the other also increases. This is called a positive
correlation. For example, height and weight are positively correlated because taller people usually weigh more.
• If the correlation coefficient is between -1 and 0, as one variable increases the other decreases. This is called a
negative correlation. For example, age and hours slept per night are negatively correlated because older people
usually sleep fewer hours per night
Measures of Association (Continue)

Tools for Assessing Normality
A primary use of descriptive statistics is to determine whether the data are normally
distributed. If the variable is normally distributed, you can use parametric statistics that
are based on this assumption. If the variable is not normally distributed, you might try
a transformation on the variable (such as, the natural log or square root) to make the
data normal. There are several methods of assessing whether data are normally
distributed or not. They fall into two broad categories: graphical and statistical. The
some common techniques are:
 Graphical
i. Skewness & Kurtosis
ii. P-P Plot
iii. Q-Q Plot
iv. Boxplot
v. Histogram
vi. Normal Quantile Plot (also called
Normal Probability Plot)
 Statistical
i. Shapiro-Wilk Test
ii. Kolmogorov-Smirnov Test
iii. Jarque-Bera Test
iv. Anderson-Darling Test

Measure of Skewness
1. Symmetrical Distribution: It is that type of frequency curve Fig (a), for which
mean = median = mode
2. Asymmetrical distribution: For this type of frequency distribution mean, median and mode do
not have same value (Fig. (b) and (c)). Skewness is + ve or — ve depending upon location of the
mode with respect to the mean.
i. Positively skewed distribution: In a positively skewed distribution there is a long tail on the
right and the mean is on the right of the mode (Fig. (b)).
ii. Negatively skewed distribution: In a negatively skewed distribution there is a long tail on the
left and the mean is on the left of the mode (Fig. (c)).

Measure of Kurtosis
Kurtosis is defined as the parameter of relative sharpness of the peak of the probability distribution
curve. It is used to indicate the flatness or peakedness of the frequency distribution curve and
measures the tails or outliers of the distribution.
Positive kurtosis represents that the distribution is more peaked than the normal distribution,
whereas negative kurtosis shows that the distribution is less peaked than the normal distribution.
There are three types of distributions:
30
1. Mesokurtic (Kurtosis = 3): This distribution has kurtosis statistic
similar to that of the normal distribution. It means that the extreme
values of the distribution are similar to that of a normal distribution
characteristic.
2. Leptokurtic (Kurtosis > 3): Distribution is longer, tails are fatter. Peak
is higher and sharper than Mesokurtic, which means that data are heavy-
tailed or profusion of outliers. Outliers stretch the horizontal axis of the
histogram graph, which makes the bulk of the data appear in a narrow
(“skinny”) vertical range.
3. Platykurtic: (Kurtosis < 3): Distribution is shorter, tails are thinner than
the normal distribution. The peak is lower and broader than Mesokurtic,
which means that data are light-tailed or lack of outliers. The reason for
this is because the extreme values are less than that of the normal
distribution.

Q-Q Plot & P-P Plot
P-P plot(Probability-Probability) and Q-Q(Quantile-Quantile) plot are called probability plots. Probability plot helps us to compare two data sets in
terms of distribution. Generally one set is theoretical and one set is empirical. The two types of probability plots are
• Q-Q plot (more common)
• P-P plot
• If we focus on "blue line” of Fig-A, it looks like normal distribution of some data. The "yellow line" represents distribution of same data in
cumulative manner. If we consider plotting non-cumulative distribution (similar to blue line above) of two data sets against each other then it is
called Q-Q plot. If we consider plotting cumulative distribution (similar to yellow line) of two sets against each other then it is called P-P plot.
• For example, Q-Q plot is used to check if the given data set is normally distributed by plotting its distribution against normally distributed data. If the
data is normally distributed, the result would be a straight line with positive slope like following Fig-B. Similarly for P-P plot, we can measure how
well a theoretical distribution fits given data (observed distribution). The theoretical distribution can be normal, lognormal, exponential, betta,
gamma etc. In both P-P plot or Q-Q plot if we get a straight line by plotting theoretical data against observed data, then it indicated a good match for
both data distributions. A Q-Q plot is very similar to the P-P plot except that it plots the quantiles (values that split a data set into equal portions) of
the data set instead of every individual score in the data. Moreover, the Q-Q plots are easier to interpret in case of large sample sizes.
Fig-A (Normal Distribution) Fig-B (Q-Q Plot) Fig-C (P-P Plot)

Boxplot, Normal Probability Plot & Histogram
• Boxplot: Descriptive statistics are an attempt to use numbers to describe how data are the
same and not the same. The box plot is a standardized way of displaying the distribution of
data based on the five number summary: minimum, first quartile, median, third quartile, and
maximum. In the simplest box plot the central rectangle spans the first quartile to the third
quartile (the interquartile range or IQR).
• Normal Probability Plot: The normal probability plot was designed specifically to test for
the assumption of normality. If your data comes from a normal distribution, the points on the
graph will form a line.
• Histogram: The popular histogram can give you a good idea about whether your data meets
the assumption. If your data looks like a bell curve: then it’s probably normal.

Kolmogorov-Smirnov Test & Shapiro-Wilk Test
Kolmogorov-Smirnov Goodness of Fit Test (K-S test): This test compares your data with a known distribution and
lets you know if they have the same distribution. Although the test is nonparametric — it doesn’t assume any
particular underlying distribution — it is commonly used as a test for normality to see if your data is normally
distributed. It’s also used to check the assumption of normality in Analysis of Variance.
Shapiro-Wilk test: This test is a way to tell if a random sample comes from a normal distribution. The test gives you
a W value; small values indicate your sample is not normally distributed.
If the sig value of the test is >0.05 then the data is normal, therefore it is normally distributed. If it is below 0.05, then
the distribution is significantly different from a normal distribution therefore it is NOT normally distributed.

Data Analysis in Research: Descriptive Statistics & Normality

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data Analysis in Research: Descriptive Statistics & Normality

Similaire à Data Analysis in Research: Descriptive Statistics & Normality (20)

Plus de Ikbal Ahmed

Plus de Ikbal Ahmed (7)

Dernier

Dernier (20)

Data Analysis in Research: Descriptive Statistics & Normality