SlideShare une entreprise Scribd logo
1  sur  34
Data Analysis in Research:
Descriptive Statistics & Normality
Ikbal Ahmed
PhD Researcher
Binary University, Malaysia
What is Data?
Data Analysis: What is Data?
In general, data is any set of characters that is gathered and translated for some
purpose, usually analysis. If data is not put into context, it doesn't do anything to a
human or computer. There are multiple types of data. Some of the more common
types of data include the following.
• Single character
• Boolean (true or false)
• Text (string)
• Number (integer or floating-point)
• Picture
• Sound
• Video
Research data can take many forms. It might be:
• documents, spreadsheets
• laboratory notebooks, field notebooks, diaries
• questionnaires, transcripts, codebooks
• audiotapes, videotapes
• photographs, films
• test responses
• slides, artefacts, specimens, samples
• collections of digital outputs
• database contents (video, audio, text, images)
• models, algorithms, scripts
• contents of an application (input, output,
logfiles for analysis software, simulation
software, schemas)
• methodologies and workflows
 Research data comes in many different formats and is gathered using a wide
variety of methodologies.
 Research data are collected and used in scholarship across all academic
disciplines and, while it can consist of numbers in a spreadsheet, it also takes
many different formats, including videos, images, artifacts, and diaries.
Whether a psychologist collecting survey data to better understand human
behavior, an artist using data to generate images and sounds, or an
anthropologist using audio files to document observations about different
cultures, scholarly research across all academic fields is increasingly data-
driven.
 Can be read by researchers to be used to support the research and prove
hypothesis
 Data becomes information after analysis
 Data types:
i. Primary
ii. Secondary
Sources of research data
Research data can be generated for different purposes and through different processes.
• Observational data is captured in real-time, and is usually irreplaceable, for
example sensor data, survey data, sample data, and neuro-images.
• Experimental data is captured from lab equipment. It is often reproducible, but this
can be expensive. Examples of experimental data are gene sequences,
chromatograms, and toroid magnetic field data.
• Simulation data is generated from test models where model and metadata are more
important than output data. For example, climate models and economic models.
• Derived or compiled data has been transformed from pre-existing data points. It is
reproducible if lost, but this would be expensive. Examples are data mining,
compiled databases, and 3D models.
• Reference or canonical data is a static or organic conglomeration or collection of
smaller (peer-reviewed) datasets, most probably published and curated. For example,
gene sequence databanks, chemical structures, or spatial data portals.
Primary Data Vs Secondary Data
Types of data
Quantitative Data
Quantitative data seems to be the easiest to explain. It answers key questions such as “how
many, “how much” and “how often”. Quantitative data can be expressed as a number or can be
quantified. Simply put, it can be measured by numerical variables.
Examples of quantitative data:
• Scores on tests and exams e.g. 85, 67, 90 and etc.
• The weight of a person or a subject.
• Your shoe size.
• The temperature in a room.
There are two general types of quantitative data:
1. Discrete data
2. Continuous data.
Quantitative Data (Continue)
1. Discrete data: Discrete data is a count that involves only integers.
Examples of discrete data:
• The number of students in a class.
• The number of workers in a company.
• The number of home runs in a baseball game.
• The number of test questions you answered correctly
2. Continuous data: Continuous data is information that could be meaningfully divided into finer
levels. It can be measured on a scale or continuum and can have almost any numeric value.
Examples of continuous data:
• The amount of time required to complete a project.
• The height of children.
• The square footage of a two-bedroom house.
• The speed of cars.
Qualitative data
Qualitative data can’t be expressed as a number and can’t be measured. Qualitative data consist
of words, pictures, and symbols, not numbers. Qualitative data is also called categorical
data because the information can be sorted by category, not by number. Qualitative data can
answer questions such as “how this has happened” or and “why this has happened”.
Examples of qualitative data:
• Colors e.g. the color of the sea
• Your favorite holiday destination such as Hawaii, New Zealand and etc.
• Names as John, Patricia,…..
• Ethnicity such as American Indian, Asian, etc.
There are two general types of qualitative data:
1. Nominal data
2. Ordinal data
Qualitative Data (Continue)
1. Nominal data: Nominal data is used just for labeling variables, without any type of quantitative value. The
name ‘nominal’ comes from the Latin word “nomen” which means ‘name’.
Examples of Nominal Data:
• Gender (Women, Men)
• Hair color (Blonde, Brown, Brunette, Red, etc.)
• Marital status (Married, Single, Widowed)
• Ethnicity (Hispanic, Asian)
2. Ordinal data: Ordinal data shows where a number is in order. Ordinal data is data which is placed into some
kind of order by their position on a scale. Ordinal data may indicate superiority. However, you cannot do
arithmetic with ordinal numbers because they only show sequence.
Examples of Ordinal Data:
• The first, second and third person in a competition.
• Letter grades: A, B, C, and etc.
• When a company asks a customer to rate the sales experience on a scale of 1-10.
• Economic status: low, medium and high.
What is data analysis?
• According to LeCompte and Schensul, research data analysis is a process used by researchers for
reducing data to a story and interpreting it to derive insights. The data analysis process helps in
reducing a large chunk of data into smaller fragments, which makes sense.
• Marshall and Rossman, on the other hand, describe data analysis as a messy, ambiguous, and time-
consuming, but a creative and fascinating process through which a mass of collected data is being
brought to order, structure and meaning.
• Types of Data Analysis: Techniques and Methods
There are several types of data analysis techniques that exist based on business and technology. The
major types of data analysis are:
• Descriptive Analysis
• Diagnostic Analysis
• Predictive Analysis
• Prescriptive Analysis
Data Analysis Tools
Data Analysis Software:
1) Statistical Package for Social Science (SPSS)
2) STATA
3) Smart PLS
4) EViews
5) Microsoft Excel
Data Analysis Programming Languages:
Qualitative Data Analysis
The data obtained through this method consists of words, pictures, symbols and observations. This
type of analysis refers to the procedures and processes that are utilized for the analysis of data to
provide some level of understanding, explanation or interpretation. Unlike the quantitative
analysis, no statistical approaches are used to collect and analyze this data. There are a variety of
approaches to collecting this type of data and interpreting it. Some of the most commonly used
methods are:
1. Content Analysis: It is used to analyze verbal or behavioral data. This data can consist of
documents or communication artifacts like texts in various formats, pictures or audios/videos.
2. Narrative Analysis: This one is the most commonly used as it involves analyzing data that
comes from a variety of sources including field notes, surveys, diaries, interviews and other
written forms. It involves reformulating the stories given by people based on their experiences
and in different contexts.
3. Grounded Theory: This method involves the development of causal explanations of a single
phenomenon from the study of one or more cases. If further cases are studied, then the
explanations are altered until the researchers arrive at a statement that fits all of the cases.
Quantitative Data Analysis
The two most commonly used quantitative data analysis methods are-
1. Descriptive statistics
2. Inferential statistics.
1) Descriptive Analysis: analyses complete data or a sample of summarized numerical
data. It shows mean and deviation for continuous data whereas percentage and
frequency for categorical data.
2) Inferential Analysis: analyses sample from complete data. In this type of Analysis,
you can find different conclusions from the same data by selecting different
samples.
Descriptive Statistics
Typically descriptive statistics (also known as descriptive analysis) is the first
level of analysis. This helps researchers find absolute numbers to summarize
individual variables and find patterns. A few commonly used descriptive
statistics are:
• Mean: numerical average of a set of values.
• Median: midpoint of a set of numerical values.
• Mode: most common value among a set of values.
• Percentage: used to express how a value or group of respondents within
the data relates to a larger group of respondents.
• Frequency: the number of times a value is found.
• Range: the highest and lowest value in a set of values.
Inferential Analysis
These complex analyses show the relationships between multiple variables
to generalize results and make predictions. A few examples are...
• Correlation: describes the relationship between 2 variables
• Regression: shows or predicts the relationship between 2 variables
• Analysis of variance: tests the extent to which 2+ groups differ
Descriptive statistics
Descriptive statistics can be useful for two purposes:
1) to provide basic information about variables in a dataset and
2) to highlight potential relationships between variables.
The three most common descriptive statistics can be displayed graphically or
pictorially and are measures of:
• Graphical/Pictorial Methods
• Measures of Central Tendency
• Measures of Dispersion
• Measures of Association
Graphical/Pictorial Methods
There are several graphical and pictorial methods that enhance researchers' understanding of individual variables and the
relationships between variables. Graphical and pictorial methods provide a visual representation of the data. Some of
these methods include:
• Histograms
• Scatter plots
• Bar charts
• Pie charts
• Line Graph
• Histograms
• Each value of a variable is displayed along the bottom of a histogram, and a bar is drawn for each value
• The height of the bar corresponds to the frequency with which that value occurs
• Bar Charts
• The heights of bars in a bar chart represent the frequencies (or relative frequencies) in each group.
Note that in a bar chart, there is a gap between each bar. This is unlike a histogram, where there are no gaps between the
bars, reflecting the continuous nature of the underlying variable.
Graphical/Pictorial Methods (Continue)
• Scatter plots
• Display the relationship between two quantitative or
numeric variables by plotting one variable against the
value of another variable
• For example, one axis of a scatter plot could represent
height and the other could represent weight. Each
person in the data would receive one data point on the
scatter plot that corresponds to his or her height and
weight
• Pie Chart
• In a pie chart, a circle is divided into slices, such that
each slice represents a different category and the size of
the slice is proportional to the relative frequency of that
category. Conventionally, the categories in a pie chart
are ordered clockwise from the largest slice to the
smallest, starting at the 12 o’clock position.
A line graph, also known as a line chart, is a type of chart used to
visualize the value of something over time. For example, a finance
department may plot the change in the amount of cash the
company has on hand over time.
The line graph consists of a horizontal x-axis and a vertical y-axis.
Most line graphs only deal with positive number values, so these
axes typically intersect near the bottom of the y-axis and the left
end of the x-axis. The point at which the axes intersect is always
(0, 0). Each axis is labeled with a data type. For example, the x-
axis could be days, weeks, quarters, or years, while the y-axis
shows revenue in dollars.
The x-axis is also called the independent axis because its values
do not depend on anything. For example, time is always placed on
the x-axis since it continues to move forward regardless of
anything else. The y-axis is also called the dependent axis because
its values depend on those of the x-axis: at this time, the company
had this much money. The result is that the line of the graph
always progresses in a horizontal fashion and each x value only
has one y value (the company cannot have two amounts of money
at the same time).
Graphical/Pictorial Methods (Continue)
Line graph
Measures of central tendency are the most basic and, often, the most informative description of a
population's characteristics. There are three measures of central tendency:
• Mean -- the sum of a variable's values divided by the total number of values
• Median -- the middle value of a variable
• Mode -- the value that occurs most often
Example:
The incomes of five randomly selected people in the United States are
$10,000, $10,000, $45,000, $60,000, and $1,000,000.
Mean Income =
10,000 + 10,000 + 45,000 + 60,000 + 1,000,000
5
= $225,000
Median Income = $45,000
Modal Income = $10,000
The mean is the most commonly used measure of central tendency. Medians are generally used when a few
values are extremely different from the rest of the values (this is called a skewed distribution).
For example, the median income is often the best measure of the average income because, while most
individuals earn between $0 and $200,000, a handful of individuals earn millions.
Measures of Central Tendency
Measures of dispersion provide information about
the spread of a variable's values. There are four key
measures of dispersion:
• Range
• Variance
• Standard Deviation
• Skew
Range is simply the difference between the smallest
and largest values in the data. The interquartile range
is the difference between the values at the
75th percentile and the 25th percentile of the data.
Variance is the most commonly used measure of
dispersion. It is calculated by taking the average of
the squared differences between each value and the
mean.
Measures of Dispersion
Standard deviation, another commonly used statistic,
is the square root of the variance.
Skewness The term ‘skewness’ is used to mean the
absence of symmetry from the mean of the dataset. It is
characteristic of the deviation from the mean, to be
greater on one side than the other, i.e. attribute of the
distribution having one tail heavier than the other.
Skewness is used to indicate the shape of the
distribution of data.
In a skewed distribution, the curve is extended to either
left or right side. So, when the plot is extended towards
the right side more, it denotes positive skewness,
wherein mode < median < mean. On the other hand,
when the plot is stretched more towards the left
direction, then it is called as negative skewness and so,
mean < median < mode.
Measures of Dispersion (Continue)
Example:
The incomes of five randomly selected people in the United States are $10,000,
$10,000, $45,000, $60,000, and $1,000,000:
Range = 1,000,000 - 10,000
= 990,000
Variance = [(10,000 − 225,000)
2
+ (10,000 − 225,000)
2
+ (45,000 − 225,000)
2
+ (60,000 − 225,000)
2
+ (1,00,000 − 225,000)
2
]
4
= 150,540,000,000
Standard Deviation = 150,540,000,000
= 387,995
Skew = Income is positively skewed
Measures of Dispersion (Continue)
Measures of association indicate whether two variables are related. Two measures are commonly used:
• Chi-square
• Correlation
Chi-Square
• As a measure of association between variables, chi-square tests are used on nominal data (i.e., data that are put into
classes: e.g., gender [male, female] and type of job [unskilled, semi-skilled, skilled]) to determine whether they are
associated.
• A chi-square is called significant if there is an association between two variables, and non significant if there is not an
association
To test for associations, a chi-square is calculated in the following way: In the sample dataset, respondents were asked
their gender and whether or not they were a cigarette smoker. There were three answer choices: Nonsmoker, Past
smoker, and Current smoker. Suppose we want to test for an association between smoking behavior (nonsmoker, current
smoker, or past smoker) and gender (male or female) using a Chi-Square Test of Independence.
Measures of Association
Correlation
• A correlation coefficient is used to measure the strength of the relationship between numeric variables (e.g., weight
and height)
• The most common correlation coefficient is Pearson's r, which can range from -1 to +1.
• If the coefficient is between 0 and 1, as one variable increases, the other also increases. This is called a positive
correlation. For example, height and weight are positively correlated because taller people usually weigh more.
• If the correlation coefficient is between -1 and 0, as one variable increases the other decreases. This is called a
negative correlation. For example, age and hours slept per night are negatively correlated because older people
usually sleep fewer hours per night
Measures of Association (Continue)
Data Analysis:
Normality
Tools for Assessing Normality
A primary use of descriptive statistics is to determine whether the data are normally
distributed. If the variable is normally distributed, you can use parametric statistics that
are based on this assumption. If the variable is not normally distributed, you might try
a transformation on the variable (such as, the natural log or square root) to make the
data normal. There are several methods of assessing whether data are normally
distributed or not. They fall into two broad categories: graphical and statistical. The
some common techniques are:
 Graphical
i. Skewness & Kurtosis
ii. P-P Plot
iii. Q-Q Plot
iv. Boxplot
v. Histogram
vi. Normal Quantile Plot (also called
Normal Probability Plot)
 Statistical
i. Shapiro-Wilk Test
ii. Kolmogorov-Smirnov Test
iii. Jarque-Bera Test
iv. Anderson-Darling Test
Measure of Skewness
1. Symmetrical Distribution: It is that type of frequency curve Fig (a), for which
mean = median = mode
2. Asymmetrical distribution: For this type of frequency distribution mean, median and mode do
not have same value (Fig. (b) and (c)). Skewness is + ve or — ve depending upon location of the
mode with respect to the mean.
i. Positively skewed distribution: In a positively skewed distribution there is a long tail on the
right and the mean is on the right of the mode (Fig. (b)).
ii. Negatively skewed distribution: In a negatively skewed distribution there is a long tail on the
left and the mean is on the left of the mode (Fig. (c)).
Measure of Kurtosis
Kurtosis is defined as the parameter of relative sharpness of the peak of the probability distribution
curve. It is used to indicate the flatness or peakedness of the frequency distribution curve and
measures the tails or outliers of the distribution.
Positive kurtosis represents that the distribution is more peaked than the normal distribution,
whereas negative kurtosis shows that the distribution is less peaked than the normal distribution.
There are three types of distributions:
30
1. Mesokurtic (Kurtosis = 3): This distribution has kurtosis statistic
similar to that of the normal distribution. It means that the extreme
values of the distribution are similar to that of a normal distribution
characteristic.
2. Leptokurtic (Kurtosis > 3): Distribution is longer, tails are fatter. Peak
is higher and sharper than Mesokurtic, which means that data are heavy-
tailed or profusion of outliers. Outliers stretch the horizontal axis of the
histogram graph, which makes the bulk of the data appear in a narrow
(“skinny”) vertical range.
3. Platykurtic: (Kurtosis < 3): Distribution is shorter, tails are thinner than
the normal distribution. The peak is lower and broader than Mesokurtic,
which means that data are light-tailed or lack of outliers. The reason for
this is because the extreme values are less than that of the normal
distribution.
Q-Q Plot & P-P Plot
P-P plot(Probability-Probability) and Q-Q(Quantile-Quantile) plot are called probability plots. Probability plot helps us to compare two data sets in
terms of distribution. Generally one set is theoretical and one set is empirical. The two types of probability plots are
• Q-Q plot (more common)
• P-P plot
• If we focus on "blue line” of Fig-A, it looks like normal distribution of some data. The "yellow line" represents distribution of same data in
cumulative manner. If we consider plotting non-cumulative distribution (similar to blue line above) of two data sets against each other then it is
called Q-Q plot. If we consider plotting cumulative distribution (similar to yellow line) of two sets against each other then it is called P-P plot.
• For example, Q-Q plot is used to check if the given data set is normally distributed by plotting its distribution against normally distributed data. If the
data is normally distributed, the result would be a straight line with positive slope like following Fig-B. Similarly for P-P plot, we can measure how
well a theoretical distribution fits given data (observed distribution). The theoretical distribution can be normal, lognormal, exponential, betta,
gamma etc. In both P-P plot or Q-Q plot if we get a straight line by plotting theoretical data against observed data, then it indicated a good match for
both data distributions. A Q-Q plot is very similar to the P-P plot except that it plots the quantiles (values that split a data set into equal portions) of
the data set instead of every individual score in the data. Moreover, the Q-Q plots are easier to interpret in case of large sample sizes.
Fig-A (Normal Distribution) Fig-B (Q-Q Plot) Fig-C (P-P Plot)
Boxplot, Normal Probability Plot & Histogram
• Boxplot: Descriptive statistics are an attempt to use numbers to describe how data are the
same and not the same. The box plot is a standardized way of displaying the distribution of
data based on the five number summary: minimum, first quartile, median, third quartile, and
maximum. In the simplest box plot the central rectangle spans the first quartile to the third
quartile (the interquartile range or IQR).
• Normal Probability Plot: The normal probability plot was designed specifically to test for
the assumption of normality. If your data comes from a normal distribution, the points on the
graph will form a line.
• Histogram: The popular histogram can give you a good idea about whether your data meets
the assumption. If your data looks like a bell curve: then it’s probably normal.
Kolmogorov-Smirnov Test & Shapiro-Wilk Test
Kolmogorov-Smirnov Goodness of Fit Test (K-S test): This test compares your data with a known distribution and
lets you know if they have the same distribution. Although the test is nonparametric — it doesn’t assume any
particular underlying distribution — it is commonly used as a test for normality to see if your data is normally
distributed. It’s also used to check the assumption of normality in Analysis of Variance.
Shapiro-Wilk test: This test is a way to tell if a random sample comes from a normal distribution. The test gives you
a W value; small values indicate your sample is not normally distributed.
If the sig value of the test is >0.05 then the data is normal, therefore it is normally distributed. If it is below 0.05, then
the distribution is significantly different from a normal distribution therefore it is NOT normally distributed.
Thank You!

Contenu connexe

Tendances

Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
Aiden Yeh
 
Univariate & bivariate analysis
Univariate & bivariate analysisUnivariate & bivariate analysis
Univariate & bivariate analysis
sristi1992
 

Tendances (20)

Application of Univariate, Bi-variate and Multivariate analysis Pooja k shetty
Application of Univariate, Bi-variate and Multivariate analysis Pooja k shettyApplication of Univariate, Bi-variate and Multivariate analysis Pooja k shetty
Application of Univariate, Bi-variate and Multivariate analysis Pooja k shetty
 
Data analysis using spss
Data analysis using spssData analysis using spss
Data analysis using spss
 
Ppt for 1.1 introduction to statistical inference
Ppt for 1.1 introduction to statistical inferencePpt for 1.1 introduction to statistical inference
Ppt for 1.1 introduction to statistical inference
 
Descriptive statistics and graphs
Descriptive statistics and graphsDescriptive statistics and graphs
Descriptive statistics and graphs
 
Statistics "Descriptive & Inferential"
Statistics "Descriptive & Inferential"Statistics "Descriptive & Inferential"
Statistics "Descriptive & Inferential"
 
Data Analysis and Statistics
Data Analysis and StatisticsData Analysis and Statistics
Data Analysis and Statistics
 
Basics of Educational Statistics (Inferential statistics)
Basics of Educational Statistics (Inferential statistics)Basics of Educational Statistics (Inferential statistics)
Basics of Educational Statistics (Inferential statistics)
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
(Manual spss)
(Manual spss)(Manual spss)
(Manual spss)
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Basic Statistics & Data Analysis
Basic Statistics & Data AnalysisBasic Statistics & Data Analysis
Basic Statistics & Data Analysis
 
Univariate & bivariate analysis
Univariate & bivariate analysisUnivariate & bivariate analysis
Univariate & bivariate analysis
 
data analysis techniques and statistical softwares
data analysis techniques and statistical softwaresdata analysis techniques and statistical softwares
data analysis techniques and statistical softwares
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Quantitative data analysis
Quantitative data analysisQuantitative data analysis
Quantitative data analysis
 
Qualitative data analysis
Qualitative data analysisQualitative data analysis
Qualitative data analysis
 
Inferential Statistics
Inferential StatisticsInferential Statistics
Inferential Statistics
 
Research.method
Research.methodResearch.method
Research.method
 

Similaire à Data Analysis in Research: Descriptive Statistics & Normality

IDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notesIDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notes
AnkurTiwari813070
 

Similaire à Data Analysis in Research: Descriptive Statistics & Normality (20)

Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
Data Presentation & Analysis.pptx
Data Presentation & Analysis.pptxData Presentation & Analysis.pptx
Data Presentation & Analysis.pptx
 
1.2 types of data
1.2 types of data1.2 types of data
1.2 types of data
 
Data Analysis
Data Analysis Data Analysis
Data Analysis
 
Practical applications and analysis in Research Methodology
Practical applications and analysis in Research Methodology Practical applications and analysis in Research Methodology
Practical applications and analysis in Research Methodology
 
IT3010 Lecture on Data Analysis
IT3010 Lecture on Data AnalysisIT3010 Lecture on Data Analysis
IT3010 Lecture on Data Analysis
 
Research Data Management
Research  Data ManagementResearch  Data Management
Research Data Management
 
IDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notesIDS-Unit-II. bachelor of computer applicatio notes
IDS-Unit-II. bachelor of computer applicatio notes
 
Quantitative Data - A Basic Introduction
Quantitative Data - A Basic IntroductionQuantitative Data - A Basic Introduction
Quantitative Data - A Basic Introduction
 
Data analysis aug-11
Data analysis aug-11Data analysis aug-11
Data analysis aug-11
 
introduction to statistical theory
introduction to statistical theoryintroduction to statistical theory
introduction to statistical theory
 
Basic Level Quantitative Analysis Using SPSS.ppt
Basic Level Quantitative Analysis Using SPSS.pptBasic Level Quantitative Analysis Using SPSS.ppt
Basic Level Quantitative Analysis Using SPSS.ppt
 
Research methodology-Research Report
Research methodology-Research ReportResearch methodology-Research Report
Research methodology-Research Report
 
Research Methodology-Data Processing
Research Methodology-Data ProcessingResearch Methodology-Data Processing
Research Methodology-Data Processing
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data science
 
biostatistical data and their types.pptx
biostatistical data and their types.pptxbiostatistical data and their types.pptx
biostatistical data and their types.pptx
 
Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...
 
Research and Data Analysi-1.pptx
Research and Data Analysi-1.pptxResearch and Data Analysi-1.pptx
Research and Data Analysi-1.pptx
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Data in Research
Data in ResearchData in Research
Data in Research
 

Plus de Ikbal Ahmed

Plus de Ikbal Ahmed (7)

Industrial Internet of Things (IIOT)
Industrial Internet of Things (IIOT)Industrial Internet of Things (IIOT)
Industrial Internet of Things (IIOT)
 
COVID-19
COVID-19COVID-19
COVID-19
 
Top Programming Languages of 2020
Top Programming Languages of 2020Top Programming Languages of 2020
Top Programming Languages of 2020
 
Latest Technology Products 2020
Latest Technology Products 2020Latest Technology Products 2020
Latest Technology Products 2020
 
Latest Technology News 2020
Latest Technology News 2020Latest Technology News 2020
Latest Technology News 2020
 
Theoretical and Conceptual framework in Research
 Theoretical and Conceptual  framework in Research Theoretical and Conceptual  framework in Research
Theoretical and Conceptual framework in Research
 
Reliability & Validity
Reliability & ValidityReliability & Validity
Reliability & Validity
 

Dernier

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
ssuserdda66b
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Dernier (20)

Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 

Data Analysis in Research: Descriptive Statistics & Normality

  • 1. Data Analysis in Research: Descriptive Statistics & Normality Ikbal Ahmed PhD Researcher Binary University, Malaysia
  • 3. Data Analysis: What is Data? In general, data is any set of characters that is gathered and translated for some purpose, usually analysis. If data is not put into context, it doesn't do anything to a human or computer. There are multiple types of data. Some of the more common types of data include the following. • Single character • Boolean (true or false) • Text (string) • Number (integer or floating-point) • Picture • Sound • Video Research data can take many forms. It might be: • documents, spreadsheets • laboratory notebooks, field notebooks, diaries • questionnaires, transcripts, codebooks • audiotapes, videotapes • photographs, films • test responses • slides, artefacts, specimens, samples • collections of digital outputs • database contents (video, audio, text, images) • models, algorithms, scripts • contents of an application (input, output, logfiles for analysis software, simulation software, schemas) • methodologies and workflows  Research data comes in many different formats and is gathered using a wide variety of methodologies.  Research data are collected and used in scholarship across all academic disciplines and, while it can consist of numbers in a spreadsheet, it also takes many different formats, including videos, images, artifacts, and diaries. Whether a psychologist collecting survey data to better understand human behavior, an artist using data to generate images and sounds, or an anthropologist using audio files to document observations about different cultures, scholarly research across all academic fields is increasingly data- driven.  Can be read by researchers to be used to support the research and prove hypothesis  Data becomes information after analysis  Data types: i. Primary ii. Secondary
  • 4. Sources of research data Research data can be generated for different purposes and through different processes. • Observational data is captured in real-time, and is usually irreplaceable, for example sensor data, survey data, sample data, and neuro-images. • Experimental data is captured from lab equipment. It is often reproducible, but this can be expensive. Examples of experimental data are gene sequences, chromatograms, and toroid magnetic field data. • Simulation data is generated from test models where model and metadata are more important than output data. For example, climate models and economic models. • Derived or compiled data has been transformed from pre-existing data points. It is reproducible if lost, but this would be expensive. Examples are data mining, compiled databases, and 3D models. • Reference or canonical data is a static or organic conglomeration or collection of smaller (peer-reviewed) datasets, most probably published and curated. For example, gene sequence databanks, chemical structures, or spatial data portals.
  • 5. Primary Data Vs Secondary Data
  • 7. Quantitative Data Quantitative data seems to be the easiest to explain. It answers key questions such as “how many, “how much” and “how often”. Quantitative data can be expressed as a number or can be quantified. Simply put, it can be measured by numerical variables. Examples of quantitative data: • Scores on tests and exams e.g. 85, 67, 90 and etc. • The weight of a person or a subject. • Your shoe size. • The temperature in a room. There are two general types of quantitative data: 1. Discrete data 2. Continuous data.
  • 8. Quantitative Data (Continue) 1. Discrete data: Discrete data is a count that involves only integers. Examples of discrete data: • The number of students in a class. • The number of workers in a company. • The number of home runs in a baseball game. • The number of test questions you answered correctly 2. Continuous data: Continuous data is information that could be meaningfully divided into finer levels. It can be measured on a scale or continuum and can have almost any numeric value. Examples of continuous data: • The amount of time required to complete a project. • The height of children. • The square footage of a two-bedroom house. • The speed of cars.
  • 9. Qualitative data Qualitative data can’t be expressed as a number and can’t be measured. Qualitative data consist of words, pictures, and symbols, not numbers. Qualitative data is also called categorical data because the information can be sorted by category, not by number. Qualitative data can answer questions such as “how this has happened” or and “why this has happened”. Examples of qualitative data: • Colors e.g. the color of the sea • Your favorite holiday destination such as Hawaii, New Zealand and etc. • Names as John, Patricia,….. • Ethnicity such as American Indian, Asian, etc. There are two general types of qualitative data: 1. Nominal data 2. Ordinal data
  • 10. Qualitative Data (Continue) 1. Nominal data: Nominal data is used just for labeling variables, without any type of quantitative value. The name ‘nominal’ comes from the Latin word “nomen” which means ‘name’. Examples of Nominal Data: • Gender (Women, Men) • Hair color (Blonde, Brown, Brunette, Red, etc.) • Marital status (Married, Single, Widowed) • Ethnicity (Hispanic, Asian) 2. Ordinal data: Ordinal data shows where a number is in order. Ordinal data is data which is placed into some kind of order by their position on a scale. Ordinal data may indicate superiority. However, you cannot do arithmetic with ordinal numbers because they only show sequence. Examples of Ordinal Data: • The first, second and third person in a competition. • Letter grades: A, B, C, and etc. • When a company asks a customer to rate the sales experience on a scale of 1-10. • Economic status: low, medium and high.
  • 11. What is data analysis? • According to LeCompte and Schensul, research data analysis is a process used by researchers for reducing data to a story and interpreting it to derive insights. The data analysis process helps in reducing a large chunk of data into smaller fragments, which makes sense. • Marshall and Rossman, on the other hand, describe data analysis as a messy, ambiguous, and time- consuming, but a creative and fascinating process through which a mass of collected data is being brought to order, structure and meaning. • Types of Data Analysis: Techniques and Methods There are several types of data analysis techniques that exist based on business and technology. The major types of data analysis are: • Descriptive Analysis • Diagnostic Analysis • Predictive Analysis • Prescriptive Analysis
  • 12. Data Analysis Tools Data Analysis Software: 1) Statistical Package for Social Science (SPSS) 2) STATA 3) Smart PLS 4) EViews 5) Microsoft Excel Data Analysis Programming Languages:
  • 13. Qualitative Data Analysis The data obtained through this method consists of words, pictures, symbols and observations. This type of analysis refers to the procedures and processes that are utilized for the analysis of data to provide some level of understanding, explanation or interpretation. Unlike the quantitative analysis, no statistical approaches are used to collect and analyze this data. There are a variety of approaches to collecting this type of data and interpreting it. Some of the most commonly used methods are: 1. Content Analysis: It is used to analyze verbal or behavioral data. This data can consist of documents or communication artifacts like texts in various formats, pictures or audios/videos. 2. Narrative Analysis: This one is the most commonly used as it involves analyzing data that comes from a variety of sources including field notes, surveys, diaries, interviews and other written forms. It involves reformulating the stories given by people based on their experiences and in different contexts. 3. Grounded Theory: This method involves the development of causal explanations of a single phenomenon from the study of one or more cases. If further cases are studied, then the explanations are altered until the researchers arrive at a statement that fits all of the cases.
  • 14. Quantitative Data Analysis The two most commonly used quantitative data analysis methods are- 1. Descriptive statistics 2. Inferential statistics. 1) Descriptive Analysis: analyses complete data or a sample of summarized numerical data. It shows mean and deviation for continuous data whereas percentage and frequency for categorical data. 2) Inferential Analysis: analyses sample from complete data. In this type of Analysis, you can find different conclusions from the same data by selecting different samples.
  • 15. Descriptive Statistics Typically descriptive statistics (also known as descriptive analysis) is the first level of analysis. This helps researchers find absolute numbers to summarize individual variables and find patterns. A few commonly used descriptive statistics are: • Mean: numerical average of a set of values. • Median: midpoint of a set of numerical values. • Mode: most common value among a set of values. • Percentage: used to express how a value or group of respondents within the data relates to a larger group of respondents. • Frequency: the number of times a value is found. • Range: the highest and lowest value in a set of values.
  • 16. Inferential Analysis These complex analyses show the relationships between multiple variables to generalize results and make predictions. A few examples are... • Correlation: describes the relationship between 2 variables • Regression: shows or predicts the relationship between 2 variables • Analysis of variance: tests the extent to which 2+ groups differ
  • 17. Descriptive statistics Descriptive statistics can be useful for two purposes: 1) to provide basic information about variables in a dataset and 2) to highlight potential relationships between variables. The three most common descriptive statistics can be displayed graphically or pictorially and are measures of: • Graphical/Pictorial Methods • Measures of Central Tendency • Measures of Dispersion • Measures of Association
  • 18. Graphical/Pictorial Methods There are several graphical and pictorial methods that enhance researchers' understanding of individual variables and the relationships between variables. Graphical and pictorial methods provide a visual representation of the data. Some of these methods include: • Histograms • Scatter plots • Bar charts • Pie charts • Line Graph • Histograms • Each value of a variable is displayed along the bottom of a histogram, and a bar is drawn for each value • The height of the bar corresponds to the frequency with which that value occurs • Bar Charts • The heights of bars in a bar chart represent the frequencies (or relative frequencies) in each group. Note that in a bar chart, there is a gap between each bar. This is unlike a histogram, where there are no gaps between the bars, reflecting the continuous nature of the underlying variable.
  • 19. Graphical/Pictorial Methods (Continue) • Scatter plots • Display the relationship between two quantitative or numeric variables by plotting one variable against the value of another variable • For example, one axis of a scatter plot could represent height and the other could represent weight. Each person in the data would receive one data point on the scatter plot that corresponds to his or her height and weight • Pie Chart • In a pie chart, a circle is divided into slices, such that each slice represents a different category and the size of the slice is proportional to the relative frequency of that category. Conventionally, the categories in a pie chart are ordered clockwise from the largest slice to the smallest, starting at the 12 o’clock position.
  • 20. A line graph, also known as a line chart, is a type of chart used to visualize the value of something over time. For example, a finance department may plot the change in the amount of cash the company has on hand over time. The line graph consists of a horizontal x-axis and a vertical y-axis. Most line graphs only deal with positive number values, so these axes typically intersect near the bottom of the y-axis and the left end of the x-axis. The point at which the axes intersect is always (0, 0). Each axis is labeled with a data type. For example, the x- axis could be days, weeks, quarters, or years, while the y-axis shows revenue in dollars. The x-axis is also called the independent axis because its values do not depend on anything. For example, time is always placed on the x-axis since it continues to move forward regardless of anything else. The y-axis is also called the dependent axis because its values depend on those of the x-axis: at this time, the company had this much money. The result is that the line of the graph always progresses in a horizontal fashion and each x value only has one y value (the company cannot have two amounts of money at the same time). Graphical/Pictorial Methods (Continue) Line graph
  • 21. Measures of central tendency are the most basic and, often, the most informative description of a population's characteristics. There are three measures of central tendency: • Mean -- the sum of a variable's values divided by the total number of values • Median -- the middle value of a variable • Mode -- the value that occurs most often Example: The incomes of five randomly selected people in the United States are $10,000, $10,000, $45,000, $60,000, and $1,000,000. Mean Income = 10,000 + 10,000 + 45,000 + 60,000 + 1,000,000 5 = $225,000 Median Income = $45,000 Modal Income = $10,000 The mean is the most commonly used measure of central tendency. Medians are generally used when a few values are extremely different from the rest of the values (this is called a skewed distribution). For example, the median income is often the best measure of the average income because, while most individuals earn between $0 and $200,000, a handful of individuals earn millions. Measures of Central Tendency
  • 22. Measures of dispersion provide information about the spread of a variable's values. There are four key measures of dispersion: • Range • Variance • Standard Deviation • Skew Range is simply the difference between the smallest and largest values in the data. The interquartile range is the difference between the values at the 75th percentile and the 25th percentile of the data. Variance is the most commonly used measure of dispersion. It is calculated by taking the average of the squared differences between each value and the mean. Measures of Dispersion
  • 23. Standard deviation, another commonly used statistic, is the square root of the variance. Skewness The term ‘skewness’ is used to mean the absence of symmetry from the mean of the dataset. It is characteristic of the deviation from the mean, to be greater on one side than the other, i.e. attribute of the distribution having one tail heavier than the other. Skewness is used to indicate the shape of the distribution of data. In a skewed distribution, the curve is extended to either left or right side. So, when the plot is extended towards the right side more, it denotes positive skewness, wherein mode < median < mean. On the other hand, when the plot is stretched more towards the left direction, then it is called as negative skewness and so, mean < median < mode. Measures of Dispersion (Continue)
  • 24. Example: The incomes of five randomly selected people in the United States are $10,000, $10,000, $45,000, $60,000, and $1,000,000: Range = 1,000,000 - 10,000 = 990,000 Variance = [(10,000 − 225,000) 2 + (10,000 − 225,000) 2 + (45,000 − 225,000) 2 + (60,000 − 225,000) 2 + (1,00,000 − 225,000) 2 ] 4 = 150,540,000,000 Standard Deviation = 150,540,000,000 = 387,995 Skew = Income is positively skewed Measures of Dispersion (Continue)
  • 25. Measures of association indicate whether two variables are related. Two measures are commonly used: • Chi-square • Correlation Chi-Square • As a measure of association between variables, chi-square tests are used on nominal data (i.e., data that are put into classes: e.g., gender [male, female] and type of job [unskilled, semi-skilled, skilled]) to determine whether they are associated. • A chi-square is called significant if there is an association between two variables, and non significant if there is not an association To test for associations, a chi-square is calculated in the following way: In the sample dataset, respondents were asked their gender and whether or not they were a cigarette smoker. There were three answer choices: Nonsmoker, Past smoker, and Current smoker. Suppose we want to test for an association between smoking behavior (nonsmoker, current smoker, or past smoker) and gender (male or female) using a Chi-Square Test of Independence. Measures of Association
  • 26. Correlation • A correlation coefficient is used to measure the strength of the relationship between numeric variables (e.g., weight and height) • The most common correlation coefficient is Pearson's r, which can range from -1 to +1. • If the coefficient is between 0 and 1, as one variable increases, the other also increases. This is called a positive correlation. For example, height and weight are positively correlated because taller people usually weigh more. • If the correlation coefficient is between -1 and 0, as one variable increases the other decreases. This is called a negative correlation. For example, age and hours slept per night are negatively correlated because older people usually sleep fewer hours per night Measures of Association (Continue)
  • 28. Tools for Assessing Normality A primary use of descriptive statistics is to determine whether the data are normally distributed. If the variable is normally distributed, you can use parametric statistics that are based on this assumption. If the variable is not normally distributed, you might try a transformation on the variable (such as, the natural log or square root) to make the data normal. There are several methods of assessing whether data are normally distributed or not. They fall into two broad categories: graphical and statistical. The some common techniques are:  Graphical i. Skewness & Kurtosis ii. P-P Plot iii. Q-Q Plot iv. Boxplot v. Histogram vi. Normal Quantile Plot (also called Normal Probability Plot)  Statistical i. Shapiro-Wilk Test ii. Kolmogorov-Smirnov Test iii. Jarque-Bera Test iv. Anderson-Darling Test
  • 29. Measure of Skewness 1. Symmetrical Distribution: It is that type of frequency curve Fig (a), for which mean = median = mode 2. Asymmetrical distribution: For this type of frequency distribution mean, median and mode do not have same value (Fig. (b) and (c)). Skewness is + ve or — ve depending upon location of the mode with respect to the mean. i. Positively skewed distribution: In a positively skewed distribution there is a long tail on the right and the mean is on the right of the mode (Fig. (b)). ii. Negatively skewed distribution: In a negatively skewed distribution there is a long tail on the left and the mean is on the left of the mode (Fig. (c)).
  • 30. Measure of Kurtosis Kurtosis is defined as the parameter of relative sharpness of the peak of the probability distribution curve. It is used to indicate the flatness or peakedness of the frequency distribution curve and measures the tails or outliers of the distribution. Positive kurtosis represents that the distribution is more peaked than the normal distribution, whereas negative kurtosis shows that the distribution is less peaked than the normal distribution. There are three types of distributions: 30 1. Mesokurtic (Kurtosis = 3): This distribution has kurtosis statistic similar to that of the normal distribution. It means that the extreme values of the distribution are similar to that of a normal distribution characteristic. 2. Leptokurtic (Kurtosis > 3): Distribution is longer, tails are fatter. Peak is higher and sharper than Mesokurtic, which means that data are heavy- tailed or profusion of outliers. Outliers stretch the horizontal axis of the histogram graph, which makes the bulk of the data appear in a narrow (“skinny”) vertical range. 3. Platykurtic: (Kurtosis < 3): Distribution is shorter, tails are thinner than the normal distribution. The peak is lower and broader than Mesokurtic, which means that data are light-tailed or lack of outliers. The reason for this is because the extreme values are less than that of the normal distribution.
  • 31. Q-Q Plot & P-P Plot P-P plot(Probability-Probability) and Q-Q(Quantile-Quantile) plot are called probability plots. Probability plot helps us to compare two data sets in terms of distribution. Generally one set is theoretical and one set is empirical. The two types of probability plots are • Q-Q plot (more common) • P-P plot • If we focus on "blue line” of Fig-A, it looks like normal distribution of some data. The "yellow line" represents distribution of same data in cumulative manner. If we consider plotting non-cumulative distribution (similar to blue line above) of two data sets against each other then it is called Q-Q plot. If we consider plotting cumulative distribution (similar to yellow line) of two sets against each other then it is called P-P plot. • For example, Q-Q plot is used to check if the given data set is normally distributed by plotting its distribution against normally distributed data. If the data is normally distributed, the result would be a straight line with positive slope like following Fig-B. Similarly for P-P plot, we can measure how well a theoretical distribution fits given data (observed distribution). The theoretical distribution can be normal, lognormal, exponential, betta, gamma etc. In both P-P plot or Q-Q plot if we get a straight line by plotting theoretical data against observed data, then it indicated a good match for both data distributions. A Q-Q plot is very similar to the P-P plot except that it plots the quantiles (values that split a data set into equal portions) of the data set instead of every individual score in the data. Moreover, the Q-Q plots are easier to interpret in case of large sample sizes. Fig-A (Normal Distribution) Fig-B (Q-Q Plot) Fig-C (P-P Plot)
  • 32. Boxplot, Normal Probability Plot & Histogram • Boxplot: Descriptive statistics are an attempt to use numbers to describe how data are the same and not the same. The box plot is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. In the simplest box plot the central rectangle spans the first quartile to the third quartile (the interquartile range or IQR). • Normal Probability Plot: The normal probability plot was designed specifically to test for the assumption of normality. If your data comes from a normal distribution, the points on the graph will form a line. • Histogram: The popular histogram can give you a good idea about whether your data meets the assumption. If your data looks like a bell curve: then it’s probably normal.
  • 33. Kolmogorov-Smirnov Test & Shapiro-Wilk Test Kolmogorov-Smirnov Goodness of Fit Test (K-S test): This test compares your data with a known distribution and lets you know if they have the same distribution. Although the test is nonparametric — it doesn’t assume any particular underlying distribution — it is commonly used as a test for normality to see if your data is normally distributed. It’s also used to check the assumption of normality in Analysis of Variance. Shapiro-Wilk test: This test is a way to tell if a random sample comes from a normal distribution. The test gives you a W value; small values indicate your sample is not normally distributed. If the sig value of the test is >0.05 then the data is normal, therefore it is normally distributed. If it is below 0.05, then the distribution is significantly different from a normal distribution therefore it is NOT normally distributed.