SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Data Analysis Course
Descriptive Statistics(Version-1)
Venkat Reddy
Data Analysis Course
• Data analysis design document
• Introduction to statistical data analysis

• Descriptive statistics
•   Data exploration, validation & sanitization




                                                                Venkat Reddy
                                                          Data Analysis Course
•   Probability distributions examples and applications
•   Simple correlation and regression analysis
•   Multiple liner regression analysis
•   Logistic regression analysis
•   Testing of hypothesis
•   Clustering and decision trees
•   Time series analysis and forecasting
•   Credit Risk Model building-1                                 2
•   Credit Risk Model building-2
Note
• This presentation is just class notes. The course notes for Data
  Analysis Training is by written by me, as an aid for myself.
• The best way to treat this is as a high-level summary; the
  actual session went more in depth and contained other




                                                                           Venkat Reddy
                                                                     Data Analysis Course
  information.
• Most of this material was written as informal notes, not
  intended for publication
• Please send questions/comments/corrections to
  venkat@trenwiseanalytics.com or 21.venkat@gmail.com
• Please check my website for latest version of this document
                                         -Venkat Reddy                      3
Contents
•   What are Descriptive statistics
•   Frequency tables and graphs, Histograms
•   Central Tendency
•   Mean, Median, Mode




                                                    Venkat Reddy
                                              Data Analysis Course
•   Dispersion
•   Range, variance, standard deviation
•   Quartiles, Percentiles
•   Box Plots
•   Bivariate Descriptive Statistics
    • Contingency Tables
    • Correlation                                    4
    • Regression
Why Descriptive statistics?
• Who is a better ODI batsmen - Sachin or Muralidharan?
   • Batting average?
• Who is the reliable- Dhoni or Afridi?
   • Score variance
• A triangular series among Aus, Eng & Newziland ; Who will win?
   • Most number of wins - Mode




                                                                                  Venkat Reddy
                                                                            Data Analysis Course
• I am going to buy shoes. Which brand has verity- Power or Adidas?
   • Price range - Range

• We used Average, Variance, Mode, Range to make some inferences.
  These are nothing but descriptive statistics
• Descriptive statistics tell us what happened in the past.
• Descriptive statistics avoid inferences but, they help us to get a feel
  of the data.
• Some times they are good enough to make an inference.                            5
Descriptive Statistics
• A statistic or a measure that describes the data
  • Average salary of employees
• Describing data with tables and graphs (quantitative or
  categorical variables)




                                                                          Venkat Reddy
                                                                    Data Analysis Course
• Numerical descriptions
  • Center – Give some example measures of center of the data
  • Variability– Give some example measures of variability of the
    data
• Bivariate descriptions (In practice, most studies have several
  variables)
  • Dependency measures(Correlation)
                                                                           6
Simple Descriptive Statistics
•   N
•   Sum
•   Min
•   Max




                                                                       Venkat Reddy
                                                                 Data Analysis Course
•   Average
•   Frequency of each level
•   Variance
•   Standard deviation

These simple descriptive statistics will be use in inferential
statistics later.                                                       7
Frequency tables & Histograms
• Frequency distribution: Lists possible values of variable and
  number of times each occurs




                                                                        Venkat Reddy
                                                                  Data Analysis Course
                                                                         8
Shapes of histograms
• Bell-shaped (IQ, SAT, political ideology in all U.S. )
• Skewed right
  • Example Annual income
  • No. times arrested




                                                                       Venkat Reddy
                                                                 Data Analysis Course
• Skewed left
  • Score on easy exam
  • Daily level if excitement in office
• Bimodal
  • Hardworking days in a year (Peaks near Mid year & year end
    Appraisal)

                                                                        9
Lab : Histogram
• Create a histogram on variable ‘actual’ in prdsale data
  • How many modes?
  • What is the skewness?
  • What is its kurtosis?
• Create a histogram on variable ‘msrp’ in cars data




                                                                  Venkat Reddy
                                                            Data Analysis Course
  • How many modes?
  • What is the skewness?
  • What is its kurtosis?
• Create a histogram on variable ‘weight’ in cars data
  • How many modes?
  • What is the skewness?
  • What is its kurtosis?

                                                                10
Compare the above three histograms.
Central tendency
• What is the flight fare from Bangalore to Delhi? 3500–Exact or
  average?
• What is central tendency? - Average
• Three types of Averages




                                                                         Venkat Reddy
                                                                   Data Analysis Course
  • Mean
  • Median
  • Mode




                                                                       11
Mean
• Center of gravity
• Evenly partitions the sum of all measurement among all cases;
  average of all measures
                              n

                            x




                                                                        Venkat Reddy
                                                                  Data Analysis Course
                                      i
                    x       i 1

                                  n
• Crucial for inferential statistics
• Mean is not very resistant to outliers –See in Median
                                                                      12
Median
• What is the mean of [0.1     0.8    0.4    0.3     0.1
      0.4      9.0    0.1      0.9    0.3    1.0     0.3
      0.1]
• Guess without calculation – Around 0.5?
• Now calculate the mean




                                                                      Venkat Reddy
                                                                Data Analysis Course
• Median is exactly in the middle. Isn’t mean exactly in the
  middle
• Order the observations in ascending or descending order and
  pick the middle observation
• less useful for inferential purposes
                                                                    13
• More resistant to effects of outliers…
Calculation of Median
        rim diameter (cm)

               unit 1 unit 2
                  9.7   9.0
                11.5 11.2
                11.6 11.3




                                         Venkat Reddy
                                   Data Analysis Course
                12.1 11.7
                12.4 12.2
                12.6 12.5
         12.9 <--      13.2 13.2
                13.1 13.8
                13.5 14.0
                13.6 15.5
                14.8 15.6
                16.3 16.2
                26.9 16.4
                                       14
Mode
• How do you express average size of the shoes ?
  • 6.567 or 6?

• Mode is the most numerous category
• Can be more or less created by the grouping procedure




                                                                        Venkat Reddy
                                                                  Data Analysis Course
• For theoretical distributions—simply the location of the peak
  on the frequency distribution




                                                                      15
Lab
•   Run Proc means data product data
•   What is the mean of ‘msrp’ in cars data?
•   Is it reflecting the average value of price?
•   What is median of ‘msrp’ in cars data?




                                                                       Venkat Reddy
                                                                 Data Analysis Course
•   Is it reflecting the average value of price?
•   Run Proc Univariate on weight varaibale in cars data. Find
    mean, Median & Mode.




                                                                     16
Dispersion
Person1: What is the average depth of this river? 5 feet
Person2: I am 5.5 I can easily cross it(and starts crossing it)
Person 2: Help….help.
Person 1: Some times just knowing the central tendency is not




                                                                        Venkat Reddy
                                                                  Data Analysis Course
sufficient

• Measures of dispersion summarize the degree of
  clustering/spread of cases, esp. with respect to central
  tendency…
  • range
  • variance
  • standard deviation                                                17
Range
                  unit 1 unit 2
• Max –Min          9.7    9.0
                   11.5 11.2
                   11.6 11.3
                   12.1 11.7
                   12.4 12.2




                                        Venkat Reddy
                                  Data Analysis Course
                   12.6 12.5
    R: range(x)    13.1 13.2
                   13.5 13.8
                   13.6 14.0
                   14.8 15.5
                   16.3 15.6
                   26.9 16.2
                          16.4


                                      18
Variance
 • Take deviation from Mean- It can be zero some times
 • Hence take square of deviation from mean  Take average of
   that
 • Average mean squared distance is variance




                                                                               Venkat Reddy
                                                                         Data Analysis Course
                               n

                               x  x 
                                         2
                                     i
                       2    i 1
                                     n

• Units of variance are squared… this makes variance hard to interpret
• Eg : Mean length = 22.6 mm variance = 38 mm2
• What does this mean??? –I don’t Know
                                                                             19
Standard Deviation
• Square root of variance                n

                                        xi  x 2
                                  s    i 1
                                               n




                                                            Venkat Reddy
                                                      Data Analysis Course
• Units are in same units as base measurements
• Mean = 22.6 mm standard deviation = 6.2 mm
• Mean +/- sd (16.4—28.8 mm)
  • should give at least some intuitive sense of
    where most of the cases lie, barring major
    effects of outliers

                                                          20
Quartiles & Percentiles
• pth percentile: p percent of observations below it, (100 - p)%
  above it.
• Like 95% of CAT percentile means 5% are above & 95% are
  below
• 1,2,3,4,5,6,7,8,9,10 - What is 25th percentile?




                                                                         Venkat Reddy
                                                                   Data Analysis Course
• 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20 - What is
  25th percentile? What is 80th percentile?

  • p = 50: median
  • p = 25: lower quartile (LQ)
  • p = 75: upper quartile (UQ)

• Interquartile range IQR = UQ - LQ                                    21
Box Plots
• Quartiles portrayed graphically by box plots




                                                       Venkat Reddy
                                                 Data Analysis Course
                                                     22
Box Plots




                                                         Venkat Reddy
                                                   Data Analysis Course
Example: weekly TV watching for n=60, 3 outliers       23
Box Plots Interpretation
• Box plots have box from LQ to UQ, with median marked. They
  portray a five-number summary of the data: Minimum, LQ,
  Median, UQ, Maximum
• Except for outliers identified separately
• Outlier = observation falling




                                                                       Venkat Reddy
                                                                 Data Analysis Course
          below LQ – 1.5(IQR) or above UQ + 1.5(IQR)
• Ex. If LQ = 2, UQ = 10, then IQR = 8 and outliers above 10 +
  1.5(8) = 22




                                                                     24
Lab
• Run proc univariate on a variable from sample data in sas
  default library(prd sale / cars)
• Run proc means on actual & predicted variables from product
  sales data
• What are the values of Range, Variance, SD




                                                                      Venkat Reddy
                                                                Data Analysis Course
• What are 1,2,3 & 4 quartile values
• What is 95th percentile?
• Use “all” option to display the box plots



                                                                    25
Contingency Tables
• Cross classifications of categorical variables in which rows (typically)
  represent categories of explanatory variable and columns represent
  categories of response variable.
• Counts in “cells” of the table give the numbers of individuals at the
  corresponding combination of levels of the two variables




                                                                                   Venkat Reddy
                                                                             Data Analysis Course
   Example: Happiness and Family Income of 1993 families (GSS 2008 data:
   “happy,” “finrela”)
                             Happiness
    Income       Very Pretty Nottoo               Total
                -------------------------------
     Above Aver. 164 233               26         423
     Average      293 473             117         883
     Below Aver. 132 383              172          687
                 ------------------------------
    Total         589 1089           315          1993
                                                                                 26
Contingency tables
• Example: Percentage “very happy” is
  • 39% for above average income (164/423 = 0.39)
  • 33% for average income (293/883 = 0.33)
  • What percent for below average income?




                                                                      Venkat Reddy
                                                                Data Analysis Course
                        Happiness
  Income      Very        Pretty Not oo                 Total
           --------------------------------------------
   Above 164 (39%) 233 (55%) 26 (6%)                     423
   Average 293 (33%) 473 (54%) 117 (13%) 883
   Below 132 (19%) 383 (56%) 172 (25%) 687
          ----------------------------------------------
• What can we conclude? Is happiness depending on Income? Or
  Happiness is independent of Income?                               27
• Inference questions for later chapters?
Correlation
• Correlation describes strength of association between
  two variables
• Falls between -1 and +1, with sign indicating direction of
  association (formula & other details later )




                                                                     Venkat Reddy
                                                               Data Analysis Course
• The larger the correlation in absolute value, the stronger
  the association (in terms of a straight line trend)
• Examples: (positive or negative, how strong?)
  • Mental impairment and life events, correlation =
  • GDP and fertility, correlation =
  • GDP and percent using Internet, correlation =
                                                                   28
Strength of Association
• Correlation 0 No linear association
• Correlation 0 to 0.25 Negligible positive
  association
• Correlation 0.25-0.5  Weak positive
  association
• Correlation 0.5-0.75 Moderate positive




                                                     Venkat Reddy
                                               Data Analysis Course
  association
• Correlation >0.75 Very Strong positive
  association
• What are the limits for negative
  correlation




                                                   29
Regression
• Regression analysis gives line predicting y using
  x(algorithm & other details later )

• y = college GPA, x = high school GPA




                                                            Venkat Reddy
                                                      Data Analysis Course
• Predicted y = 0.234 + 1.002(x)




                                                          30
Lab
• Create a contingency table for product sales data
• Find contingency tables for
  • Region by product type
  • Division by Product type




                                                                         Venkat Reddy
                                                                   Data Analysis Course
• Find the correlation between actual sales and predicted sales.
• Find the correlation between weight & msrp in cars data




                                                                       31
Venkat Reddy Konasani
Manager at Trendwise Analytics
venkat@TrendwiseAnalytics.com
21.venkat@gmail.com




                                          Venkat Reddy
                                    Data Analysis Course
+91 9886 768879
www.TrendwiseAnalytics.com/venkat




                                        32

Contenu connexe

Tendances

Statistics-Measures of dispersions
Statistics-Measures of dispersionsStatistics-Measures of dispersions
Statistics-Measures of dispersions
Capricorn
 
Central limit theorem
Central limit theoremCentral limit theorem
Central limit theorem
Vijeesh Soman
 
Statistical inference
Statistical inferenceStatistical inference
Statistical inference
Jags Jagdish
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
Aileen Balbido
 

Tendances (20)

Statistics-Measures of dispersions
Statistics-Measures of dispersionsStatistics-Measures of dispersions
Statistics-Measures of dispersions
 
Basic Concepts of Inferential statistics
Basic Concepts of Inferential statisticsBasic Concepts of Inferential statistics
Basic Concepts of Inferential statistics
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Normal Distribution
Normal DistributionNormal Distribution
Normal Distribution
 
Quantitative analysis
Quantitative analysisQuantitative analysis
Quantitative analysis
 
Central limit theorem
Central limit theoremCentral limit theorem
Central limit theorem
 
Statistical inference
Statistical inferenceStatistical inference
Statistical inference
 
Frequency Distributions
Frequency DistributionsFrequency Distributions
Frequency Distributions
 
Torturing numbers - Descriptive Statistics for Growers (2013)
Torturing numbers - Descriptive Statistics for Growers (2013)Torturing numbers - Descriptive Statistics for Growers (2013)
Torturing numbers - Descriptive Statistics for Growers (2013)
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
T distribution | Statistics
T distribution | StatisticsT distribution | Statistics
T distribution | Statistics
 
Measures of Variation
Measures of VariationMeasures of Variation
Measures of Variation
 
Test for equal variances
Test for equal variancesTest for equal variances
Test for equal variances
 
Data Analysis and Statistics
Data Analysis and StatisticsData Analysis and Statistics
Data Analysis and Statistics
 
Analysis and Interpretation of Data
Analysis and Interpretation of DataAnalysis and Interpretation of Data
Analysis and Interpretation of Data
 
Statistical inference: Estimation
Statistical inference: EstimationStatistical inference: Estimation
Statistical inference: Estimation
 
Introduction to Statistics
Introduction to StatisticsIntroduction to Statistics
Introduction to Statistics
 
Introduction to Statistics - Basic concepts
Introduction to Statistics - Basic conceptsIntroduction to Statistics - Basic concepts
Introduction to Statistics - Basic concepts
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Measure of central tendency
Measure of central tendencyMeasure of central tendency
Measure of central tendency
 

Similaire à Descriptive statistics

LESSON 4_UNGROUPED.pptx.pdf
LESSON  4_UNGROUPED.pptx.pdfLESSON  4_UNGROUPED.pptx.pdf
LESSON 4_UNGROUPED.pptx.pdf
nnzuliyana2
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
Turi, Inc.
 

Similaire à Descriptive statistics (20)

Data analysis Design Document
Data analysis Design DocumentData analysis Design Document
Data analysis Design Document
 
Timeseries forecasting
Timeseries forecastingTimeseries forecasting
Timeseries forecasting
 
Multiple regression
Multiple regressionMultiple regression
Multiple regression
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Testing of hypothesis
Testing of hypothesisTesting of hypothesis
Testing of hypothesis
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
LESSON 4_UNGROUPED.pptx.pdf
LESSON  4_UNGROUPED.pptx.pdfLESSON  4_UNGROUPED.pptx.pdf
LESSON 4_UNGROUPED.pptx.pdf
 
The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modelling
 
Be a pro in statistics
Be a pro in statisticsBe a pro in statistics
Be a pro in statistics
 
Data Analysis, Intepretation
Data Analysis, IntepretationData Analysis, Intepretation
Data Analysis, Intepretation
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Data analysis
Data analysisData analysis
Data analysis
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
T7 data analysis
T7 data analysisT7 data analysis
T7 data analysis
 
Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
 
Introduction to Data Analytics with R
Introduction to Data Analytics with RIntroduction to Data Analytics with R
Introduction to Data Analytics with R
 
Calibration of weights in surveys with nonresponse and frame imperfections
Calibration of weights in surveys with nonresponse and frame imperfectionsCalibration of weights in surveys with nonresponse and frame imperfections
Calibration of weights in surveys with nonresponse and frame imperfections
 

Plus de Venkata Reddy Konasani

Plus de Venkata Reddy Konasani (20)

Transformers 101
Transformers 101 Transformers 101
Transformers 101
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniques
 
Neural Network Part-2
Neural Network Part-2Neural Network Part-2
Neural Network Part-2
 
GBM theory code and parameters
GBM theory code and parametersGBM theory code and parameters
GBM theory code and parameters
 
Neural Networks made easy
Neural Networks made easyNeural Networks made easy
Neural Networks made easy
 
Decision tree
Decision treeDecision tree
Decision tree
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
Credit Risk Model Building Steps
Credit Risk Model Building StepsCredit Risk Model Building Steps
Credit Risk Model Building Steps
 
Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS Table of Contents - Practical Business Analytics using SAS
Table of Contents - Practical Business Analytics using SAS
 
SAS basics Step by step learning
SAS basics Step by step learningSAS basics Step by step learning
SAS basics Step by step learning
 
Testing of hypothesis case study
Testing of hypothesis case study Testing of hypothesis case study
Testing of hypothesis case study
 
L101 predictive modeling case_study
L101 predictive modeling case_studyL101 predictive modeling case_study
L101 predictive modeling case_study
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau -  Data, Graphs, Filters, Dashboards and Advanced featuresLearning Tableau -  Data, Graphs, Filters, Dashboards and Advanced features
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Online data sources for analaysis
Online data sources for analaysis Online data sources for analaysis
Online data sources for analaysis
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 

Dernier

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Dernier (20)

On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 

Descriptive statistics

  • 1. Data Analysis Course Descriptive Statistics(Version-1) Venkat Reddy
  • 2. Data Analysis Course • Data analysis design document • Introduction to statistical data analysis • Descriptive statistics • Data exploration, validation & sanitization Venkat Reddy Data Analysis Course • Probability distributions examples and applications • Simple correlation and regression analysis • Multiple liner regression analysis • Logistic regression analysis • Testing of hypothesis • Clustering and decision trees • Time series analysis and forecasting • Credit Risk Model building-1 2 • Credit Risk Model building-2
  • 3. Note • This presentation is just class notes. The course notes for Data Analysis Training is by written by me, as an aid for myself. • The best way to treat this is as a high-level summary; the actual session went more in depth and contained other Venkat Reddy Data Analysis Course information. • Most of this material was written as informal notes, not intended for publication • Please send questions/comments/corrections to venkat@trenwiseanalytics.com or 21.venkat@gmail.com • Please check my website for latest version of this document -Venkat Reddy 3
  • 4. Contents • What are Descriptive statistics • Frequency tables and graphs, Histograms • Central Tendency • Mean, Median, Mode Venkat Reddy Data Analysis Course • Dispersion • Range, variance, standard deviation • Quartiles, Percentiles • Box Plots • Bivariate Descriptive Statistics • Contingency Tables • Correlation 4 • Regression
  • 5. Why Descriptive statistics? • Who is a better ODI batsmen - Sachin or Muralidharan? • Batting average? • Who is the reliable- Dhoni or Afridi? • Score variance • A triangular series among Aus, Eng & Newziland ; Who will win? • Most number of wins - Mode Venkat Reddy Data Analysis Course • I am going to buy shoes. Which brand has verity- Power or Adidas? • Price range - Range • We used Average, Variance, Mode, Range to make some inferences. These are nothing but descriptive statistics • Descriptive statistics tell us what happened in the past. • Descriptive statistics avoid inferences but, they help us to get a feel of the data. • Some times they are good enough to make an inference. 5
  • 6. Descriptive Statistics • A statistic or a measure that describes the data • Average salary of employees • Describing data with tables and graphs (quantitative or categorical variables) Venkat Reddy Data Analysis Course • Numerical descriptions • Center – Give some example measures of center of the data • Variability– Give some example measures of variability of the data • Bivariate descriptions (In practice, most studies have several variables) • Dependency measures(Correlation) 6
  • 7. Simple Descriptive Statistics • N • Sum • Min • Max Venkat Reddy Data Analysis Course • Average • Frequency of each level • Variance • Standard deviation These simple descriptive statistics will be use in inferential statistics later. 7
  • 8. Frequency tables & Histograms • Frequency distribution: Lists possible values of variable and number of times each occurs Venkat Reddy Data Analysis Course 8
  • 9. Shapes of histograms • Bell-shaped (IQ, SAT, political ideology in all U.S. ) • Skewed right • Example Annual income • No. times arrested Venkat Reddy Data Analysis Course • Skewed left • Score on easy exam • Daily level if excitement in office • Bimodal • Hardworking days in a year (Peaks near Mid year & year end Appraisal) 9
  • 10. Lab : Histogram • Create a histogram on variable ‘actual’ in prdsale data • How many modes? • What is the skewness? • What is its kurtosis? • Create a histogram on variable ‘msrp’ in cars data Venkat Reddy Data Analysis Course • How many modes? • What is the skewness? • What is its kurtosis? • Create a histogram on variable ‘weight’ in cars data • How many modes? • What is the skewness? • What is its kurtosis? 10 Compare the above three histograms.
  • 11. Central tendency • What is the flight fare from Bangalore to Delhi? 3500–Exact or average? • What is central tendency? - Average • Three types of Averages Venkat Reddy Data Analysis Course • Mean • Median • Mode 11
  • 12. Mean • Center of gravity • Evenly partitions the sum of all measurement among all cases; average of all measures n x Venkat Reddy Data Analysis Course i x i 1 n • Crucial for inferential statistics • Mean is not very resistant to outliers –See in Median 12
  • 13. Median • What is the mean of [0.1 0.8 0.4 0.3 0.1 0.4 9.0 0.1 0.9 0.3 1.0 0.3 0.1] • Guess without calculation – Around 0.5? • Now calculate the mean Venkat Reddy Data Analysis Course • Median is exactly in the middle. Isn’t mean exactly in the middle • Order the observations in ascending or descending order and pick the middle observation • less useful for inferential purposes 13 • More resistant to effects of outliers…
  • 14. Calculation of Median rim diameter (cm) unit 1 unit 2 9.7 9.0 11.5 11.2 11.6 11.3 Venkat Reddy Data Analysis Course 12.1 11.7 12.4 12.2 12.6 12.5 12.9 <-- 13.2 13.2 13.1 13.8 13.5 14.0 13.6 15.5 14.8 15.6 16.3 16.2 26.9 16.4 14
  • 15. Mode • How do you express average size of the shoes ? • 6.567 or 6? • Mode is the most numerous category • Can be more or less created by the grouping procedure Venkat Reddy Data Analysis Course • For theoretical distributions—simply the location of the peak on the frequency distribution 15
  • 16. Lab • Run Proc means data product data • What is the mean of ‘msrp’ in cars data? • Is it reflecting the average value of price? • What is median of ‘msrp’ in cars data? Venkat Reddy Data Analysis Course • Is it reflecting the average value of price? • Run Proc Univariate on weight varaibale in cars data. Find mean, Median & Mode. 16
  • 17. Dispersion Person1: What is the average depth of this river? 5 feet Person2: I am 5.5 I can easily cross it(and starts crossing it) Person 2: Help….help. Person 1: Some times just knowing the central tendency is not Venkat Reddy Data Analysis Course sufficient • Measures of dispersion summarize the degree of clustering/spread of cases, esp. with respect to central tendency… • range • variance • standard deviation 17
  • 18. Range unit 1 unit 2 • Max –Min 9.7 9.0 11.5 11.2 11.6 11.3 12.1 11.7 12.4 12.2 Venkat Reddy Data Analysis Course 12.6 12.5 R: range(x) 13.1 13.2 13.5 13.8 13.6 14.0 14.8 15.5 16.3 15.6 26.9 16.2 16.4 18
  • 19. Variance • Take deviation from Mean- It can be zero some times • Hence take square of deviation from mean  Take average of that • Average mean squared distance is variance Venkat Reddy Data Analysis Course n  x  x  2 i 2  i 1 n • Units of variance are squared… this makes variance hard to interpret • Eg : Mean length = 22.6 mm variance = 38 mm2 • What does this mean??? –I don’t Know 19
  • 20. Standard Deviation • Square root of variance n  xi  x 2 s i 1 n Venkat Reddy Data Analysis Course • Units are in same units as base measurements • Mean = 22.6 mm standard deviation = 6.2 mm • Mean +/- sd (16.4—28.8 mm) • should give at least some intuitive sense of where most of the cases lie, barring major effects of outliers 20
  • 21. Quartiles & Percentiles • pth percentile: p percent of observations below it, (100 - p)% above it. • Like 95% of CAT percentile means 5% are above & 95% are below • 1,2,3,4,5,6,7,8,9,10 - What is 25th percentile? Venkat Reddy Data Analysis Course • 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20 - What is 25th percentile? What is 80th percentile? • p = 50: median • p = 25: lower quartile (LQ) • p = 75: upper quartile (UQ) • Interquartile range IQR = UQ - LQ 21
  • 22. Box Plots • Quartiles portrayed graphically by box plots Venkat Reddy Data Analysis Course 22
  • 23. Box Plots Venkat Reddy Data Analysis Course Example: weekly TV watching for n=60, 3 outliers 23
  • 24. Box Plots Interpretation • Box plots have box from LQ to UQ, with median marked. They portray a five-number summary of the data: Minimum, LQ, Median, UQ, Maximum • Except for outliers identified separately • Outlier = observation falling Venkat Reddy Data Analysis Course below LQ – 1.5(IQR) or above UQ + 1.5(IQR) • Ex. If LQ = 2, UQ = 10, then IQR = 8 and outliers above 10 + 1.5(8) = 22 24
  • 25. Lab • Run proc univariate on a variable from sample data in sas default library(prd sale / cars) • Run proc means on actual & predicted variables from product sales data • What are the values of Range, Variance, SD Venkat Reddy Data Analysis Course • What are 1,2,3 & 4 quartile values • What is 95th percentile? • Use “all” option to display the box plots 25
  • 26. Contingency Tables • Cross classifications of categorical variables in which rows (typically) represent categories of explanatory variable and columns represent categories of response variable. • Counts in “cells” of the table give the numbers of individuals at the corresponding combination of levels of the two variables Venkat Reddy Data Analysis Course Example: Happiness and Family Income of 1993 families (GSS 2008 data: “happy,” “finrela”) Happiness Income Very Pretty Nottoo Total ------------------------------- Above Aver. 164 233 26 423 Average 293 473 117 883 Below Aver. 132 383 172 687 ------------------------------ Total 589 1089 315 1993 26
  • 27. Contingency tables • Example: Percentage “very happy” is • 39% for above average income (164/423 = 0.39) • 33% for average income (293/883 = 0.33) • What percent for below average income? Venkat Reddy Data Analysis Course Happiness Income Very Pretty Not oo Total -------------------------------------------- Above 164 (39%) 233 (55%) 26 (6%) 423 Average 293 (33%) 473 (54%) 117 (13%) 883 Below 132 (19%) 383 (56%) 172 (25%) 687 ---------------------------------------------- • What can we conclude? Is happiness depending on Income? Or Happiness is independent of Income? 27 • Inference questions for later chapters?
  • 28. Correlation • Correlation describes strength of association between two variables • Falls between -1 and +1, with sign indicating direction of association (formula & other details later ) Venkat Reddy Data Analysis Course • The larger the correlation in absolute value, the stronger the association (in terms of a straight line trend) • Examples: (positive or negative, how strong?) • Mental impairment and life events, correlation = • GDP and fertility, correlation = • GDP and percent using Internet, correlation = 28
  • 29. Strength of Association • Correlation 0 No linear association • Correlation 0 to 0.25 Negligible positive association • Correlation 0.25-0.5  Weak positive association • Correlation 0.5-0.75 Moderate positive Venkat Reddy Data Analysis Course association • Correlation >0.75 Very Strong positive association • What are the limits for negative correlation 29
  • 30. Regression • Regression analysis gives line predicting y using x(algorithm & other details later ) • y = college GPA, x = high school GPA Venkat Reddy Data Analysis Course • Predicted y = 0.234 + 1.002(x) 30
  • 31. Lab • Create a contingency table for product sales data • Find contingency tables for • Region by product type • Division by Product type Venkat Reddy Data Analysis Course • Find the correlation between actual sales and predicted sales. • Find the correlation between weight & msrp in cars data 31
  • 32. Venkat Reddy Konasani Manager at Trendwise Analytics venkat@TrendwiseAnalytics.com 21.venkat@gmail.com Venkat Reddy Data Analysis Course +91 9886 768879 www.TrendwiseAnalytics.com/venkat 32