SlideShare a Scribd company logo
1 of 26
Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012

License: Unless otherwise noted, this material is made available under the terms
of the Creative Commons Attribution – Noncommercial – Share Alike 3.0 Lic
ense: http://creativecommons.org/licenses/by-nc-sa/3.0/
We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your abilit
y to use, share, and adapt it. The citation key on the following slide provides information about how you may sha
re and adapt this material.

Copyright holders of content included in this material should contact open.michigan@umich.edu with any questi
ons, corrections, or clarification regarding the use of content.

For more information about how to cite these materials visit http://open.umich.edu/education/about/terms-of-use.
Attribution Key
                        for more information see: http://open.umich.edu/wiki/AttributionPolicy



Use + Share + Adapt
  { Content the copyright holder, author, or law permits you to use, share and adapt. }
               Public Domain – Government: Works that are produced by the U.S. Government. (17 USC § 105
               )
               Public Domain – Expired: Works that are no longer protected due to an expired copyright term.
               Public Domain – Self Dedicated: Works that a copyright holder has dedicated to the public domain.

               Creative Commons – Zero Waiver

               Creative Commons – Attribution License
               Creative Commons – Attribution Share Alike License
               Creative Commons – Attribution Noncommercial License
               Creative Commons – Attribution Noncommercial Share Alike License
               GNU – Free Documentation License

Make Your Own Assessment
  { Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for copyright. }
               Public Domain – Ineligible: Works that are ineligible for copyright protection in the U.S. (17 USC § 102(b)) *laws in
               your jurisdiction may differ
   { Content Open.Michigan has used under a Fair Use determination. }
               Fair Use: Use of works that is determined to be Fair consistent with the U.S. Copyright Act. (17 USC § 107) *laws in your j
               urisdiction may differ
               Our determination DOES NOT mean that all uses of this 3rd-party content are Fair Uses and we DO NOT guarantee that y
               our use of the content is Fair.
               To use this content you should do your own independent analysis to determine whether or not your use will be Fair.
Descriptive Statistics
quantitatively describe the main features of a collection of data.



                How do salaries             What should I
                vary across the              make of all
                  company?                   this???!!!




                                       employee
                        manager


 Staff. Jones
                                                        HR
Descriptive Statistics in R

Mean                  > mean(x);
                      > mean(x,trim=a)
Median                > median(x)


Mode                  > sort(table(x))
Standard deviation    > sd(x)
Variance              > var(x)

the median absolute   > mad(c(x))
deviation
interquartile range   > IQR(x)

Range                 > range(x)
Data Dimensions

> length(x)
[1] 1000
-------------------------
> nrow(X)                   Matrix X
[1] 2030
                                       ….
> ncol(X)
[1] 100000
> dim(X)
[1] 2034 100000
                       ….
Vectorization in R
               Matrix X




> apply( X, MARGIN=1, FUN= mean)

> apply( X, MARGIN=2, FUN= mean)
25                 boxplot(X)

                                                • Good for small
20




                                                  data sets
                                                • Easy to compar
                                                  e groups side b
15




                                                  y side
                                                • 1.5*IQR defines
10




                                                  outlier
5
0




     epiE   epiS    epiImp   epilie   epiNeur
The Big Six
 Minimum, 1st Q, Median, Mean, 3rd Q, Maximu
 m

> summary(X)
R tries to understand you

            > summary(X)
Histograms: > hist(X)
                          epiE                                      epiS                                   epiImp                                              epilie




                                                                                                                                                     80
            50
Frequency




                                          Frequency




                                                                                  Frequency




                                                                                                                                         Frequency
                                                                                              20 40
                                                      40




                                                                                                                                                     40
            20
            0




                                                      0




                                                                                              0




                                                                                                                                                     0
                 0    5 10           20                    0    4     8     12                        0    2        4     6     8                         0   2       4      6

                           epiE                                     epiS                                       epiImp                                              epilie



                      epiNeur                                   bfagree                                        bfcon                                           bfext
            40




                                                                                              40
Frequency




                                          Frequency




                                                                                  Frequency




                                                                                                                                         Frequency
                                                      30




                                                                                                                                                     40
            20




                                                                                              20
            0




                                                      0




                                                                                              0




                                                                                                                                                     0
                 0    5        15                          80       120     160                       60       100             160                        0   50            150

                          epiNeur                               bfagree                                         bfcon                                              bfext



                          bfneur                                bfopen                                          bdi
                                                      50
Frequency




                                          Frequency




                                                                                  Frequency

                                                                                              60
            20




                                                      20
            0




                                                      0




                                                                                              0




                 40       80   120                         80       120    160                        0        10         20        30

                          bfneur                                 bfopen                                             bdi
Correlation
> cor(wt,mpg)
[1] -0.8676594
> plot(x=wt,y=mpg)                    Scatterplot Example




                             30
          Miles Per Gallon

                             25
                             20
                             15
                             10




                                  2      3                4   5

                                             Car Weight
Scatterplot Matrix




• Iris dataset
• 150 flowers
• 5 variables                         Goingslo, flickr
Scatterplot Matrix
plot > pairs(data)
                                              2.0   3.0   4.0                   0.5   1.5   2.5




                                                                                                                        7.5
                            Sepal.Length




                                                                                                                        6.0
                                                                                                                        4.5
                      4.0
         setosa       3.0
                                                Sepal.Width
         versicolor
                      2.0




         virginica




                                                                                                                        7
                                                                                                                        5
                                                                Petal.Length




                                                                                                                        3
                                                                                                                        1
                      2.5
                      1.5




                                                                                Petal.Width
                      0.5




                                                                                                                        3.0
                                                                                                                        2.0
                                                                                                        Species




                                                                                                                        1.0
                            4.5 5.5 6.5 7.5                     1 2 3 4 5 6 7                     1.0     2.0     3.0
> coplot(lat ~ long | depth)
                                               Given : depth
                   100            200         300           400         500            600




            165 170 175 180 185                         165 170 175 180 185
      -10
      -15
      -20
lat
      -25
      -30
      -35




                                  165 170 175 180 185                         165 170 175 180 185


                                                    long
Linear Regression

 Why?
  Prediction of future or unknown observations
  Assessment of relationship between variables
  General description of data structure
 What?
Variable Selection

 Why?
   Simplification
   Elimination of multicollinearity and noise
   Time and money saving
 How?
   Testing-based Variable Selection Methods
     - Backward, Forward, Stepwise
   Criterion-based Procedures


 What?
   AIC = n ln(RSS/n) + 2(p)
Example: U.S. State Fact and Figures

 Life Expectancy
    Population, Income, Illiteracy, Murder, HS Grad, Frost, Area

 Selected R code
    Linear Regression
      > g <- lm(Life.Exp ~ Population + Income + Illiteracy + Murder
                            + HS.Grad + Frost + Area, data = statedata)
      > summary(g) Coefficients: Variance Table
                      Analysis of
                           Response: Life.Exp
                                       Estimate Std. Error t value Pr(>|t|)
       > anova(g)        (Intercept) Df Sum Sq Mean Sq F value
                                      7.094e+01 1.748e+00 40.586 Pr(>F)
                                                                    < 2e-16 ***
                           Population 5.180e-05 2.919e-05
                         Population    1 0.4089 0.4089 0.7372 0.395434 .
                                                             1.775   0.0832
     AIC                  Income
                         Income        1 11.5946 11.5946 20.9028 4.218e-05 ***
                                     -2.180e-05 2.444e-04 -0.089     0.9293
                           Illiteracy 3.382e-02 19.4207 35.0116 5.228e-07 ***
                         Illiteracy    1 19.4207 3.663e-01   0.092   0.9269
       > step(g)           Murder
                         Murder      -3.011e-01 27.4288 49.4486 1.308e-08 ***
                                       1 27.4288 4.662e-02 -6.459 8.68e-08 ***
                           HS.Grad
                         HS.Grad       1 4.0989 4.0989 7.3895 0.009494 **
                                      4.893e-02 2.332e-02    2.098   0.0420 *
                           Frost
                         Frost         1 2.0488 2.0488 3.6935 0.061426 . .
                                     -5.735e-03 3.143e-03 -1.825     0.0752
                           Area
                         Area          1 0.0011 1.668e-06 -0.044
                                     -7.383e-08   0.0011 0.0020 0.964908
                                                                     0.9649
AIC = n ln(RSS/n) + 2(p)   Residuals 42 23.2971 0.5547
Continued: U.S. State Fact and Figures
Start: AIC=-22.18
Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area

            Df Sum of Sq      RSS       AIC
- Area       1    0.0011   23.298   -24.182
- Income     1    0.0044   23.302   -24.175
- Illiteracy 1    0.0047   23.302   -24.174
<none>                     23.297   -22.185
- Population 1    1.7472   25.044   -20.569
- Frost      1    1.8466   25.144   -20.371
- HS.Grad    1    2.4413   25.738   -19.202
- Murder     1   23.1411   46.438    10.305

Step: AIC=-24.18
Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost

            Df Sum of Sq      RSS       AIC
- Illiteracy 1    0.0038   23.302   -26.174
- Income     1    0.0059   23.304   -26.170
<none>                     23.298   -24.182
- Population 1    1.7599   25.058   -22.541
- Frost      1    2.0488   25.347   -21.968
- HS.Grad    1    2.9804   26.279   -20.163
- Murder     1   26.2721   49.570    11.569
Continued: U.S. State Fact and Figures
                          73
Step: AIC=-28.16
Life.Exp ~ Population + Murder + HS.Grad + Frost

                               Df Sum of Sq         RSS           AIC                Effect on Response Variable of
<none>                                           23.308       -28.161              One Unit Change of Predict Variable
- Population                   1         2.064   25.372       -25.920
        Life Expectancy




- Frost                        1         3.122   26.430       -23.877
- HS.Grad                      1         5.112   28.420       -20.246
- Murder                       1        34.816   58.124        15.528

Coefficients:
                                                 0.00005014
(Intercept)   Population                              Murder                 HS.Grad          Frost
                                                                    0.3001
  7.103e+01    5.014e-05                          -3.001e-01               4.658e-02     -5.943e-03
                                                                                   0.04658      0.005943

                                    71.03




                          70
                                   Intercept         x1               x4                x5          x6
                                                               Predict Variables
What is Principal Component Analysis (PCA)?

 Two general approaches of reducing variables :
 feature selection and feature extraction

  Feature Selection : “Akaike Information
   Criterion”(AIC), BIC or Back-Substitution
  Feature extraction : “Principal Component
   Analysis”(PCA) is most widely used

      Create  several artificial variables
      Built-in functions in R = Convenient!
Actual Pima Data

        pregnant glucose diastolic   triceps   insulin   bmi    diabetes   age   test
   1       6       148      72          35        0      33.6     0.627     50    1
   2       1        85      66          29        0      26.6     0.351     31    0
   3       8       183      64           0        0      23.3     0.672     32    1
   4       1        89      66          23        94     28.1     0.167     21    0
   5       0       137      40          35       168     43.1     2.288     33    1
   6       5       116      74           0        0      25.6     0.201     30    0

                                     ….

( Imagine a data set with many more (~1000) columns )

(Imagine a Linear Regression: Which variables affect diabetes in what ways?)
PCA Example: Pima Indians

 The National Institute of Diabetes and Digestive and Kidney Diseases conducte
  d a study on 768 adult female Pima Indians living near Phoenix.
 9 Variables (8 continuous, 1 categorical)
     pregnant: Number of times pregnant
     Glucose : Plasma glucose concentration at 2 hours in an oral glucose tolerance test
     Diastolic : Diastolic blood pressure (mm Hg)
     Triceps : Triceps skin fold thickness (mm)
     Insulin : 2-Hour serum insulin (mu U/ml)
     Bmi : Body mass index (weight in kg/(height in metres squared))
     Diabetes : Diabetes pedigree function
     Age : Age (years)
     Test : diabetes (coded 0 if negative, 1 if positive)
 Next Slide: PCA Implementation
What principal components might look like:


 PC1 : 1*Insulin + 0.01*Glucose + ..
 PC2 : 1*Glucose + 0.12*Age + 0.12*DiastolicBP + ..
 PC3 :     0.92 * DiastolicBP + 0.31*Triceps

    Principal components : What are they composed of?
       (less important)


    Difference with Linear Regression
+
                                                                                                   ++
                                      -4000      -3000           -2000        -1000           0     +

-Goal: obtain summary




                              0.10
about data in lower




                                                                                                        1000
dimensions                                                                                        + +
                                                                                                + +
                                                                                                + +




                              0.05
                                                                                             ++ +++
                                                                                               +++  +
                                                                 +                  +        + ++ +
                                                                                + + + + ++++ ++ ++ +
                                                                                                ++++
                                                                                                 ++




                                                                                                        500
                                                                                   + ++ ++++++++
                                                                                         + +
                                                                                        + + ++ +++ +
                                                                                        + +++ +++
                                                                                               +
                                                                            +           + + ++++ +
                                                                                        + + ++++ +
                                                                        + + + + ++++ +++ +
                                                                                               ++
                                                                                               +
                                                                                         + + +++ +
                                                                                           + ++
-- How many                                                                              + +++++ +
                                                                                       +++ +++ + +
                                                                                         + ++ +++ +
                                                                                   + + + + +++ ++
                                                             +      + +              ++ + + + + +
                                                                              + + ++++++ +++++ +
                                                                                      + ++ +++ +
dimensions?                                                         + ++                  + ++
                                                                                      + + + ++
                                                                                     ++ + +
                                                                                       +++ ++++ +
                                                                              + + ++++ +++++ +


                              0.00
                                               insulin
                                                  +                                     triceps +
                                                                                         + +++




                                                                                                        0
                        PC2                                       +       +      +        + ++
                                                                                          pregnant
                                                                               +++ + ++ ++++ +
                                                                                         + +age +
                                                                                             ++
                                                                                             bmi +
                                                                             +            ++++
                                                                                  + + +diastolic+
                                                         +              ++      + + + + ++ ++ + ++
                                                                                   +
                                                                                        ++
                                                                                           + +
                                                                                           ++       +
                                                                      ++              + ++ + +
                                                                                    + ++ + +        +
                                                                                + + + ++++++ +      +
                                                                                                    +




                                                                                                        -500
                                                                    +       +             + ++
                                                                                           +        +
                                                                                                    +
                                                                                                    +
- R code in the next                                                                 +
                                                                             + ++ + + ++ + + ++
                                                                                                    +
                              -0.05
                                                                                 +       + ++       +
                                                                                                    +
                                                                                 ++ + ++ + + +
                                                                                          ++
slide:                                                                                  +++ +
                                                                                      glucose +
                                                                                                    +
                                                                                                    +
                                                                                                    +
                                                                                   +++++       + +




                                                                                                        -1000
                                                                                  +   ++            +
                                                                                           ++ +    +
                                                                                                    +
                                                                                          +        ++
                                                                                                   ++
                              -0.10




                                                                                                   +
                                                                                                   ++




                                                                                                        -1500
                                                                                                    +



                                      -0.30   -0.25      -0.20   -0.15   -0.10    -0.05     0.00

                                                                   PC1
Brief : R-Code

> data.pca <- prcomp(data[,-9]); summary(data.pca);
Importance of components:
                            PC1    PC2    PC3     PC4    PC5      PC6     PC7
Standard deviation     116.002 30.5411 19.7630 14.0777 10.6155 6.76973 2.78575
Proportion of Variance 0.889 0.0616 0.0258 0.0131      0.00744 0.00303 0.00051
Cumulative Proportion 0.889 0.950 0.976         0.9890   0.996    0.999 1.00000

> data.pca
Rotation:
             PC1 PC2 PC3 PC4 PC5 PC6 PC7
pregnant 0.002 -0.02 0.02 0.05 2e-01 -0.005 -1e+00
glucose -0.098 -0.97 -0.14 -0.12 -9e-02 0.051 -9e-04
Diastolic -0.016 -0.14 0.92 0.26 -2e-01 0.076 1e-03
triceps   -0.061 0.06 0.31 -0.88 3e-01 0.221 4e-04
insulin   -0.993 0.09 -0.02 0.07 -2e-04 -0.006 -1e-03
bmi       -0.014 -0.05 0.13 -0.19 2e-02 -0.971 3e-03
age        0.004 -0.14 0.13 0.30 9e-01 -0.015 2e-01

> barplot(totalrep, main="Representation of Principal Components", xlab="Principal
   Component", ylab="% of Total Variance")
> biplot(data.pca, xlabs=rep('+',768), xlim = c(-0.05,0.3), ylim = c(-0.15,0.12)); abline(h=0,v=0);
Representation of Principal Components


                      0.5
                      0.4
% of Total Variance

                      0.3
                      0.2
                      0.1
                      0.0




                                       Principal Component

More Related Content

More from Open.Michigan

GEMC- Disorders of the Pleura, Mediastinum, and Chest Wall- Resident Training
GEMC- Disorders of the Pleura, Mediastinum, and Chest Wall- Resident TrainingGEMC- Disorders of the Pleura, Mediastinum, and Chest Wall- Resident Training
GEMC- Disorders of the Pleura, Mediastinum, and Chest Wall- Resident TrainingOpen.Michigan
 
GEMC- Dental Emergencies and Common Dental Blocks- Resident Training
GEMC- Dental Emergencies and Common Dental Blocks- Resident TrainingGEMC- Dental Emergencies and Common Dental Blocks- Resident Training
GEMC- Dental Emergencies and Common Dental Blocks- Resident TrainingOpen.Michigan
 
GEMC- EMedHome Board Review: Procedures- Resident Training
GEMC- EMedHome Board Review: Procedures- Resident TrainingGEMC- EMedHome Board Review: Procedures- Resident Training
GEMC- EMedHome Board Review: Procedures- Resident TrainingOpen.Michigan
 
GEMC- Arthritis and Arthrocentesis- Resident Training
GEMC- Arthritis and Arthrocentesis- Resident TrainingGEMC- Arthritis and Arthrocentesis- Resident Training
GEMC- Arthritis and Arthrocentesis- Resident TrainingOpen.Michigan
 
GEMC- Bursitis, Tendonitis, Fibromyalgia, and RSD- Resident Training
GEMC- Bursitis, Tendonitis, Fibromyalgia, and RSD- Resident TrainingGEMC- Bursitis, Tendonitis, Fibromyalgia, and RSD- Resident Training
GEMC- Bursitis, Tendonitis, Fibromyalgia, and RSD- Resident TrainingOpen.Michigan
 
GEMC- Right Upper Quadrant Ultrasound- Resident Training
GEMC- Right Upper Quadrant Ultrasound- Resident TrainingGEMC- Right Upper Quadrant Ultrasound- Resident Training
GEMC- Right Upper Quadrant Ultrasound- Resident TrainingOpen.Michigan
 
GEMC- Cardiovascular Board Review Session 3- Resident Training
GEMC- Cardiovascular Board Review Session 3- Resident TrainingGEMC- Cardiovascular Board Review Session 3- Resident Training
GEMC- Cardiovascular Board Review Session 3- Resident TrainingOpen.Michigan
 
GEMC- Cardiovascular Board Review Session 2- Resident Training
GEMC- Cardiovascular Board Review Session 2- Resident TrainingGEMC- Cardiovascular Board Review Session 2- Resident Training
GEMC- Cardiovascular Board Review Session 2- Resident TrainingOpen.Michigan
 
GEMC- Cardiovascular Board Review Session 1- Resident Training
GEMC- Cardiovascular Board Review Session 1- Resident TrainingGEMC- Cardiovascular Board Review Session 1- Resident Training
GEMC- Cardiovascular Board Review Session 1- Resident TrainingOpen.Michigan
 
GEMC: Nursing Process and Linkage between Theory and Practice
GEMC: Nursing Process and Linkage between Theory and PracticeGEMC: Nursing Process and Linkage between Theory and Practice
GEMC: Nursing Process and Linkage between Theory and PracticeOpen.Michigan
 
2014 gemc-nursing-lapham-general survey and patient care management
2014 gemc-nursing-lapham-general survey and patient care management2014 gemc-nursing-lapham-general survey and patient care management
2014 gemc-nursing-lapham-general survey and patient care managementOpen.Michigan
 
GEMC: When Kidneys Fail
GEMC: When Kidneys FailGEMC: When Kidneys Fail
GEMC: When Kidneys FailOpen.Michigan
 
GEMC: The Role of Radiography in the Initial Evaluation of C-Spine Trauma
GEMC: The Role of Radiography in the Initial Evaluation of C-Spine TraumaGEMC: The Role of Radiography in the Initial Evaluation of C-Spine Trauma
GEMC: The Role of Radiography in the Initial Evaluation of C-Spine TraumaOpen.Michigan
 
GEMC - Mammal and Human Bite Injuries
GEMC - Mammal and Human Bite InjuriesGEMC - Mammal and Human Bite Injuries
GEMC - Mammal and Human Bite InjuriesOpen.Michigan
 
GEMC- Sickle Cell Disease: Special Considerations in Pediatrics- Resident Tra...
GEMC- Sickle Cell Disease: Special Considerations in Pediatrics- Resident Tra...GEMC- Sickle Cell Disease: Special Considerations in Pediatrics- Resident Tra...
GEMC- Sickle Cell Disease: Special Considerations in Pediatrics- Resident Tra...Open.Michigan
 
GEMC- Ghana Grab Bag Pediatric Quiz- Resident Training
GEMC- Ghana Grab Bag Pediatric Quiz- Resident TrainingGEMC- Ghana Grab Bag Pediatric Quiz- Resident Training
GEMC- Ghana Grab Bag Pediatric Quiz- Resident TrainingOpen.Michigan
 
GEMC- Pediatric Neurologic Emergencies- Resident Training
GEMC- Pediatric Neurologic Emergencies- Resident TrainingGEMC- Pediatric Neurologic Emergencies- Resident Training
GEMC- Pediatric Neurologic Emergencies- Resident TrainingOpen.Michigan
 
GEMC- Seizures- Resident Training
GEMC- Seizures- Resident TrainingGEMC- Seizures- Resident Training
GEMC- Seizures- Resident TrainingOpen.Michigan
 
GEMC- Laceration Care- Resident Training
GEMC- Laceration Care- Resident TrainingGEMC- Laceration Care- Resident Training
GEMC- Laceration Care- Resident TrainingOpen.Michigan
 
GEMC- Toddler Toxicology- Resident Training
GEMC- Toddler Toxicology- Resident TrainingGEMC- Toddler Toxicology- Resident Training
GEMC- Toddler Toxicology- Resident TrainingOpen.Michigan
 

More from Open.Michigan (20)

GEMC- Disorders of the Pleura, Mediastinum, and Chest Wall- Resident Training
GEMC- Disorders of the Pleura, Mediastinum, and Chest Wall- Resident TrainingGEMC- Disorders of the Pleura, Mediastinum, and Chest Wall- Resident Training
GEMC- Disorders of the Pleura, Mediastinum, and Chest Wall- Resident Training
 
GEMC- Dental Emergencies and Common Dental Blocks- Resident Training
GEMC- Dental Emergencies and Common Dental Blocks- Resident TrainingGEMC- Dental Emergencies and Common Dental Blocks- Resident Training
GEMC- Dental Emergencies and Common Dental Blocks- Resident Training
 
GEMC- EMedHome Board Review: Procedures- Resident Training
GEMC- EMedHome Board Review: Procedures- Resident TrainingGEMC- EMedHome Board Review: Procedures- Resident Training
GEMC- EMedHome Board Review: Procedures- Resident Training
 
GEMC- Arthritis and Arthrocentesis- Resident Training
GEMC- Arthritis and Arthrocentesis- Resident TrainingGEMC- Arthritis and Arthrocentesis- Resident Training
GEMC- Arthritis and Arthrocentesis- Resident Training
 
GEMC- Bursitis, Tendonitis, Fibromyalgia, and RSD- Resident Training
GEMC- Bursitis, Tendonitis, Fibromyalgia, and RSD- Resident TrainingGEMC- Bursitis, Tendonitis, Fibromyalgia, and RSD- Resident Training
GEMC- Bursitis, Tendonitis, Fibromyalgia, and RSD- Resident Training
 
GEMC- Right Upper Quadrant Ultrasound- Resident Training
GEMC- Right Upper Quadrant Ultrasound- Resident TrainingGEMC- Right Upper Quadrant Ultrasound- Resident Training
GEMC- Right Upper Quadrant Ultrasound- Resident Training
 
GEMC- Cardiovascular Board Review Session 3- Resident Training
GEMC- Cardiovascular Board Review Session 3- Resident TrainingGEMC- Cardiovascular Board Review Session 3- Resident Training
GEMC- Cardiovascular Board Review Session 3- Resident Training
 
GEMC- Cardiovascular Board Review Session 2- Resident Training
GEMC- Cardiovascular Board Review Session 2- Resident TrainingGEMC- Cardiovascular Board Review Session 2- Resident Training
GEMC- Cardiovascular Board Review Session 2- Resident Training
 
GEMC- Cardiovascular Board Review Session 1- Resident Training
GEMC- Cardiovascular Board Review Session 1- Resident TrainingGEMC- Cardiovascular Board Review Session 1- Resident Training
GEMC- Cardiovascular Board Review Session 1- Resident Training
 
GEMC: Nursing Process and Linkage between Theory and Practice
GEMC: Nursing Process and Linkage between Theory and PracticeGEMC: Nursing Process and Linkage between Theory and Practice
GEMC: Nursing Process and Linkage between Theory and Practice
 
2014 gemc-nursing-lapham-general survey and patient care management
2014 gemc-nursing-lapham-general survey and patient care management2014 gemc-nursing-lapham-general survey and patient care management
2014 gemc-nursing-lapham-general survey and patient care management
 
GEMC: When Kidneys Fail
GEMC: When Kidneys FailGEMC: When Kidneys Fail
GEMC: When Kidneys Fail
 
GEMC: The Role of Radiography in the Initial Evaluation of C-Spine Trauma
GEMC: The Role of Radiography in the Initial Evaluation of C-Spine TraumaGEMC: The Role of Radiography in the Initial Evaluation of C-Spine Trauma
GEMC: The Role of Radiography in the Initial Evaluation of C-Spine Trauma
 
GEMC - Mammal and Human Bite Injuries
GEMC - Mammal and Human Bite InjuriesGEMC - Mammal and Human Bite Injuries
GEMC - Mammal and Human Bite Injuries
 
GEMC- Sickle Cell Disease: Special Considerations in Pediatrics- Resident Tra...
GEMC- Sickle Cell Disease: Special Considerations in Pediatrics- Resident Tra...GEMC- Sickle Cell Disease: Special Considerations in Pediatrics- Resident Tra...
GEMC- Sickle Cell Disease: Special Considerations in Pediatrics- Resident Tra...
 
GEMC- Ghana Grab Bag Pediatric Quiz- Resident Training
GEMC- Ghana Grab Bag Pediatric Quiz- Resident TrainingGEMC- Ghana Grab Bag Pediatric Quiz- Resident Training
GEMC- Ghana Grab Bag Pediatric Quiz- Resident Training
 
GEMC- Pediatric Neurologic Emergencies- Resident Training
GEMC- Pediatric Neurologic Emergencies- Resident TrainingGEMC- Pediatric Neurologic Emergencies- Resident Training
GEMC- Pediatric Neurologic Emergencies- Resident Training
 
GEMC- Seizures- Resident Training
GEMC- Seizures- Resident TrainingGEMC- Seizures- Resident Training
GEMC- Seizures- Resident Training
 
GEMC- Laceration Care- Resident Training
GEMC- Laceration Care- Resident TrainingGEMC- Laceration Care- Resident Training
GEMC- Laceration Care- Resident Training
 
GEMC- Toddler Toxicology- Resident Training
GEMC- Toddler Toxicology- Resident TrainingGEMC- Toddler Toxicology- Resident Training
GEMC- Toddler Toxicology- Resident Training
 

Recently uploaded

Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleCeline George
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptxJonalynLegaspi2
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDhatriParmar
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxDhatriParmar
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 

Recently uploaded (20)

Multi Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP ModuleMulti Domain Alias In the Odoo 17 ERP Module
Multi Domain Alias In the Odoo 17 ERP Module
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptx
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 

A2DataDive workshop: Introduction to R

  • 1. Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution – Noncommercial – Share Alike 3.0 Lic ense: http://creativecommons.org/licenses/by-nc-sa/3.0/ We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your abilit y to use, share, and adapt it. The citation key on the following slide provides information about how you may sha re and adapt this material. Copyright holders of content included in this material should contact open.michigan@umich.edu with any questi ons, corrections, or clarification regarding the use of content. For more information about how to cite these materials visit http://open.umich.edu/education/about/terms-of-use.
  • 2. Attribution Key for more information see: http://open.umich.edu/wiki/AttributionPolicy Use + Share + Adapt { Content the copyright holder, author, or law permits you to use, share and adapt. } Public Domain – Government: Works that are produced by the U.S. Government. (17 USC § 105 ) Public Domain – Expired: Works that are no longer protected due to an expired copyright term. Public Domain – Self Dedicated: Works that a copyright holder has dedicated to the public domain. Creative Commons – Zero Waiver Creative Commons – Attribution License Creative Commons – Attribution Share Alike License Creative Commons – Attribution Noncommercial License Creative Commons – Attribution Noncommercial Share Alike License GNU – Free Documentation License Make Your Own Assessment { Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for copyright. } Public Domain – Ineligible: Works that are ineligible for copyright protection in the U.S. (17 USC § 102(b)) *laws in your jurisdiction may differ { Content Open.Michigan has used under a Fair Use determination. } Fair Use: Use of works that is determined to be Fair consistent with the U.S. Copyright Act. (17 USC § 107) *laws in your j urisdiction may differ Our determination DOES NOT mean that all uses of this 3rd-party content are Fair Uses and we DO NOT guarantee that y our use of the content is Fair. To use this content you should do your own independent analysis to determine whether or not your use will be Fair.
  • 3. Descriptive Statistics quantitatively describe the main features of a collection of data. How do salaries What should I vary across the make of all company? this???!!! employee manager Staff. Jones HR
  • 4. Descriptive Statistics in R Mean > mean(x); > mean(x,trim=a) Median > median(x) Mode > sort(table(x)) Standard deviation > sd(x) Variance > var(x) the median absolute > mad(c(x)) deviation interquartile range > IQR(x) Range > range(x)
  • 5. Data Dimensions > length(x) [1] 1000 ------------------------- > nrow(X) Matrix X [1] 2030 …. > ncol(X) [1] 100000 > dim(X) [1] 2034 100000 ….
  • 6. Vectorization in R Matrix X > apply( X, MARGIN=1, FUN= mean) > apply( X, MARGIN=2, FUN= mean)
  • 7. 25 boxplot(X) • Good for small 20 data sets • Easy to compar e groups side b 15 y side • 1.5*IQR defines 10 outlier 5 0 epiE epiS epiImp epilie epiNeur
  • 8. The Big Six  Minimum, 1st Q, Median, Mean, 3rd Q, Maximu m > summary(X)
  • 9. R tries to understand you > summary(X)
  • 10. Histograms: > hist(X) epiE epiS epiImp epilie 80 50 Frequency Frequency Frequency Frequency 20 40 40 40 20 0 0 0 0 0 5 10 20 0 4 8 12 0 2 4 6 8 0 2 4 6 epiE epiS epiImp epilie epiNeur bfagree bfcon bfext 40 40 Frequency Frequency Frequency Frequency 30 40 20 20 0 0 0 0 0 5 15 80 120 160 60 100 160 0 50 150 epiNeur bfagree bfcon bfext bfneur bfopen bdi 50 Frequency Frequency Frequency 60 20 20 0 0 0 40 80 120 80 120 160 0 10 20 30 bfneur bfopen bdi
  • 11. Correlation > cor(wt,mpg) [1] -0.8676594 > plot(x=wt,y=mpg) Scatterplot Example 30 Miles Per Gallon 25 20 15 10 2 3 4 5 Car Weight
  • 12. Scatterplot Matrix • Iris dataset • 150 flowers • 5 variables Goingslo, flickr
  • 13. Scatterplot Matrix plot > pairs(data) 2.0 3.0 4.0 0.5 1.5 2.5 7.5 Sepal.Length 6.0 4.5 4.0 setosa 3.0 Sepal.Width versicolor 2.0 virginica 7 5 Petal.Length 3 1 2.5 1.5 Petal.Width 0.5 3.0 2.0 Species 1.0 4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 1.0 2.0 3.0
  • 14. > coplot(lat ~ long | depth) Given : depth 100 200 300 400 500 600 165 170 175 180 185 165 170 175 180 185 -10 -15 -20 lat -25 -30 -35 165 170 175 180 185 165 170 175 180 185 long
  • 15. Linear Regression  Why?  Prediction of future or unknown observations  Assessment of relationship between variables  General description of data structure  What?
  • 16. Variable Selection  Why?  Simplification  Elimination of multicollinearity and noise  Time and money saving  How?  Testing-based Variable Selection Methods - Backward, Forward, Stepwise  Criterion-based Procedures  What?  AIC = n ln(RSS/n) + 2(p)
  • 17. Example: U.S. State Fact and Figures  Life Expectancy  Population, Income, Illiteracy, Murder, HS Grad, Frost, Area  Selected R code  Linear Regression > g <- lm(Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area, data = statedata) > summary(g) Coefficients: Variance Table Analysis of Response: Life.Exp Estimate Std. Error t value Pr(>|t|) > anova(g) (Intercept) Df Sum Sq Mean Sq F value 7.094e+01 1.748e+00 40.586 Pr(>F) < 2e-16 *** Population 5.180e-05 2.919e-05 Population 1 0.4089 0.4089 0.7372 0.395434 . 1.775 0.0832  AIC Income Income 1 11.5946 11.5946 20.9028 4.218e-05 *** -2.180e-05 2.444e-04 -0.089 0.9293 Illiteracy 3.382e-02 19.4207 35.0116 5.228e-07 *** Illiteracy 1 19.4207 3.663e-01 0.092 0.9269 > step(g) Murder Murder -3.011e-01 27.4288 49.4486 1.308e-08 *** 1 27.4288 4.662e-02 -6.459 8.68e-08 *** HS.Grad HS.Grad 1 4.0989 4.0989 7.3895 0.009494 ** 4.893e-02 2.332e-02 2.098 0.0420 * Frost Frost 1 2.0488 2.0488 3.6935 0.061426 . . -5.735e-03 3.143e-03 -1.825 0.0752 Area Area 1 0.0011 1.668e-06 -0.044 -7.383e-08 0.0011 0.0020 0.964908 0.9649 AIC = n ln(RSS/n) + 2(p) Residuals 42 23.2971 0.5547
  • 18. Continued: U.S. State Fact and Figures Start: AIC=-22.18 Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area Df Sum of Sq RSS AIC - Area 1 0.0011 23.298 -24.182 - Income 1 0.0044 23.302 -24.175 - Illiteracy 1 0.0047 23.302 -24.174 <none> 23.297 -22.185 - Population 1 1.7472 25.044 -20.569 - Frost 1 1.8466 25.144 -20.371 - HS.Grad 1 2.4413 25.738 -19.202 - Murder 1 23.1411 46.438 10.305 Step: AIC=-24.18 Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost Df Sum of Sq RSS AIC - Illiteracy 1 0.0038 23.302 -26.174 - Income 1 0.0059 23.304 -26.170 <none> 23.298 -24.182 - Population 1 1.7599 25.058 -22.541 - Frost 1 2.0488 25.347 -21.968 - HS.Grad 1 2.9804 26.279 -20.163 - Murder 1 26.2721 49.570 11.569
  • 19. Continued: U.S. State Fact and Figures 73 Step: AIC=-28.16 Life.Exp ~ Population + Murder + HS.Grad + Frost Df Sum of Sq RSS AIC Effect on Response Variable of <none> 23.308 -28.161 One Unit Change of Predict Variable - Population 1 2.064 25.372 -25.920 Life Expectancy - Frost 1 3.122 26.430 -23.877 - HS.Grad 1 5.112 28.420 -20.246 - Murder 1 34.816 58.124 15.528 Coefficients: 0.00005014 (Intercept) Population Murder HS.Grad Frost 0.3001 7.103e+01 5.014e-05 -3.001e-01 4.658e-02 -5.943e-03 0.04658 0.005943 71.03 70 Intercept x1 x4 x5 x6 Predict Variables
  • 20. What is Principal Component Analysis (PCA)?  Two general approaches of reducing variables : feature selection and feature extraction  Feature Selection : “Akaike Information Criterion”(AIC), BIC or Back-Substitution  Feature extraction : “Principal Component Analysis”(PCA) is most widely used  Create several artificial variables  Built-in functions in R = Convenient!
  • 21. Actual Pima Data pregnant glucose diastolic triceps insulin bmi diabetes age test 1 6 148 72 35 0 33.6 0.627 50 1 2 1 85 66 29 0 26.6 0.351 31 0 3 8 183 64 0 0 23.3 0.672 32 1 4 1 89 66 23 94 28.1 0.167 21 0 5 0 137 40 35 168 43.1 2.288 33 1 6 5 116 74 0 0 25.6 0.201 30 0 …. ( Imagine a data set with many more (~1000) columns ) (Imagine a Linear Regression: Which variables affect diabetes in what ways?)
  • 22. PCA Example: Pima Indians  The National Institute of Diabetes and Digestive and Kidney Diseases conducte d a study on 768 adult female Pima Indians living near Phoenix.  9 Variables (8 continuous, 1 categorical)  pregnant: Number of times pregnant  Glucose : Plasma glucose concentration at 2 hours in an oral glucose tolerance test  Diastolic : Diastolic blood pressure (mm Hg)  Triceps : Triceps skin fold thickness (mm)  Insulin : 2-Hour serum insulin (mu U/ml)  Bmi : Body mass index (weight in kg/(height in metres squared))  Diabetes : Diabetes pedigree function  Age : Age (years)  Test : diabetes (coded 0 if negative, 1 if positive)  Next Slide: PCA Implementation
  • 23. What principal components might look like:  PC1 : 1*Insulin + 0.01*Glucose + ..  PC2 : 1*Glucose + 0.12*Age + 0.12*DiastolicBP + ..  PC3 : 0.92 * DiastolicBP + 0.31*Triceps  Principal components : What are they composed of? (less important)  Difference with Linear Regression
  • 24. + ++ -4000 -3000 -2000 -1000 0 + -Goal: obtain summary 0.10 about data in lower 1000 dimensions + + + + + + 0.05 ++ +++ +++ + + + + ++ + + + + + ++++ ++ ++ + ++++ ++ 500 + ++ ++++++++ + + + + ++ +++ + + +++ +++ + + + + ++++ + + + ++++ + + + + + ++++ +++ + ++ + + + +++ + + ++ -- How many + +++++ + +++ +++ + + + ++ +++ + + + + + +++ ++ + + + ++ + + + + + + + ++++++ +++++ + + ++ +++ + dimensions? + ++ + ++ + + + ++ ++ + + +++ ++++ + + + ++++ +++++ + 0.00 insulin + triceps + + +++ 0 PC2 + + + + ++ pregnant +++ + ++ ++++ + + +age + ++ bmi + + ++++ + + +diastolic+ + ++ + + + + ++ ++ + ++ + ++ + + ++ + ++ + ++ + + + ++ + + + + + + ++++++ + + + -500 + + + ++ + + + + - R code in the next + + ++ + + ++ + + ++ + -0.05 + + ++ + + ++ + ++ + + + ++ slide: +++ + glucose + + + + +++++ + + -1000 + ++ + ++ + + + + ++ ++ -0.10 + ++ -1500 + -0.30 -0.25 -0.20 -0.15 -0.10 -0.05 0.00 PC1
  • 25. Brief : R-Code > data.pca <- prcomp(data[,-9]); summary(data.pca); Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation 116.002 30.5411 19.7630 14.0777 10.6155 6.76973 2.78575 Proportion of Variance 0.889 0.0616 0.0258 0.0131 0.00744 0.00303 0.00051 Cumulative Proportion 0.889 0.950 0.976 0.9890 0.996 0.999 1.00000 > data.pca Rotation: PC1 PC2 PC3 PC4 PC5 PC6 PC7 pregnant 0.002 -0.02 0.02 0.05 2e-01 -0.005 -1e+00 glucose -0.098 -0.97 -0.14 -0.12 -9e-02 0.051 -9e-04 Diastolic -0.016 -0.14 0.92 0.26 -2e-01 0.076 1e-03 triceps -0.061 0.06 0.31 -0.88 3e-01 0.221 4e-04 insulin -0.993 0.09 -0.02 0.07 -2e-04 -0.006 -1e-03 bmi -0.014 -0.05 0.13 -0.19 2e-02 -0.971 3e-03 age 0.004 -0.14 0.13 0.30 9e-01 -0.015 2e-01 > barplot(totalrep, main="Representation of Principal Components", xlab="Principal Component", ylab="% of Total Variance") > biplot(data.pca, xlabs=rep('+',768), xlim = c(-0.05,0.3), ylim = c(-0.15,0.12)); abline(h=0,v=0);
  • 26. Representation of Principal Components 0.5 0.4 % of Total Variance 0.3 0.2 0.1 0.0 Principal Component

Editor's Notes

  1. *Why do we do linear regression?*give two examples of prediction (modern examples)Genetic canderdeseace personalize medicineSubset so manyTend not to test allComputational advertisingInformation about your history as variblesMultiple contributing factors how do they unqiuely affect the response variable*When? When you need the these kinds of information.*Y is a continuous variable*X’s can be continuous, discrete or categorical*If p = 1, the model is called simple linear regression. If p &gt; 1, the model is called multiple linear regression.*Typical Linear Model displayed, epsilon is the additive error term*Unknown parameters beta zero to beta p are what we are going to estimate.*Note1: “linear” means linear with beta’s not x’s. Oral examples given.*Note2: there are always p+1 unknown parameters when there are p predict variables. Beta zero is called intercept.*The basic mechanism for building a linear regression model is to minimize the sum of residual square, which involves huge amounts of calculation.*With the help of R, we can easily obtain a model through a line of code and we will get to that later.
  2. *Don’t want to include all variables that you have obtained. Also we want simplify the model, maybe not containing 20 variables.*Also, some variables have high correlation among each other. Eliminate the chance of multi-colinearity*Reduce the noise caused by unnecessary variables*Save time and money*Backward elimination, forward selection and stepwise regression*Testing-based methods are sensitive to outliers, so we introduce a better one: Criterion-based procedures*General Idea: choose the model that optimizes a criterion with the minimization of information loss, which balances goodness-of-fit and model size or accuracy and complexity*AIC: Akaike Information Criterion*BIC: Bayes Information Criterion*Adjusted R square and Mallow’s C_p are similar criterion which can be easily calculated by R*What exactly is a critirion?*It is a number that one can calculate based on the model that he or she has fitted.*n is the number of observations, p is the number of predict variables and RSS, which means residual sum squares is a number based on your model*Once AIC is calculated, we do other possible models containing other combination of variables, and check if they will give smaller AIC or BIC*Our goal is to find a model with the smallest AIC or BIC*R can help us check all possible AIC or BIC values
  3. //WITH ANIMATION*How the predict variables Population, Income, Illiteracy percentage, murder rate, high school graduation percentage, weather hazard of frosting and land areas affect life expectancy from 1969 to 1971*R can easily help us get the result that we want*After we read the data set into R, we only need to use the lm function to set up the linear regression model*can use “~.” to indicate all variables in the data*summary(g) gives the number of interceptions and coefficients of x’s, which are beta’s that we are estimating*R calculates beta’s as mentioned by minimizing residuals sum of squares*People who are familiar with statistics might notice that p-value are so large for some variables, indicating these predict variables are going to be eliminated in future steps*Here we have our complete model including all possible variables*R also provides ANOVA table where we can easily see the RSS which is used to calculate AIC*But we don’t need to calculate AIC by ourselves. R will help us do it.*Just simply use the “step” function
  4. *This slides goes through very quickly just with the simple explanation of mechanism of AIC
  5. //WITH ANIMATION*Final Result*Visualization of the model
  6. http://rss.acs.unt.edu/Rdoc/library/faraway/html/pima.htmlpregnant: Number of times pregnantGlucose : Plasma glucose concentration at 2 hours in an oral glucose tolerance testDiastolic : Diastolic blood pressure (mm Hg)Triceps : Triceps skin fold thickness (mm)Insulin : 2-Hour serum insulin (mu U/ml)Bmi : Body mass index (weight in kg/(height in metres squared))Diabetes : Diabetes pedigree functionAge : Age (years)Test : test whether the patient shows signs of diabetes (coded 0 if negative, 1 if positive)Artificial results: makeBMI + 73*Cholesterol + Age
  7. The difference with Linear Regression is that LR has a specific goal: it is clear which variable is the dependent variable and which ones are independent variables whose effect we want to examine. Principal component’s goal would be less specific: we are examining the correlations between the variables. **Professor Shedden could you be more specific about what I should talk about in this slide?
  8. Goal: obtain several “principal components”.Often the first several principal components account for most of the variation. (Shown in last slide)We can see that towards the PC1 direction, insulin accounts for most of the variation. We can verify this by observing the PCs.The usefulness of PCA is in that the first several principal components may give us which variables account for more “variation”Also, unlike “variable selection” in linear regression, we would be preserving the effects of all the columns while creating new (fewer) variables that can explain the data.*Many dimensions can be reduced into several principal components whose “directions” are a combinations of the original variables.(Downside is that Interpretation is sometimes unclear: what does it mean to have -0.3 of insulin and 0.594 of glucose? Maybe I shouldn’t talk about this)
  9. ****Round down to 3 digitslibrary(faraway)data = pima;summary(pima);data.pca &lt;- prcomp(data[,-9])quartz()plot(data.pca)totalrep = data.pca$sdevtotalrep = totalrep/sum(totalrep)barplot(totalrep, main=&quot;Representation of Principal Components&quot;, xlab=&quot;Principal Component&quot;, ylab=&quot;% of Total Variance&quot;)biplot(data.pca)plotAll(data.pca)plot(data.pca$x[,1],data.pca[,2])biplot(data.pca, xlabs=rep(&apos;+&apos;,768), xlim = c(-0.05,0.3), ylim = c(-0.15,0.12)); abline(h=0,v=0);summary(data.pca)