A2DataDive workshop: Introduction to R

Author(s): Alex Ocampo, Chong Zhang, Sangwon Hyun, Yiqun Hu, 2012

License: Unless otherwise noted, this material is made available under the terms
of the Creative Commons Attribution – Noncommercial – Share Alike 3.0 Lic
ense: http://creativecommons.org/licenses/by-nc-sa/3.0/
We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your abilit
y to use, share, and adapt it. The citation key on the following slide provides information about how you may sha
re and adapt this material.

Copyright holders of content included in this material should contact open.michigan@umich.edu with any questi
ons, corrections, or clarification regarding the use of content.

For more information about how to cite these materials visit http://open.umich.edu/education/about/terms-of-use.

Attribution Key
for more information see: http://open.umich.edu/wiki/AttributionPolicy

Use + Share + Adapt
{ Content the copyright holder, author, or law permits you to use, share and adapt. }
Public Domain – Government: Works that are produced by the U.S. Government. (17 USC § 105
)
Public Domain – Expired: Works that are no longer protected due to an expired copyright term.
Public Domain – Self Dedicated: Works that a copyright holder has dedicated to the public domain.

Creative Commons – Zero Waiver

Creative Commons – Attribution License
Creative Commons – Attribution Share Alike License
Creative Commons – Attribution Noncommercial License
Creative Commons – Attribution Noncommercial Share Alike License
GNU – Free Documentation License

Make Your Own Assessment
{ Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for copyright. }
Public Domain – Ineligible: Works that are ineligible for copyright protection in the U.S. (17 USC § 102(b)) *laws in
your jurisdiction may differ
{ Content Open.Michigan has used under a Fair Use determination. }
Fair Use: Use of works that is determined to be Fair consistent with the U.S. Copyright Act. (17 USC § 107) *laws in your j
urisdiction may differ
Our determination DOES NOT mean that all uses of this 3rd-party content are Fair Uses and we DO NOT guarantee that y
our use of the content is Fair.
To use this content you should do your own independent analysis to determine whether or not your use will be Fair.

Descriptive Statistics
quantitatively describe the main features of a collection of data.

How do salaries What should I
vary across the make of all
company? this???!!!

employee
manager

Staff. Jones
HR

Descriptive Statistics in R

Mean > mean(x);
> mean(x,trim=a)
Median > median(x)

Mode > sort(table(x))
Standard deviation > sd(x)
Variance > var(x)

the median absolute > mad(c(x))
deviation
interquartile range > IQR(x)

Range > range(x)

Data Dimensions

> length(x)
[1] 1000
-------------------------
> nrow(X) Matrix X
[1] 2030
….
> ncol(X)
[1] 100000
> dim(X)
[1] 2034 100000
….

Vectorization in R
Matrix X

> apply( X, MARGIN=1, FUN= mean)

> apply( X, MARGIN=2, FUN= mean)

25 boxplot(X)

• Good for small
20

data sets
• Easy to compar
e groups side b
15

y side
• 1.5*IQR defines
10

outlier
5
0

epiE epiS epiImp epilie epiNeur

The Big Six
 Minimum, 1st Q, Median, Mean, 3rd Q, Maximu
m

> summary(X)

R tries to understand you

> summary(X)

Histograms: > hist(X)
epiE epiS epiImp epilie

80
50
Frequency

Frequency

Frequency

Frequency
20 40
40

40
20
0

0

0

0
0 5 10 20 0 4 8 12 0 2 4 6 8 0 2 4 6

epiE epiS epiImp epilie

epiNeur bfagree bfcon bfext
40

40
Frequency

Frequency

Frequency

Frequency
30

40
20

20
0

0

0

0
0 5 15 80 120 160 60 100 160 0 50 150

epiNeur bfagree bfcon bfext

bfneur bfopen bdi
50
Frequency

Frequency

Frequency

60
20

20
0

0

0

40 80 120 80 120 160 0 10 20 30

bfneur bfopen bdi

Correlation
> cor(wt,mpg)
[1] -0.8676594
> plot(x=wt,y=mpg) Scatterplot Example

30
Miles Per Gallon

25
20
15
10

2 3 4 5

Car Weight

Scatterplot Matrix

• Iris dataset
• 150 flowers
• 5 variables Goingslo, flickr

Scatterplot Matrix
plot > pairs(data)
2.0 3.0 4.0 0.5 1.5 2.5

7.5
Sepal.Length

6.0
4.5
4.0
setosa 3.0
Sepal.Width
versicolor
2.0

virginica

7
5
Petal.Length

3
1
2.5
1.5

Petal.Width
0.5

3.0
2.0
Species

1.0
4.5 5.5 6.5 7.5 1 2 3 4 5 6 7 1.0 2.0 3.0

> coplot(lat ~ long | depth)
Given : depth
100 200 300 400 500 600

165 170 175 180 185 165 170 175 180 185
-10
-15
-20
lat
-25
-30
-35

165 170 175 180 185 165 170 175 180 185

long

Linear Regression

 Why?
 Prediction of future or unknown observations
 Assessment of relationship between variables
 General description of data structure
 What?

Variable Selection

 Why?
 Simplification
 Elimination of multicollinearity and noise
 Time and money saving
 How?
 Testing-based Variable Selection Methods
- Backward, Forward, Stepwise
 Criterion-based Procedures

 What?
 AIC = n ln(RSS/n) + 2(p)

Example: U.S. State Fact and Figures

 Life Expectancy
 Population, Income, Illiteracy, Murder, HS Grad, Frost, Area

 Selected R code
 Linear Regression
> g <- lm(Life.Exp ~ Population + Income + Illiteracy + Murder
+ HS.Grad + Frost + Area, data = statedata)
> summary(g) Coefficients: Variance Table
Analysis of
Response: Life.Exp
Estimate Std. Error t value Pr(>|t|)
> anova(g) (Intercept) Df Sum Sq Mean Sq F value
7.094e+01 1.748e+00 40.586 Pr(>F)
< 2e-16 ***
Population 5.180e-05 2.919e-05
Population 1 0.4089 0.4089 0.7372 0.395434 .
1.775 0.0832
 AIC Income
Income 1 11.5946 11.5946 20.9028 4.218e-05 ***
-2.180e-05 2.444e-04 -0.089 0.9293
Illiteracy 3.382e-02 19.4207 35.0116 5.228e-07 ***
Illiteracy 1 19.4207 3.663e-01 0.092 0.9269
> step(g) Murder
Murder -3.011e-01 27.4288 49.4486 1.308e-08 ***
1 27.4288 4.662e-02 -6.459 8.68e-08 ***
HS.Grad
HS.Grad 1 4.0989 4.0989 7.3895 0.009494 **
4.893e-02 2.332e-02 2.098 0.0420 *
Frost
Frost 1 2.0488 2.0488 3.6935 0.061426 . .
-5.735e-03 3.143e-03 -1.825 0.0752
Area
Area 1 0.0011 1.668e-06 -0.044
-7.383e-08 0.0011 0.0020 0.964908
0.9649
AIC = n ln(RSS/n) + 2(p) Residuals 42 23.2971 0.5547

Continued: U.S. State Fact and Figures
Start: AIC=-22.18
Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost + Area

Df Sum of Sq RSS AIC
- Area 1 0.0011 23.298 -24.182
- Income 1 0.0044 23.302 -24.175
- Illiteracy 1 0.0047 23.302 -24.174
<none> 23.297 -22.185
- Population 1 1.7472 25.044 -20.569
- Frost 1 1.8466 25.144 -20.371
- HS.Grad 1 2.4413 25.738 -19.202
- Murder 1 23.1411 46.438 10.305

Step: AIC=-24.18
Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + Frost

Df Sum of Sq RSS AIC
- Illiteracy 1 0.0038 23.302 -26.174
- Income 1 0.0059 23.304 -26.170
<none> 23.298 -24.182
- Population 1 1.7599 25.058 -22.541
- Frost 1 2.0488 25.347 -21.968
- HS.Grad 1 2.9804 26.279 -20.163
- Murder 1 26.2721 49.570 11.569

Continued: U.S. State Fact and Figures
73
Step: AIC=-28.16
Life.Exp ~ Population + Murder + HS.Grad + Frost

Df Sum of Sq RSS AIC Effect on Response Variable of
<none> 23.308 -28.161 One Unit Change of Predict Variable
- Population 1 2.064 25.372 -25.920
Life Expectancy

- Frost 1 3.122 26.430 -23.877
- HS.Grad 1 5.112 28.420 -20.246
- Murder 1 34.816 58.124 15.528

Coefficients:
0.00005014
(Intercept) Population Murder HS.Grad Frost
0.3001
7.103e+01 5.014e-05 -3.001e-01 4.658e-02 -5.943e-03
0.04658 0.005943

71.03

70
Intercept x1 x4 x5 x6
Predict Variables

What is Principal Component Analysis (PCA)?

 Two general approaches of reducing variables :
feature selection and feature extraction

 Feature Selection : “Akaike Information
Criterion”(AIC), BIC or Back-Substitution
 Feature extraction : “Principal Component
Analysis”(PCA) is most widely used

 Create several artificial variables
 Built-in functions in R = Convenient!

Actual Pima Data

pregnant glucose diastolic triceps insulin bmi diabetes age test
1 6 148 72 35 0 33.6 0.627 50 1
2 1 85 66 29 0 26.6 0.351 31 0
3 8 183 64 0 0 23.3 0.672 32 1
4 1 89 66 23 94 28.1 0.167 21 0
5 0 137 40 35 168 43.1 2.288 33 1
6 5 116 74 0 0 25.6 0.201 30 0

….

( Imagine a data set with many more (~1000) columns )

(Imagine a Linear Regression: Which variables affect diabetes in what ways?)

PCA Example: Pima Indians

 The National Institute of Diabetes and Digestive and Kidney Diseases conducte
d a study on 768 adult female Pima Indians living near Phoenix.
 9 Variables (8 continuous, 1 categorical)
 pregnant: Number of times pregnant
 Glucose : Plasma glucose concentration at 2 hours in an oral glucose tolerance test
 Diastolic : Diastolic blood pressure (mm Hg)
 Triceps : Triceps skin fold thickness (mm)
 Insulin : 2-Hour serum insulin (mu U/ml)
 Bmi : Body mass index (weight in kg/(height in metres squared))
 Diabetes : Diabetes pedigree function
 Age : Age (years)
 Test : diabetes (coded 0 if negative, 1 if positive)
 Next Slide: PCA Implementation

What principal components might look like:

 PC1 : 1*Insulin + 0.01*Glucose + ..
 PC2 : 1*Glucose + 0.12*Age + 0.12*DiastolicBP + ..
 PC3 : 0.92 * DiastolicBP + 0.31*Triceps

 Principal components : What are they composed of?
(less important)

 Difference with Linear Regression

+
++
-4000 -3000 -2000 -1000 0 +

-Goal: obtain summary

0.10
about data in lower

1000
dimensions + +
+ +
+ +

0.05
++ +++
+++ +
+ + + ++ +
+ + + + ++++ ++ ++ +
++++
++

500
+ ++ ++++++++
+ +
+ + ++ +++ +
+ +++ +++
+
+ + + ++++ +
+ + ++++ +
+ + + + ++++ +++ +
++
+
+ + +++ +
+ ++
-- How many + +++++ +
+++ +++ + +
+ ++ +++ +
+ + + + +++ ++
+ + + ++ + + + + +
+ + ++++++ +++++ +
+ ++ +++ +
dimensions? + ++ + ++
+ + + ++
++ + +
+++ ++++ +
+ + ++++ +++++ +

0.00
insulin
+ triceps +
+ +++

0
PC2 + + + + ++
pregnant
+++ + ++ ++++ +
+ +age +
++
bmi +
+ ++++
+ + +diastolic+
+ ++ + + + + ++ ++ + ++
+
++
+ +
++ +
++ + ++ + +
+ ++ + + +
+ + + ++++++ + +
+

-500
+ + + ++
+ +
+
+
- R code in the next +
+ ++ + + ++ + + ++
+
-0.05
+ + ++ +
+
++ + ++ + + +
++
slide: +++ +
glucose +
+
+
+
+++++ + +

-1000
+ ++ +
++ + +
+
+ ++
++
-0.10

+
++

-1500
+

-0.30 -0.25 -0.20 -0.15 -0.10 -0.05 0.00

PC1

Brief : R-Code

> data.pca <- prcomp(data[,-9]); summary(data.pca);
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 116.002 30.5411 19.7630 14.0777 10.6155 6.76973 2.78575
Proportion of Variance 0.889 0.0616 0.0258 0.0131 0.00744 0.00303 0.00051
Cumulative Proportion 0.889 0.950 0.976 0.9890 0.996 0.999 1.00000

> data.pca
Rotation:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
pregnant 0.002 -0.02 0.02 0.05 2e-01 -0.005 -1e+00
glucose -0.098 -0.97 -0.14 -0.12 -9e-02 0.051 -9e-04
Diastolic -0.016 -0.14 0.92 0.26 -2e-01 0.076 1e-03
triceps -0.061 0.06 0.31 -0.88 3e-01 0.221 4e-04
insulin -0.993 0.09 -0.02 0.07 -2e-04 -0.006 -1e-03
bmi -0.014 -0.05 0.13 -0.19 2e-02 -0.971 3e-03
age 0.004 -0.14 0.13 0.30 9e-01 -0.015 2e-01

> barplot(totalrep, main="Representation of Principal Components", xlab="Principal
Component", ylab="% of Total Variance")
> biplot(data.pca, xlabs=rep('+',768), xlim = c(-0.05,0.3), ylim = c(-0.15,0.12)); abline(h=0,v=0);

Representation of Principal Components

0.5
0.4
% of Total Variance

0.3
0.2
0.1
0.0

Principal Component

A2DataDive workshop: Introduction to R

Recommended

Recommended

More Related Content

More from Open.Michigan

More from Open.Michigan (20)

Recently uploaded

Recently uploaded (20)

A2DataDive workshop: Introduction to R

Editor's Notes