SlideShare a Scribd company logo
1 of 32
Download to read offline
Data Manipulation on R 
Factor Manipulations,subset,sorting and Reshape 
Abhik Seal 
Indiana University School of Informatics and Computing(dsdht.wikispaces.com)
Basic Manipulating Data 
So far , we've covered how to read in data from various ways like from files, internet and databases and 
reading various formats of files. This session we are interested to manipulate data after reading in the file for 
easy data processing. 
2/35
Sorting and Ordering data 
sort(x,decreasing=FALSE) : 'sort (or order) a vector or factor (partially) into ascending or descending 
order.' order(...,decreasing=FALSE):'returns a permutation which rearranges its first argument into 
ascending or descending order, breaking ties by further arguments.' 
x <- c(1,5,7,8,3,12,34,2) 
sort(x) 
## [1] 1 2 3 5 7 8 12 34 
order(x) 
## [1] 1 8 5 2 3 4 6 7 
3/35
Some examples of sorting and ordering 
# sort by mpg 
newdata <- mtcars[order(mpg),] 
head(newdata,3) 
## mpg cyl disp hp drat wt qsec vs am gear carb 
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 
# sort by mpg and cyl 
newdata <- mtcars[order(mpg, cyl),] 
head(newdata,3) 
## mpg cyl disp hp drat wt qsec vs am gear carb 
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 
4/35
Ordering with plyr 
library(plyr) 
head(arrange(mtcars,mpg),3) 
## mpg cyl disp hp drat wt qsec vs am gear carb 
## 1 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 
## 2 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 
## 3 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 
head(arrange(mtcars,desc(mpg)),3) 
## mpg cyl disp hp drat wt qsec vs am gear carb 
## 1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 
## 2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 
## 3 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 
5/35
Subsetting data 
set.seed(12345) 
#create a dataframe 
X<-data.frame("A"=sample(1:10),"B"=sample(11:20),"C"=sample(21:30)) 
# Add NA VALUES 
X<-X[sample(1:10),];X$B[c(1,6,10)]=NA 
head(X) 
## A B C 
## 8 4 NA 27 
## 1 8 11 25 
## 2 10 12 23 
## 5 3 13 24 
## 3 7 16 28 
## 10 5 NA 26 
6/35
Basic data subsetting 
# Accessing only first row 
X[1,] 
## A B C 
## 8 4 NA 27 
# accessing only first column 
X[,1] 
## [1] 4 8 10 3 7 5 9 1 2 6 
# accessing first row and first column 
X[1,1] 
## [1] 4 
7/35
And/OR's 
head(X[(X$A <=6 & X$C > 24),],3) 
## A B C 
## 8 4 NA 27 
## 10 5 NA 26 
## 7 2 19 29 
head(X[(X$A <=6 | X$C > 24),],3) 
## A B C 
## 8 4 NA 27 
## 1 8 11 25 
## 5 3 13 24 
8/35
select Non NA values Data Frame 
# select the dataframe without NA values in B column 
head(X[which(X$B!='NA'),],4) 
## A B C 
## 1 8 11 25 
## 2 10 12 23 
## 5 3 13 24 
## 3 7 16 28 
# select those which have values > 14 
head(X[which(X$B>11),],4) 
## A B C 
## 2 10 12 23 
## 5 3 13 24 
## 3 7 16 28 
## 4 9 20 30 
9/35
# creating a data frame with 2 variables 
data <- data.frame(x1=c(2,3,4,5,6),x2=c(5,6,7,8,1)) 
list_data<-list(dat=data,vec.obj=c(1,2,3)) 
list_data 
## $dat 
## x1 x2 
## 1 2 5 
## 2 3 6 
## 3 4 7 
## 4 5 8 
## 5 6 1 
## 
## $vec.obj 
## [1] 1 2 3 
# accessing second element of the list_obj objects 
list_data[[2]] 
## [1] 1 2 3 
10/35
Factors 
Factors are used to represent categorical data, and can also be used for ordinal data (ie categories have an 
intrinsic ordering) Note that R reads in character strings as factors by default in functions like read.table()'The 
function factor is used to encode a vector as a factor (the terms 'category' and 'enumerated type' are also used 
for factors). If argument ordered is TRUE, the factor levels are assumed to be ordered. For compatibility with S 
there is also a function ordered.'is.factor, is.ordered, as.factor and as.ordered are the membership and 
coercion functions for these classes. 
11/35
Factors 
Suppose we have a vector of case-control status 
cc=factor(c("case","case","case","control","control","control")) 
cc 
## [1] case case case control control control 
## Levels: case control 
levels(cc)=c("control","case") 
cc 
## [1] control control control case case case 
## Levels: control case 
12/35
Factors 
Factors can be converted to numericor charactervery easily 
x=factor(c("case","case","case","control","control","control"),levels=c("control","case")) 
as.character(x) 
## [1] "case" "case" "case" "control" "control" "control" 
as.numeric(x) 
## [1] 2 2 2 1 1 1 
13/35
Cut 
Now that we know more about factors, cut()will make more sense: 
x=1:100 
cx=cut(x,breaks=c(0,10,25,50,100)) 
head(cx) 
## [1] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] 
## Levels: (0,10] (10,25] (25,50] (50,100] 
table(cx) 
## cx 
## (0,10] (10,25] (25,50] (50,100] 
## 10 15 25 50 
14/35
Cut 
We can also leave off the labels 
cx=cut(x,breaks=c(0,10,25,50,100),labels=FALSE) 
head(cx) 
## [1] 1 1 1 1 1 1 
table(cx) 
## cx 
## 1 2 3 4 
## 10 15 25 50 
15/35
Cut 
cx=cut(x,breaks=c(10,25,50),labels=FALSE) 
head(cx) 
## [1] NA NA NA NA NA NA 
table(cx) 
## cx 
## 1 2 
## 15 25 
table(cx,useNA="ifany") 
## cx 
## 1 2 <NA> 
## 15 25 60 
16/35
Adding to data frames 
m1=matrix(1:9,nrow=3,ncol=3,byrow=FALSE) 
m1 
## [,1] [,2] [,3] 
## [1,] 1 4 7 
## [2,] 2 5 8 
## [3,] 3 6 9 
m2=matrix(1:9,nrow=3,ncol=3,byrow=TRUE) 
m2 
## [,1] [,2] [,3] 
## [1,] 1 2 3 
## [2,] 4 5 6 
## [3,] 7 8 9 
17/35
Adding using cbind 
You can add columns (or another matrix/data frame) to a data frame or matrix using cbind()('column bind'). 
You can also add rows (or another matrix/data frame) using rbind()('row bind'). Note that the vector you are 
adding has to have the same length as the number of rows (for cbind()) or the number of columns (rbind()) 
cbind(m1,m2) 
## [,1] [,2] [,3] [,4] [,5] [,6] 
## [1,] 1 4 7 1 2 3 
## [2,] 2 5 8 4 5 6 
## [3,] 3 6 9 7 8 9 
18/35
Reshape data 
Datasets layout could be long or wide. In long-layout, multiple rows represent a single subject's record, 
whereas in wide-layout, a single row represents a single subject's record. In doing some statistical analysis 
sometimes we require wide data and sometimes long data, so that we can easily reshape the data to meet the 
requirements of statistical analysis. Data reshaping is just a rearrangement of the form of the data—it does not 
change the content of the dataset. This section mainly focuses the melt and cast paradigm of reshaping 
datasets, which is implemented in the reshape contributed package. Later on, this same package is 
reimplemented with a new name, reshape2, which is much more time and memory efficient (the Reshaping 
Data with the reshape Package paper, by Wickham, which can be found at 
(http://www.jstatsoft.org/v21/i12/paper)) 
19/35
Wide data has a column for each variable. For example, this is wide-format data: 
# ozone wind temp 
# 1 23.62 11.623 65.55 
# 2 29.44 10.267 79.10 
# 3 59.12 8.942 83.90 
# 4 59.96 8.794 83.97 
Data in long format 
# variable value 
# 1 ozone 23.615 
# 2 ozone 29.444 
# 3 ozone 59.115 
# 4 ozone 59.962 
# 5 wind 11.623 
# 6 wind 10.267 
# 7 wind 8.942 
# 8 wind 8.794 
# 9 temp 65.548 
# 10 temp 79.100 
# 11 temp 83.903 
# 12 temp 83.968 
20/35
reshape 2 Package 
"In reality, you need long-format data much more commonly than wide-format data. For example, ggplot2 
requires long-format data plyr requires long-format data, and most modelling functions (such as lm(), glm(), 
and gam()) require long-format data. But people often find it easier to record their data in wide format." 
reshape2 is based around two key functions: melt and cast: melt takes wide-format data and melts it into 
long-format data. cast takes long-format data and casts it into wide-format data. 
21/35
Melt 
library(reshape2) 
head(airquality,2) 
## ozone solar.r wind temp month day 
## 1 41 190 7.4 67 5 1 
## 2 36 118 8.0 72 5 2 
aql <- melt(airquality) # [a]ir [q]uality [l]ong format 
head(aql,5) 
## variable value 
## 1 ozone 41 
## 2 ozone 36 
## 3 ozone 12 
## 4 ozone 18 
## 5 ozone NA 
22/35
By default, melt has assumed that all columns with numeric values are variables with values. Maybe here we 
want to know the values of ozone, solar.r, wind, and temp for each month and day. We can do that with melt 
by telling it that we want month and day to be “ID variables”. ID variables are the variables that identify 
individual rows of data. 
m <- melt(airquality, id.vars = c("month", "day")) 
head(m,4) 
## month day variable value 
## 1 5 1 ozone 41 
## 2 5 2 ozone 36 
## 3 5 3 ozone 12 
## 4 5 4 ozone 18 
23/35
Melt also allow us to control the column names in long data format 
m <- melt(airquality, id.vars = c("month", "day"), 
variable.name = "climate_variable", 
value.name = "climate_value") 
head(m) 
## month day climate_variable climate_value 
## 1 5 1 ozone 41 
## 2 5 2 ozone 36 
## 3 5 3 ozone 12 
## 4 5 4 ozone 18 
## 5 5 5 ozone NA 
## 6 5 6 ozone 28 
24/35
Long- to wide-format data: the cast functions 
In reshape2 there are multiple cast functions. Since you will most commonly work with data.frame objects, 
we’ll explore the dcast function. (There is also acast to return a vector, matrix, or array.) dcast uses a formula 
to describe the shape of the data. 
m <- melt(airquality, id.vars = c("month", "day")) 
aqw <- dcast(m, month + day ~ variable) 
head(aqw) 
## month day ozone solar.r wind temp 
## 1 5 1 41 190 7.4 67 
## 2 5 2 36 118 8.0 72 
## 3 5 3 12 149 12.6 74 
## 4 5 4 18 313 11.5 62 
## 5 5 5 NA NA 14.3 56 
## 6 5 6 28 NA 14.9 66 
Here, we need to tell dcast that month and day are the ID variables. 
Besides re-arranging the columns, we’ve recovered our original data. 
25/35
Data Manipulation Using plyr 
For large-scale data, we can split the dataset, perform the manipulation or analysis, and then combine it into a 
single output again. This type of split using default R is not much efficient, and to overcome this limitation, 
Wickham, in 2011, developed an R package called plyr in which he efficiently implemented the split-apply-combine 
strategy. We can compare this strategy to map-reduce strategy for processing large amount of data. 
In the coming slides i will give example of the split-apply-combine strategy using 
· 
Without Loops 
· 
With Loops 
· 
Using plyr package 
26/35
Without loops 
I am using the iris dataset here 
1. Split the iris dataset into three parts. 
2. Remove the species name variable from the data. 
3. Calculate the mean of each variable for the three different parts separately. 
4. Combine the output into a single data frame. 
iris.set <- iris[iris$Species=="setosa",-5] 
iris.versi <- iris[iris$Species=="versicolor",-5] 
iris.virg <- iris[iris$Species=="virginica",-5] 
# calculating mean for each piece (The apply step) 
mean.set <- colMeans(iris.set) 
mean.versi <- colMeans(iris.versi) 
mean.virg <- colMeans(iris.virg) 
# combining the output (The combine step) 
mean.iris <- rbind(mean.set,mean.versi,mean.virg) 
# giving row names so that the output could be easily understood 
rownames(mean.iris) <- c("setosa","versicolor","virginica") 
27/35
With Loops 
mean.iris.loop <- NULL 
for(species in unique(iris$Species)) 
{ 
iris_sub <- iris[iris$Species==species,] 
column_means <- colMeans(iris_sub[,-5]) 
mean.iris.loop <- rbind(mean.iris.loop,column_means) 
} 
# giving row names so that the output could be easily understood 
rownames(mean.iris.loop) <- unique(iris$Species) 
NB: In the split-apply-combine strategy is that each piece should be independent of the other. The strategy 
wont work if one piece is dependent upon one another. 
28/35
Using plyr 
library (plyr) 
ddply(iris,~Species,function(x) colMeans(x[,- 
which(colnames(x)=="Species")])) 
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width 
## 1 setosa 5.006 3.428 1.462 0.246 
## 2 versicolor 5.936 2.770 4.260 1.326 
## 3 virginica 6.588 2.974 5.552 2.026 
mean.iris.loop 
## Sepal.Length Sepal.Width Petal.Length Petal.Width 
## setosa 5.006 3.428 1.462 0.246 
## versicolor 5.936 2.770 4.260 1.326 
## virginica 6.588 2.974 5.552 2.026 
29/35
Merging data frames 
# Make a data frame mapping story numbers to titles 
stories <- read.table(header=T, text=' 
storyid title 
1 lions 
2 tigers 
3 bears 
') 
# Make another data frame with the data and story numbers (no titles) 
data <- read.table(header=T, text=' 
subject storyid rating 
1 1 6.7 
1 2 4.5 
1 3 3.7 
2 2 3.3 
2 3 4.1 
2 1 5.2 
') 
30/35
Merge the two data frames 
merge(stories, data, "storyid") 
## storyid title subject rating 
## 1 1 lions 1 6.7 
## 2 1 lions 2 5.2 
## 3 2 tigers 1 4.5 
## 4 2 tigers 2 3.3 
## 5 3 bears 1 3.7 
## 6 3 bears 2 4.1 
If the two data frames have different names for the columns you want to match on, the names can be 
specified: 
# In this case, the column is named 'id' instead of storyid 
stories2 <- read.table(header=T, text=' 
id title 
1 lions 
2 tigers 
3 bears ') 
merge(x=stories2, y=data, by.x="id", by.y="storyid") 
31/35
Resources and Materials used 
· 
Data Manipulation with R by Phil Spector 
· 
Getting and Cleaning data Coursera Course 
· 
plyr by Hadley Wickham 
· 
Andrew Jaffe Notes 
· 
R cookbok 
32/35

More Related Content

What's hot

R programming presentation
R programming presentationR programming presentation
R programming presentationAkshat Sharma
 
2 R Tutorial Programming
2 R Tutorial Programming2 R Tutorial Programming
2 R Tutorial ProgrammingSakthi Dasans
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data scienceLong Nguyen
 
R programming slides
R  programming slidesR  programming slides
R programming slidesPankaj Saini
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R StudioRupak Roy
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with PythonDavis David
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with RShareThis
 
Introduction to ggplot2
Introduction to ggplot2Introduction to ggplot2
Introduction to ggplot2maikroeder
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using RVictoria López
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisVishwas N
 
R Programming Language
R Programming LanguageR Programming Language
R Programming LanguageNareshKarela1
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Data Visualization Tools in Python
Data Visualization Tools in PythonData Visualization Tools in Python
Data Visualization Tools in PythonRoman Merkulov
 
Data mining-2
Data mining-2Data mining-2
Data mining-2Nit Hik
 
R basics
R basicsR basics
R basicsFAO
 

What's hot (20)

R programming
R programmingR programming
R programming
 
R programming presentation
R programming presentationR programming presentation
R programming presentation
 
2 R Tutorial Programming
2 R Tutorial Programming2 R Tutorial Programming
2 R Tutorial Programming
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
 
R programming slides
R  programming slidesR  programming slides
R programming slides
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R Studio
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
R programming
R programmingR programming
R programming
 
Data model
Data modelData model
Data model
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with R
 
Introduction to ggplot2
Introduction to ggplot2Introduction to ggplot2
Introduction to ggplot2
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
R Programming Language
R Programming LanguageR Programming Language
R Programming Language
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Data Visualization Tools in Python
Data Visualization Tools in PythonData Visualization Tools in Python
Data Visualization Tools in Python
 
Data mining-2
Data mining-2Data mining-2
Data mining-2
 
R basics
R basicsR basics
R basics
 
Data Visualization With R
Data Visualization With RData Visualization With R
Data Visualization With R
 

Viewers also liked

스마트러닝시장동향
스마트러닝시장동향스마트러닝시장동향
스마트러닝시장동향Duke Kim
 
Impacto de las aulas virtuales en la educación
Impacto de las aulas virtuales en la educaciónImpacto de las aulas virtuales en la educación
Impacto de las aulas virtuales en la educaciónalejandracastroandrade
 
Evolucion de la comunicacion humana susana castaneda
Evolucion de la  comunicacion humana susana castanedaEvolucion de la  comunicacion humana susana castaneda
Evolucion de la comunicacion humana susana castanedaSusana Castañeda
 
ITサービス運営におけるアーキテクチャ設計 - 要求開発アライアンス 4月定例会
ITサービス運営におけるアーキテクチャ設計 - 要求開発アライアンス 4月定例会ITサービス運営におけるアーキテクチャ設計 - 要求開発アライアンス 4月定例会
ITサービス運営におけるアーキテクチャ設計 - 要求開発アライアンス 4月定例会Yusuke Suzuki
 
Interview Ilb Life Style Dordrecht Dec2011
Interview Ilb Life Style Dordrecht Dec2011Interview Ilb Life Style Dordrecht Dec2011
Interview Ilb Life Style Dordrecht Dec2011Leanne_Eline
 
Sharing is the new lead gen - Talk at Web 2.0 expo
Sharing is the new lead gen - Talk at Web 2.0 expoSharing is the new lead gen - Talk at Web 2.0 expo
Sharing is the new lead gen - Talk at Web 2.0 expoRashmi Sinha
 
Interview exercise
Interview exerciseInterview exercise
Interview exerciseworkventures
 
Understanding the Technology Buyer on LinkedIn - TECHconnect Bangalore 2015
Understanding the Technology Buyer on LinkedIn - TECHconnect Bangalore 2015Understanding the Technology Buyer on LinkedIn - TECHconnect Bangalore 2015
Understanding the Technology Buyer on LinkedIn - TECHconnect Bangalore 2015LinkedIn India
 

Viewers also liked (12)

R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
스마트러닝시장동향
스마트러닝시장동향스마트러닝시장동향
스마트러닝시장동향
 
Impacto de las aulas virtuales en la educación
Impacto de las aulas virtuales en la educaciónImpacto de las aulas virtuales en la educación
Impacto de las aulas virtuales en la educación
 
Evolucion de la comunicacion humana susana castaneda
Evolucion de la  comunicacion humana susana castanedaEvolucion de la  comunicacion humana susana castaneda
Evolucion de la comunicacion humana susana castaneda
 
Zaragoza turismo 200
Zaragoza turismo 200Zaragoza turismo 200
Zaragoza turismo 200
 
ITサービス運営におけるアーキテクチャ設計 - 要求開発アライアンス 4月定例会
ITサービス運営におけるアーキテクチャ設計 - 要求開発アライアンス 4月定例会ITサービス運営におけるアーキテクチャ設計 - 要求開発アライアンス 4月定例会
ITサービス運営におけるアーキテクチャ設計 - 要求開発アライアンス 4月定例会
 
Interview Ilb Life Style Dordrecht Dec2011
Interview Ilb Life Style Dordrecht Dec2011Interview Ilb Life Style Dordrecht Dec2011
Interview Ilb Life Style Dordrecht Dec2011
 
Judit Jorba
Judit JorbaJudit Jorba
Judit Jorba
 
Sharing is the new lead gen - Talk at Web 2.0 expo
Sharing is the new lead gen - Talk at Web 2.0 expoSharing is the new lead gen - Talk at Web 2.0 expo
Sharing is the new lead gen - Talk at Web 2.0 expo
 
Interview exercise
Interview exerciseInterview exercise
Interview exercise
 
Understanding the Technology Buyer on LinkedIn - TECHconnect Bangalore 2015
Understanding the Technology Buyer on LinkedIn - TECHconnect Bangalore 2015Understanding the Technology Buyer on LinkedIn - TECHconnect Bangalore 2015
Understanding the Technology Buyer on LinkedIn - TECHconnect Bangalore 2015
 
Chapter 11
Chapter 11Chapter 11
Chapter 11
 

Similar to Data manipulation on r

fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxdataKarthik
 
A quick introduction to R
A quick introduction to RA quick introduction to R
A quick introduction to RAngshuman Saha
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciencesalexstorer
 
R Programming.pptx
R Programming.pptxR Programming.pptx
R Programming.pptxkalai75
 
R tutorial for a windows environment
R tutorial for a windows environmentR tutorial for a windows environment
R tutorial for a windows environmentYogendra Chaubey
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RStacy Irwin
 
Basic R Data Manipulation
Basic R Data ManipulationBasic R Data Manipulation
Basic R Data ManipulationChu An
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RRajib Layek
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxcarliotwaycave
 
3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data framekrishna singh
 
Day 1d R structures & objects: matrices and data frames.pptx
Day 1d   R structures & objects: matrices and data frames.pptxDay 1d   R structures & objects: matrices and data frames.pptx
Day 1d R structures & objects: matrices and data frames.pptxAdrien Melquiond
 
Ggplot2 work
Ggplot2 workGgplot2 work
Ggplot2 workARUN DN
 

Similar to Data manipulation on r (20)

R programming
R programmingR programming
R programming
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
 
R Programming Homework Help
R Programming Homework HelpR Programming Homework Help
R Programming Homework Help
 
A quick introduction to R
A quick introduction to RA quick introduction to R
A quick introduction to R
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Introduction to r
Introduction to rIntroduction to r
Introduction to r
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciences
 
R Programming.pptx
R Programming.pptxR Programming.pptx
R Programming.pptx
 
R tutorial for a windows environment
R tutorial for a windows environmentR tutorial for a windows environment
R tutorial for a windows environment
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
R gráfico
R gráficoR gráfico
R gráfico
 
Basic R Data Manipulation
Basic R Data ManipulationBasic R Data Manipulation
Basic R Data Manipulation
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
 
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
 
3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data frame
 
Day 1d R structures & objects: matrices and data frames.pptx
Day 1d   R structures & objects: matrices and data frames.pptxDay 1d   R structures & objects: matrices and data frames.pptx
Day 1d R structures & objects: matrices and data frames.pptx
 
Ggplot2 work
Ggplot2 workGgplot2 work
Ggplot2 work
 

More from Abhik Seal

Clinicaldataanalysis in r
Clinicaldataanalysis in rClinicaldataanalysis in r
Clinicaldataanalysis in rAbhik Seal
 
Virtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryVirtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryAbhik Seal
 
Data handling in r
Data handling in rData handling in r
Data handling in rAbhik Seal
 
Modeling Chemical Datasets
Modeling Chemical DatasetsModeling Chemical Datasets
Modeling Chemical DatasetsAbhik Seal
 
Introduction to Adverse Drug Reactions
Introduction to Adverse Drug ReactionsIntroduction to Adverse Drug Reactions
Introduction to Adverse Drug ReactionsAbhik Seal
 
Mapping protein to function
Mapping protein to functionMapping protein to function
Mapping protein to functionAbhik Seal
 
Sequencedatabases
SequencedatabasesSequencedatabases
SequencedatabasesAbhik Seal
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataAbhik Seal
 
Understanding Smiles
Understanding Smiles Understanding Smiles
Understanding Smiles Abhik Seal
 
Learning chemistry with google
Learning chemistry with googleLearning chemistry with google
Learning chemistry with googleAbhik Seal
 
3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using data3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using dataAbhik Seal
 
R scatter plots
R scatter plotsR scatter plots
R scatter plotsAbhik Seal
 
Q plot tutorial
Q plot tutorialQ plot tutorial
Q plot tutorialAbhik Seal
 
Pharmacohoreppt
PharmacohorepptPharmacohoreppt
PharmacohorepptAbhik Seal
 

More from Abhik Seal (20)

Chemical data
Chemical dataChemical data
Chemical data
 
Clinicaldataanalysis in r
Clinicaldataanalysis in rClinicaldataanalysis in r
Clinicaldataanalysis in r
 
Virtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryVirtual Screening in Drug Discovery
Virtual Screening in Drug Discovery
 
Data handling in r
Data handling in rData handling in r
Data handling in r
 
Networks
NetworksNetworks
Networks
 
Modeling Chemical Datasets
Modeling Chemical DatasetsModeling Chemical Datasets
Modeling Chemical Datasets
 
Introduction to Adverse Drug Reactions
Introduction to Adverse Drug ReactionsIntroduction to Adverse Drug Reactions
Introduction to Adverse Drug Reactions
 
Mapping protein to function
Mapping protein to functionMapping protein to function
Mapping protein to function
 
Sequencedatabases
SequencedatabasesSequencedatabases
Sequencedatabases
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical data
 
Understanding Smiles
Understanding Smiles Understanding Smiles
Understanding Smiles
 
Learning chemistry with google
Learning chemistry with googleLearning chemistry with google
Learning chemistry with google
 
3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using data3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using data
 
Poster
PosterPoster
Poster
 
R scatter plots
R scatter plotsR scatter plots
R scatter plots
 
Indo us 2012
Indo us 2012Indo us 2012
Indo us 2012
 
Q plot tutorial
Q plot tutorialQ plot tutorial
Q plot tutorial
 
Weka guide
Weka guideWeka guide
Weka guide
 
Pharmacohoreppt
PharmacohorepptPharmacohoreppt
Pharmacohoreppt
 
Document1
Document1Document1
Document1
 

Recently uploaded

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxruthvilladarez
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxJanEmmanBrigoli
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 

Recently uploaded (20)

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docx
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptx
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 

Data manipulation on r

  • 1. Data Manipulation on R Factor Manipulations,subset,sorting and Reshape Abhik Seal Indiana University School of Informatics and Computing(dsdht.wikispaces.com)
  • 2. Basic Manipulating Data So far , we've covered how to read in data from various ways like from files, internet and databases and reading various formats of files. This session we are interested to manipulate data after reading in the file for easy data processing. 2/35
  • 3. Sorting and Ordering data sort(x,decreasing=FALSE) : 'sort (or order) a vector or factor (partially) into ascending or descending order.' order(...,decreasing=FALSE):'returns a permutation which rearranges its first argument into ascending or descending order, breaking ties by further arguments.' x <- c(1,5,7,8,3,12,34,2) sort(x) ## [1] 1 2 3 5 7 8 12 34 order(x) ## [1] 1 8 5 2 3 4 6 7 3/35
  • 4. Some examples of sorting and ordering # sort by mpg newdata <- mtcars[order(mpg),] head(newdata,3) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 ## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 ## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 # sort by mpg and cyl newdata <- mtcars[order(mpg, cyl),] head(newdata,3) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 ## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 ## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 4/35
  • 5. Ordering with plyr library(plyr) head(arrange(mtcars,mpg),3) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 ## 2 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 ## 3 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 head(arrange(mtcars,desc(mpg)),3) ## mpg cyl disp hp drat wt qsec vs am gear carb ## 1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ## 2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## 3 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 5/35
  • 6. Subsetting data set.seed(12345) #create a dataframe X<-data.frame("A"=sample(1:10),"B"=sample(11:20),"C"=sample(21:30)) # Add NA VALUES X<-X[sample(1:10),];X$B[c(1,6,10)]=NA head(X) ## A B C ## 8 4 NA 27 ## 1 8 11 25 ## 2 10 12 23 ## 5 3 13 24 ## 3 7 16 28 ## 10 5 NA 26 6/35
  • 7. Basic data subsetting # Accessing only first row X[1,] ## A B C ## 8 4 NA 27 # accessing only first column X[,1] ## [1] 4 8 10 3 7 5 9 1 2 6 # accessing first row and first column X[1,1] ## [1] 4 7/35
  • 8. And/OR's head(X[(X$A <=6 & X$C > 24),],3) ## A B C ## 8 4 NA 27 ## 10 5 NA 26 ## 7 2 19 29 head(X[(X$A <=6 | X$C > 24),],3) ## A B C ## 8 4 NA 27 ## 1 8 11 25 ## 5 3 13 24 8/35
  • 9. select Non NA values Data Frame # select the dataframe without NA values in B column head(X[which(X$B!='NA'),],4) ## A B C ## 1 8 11 25 ## 2 10 12 23 ## 5 3 13 24 ## 3 7 16 28 # select those which have values > 14 head(X[which(X$B>11),],4) ## A B C ## 2 10 12 23 ## 5 3 13 24 ## 3 7 16 28 ## 4 9 20 30 9/35
  • 10. # creating a data frame with 2 variables data <- data.frame(x1=c(2,3,4,5,6),x2=c(5,6,7,8,1)) list_data<-list(dat=data,vec.obj=c(1,2,3)) list_data ## $dat ## x1 x2 ## 1 2 5 ## 2 3 6 ## 3 4 7 ## 4 5 8 ## 5 6 1 ## ## $vec.obj ## [1] 1 2 3 # accessing second element of the list_obj objects list_data[[2]] ## [1] 1 2 3 10/35
  • 11. Factors Factors are used to represent categorical data, and can also be used for ordinal data (ie categories have an intrinsic ordering) Note that R reads in character strings as factors by default in functions like read.table()'The function factor is used to encode a vector as a factor (the terms 'category' and 'enumerated type' are also used for factors). If argument ordered is TRUE, the factor levels are assumed to be ordered. For compatibility with S there is also a function ordered.'is.factor, is.ordered, as.factor and as.ordered are the membership and coercion functions for these classes. 11/35
  • 12. Factors Suppose we have a vector of case-control status cc=factor(c("case","case","case","control","control","control")) cc ## [1] case case case control control control ## Levels: case control levels(cc)=c("control","case") cc ## [1] control control control case case case ## Levels: control case 12/35
  • 13. Factors Factors can be converted to numericor charactervery easily x=factor(c("case","case","case","control","control","control"),levels=c("control","case")) as.character(x) ## [1] "case" "case" "case" "control" "control" "control" as.numeric(x) ## [1] 2 2 2 1 1 1 13/35
  • 14. Cut Now that we know more about factors, cut()will make more sense: x=1:100 cx=cut(x,breaks=c(0,10,25,50,100)) head(cx) ## [1] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] ## Levels: (0,10] (10,25] (25,50] (50,100] table(cx) ## cx ## (0,10] (10,25] (25,50] (50,100] ## 10 15 25 50 14/35
  • 15. Cut We can also leave off the labels cx=cut(x,breaks=c(0,10,25,50,100),labels=FALSE) head(cx) ## [1] 1 1 1 1 1 1 table(cx) ## cx ## 1 2 3 4 ## 10 15 25 50 15/35
  • 16. Cut cx=cut(x,breaks=c(10,25,50),labels=FALSE) head(cx) ## [1] NA NA NA NA NA NA table(cx) ## cx ## 1 2 ## 15 25 table(cx,useNA="ifany") ## cx ## 1 2 <NA> ## 15 25 60 16/35
  • 17. Adding to data frames m1=matrix(1:9,nrow=3,ncol=3,byrow=FALSE) m1 ## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9 m2=matrix(1:9,nrow=3,ncol=3,byrow=TRUE) m2 ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6 ## [3,] 7 8 9 17/35
  • 18. Adding using cbind You can add columns (or another matrix/data frame) to a data frame or matrix using cbind()('column bind'). You can also add rows (or another matrix/data frame) using rbind()('row bind'). Note that the vector you are adding has to have the same length as the number of rows (for cbind()) or the number of columns (rbind()) cbind(m1,m2) ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 1 4 7 1 2 3 ## [2,] 2 5 8 4 5 6 ## [3,] 3 6 9 7 8 9 18/35
  • 19. Reshape data Datasets layout could be long or wide. In long-layout, multiple rows represent a single subject's record, whereas in wide-layout, a single row represents a single subject's record. In doing some statistical analysis sometimes we require wide data and sometimes long data, so that we can easily reshape the data to meet the requirements of statistical analysis. Data reshaping is just a rearrangement of the form of the data—it does not change the content of the dataset. This section mainly focuses the melt and cast paradigm of reshaping datasets, which is implemented in the reshape contributed package. Later on, this same package is reimplemented with a new name, reshape2, which is much more time and memory efficient (the Reshaping Data with the reshape Package paper, by Wickham, which can be found at (http://www.jstatsoft.org/v21/i12/paper)) 19/35
  • 20. Wide data has a column for each variable. For example, this is wide-format data: # ozone wind temp # 1 23.62 11.623 65.55 # 2 29.44 10.267 79.10 # 3 59.12 8.942 83.90 # 4 59.96 8.794 83.97 Data in long format # variable value # 1 ozone 23.615 # 2 ozone 29.444 # 3 ozone 59.115 # 4 ozone 59.962 # 5 wind 11.623 # 6 wind 10.267 # 7 wind 8.942 # 8 wind 8.794 # 9 temp 65.548 # 10 temp 79.100 # 11 temp 83.903 # 12 temp 83.968 20/35
  • 21. reshape 2 Package "In reality, you need long-format data much more commonly than wide-format data. For example, ggplot2 requires long-format data plyr requires long-format data, and most modelling functions (such as lm(), glm(), and gam()) require long-format data. But people often find it easier to record their data in wide format." reshape2 is based around two key functions: melt and cast: melt takes wide-format data and melts it into long-format data. cast takes long-format data and casts it into wide-format data. 21/35
  • 22. Melt library(reshape2) head(airquality,2) ## ozone solar.r wind temp month day ## 1 41 190 7.4 67 5 1 ## 2 36 118 8.0 72 5 2 aql <- melt(airquality) # [a]ir [q]uality [l]ong format head(aql,5) ## variable value ## 1 ozone 41 ## 2 ozone 36 ## 3 ozone 12 ## 4 ozone 18 ## 5 ozone NA 22/35
  • 23. By default, melt has assumed that all columns with numeric values are variables with values. Maybe here we want to know the values of ozone, solar.r, wind, and temp for each month and day. We can do that with melt by telling it that we want month and day to be “ID variables”. ID variables are the variables that identify individual rows of data. m <- melt(airquality, id.vars = c("month", "day")) head(m,4) ## month day variable value ## 1 5 1 ozone 41 ## 2 5 2 ozone 36 ## 3 5 3 ozone 12 ## 4 5 4 ozone 18 23/35
  • 24. Melt also allow us to control the column names in long data format m <- melt(airquality, id.vars = c("month", "day"), variable.name = "climate_variable", value.name = "climate_value") head(m) ## month day climate_variable climate_value ## 1 5 1 ozone 41 ## 2 5 2 ozone 36 ## 3 5 3 ozone 12 ## 4 5 4 ozone 18 ## 5 5 5 ozone NA ## 6 5 6 ozone 28 24/35
  • 25. Long- to wide-format data: the cast functions In reshape2 there are multiple cast functions. Since you will most commonly work with data.frame objects, we’ll explore the dcast function. (There is also acast to return a vector, matrix, or array.) dcast uses a formula to describe the shape of the data. m <- melt(airquality, id.vars = c("month", "day")) aqw <- dcast(m, month + day ~ variable) head(aqw) ## month day ozone solar.r wind temp ## 1 5 1 41 190 7.4 67 ## 2 5 2 36 118 8.0 72 ## 3 5 3 12 149 12.6 74 ## 4 5 4 18 313 11.5 62 ## 5 5 5 NA NA 14.3 56 ## 6 5 6 28 NA 14.9 66 Here, we need to tell dcast that month and day are the ID variables. Besides re-arranging the columns, we’ve recovered our original data. 25/35
  • 26. Data Manipulation Using plyr For large-scale data, we can split the dataset, perform the manipulation or analysis, and then combine it into a single output again. This type of split using default R is not much efficient, and to overcome this limitation, Wickham, in 2011, developed an R package called plyr in which he efficiently implemented the split-apply-combine strategy. We can compare this strategy to map-reduce strategy for processing large amount of data. In the coming slides i will give example of the split-apply-combine strategy using · Without Loops · With Loops · Using plyr package 26/35
  • 27. Without loops I am using the iris dataset here 1. Split the iris dataset into three parts. 2. Remove the species name variable from the data. 3. Calculate the mean of each variable for the three different parts separately. 4. Combine the output into a single data frame. iris.set <- iris[iris$Species=="setosa",-5] iris.versi <- iris[iris$Species=="versicolor",-5] iris.virg <- iris[iris$Species=="virginica",-5] # calculating mean for each piece (The apply step) mean.set <- colMeans(iris.set) mean.versi <- colMeans(iris.versi) mean.virg <- colMeans(iris.virg) # combining the output (The combine step) mean.iris <- rbind(mean.set,mean.versi,mean.virg) # giving row names so that the output could be easily understood rownames(mean.iris) <- c("setosa","versicolor","virginica") 27/35
  • 28. With Loops mean.iris.loop <- NULL for(species in unique(iris$Species)) { iris_sub <- iris[iris$Species==species,] column_means <- colMeans(iris_sub[,-5]) mean.iris.loop <- rbind(mean.iris.loop,column_means) } # giving row names so that the output could be easily understood rownames(mean.iris.loop) <- unique(iris$Species) NB: In the split-apply-combine strategy is that each piece should be independent of the other. The strategy wont work if one piece is dependent upon one another. 28/35
  • 29. Using plyr library (plyr) ddply(iris,~Species,function(x) colMeans(x[,- which(colnames(x)=="Species")])) ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 setosa 5.006 3.428 1.462 0.246 ## 2 versicolor 5.936 2.770 4.260 1.326 ## 3 virginica 6.588 2.974 5.552 2.026 mean.iris.loop ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## setosa 5.006 3.428 1.462 0.246 ## versicolor 5.936 2.770 4.260 1.326 ## virginica 6.588 2.974 5.552 2.026 29/35
  • 30. Merging data frames # Make a data frame mapping story numbers to titles stories <- read.table(header=T, text=' storyid title 1 lions 2 tigers 3 bears ') # Make another data frame with the data and story numbers (no titles) data <- read.table(header=T, text=' subject storyid rating 1 1 6.7 1 2 4.5 1 3 3.7 2 2 3.3 2 3 4.1 2 1 5.2 ') 30/35
  • 31. Merge the two data frames merge(stories, data, "storyid") ## storyid title subject rating ## 1 1 lions 1 6.7 ## 2 1 lions 2 5.2 ## 3 2 tigers 1 4.5 ## 4 2 tigers 2 3.3 ## 5 3 bears 1 3.7 ## 6 3 bears 2 4.1 If the two data frames have different names for the columns you want to match on, the names can be specified: # In this case, the column is named 'id' instead of storyid stories2 <- read.table(header=T, text=' id title 1 lions 2 tigers 3 bears ') merge(x=stories2, y=data, by.x="id", by.y="storyid") 31/35
  • 32. Resources and Materials used · Data Manipulation with R by Phil Spector · Getting and Cleaning data Coursera Course · plyr by Hadley Wickham · Andrew Jaffe Notes · R cookbok 32/35