SlideShare une entreprise Scribd logo
1  sur  40
DataHandlinginRDataHandlinginR
Getting,ReadingandCleaningdata
Abhik Seal
Indiana University School of Informatics and Computing(dsdht.wikispaces.com)
Get/set your working directoryGet/set your working directory
A basic component of working with data is knowing your working directory
The two main commands are getwd() and setwd().
Be aware of relative versus absolute paths
Important difference in Windows setwd("C:UsersdatascDownloads")
·
·
·
Relative - setwd("./data"), setwd("../")
Absolute - setwd("/Users/datasc/data/")
-
-
·
2/40
Checking for and creating directoriesChecking for and creating directories
file.exists("directoryName") will check to see if the directory exists
dir.create("directoryName") will create a directory if it doesn't exist
Here is an example checking for a "data" directory and creating it if it doesn't exist
·
·
·
if(!file.exists("data")){
dir.create("data")
}
3/40
Reading data filesReading data files
We wil look at each of the methods
From Internet
Reading local files
Reading Excel Files
Reading XML
Reading JSON
Reading MySQL
Reading HDF5
Reading from other resources
·
·
·
·
·
·
·
·
4/40
Getting data from InternetGetting data from Internet
Data from Healthit.gov
Use of download.file()
Useful for downloading tab-delimited, csv, and other files
·
·
fileUrl <- "http://dashboard.healthit.gov/data/data/NAMCS_2008-2013.csv"
download.file(fileUrl,destfile="./data/NAMCS.csv",method="curl")
list.files("./data")
## [1] "NAMCS.csv"
5/40
Getting data from InternetGetting data from Internet
Reading the data using read.csv()
data<-read.csv("http://dashboard.healthit.gov/data/data/NAMCS_2008-2013.csv")
head(data,2)
## Region Period Adoption.of.Basic.EHRs..Overall.Physician.Practices
## 1 Alabama 2013 0.48
## 2 Alaska 2013 0.50
## Adoption.of.Basic.EHRs..Primary.Care.Providers
## 1 0.50
## 2 0.52
## Adoption.of.Basic.EHRs..Rural.Providers
## 1 0.54
## 2 0.37
## Adoption.of.Basic.EHRs..Small.Practices
## 1 0.40
## 2 0.39
## Percent.of.office.based.physicians.with.computerized.capability.to.view.lab.results
## 1 0.74
## 2 0.75 6/40
Some notes about download.file()Some notes about download.file()
If the url starts with http you can use download.file()
If the url starts with https on Windows you may be ok
If the url starts with https on Mac you may need to set method="curl"
If the file is big, this might take a while
Be sure to record when you downloaded.
·
·
·
·
·
7/40
Loading flat files - read.table()Loading flat files - read.table()
This is the main function for reading data into R
Flexible and robust but requires more parameters
Reads the data into RAM - big data can cause problems
Important parameters file, header, sep, row.names, nrows
Related: read.csv(), read.csv2()
Both read.table() and read.fwf() use scan to read the file, and then process the results of scan.
They are very convenient, but sometimes it is better to use scan directly
·
·
·
·
·
·
8/40
Example dataExample data
fileUrl <- "http://dashboard.healthit.gov/data/data/NAMCS_2008-2013.csv"
download.file(fileUrl,destfile="./data/NAMCS.csv",method="curl")
list.files("./data")
## [1] "NAMCS.csv"
Data <- read.table("./data/NAMCS.csv")
## Error: line 2 did not have 87 elements
head(Data,2)
## Error: object 'Data' not found
9/40
Example parametersExample parameters
read.csv sets sep="," and header=TRUE
same as
cameraData <- read.table("./data/NAMCS.csv",sep=",",header=TRUE)
cameraData <- read.csv("./data/NAMCS.csv")
head(cameraData)
## Region Period Adoption.of.Basic.EHRs..Overall.Physician.Practices
## 1 Alabama 2013 0.48
## 2 Alaska 2013 0.50
## 3 Arizona 2013 0.51
## 4 Arkansas 2013 0.46
## 5 California 2013 0.54
## 6 Colorado 2013 0.39
## Adoption.of.Basic.EHRs..Primary.Care.Providers
## 1 0.50
## 2 0.52
## 3 0.63
10/40
Some more important parametersSome more important parameters
People face trouble with reading flat files those have quotation marks ` or " placed in data values,
setting quote="" often resolves these.
quote - you can tell R whether there are any quoted values quote="" means no quotes.
na.strings - set the character that represents a missing value.
nrows - how many rows to read of the file (e.g. nrows=10 reads 10 lines).
skip - number of lines to skip before starting to read
·
·
·
·
11/40
read.xlsx(), read.xlsx2() {xlsx package}read.xlsx(), read.xlsx2() {xlsx package}
Reading specific rows and columnsReading specific rows and columns
library(xlsx)
Data <- read.xlsx("./data/ADME_genes.xlsx",sheetIndex=1,header=TRUE)
colIndex <- 2:3
rowIndex <- 1:4
dataSub <- read.xlsx("./data/ADME_genes.xlsx",sheetIndex=1,
colIndex=colIndex,rowIndex=rowIndex)
12/40
Further notesFurther notes
The write.xlsx function will write out an Excel file with similar arguments.
read.xlsx2 is much faster than read.xlsx but for reading subsets of rows may be slightly unstable.
The XLConnect is a Java-based solution, so it is cross platform and returns satisfactory results.
For large data sets it may be very slow.
xlsReadWrite is very fast: it doesn't support .xlsx files
gdata package provides a good cross platform solutions. It is available for Windows, Mac or
Linux. gdata requires you to install additional Perl libraries. Perl is usually already installed in
Linux and Mac, but sometimes require more effort in Windows platforms.
In general it is advised to store your data in either a database or in comma separated files (.csv)
or tab separated files (.tab/.txt) as they are easier to distribute.
I found on the web a self made function to easily import xlsx files. It should work in all platforms
and use XML
·
·
·
·
·
·
·
source("https://gist.github.com/schaunwheeler/5825002/raw/3526a15b032c06392740e20b6c9a179add2cee49/
xlsxToR = function("myfile.xlsx", header = TRUE)
13/40
Working with XMLWorking with XML
http://en.wikipedia.org/wiki/XML
Extensible markup language
Frequently used to store structured data
Particularly widely used in internet applications
Extracting XML is the basis for most web scraping
Components
·
·
·
·
·
Markup - labels that give the text structure
Content - the actual text of the document
-
-
14/40
Read the file into RRead the file into R
library(XML)
fileUrl <- "http://www.w3schools.com/xml/simple.xml"
doc <- xmlTreeParse(fileUrl,useInternal=TRUE)
rootNode <- xmlRoot(doc)
xmlName(rootNode)
## [1] "breakfast_menu"
names(rootNode)
## food food food food food
## "food" "food" "food" "food" "food"
15/40
Directly access parts of the XML documentDirectly access parts of the XML document
rootNode[[1]]
## <food>
## <name>Belgian Waffles</name>
## <price>$5.95</price>
## <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
## <calories>650</calories>
## </food>
rootNode[[1]][[1]]
## <name>Belgian Waffles</name>
Go for a tour of XML package
Official XML tutorials short, long
An outstanding guide to the XML package
·
·
·
16/40
JSONJSON
http://en.wikipedia.org/wiki/JSON
Javascript Object Notation
Lightweight data storage
Common format for data from application programming interfaces (APIs)
Similar structure to XML but different syntax/format
Data stored as
·
·
·
·
·
Numbers (double)
Strings (double quoted)
Boolean (true or false)
Array (ordered, comma separated enclosed in square brackets [])
Object (unorderd, comma separated collection of key:value pairs in curley brackets {})
-
-
-
-
-
17/40
Example JSON fileExample JSON file
18/40
Reading data from JSON {jsonlite package}Reading data from JSON {jsonlite package}
library(jsonlite)
# Using chembl api
jsonData <- fromJSON("https://www.ebi.ac.uk/chemblws/compounds/CHEMBL1.json")
names(jsonData)
## [1] "compound"
jsonData$compound$chemblId
## [1] "CHEMBL1"
jsonData$compound$stdInChiKey
## [1] "GHBOEFUAGSHXPO-XZOTUCIWSA-N"
19/40
Writing data frames to JSONWriting data frames to JSON
myjson <- toJSON(iris, pretty=TRUE)
cat(myjson)
## [
## {
## "Sepal.Length" : 5.1,
## "Sepal.Width" : 3.5,
## "Petal.Length" : 1.4,
## "Petal.Width" : 0.2,
## "Species" : "setosa"
## },
## {
## "Sepal.Length" : 4.9,
## "Sepal.Width" : 3,
## "Petal.Length" : 1.4,
## "Petal.Width" : 0.2,
## "Species" : "setosa"
## },
## {
## "Sepal.Length" : 4.7, 20/40
Further resourcesFurther resources
http://www.json.org/
A good tutorial on jsonlite - http://www.r-bloggers.com/new-package-jsonlite-a-smarter-json-
encoderdecoder/
jsonlite vignette
·
·
·
21/40
mySQLmySQL
http://en.wikipedia.org/wiki/MySQL http://www.mysql.com/
Free and widely used open source database software
Widely used in internet based applications
Data are structured in
Each row is called a record
·
·
·
Databases
Tables within databases
Fields within tables
-
-
-
·
22/40
Step 2 - Install RMySQL ConnectorStep 2 - Install RMySQL Connector
On a Mac: install.packages("RMySQL")
On Windows:
·
·
Official instructions - http://biostat.mc.vanderbilt.edu/wiki/Main/RMySQL (may be useful for
Mac/UNIX users as well)
Potentially useful guide - http://www.ahschulz.de/2013/07/23/installing-rmysql-under-
windows/
-
-
23/40
UCSC MySQLUCSC MySQL
http://genome.ucsc.edu/goldenPath/help/mysql.html
24/40
Connecting and listing databasesConnecting and listing databases
library(DBI)
library(RMySQL)
ucscDb <- dbConnect(MySQL(),user="genome",
host="genome-mysql.cse.ucsc.edu")
result <- dbGetQuery(ucscDb,"show databases;"); dbDisconnect(ucscDb);
## [1] TRUE
head(result)
## Database
## 1 information_schema
## 2 ailMel1
## 3 allMis1
## 4 anoCar1
## 5 anoCar2
## 6 anoGam1
25/40
Connecting to hg19 and listing tablesConnecting to hg19 and listing tables
library(RMySQL)
hg19 <- dbConnect(MySQL(),user="genome", db="hg19",
host="genome-mysql.cse.ucsc.edu")
allTables <- dbListTables(hg19)
length(allTables)
## [1] 11006
allTables[1:5]
## [1] "HInv" "HInvGeneMrna" "acembly" "acemblyClass"
## [5] "acemblyPep"
26/40
Get dimensions of a specific tableGet dimensions of a specific table
dbListFields(hg19,"affyU133Plus2")
## [1] "bin" "matches" "misMatches" "repMatches" "nCount"
## [6] "qNumInsert" "qBaseInsert" "tNumInsert" "tBaseInsert" "strand"
## [11] "qName" "qSize" "qStart" "qEnd" "tName"
## [16] "tSize" "tStart" "tEnd" "blockCount" "blockSizes"
## [21] "qStarts" "tStarts"
dbGetQuery(hg19, "select count(*) from affyU133Plus2")
## count(*)
## 1 58463
27/40
Read from the tableRead from the table
affyData <- dbReadTable(hg19, "affyU133Plus2")
head(affyData)
## bin matches misMatches repMatches nCount qNumInsert qBaseInsert
## 1 585 530 4 0 23 3 41
## 2 585 3355 17 0 109 9 67
## 3 585 4156 14 0 83 16 18
## 4 585 4667 9 0 68 21 42
## 5 585 5180 14 0 167 10 38
## 6 585 468 5 0 14 0 0
## tNumInsert tBaseInsert strand qName qSize qStart qEnd tName
## 1 3 898 - 225995_x_at 637 5 603 chr1
## 2 9 11621 - 225035_x_at 3635 0 3548 chr1
## 3 2 93 - 226340_x_at 4318 3 4274 chr1
## 4 3 5743 - 1557034_s_at 4834 48 4834 chr1
## 5 1 29 - 231811_at 5399 0 5399 chr1
## 6 0 0 - 236841_at 487 0 487 chr1
## tSize tStart tEnd blockCount
## 1 249250621 14361 15816 5
## 2 249250621 14381 29483 17 28/40
Select a specific subsetSelect a specific subset
query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3")
affyMis <- fetch(query); quantile(affyMis$misMatches)
## 0% 25% 50% 75% 100%
## 1 1 2 2 3
affyMisSmall <- fetch(query,n=10); dbClearResult(query);
## [1] TRUE
dim(affyMisSmall)
## [1] 10 22
# close connection
dbDisconnect(hg19)
29/40
Further resourcesFurther resources
RMySQL vignette http://cran.r-project.org/web/packages/RMySQL/RMySQL.pdf
R data import and export
Set up R odbc with postgres
A nice blog post summarizing some other commands http://www.r-bloggers.com/mysql-and-r/
·
·
·
·
30/40
HDF5HDF5
http://www.hdfgroup.org/
Used for storing large data sets
Supports storing a range of data types
Heirarchical data format
groups containing zero or more data sets and metadata
datasets multidimensional array of data elements with metadata
·
·
·
·
Have a group header with group name and list of attributes
Have a group symbol table with a list of objects in group
-
-
·
Have a header with name, datatype, dataspace, and storage layout
Have a data array with the data
-
-
31/40
R HDF5 packageR HDF5 package
The rhdf5 package works really well, although it is not in CRAN. To install it:
source("http://bioconductor.org/biocLite.R")
## Bioconductor version 2.13 (BiocInstaller 1.12), ?biocLite for help
## A newer version of Bioconductor is available after installing a new
## version of R, ?BiocUpgrade for help
biocLite("rhdf5")
## BioC_mirror: http://bioconductor.org
## Using Bioconductor version 2.13 (BiocInstaller 1.12.1), R version 3.0.3.
## Installing package(s) 'rhdf5'
##
## The downloaded binary packages are in
## /var/folders/pm/jg6blwt55b71g8jl64wfw8ch0000gn/T//RtmpuYnNzs/downloaded_packages
32/40
Creating an HDF5 file and group hierarchyCreating an HDF5 file and group hierarchy
library(rhdf5)
h5createFile("myhdf5.h5")
## [1] TRUE
h5createGroup("myhdf5.h5","foo")
## [1] TRUE
h5createGroup("myhdf5.h5","baa")
## [1] TRUE
h5createGroup("myhdf5.h5","foo/foobaa")
## [1] TRUE
33/40
hdf5 continuedhdf5 continued
Saving multiple objects to an HDF5 file
h5ls("myhdf5.h5")
## group name otype dclass dim
## 0 / baa H5I_GROUP
## 1 / foo H5I_GROUP
## 2 /foo foobaa H5I_GROUP
A = 1:7; B = 1:18; D = seq(0,1,by=0.1)
h5save(A, B, D, file="newfile2.h5")
h5dump("newfile2.h5")
## $A
## [1] 1 2 3 4 5 6 7
##
## $B
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 34/40
Reading from other resourcesReading from other resources
foreign package
Loads data from Minitab, S, SAS, SPSS, Stata,Systat
Basic functions read.foo
See the help page for more details http://cran.r-project.org/web/packages/foreign/foreign.pdf
·
·
read.arff (Weka)
readline() read from console
read.dta (Stata)
read.clipboard()
read.mtp (Minitab)
read.octave (Octave)
read.spss (SPSS)
read.xport (SAS)
-
-
-
-
-
-
-
-
·
35/40
Reading imagesReading images
jpeg - http://cran.r-project.org/web/packages/jpeg/index.html
readbitmap - http://cran.r-project.org/web/packages/readbitmap/index.html
png - http://cran.r-project.org/web/packages/png/index.html
EBImage (Bioconductor) - http://www.bioconductor.org/packages/2.13/bioc/html/EBImage.html
·
·
·
·
36/40
Reading GIS dataReading GIS data
rgdal - http://cran.r-project.org/web/packages/rgdal/index.html
rgeos - http://cran.r-project.org/web/packages/rgeos/index.html
raster - http://cran.r-project.org/web/packages/raster/index.html
·
·
·
37/40
Reading music dataReading music data
tuneR - http://cran.r-project.org/web/packages/tuneR/
seewave - http://rug.mnhn.fr/seewave/
·
·
38/40
AcknowledgemntAcknowledgemnt
Jeff Leek University of Washington and Coursera Getting and Cleaning data
R For Natural Resources Course
R Data import comprehensive guide
·
·
·
39/40
40/40

Contenu connexe

Tendances

4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply FunctionSakthi Dasans
 
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In RRsquared Academy
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RRsquared Academy
 
Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformationTim Essam
 
Stata cheatsheet transformation
Stata cheatsheet transformationStata cheatsheet transformation
Stata cheatsheet transformationLaura Hughes
 
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageJanuary 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageZurich_R_User_Group
 
Stata cheat sheet: data processing
Stata cheat sheet: data processingStata cheat sheet: data processing
Stata cheat sheet: data processingTim Essam
 
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data StructureSakthi Dasans
 
Stata Cheat Sheets (all)
Stata Cheat Sheets (all)Stata Cheat Sheets (all)
Stata Cheat Sheets (all)Laura Hughes
 
SAS and R Code for Basic Statistics
SAS and R Code for Basic StatisticsSAS and R Code for Basic Statistics
SAS and R Code for Basic StatisticsAvjinder (Avi) Kaler
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa
 
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R codeMove your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R codeJeffrey Breen
 
R code descriptive statistics of phenotypic data by Avjinder Kaler
R code descriptive statistics of phenotypic data by Avjinder KalerR code descriptive statistics of phenotypic data by Avjinder Kaler
R code descriptive statistics of phenotypic data by Avjinder KalerAvjinder (Avi) Kaler
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat SheetLaura Hughes
 
Manipulating data with dates
Manipulating data with datesManipulating data with dates
Manipulating data with datesRupak Roy
 
3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data framekrishna singh
 

Tendances (19)

4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
 
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In R
 
R seminar dplyr package
R seminar dplyr packageR seminar dplyr package
R seminar dplyr package
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
 
Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformation
 
Stata cheatsheet transformation
Stata cheatsheet transformationStata cheatsheet transformation
Stata cheatsheet transformation
 
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageJanuary 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
 
Stata cheat sheet: data processing
Stata cheat sheet: data processingStata cheat sheet: data processing
Stata cheat sheet: data processing
 
3 R Tutorial Data Structure
3 R Tutorial Data Structure3 R Tutorial Data Structure
3 R Tutorial Data Structure
 
Stata Cheat Sheets (all)
Stata Cheat Sheets (all)Stata Cheat Sheets (all)
Stata Cheat Sheets (all)
 
SAS and R Code for Basic Statistics
SAS and R Code for Basic StatisticsSAS and R Code for Basic Statistics
SAS and R Code for Basic Statistics
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 
R language introduction
R language introductionR language introduction
R language introduction
 
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R codeMove your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R code
 
R code descriptive statistics of phenotypic data by Avjinder Kaler
R code descriptive statistics of phenotypic data by Avjinder KalerR code descriptive statistics of phenotypic data by Avjinder Kaler
R code descriptive statistics of phenotypic data by Avjinder Kaler
 
Sparklyr
SparklyrSparklyr
Sparklyr
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
 
Manipulating data with dates
Manipulating data with datesManipulating data with dates
Manipulating data with dates
 
3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data frame
 

En vedette

Clinicaldataanalysis in r
Clinicaldataanalysis in rClinicaldataanalysis in r
Clinicaldataanalysis in rAbhik Seal
 
Introduction to Adverse Drug Reactions
Introduction to Adverse Drug ReactionsIntroduction to Adverse Drug Reactions
Introduction to Adverse Drug ReactionsAbhik Seal
 
Adverse Drug Reactions - Identifying, Causality & Reporting
Adverse Drug Reactions - Identifying, Causality & ReportingAdverse Drug Reactions - Identifying, Causality & Reporting
Adverse Drug Reactions - Identifying, Causality & ReportingRuella D'Costa Fernandes
 
Adverse drug reactions
Adverse drug  reactionsAdverse drug  reactions
Adverse drug reactionssuniu
 
Adverse drug reactions
Adverse drug reactionsAdverse drug reactions
Adverse drug reactionsDr.Vijay Talla
 
QSAR : Activity Relationships Quantitative Structure
QSAR : Activity Relationships Quantitative StructureQSAR : Activity Relationships Quantitative Structure
QSAR : Activity Relationships Quantitative StructureSaramita De Chakravarti
 
Qsar and drug design ppt
Qsar and drug design pptQsar and drug design ppt
Qsar and drug design pptAbhik Seal
 

En vedette (12)

Clinicaldataanalysis in r
Clinicaldataanalysis in rClinicaldataanalysis in r
Clinicaldataanalysis in r
 
Chemical data
Chemical dataChemical data
Chemical data
 
Introduction to Adverse Drug Reactions
Introduction to Adverse Drug ReactionsIntroduction to Adverse Drug Reactions
Introduction to Adverse Drug Reactions
 
Adverse Drug Reactions - Identifying, Causality & Reporting
Adverse Drug Reactions - Identifying, Causality & ReportingAdverse Drug Reactions - Identifying, Causality & Reporting
Adverse Drug Reactions - Identifying, Causality & Reporting
 
Adverse drug reactions
Adverse drug  reactionsAdverse drug  reactions
Adverse drug reactions
 
Adverse drug reactions
Adverse drug reactionsAdverse drug reactions
Adverse drug reactions
 
Adverse drug reactions ppt
Adverse drug reactions pptAdverse drug reactions ppt
Adverse drug reactions ppt
 
Adverse drug reactions
Adverse drug reactionsAdverse drug reactions
Adverse drug reactions
 
QSAR : Activity Relationships Quantitative Structure
QSAR : Activity Relationships Quantitative StructureQSAR : Activity Relationships Quantitative Structure
QSAR : Activity Relationships Quantitative Structure
 
Adverse drug reactions
Adverse drug reactionsAdverse drug reactions
Adverse drug reactions
 
Qsar and drug design ppt
Qsar and drug design pptQsar and drug design ppt
Qsar and drug design ppt
 
Adverse Drug Reactions
Adverse Drug ReactionsAdverse Drug Reactions
Adverse Drug Reactions
 

Similaire à Data handling in r

ASP.NET 08 - Data Binding And Representation
ASP.NET 08 - Data Binding And RepresentationASP.NET 08 - Data Binding And Representation
ASP.NET 08 - Data Binding And RepresentationRandy Connolly
 
Building node.js applications with Database Jones
Building node.js applications with Database JonesBuilding node.js applications with Database Jones
Building node.js applications with Database JonesJohn David Duncan
 
Slick: Bringing Scala’s Powerful Features to Your Database Access
Slick: Bringing Scala’s Powerful Features to Your Database Access Slick: Bringing Scala’s Powerful Features to Your Database Access
Slick: Bringing Scala’s Powerful Features to Your Database Access Rebecca Grenier
 
Python (Jinja2) Templates for Network Automation
Python (Jinja2) Templates for Network AutomationPython (Jinja2) Templates for Network Automation
Python (Jinja2) Templates for Network AutomationRick Sherman
 
Dax Declarative Api For Xml
Dax   Declarative Api For XmlDax   Declarative Api For Xml
Dax Declarative Api For XmlLars Trieloff
 
Local data storage for mobile apps
Local data storage for mobile appsLocal data storage for mobile apps
Local data storage for mobile appsIvano Malavolta
 
Ado.Net Architecture
Ado.Net ArchitectureAdo.Net Architecture
Ado.Net ArchitectureUmar Farooq
 
Boost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco GralikeBoost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco GralikeMarco Gralike
 
Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...
Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...
Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...Amazon Web Services
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command lineSharat Chikkerur
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Sameer Tiwari
 

Similaire à Data handling in r (20)

ASP.NET 08 - Data Binding And Representation
ASP.NET 08 - Data Binding And RepresentationASP.NET 08 - Data Binding And Representation
ASP.NET 08 - Data Binding And Representation
 
Building node.js applications with Database Jones
Building node.js applications with Database JonesBuilding node.js applications with Database Jones
Building node.js applications with Database Jones
 
Sql
SqlSql
Sql
 
Slick: Bringing Scala’s Powerful Features to Your Database Access
Slick: Bringing Scala’s Powerful Features to Your Database Access Slick: Bringing Scala’s Powerful Features to Your Database Access
Slick: Bringing Scala’s Powerful Features to Your Database Access
 
Python (Jinja2) Templates for Network Automation
Python (Jinja2) Templates for Network AutomationPython (Jinja2) Templates for Network Automation
Python (Jinja2) Templates for Network Automation
 
DataMapper
DataMapperDataMapper
DataMapper
 
Sequelize
SequelizeSequelize
Sequelize
 
Dax Declarative Api For Xml
Dax   Declarative Api For XmlDax   Declarative Api For Xml
Dax Declarative Api For Xml
 
Intake 37 ef2
Intake 37 ef2Intake 37 ef2
Intake 37 ef2
 
Local data storage for mobile apps
Local data storage for mobile appsLocal data storage for mobile apps
Local data storage for mobile apps
 
R stata
R stataR stata
R stata
 
Having Fun with Play
Having Fun with PlayHaving Fun with Play
Having Fun with Play
 
Chapter 15
Chapter 15Chapter 15
Chapter 15
 
Ado.Net Architecture
Ado.Net ArchitectureAdo.Net Architecture
Ado.Net Architecture
 
R data interfaces
R data interfacesR data interfaces
R data interfaces
 
Boost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco GralikeBoost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
 
Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...
Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...
Distributed, Incremental Dataflow Processing on AWS with GRAIL's Reflow (CMP3...
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...
 
Ch 7 data binding
Ch 7 data bindingCh 7 data binding
Ch 7 data binding
 

Plus de Abhik Seal

Virtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryVirtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryAbhik Seal
 
Modeling Chemical Datasets
Modeling Chemical DatasetsModeling Chemical Datasets
Modeling Chemical DatasetsAbhik Seal
 
Mapping protein to function
Mapping protein to functionMapping protein to function
Mapping protein to functionAbhik Seal
 
Sequencedatabases
SequencedatabasesSequencedatabases
SequencedatabasesAbhik Seal
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataAbhik Seal
 
Understanding Smiles
Understanding Smiles Understanding Smiles
Understanding Smiles Abhik Seal
 
Learning chemistry with google
Learning chemistry with googleLearning chemistry with google
Learning chemistry with googleAbhik Seal
 
3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using data3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using dataAbhik Seal
 
R scatter plots
R scatter plotsR scatter plots
R scatter plotsAbhik Seal
 
Q plot tutorial
Q plot tutorialQ plot tutorial
Q plot tutorialAbhik Seal
 
Pharmacohoreppt
PharmacohorepptPharmacohoreppt
PharmacohorepptAbhik Seal
 

Plus de Abhik Seal (16)

Virtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryVirtual Screening in Drug Discovery
Virtual Screening in Drug Discovery
 
Networks
NetworksNetworks
Networks
 
Modeling Chemical Datasets
Modeling Chemical DatasetsModeling Chemical Datasets
Modeling Chemical Datasets
 
Mapping protein to function
Mapping protein to functionMapping protein to function
Mapping protein to function
 
Sequencedatabases
SequencedatabasesSequencedatabases
Sequencedatabases
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical data
 
Understanding Smiles
Understanding Smiles Understanding Smiles
Understanding Smiles
 
Learning chemistry with google
Learning chemistry with googleLearning chemistry with google
Learning chemistry with google
 
3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using data3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using data
 
Poster
PosterPoster
Poster
 
R scatter plots
R scatter plotsR scatter plots
R scatter plots
 
Indo us 2012
Indo us 2012Indo us 2012
Indo us 2012
 
Q plot tutorial
Q plot tutorialQ plot tutorial
Q plot tutorial
 
Weka guide
Weka guideWeka guide
Weka guide
 
Pharmacohoreppt
PharmacohorepptPharmacohoreppt
Pharmacohoreppt
 
Document1
Document1Document1
Document1
 

Dernier

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Dernier (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

Data handling in r

  • 2. Get/set your working directoryGet/set your working directory A basic component of working with data is knowing your working directory The two main commands are getwd() and setwd(). Be aware of relative versus absolute paths Important difference in Windows setwd("C:UsersdatascDownloads") · · · Relative - setwd("./data"), setwd("../") Absolute - setwd("/Users/datasc/data/") - - · 2/40
  • 3. Checking for and creating directoriesChecking for and creating directories file.exists("directoryName") will check to see if the directory exists dir.create("directoryName") will create a directory if it doesn't exist Here is an example checking for a "data" directory and creating it if it doesn't exist · · · if(!file.exists("data")){ dir.create("data") } 3/40
  • 4. Reading data filesReading data files We wil look at each of the methods From Internet Reading local files Reading Excel Files Reading XML Reading JSON Reading MySQL Reading HDF5 Reading from other resources · · · · · · · · 4/40
  • 5. Getting data from InternetGetting data from Internet Data from Healthit.gov Use of download.file() Useful for downloading tab-delimited, csv, and other files · · fileUrl <- "http://dashboard.healthit.gov/data/data/NAMCS_2008-2013.csv" download.file(fileUrl,destfile="./data/NAMCS.csv",method="curl") list.files("./data") ## [1] "NAMCS.csv" 5/40
  • 6. Getting data from InternetGetting data from Internet Reading the data using read.csv() data<-read.csv("http://dashboard.healthit.gov/data/data/NAMCS_2008-2013.csv") head(data,2) ## Region Period Adoption.of.Basic.EHRs..Overall.Physician.Practices ## 1 Alabama 2013 0.48 ## 2 Alaska 2013 0.50 ## Adoption.of.Basic.EHRs..Primary.Care.Providers ## 1 0.50 ## 2 0.52 ## Adoption.of.Basic.EHRs..Rural.Providers ## 1 0.54 ## 2 0.37 ## Adoption.of.Basic.EHRs..Small.Practices ## 1 0.40 ## 2 0.39 ## Percent.of.office.based.physicians.with.computerized.capability.to.view.lab.results ## 1 0.74 ## 2 0.75 6/40
  • 7. Some notes about download.file()Some notes about download.file() If the url starts with http you can use download.file() If the url starts with https on Windows you may be ok If the url starts with https on Mac you may need to set method="curl" If the file is big, this might take a while Be sure to record when you downloaded. · · · · · 7/40
  • 8. Loading flat files - read.table()Loading flat files - read.table() This is the main function for reading data into R Flexible and robust but requires more parameters Reads the data into RAM - big data can cause problems Important parameters file, header, sep, row.names, nrows Related: read.csv(), read.csv2() Both read.table() and read.fwf() use scan to read the file, and then process the results of scan. They are very convenient, but sometimes it is better to use scan directly · · · · · · 8/40
  • 9. Example dataExample data fileUrl <- "http://dashboard.healthit.gov/data/data/NAMCS_2008-2013.csv" download.file(fileUrl,destfile="./data/NAMCS.csv",method="curl") list.files("./data") ## [1] "NAMCS.csv" Data <- read.table("./data/NAMCS.csv") ## Error: line 2 did not have 87 elements head(Data,2) ## Error: object 'Data' not found 9/40
  • 10. Example parametersExample parameters read.csv sets sep="," and header=TRUE same as cameraData <- read.table("./data/NAMCS.csv",sep=",",header=TRUE) cameraData <- read.csv("./data/NAMCS.csv") head(cameraData) ## Region Period Adoption.of.Basic.EHRs..Overall.Physician.Practices ## 1 Alabama 2013 0.48 ## 2 Alaska 2013 0.50 ## 3 Arizona 2013 0.51 ## 4 Arkansas 2013 0.46 ## 5 California 2013 0.54 ## 6 Colorado 2013 0.39 ## Adoption.of.Basic.EHRs..Primary.Care.Providers ## 1 0.50 ## 2 0.52 ## 3 0.63 10/40
  • 11. Some more important parametersSome more important parameters People face trouble with reading flat files those have quotation marks ` or " placed in data values, setting quote="" often resolves these. quote - you can tell R whether there are any quoted values quote="" means no quotes. na.strings - set the character that represents a missing value. nrows - how many rows to read of the file (e.g. nrows=10 reads 10 lines). skip - number of lines to skip before starting to read · · · · 11/40
  • 12. read.xlsx(), read.xlsx2() {xlsx package}read.xlsx(), read.xlsx2() {xlsx package} Reading specific rows and columnsReading specific rows and columns library(xlsx) Data <- read.xlsx("./data/ADME_genes.xlsx",sheetIndex=1,header=TRUE) colIndex <- 2:3 rowIndex <- 1:4 dataSub <- read.xlsx("./data/ADME_genes.xlsx",sheetIndex=1, colIndex=colIndex,rowIndex=rowIndex) 12/40
  • 13. Further notesFurther notes The write.xlsx function will write out an Excel file with similar arguments. read.xlsx2 is much faster than read.xlsx but for reading subsets of rows may be slightly unstable. The XLConnect is a Java-based solution, so it is cross platform and returns satisfactory results. For large data sets it may be very slow. xlsReadWrite is very fast: it doesn't support .xlsx files gdata package provides a good cross platform solutions. It is available for Windows, Mac or Linux. gdata requires you to install additional Perl libraries. Perl is usually already installed in Linux and Mac, but sometimes require more effort in Windows platforms. In general it is advised to store your data in either a database or in comma separated files (.csv) or tab separated files (.tab/.txt) as they are easier to distribute. I found on the web a self made function to easily import xlsx files. It should work in all platforms and use XML · · · · · · · source("https://gist.github.com/schaunwheeler/5825002/raw/3526a15b032c06392740e20b6c9a179add2cee49/ xlsxToR = function("myfile.xlsx", header = TRUE) 13/40
  • 14. Working with XMLWorking with XML http://en.wikipedia.org/wiki/XML Extensible markup language Frequently used to store structured data Particularly widely used in internet applications Extracting XML is the basis for most web scraping Components · · · · · Markup - labels that give the text structure Content - the actual text of the document - - 14/40
  • 15. Read the file into RRead the file into R library(XML) fileUrl <- "http://www.w3schools.com/xml/simple.xml" doc <- xmlTreeParse(fileUrl,useInternal=TRUE) rootNode <- xmlRoot(doc) xmlName(rootNode) ## [1] "breakfast_menu" names(rootNode) ## food food food food food ## "food" "food" "food" "food" "food" 15/40
  • 16. Directly access parts of the XML documentDirectly access parts of the XML document rootNode[[1]] ## <food> ## <name>Belgian Waffles</name> ## <price>$5.95</price> ## <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description> ## <calories>650</calories> ## </food> rootNode[[1]][[1]] ## <name>Belgian Waffles</name> Go for a tour of XML package Official XML tutorials short, long An outstanding guide to the XML package · · · 16/40
  • 17. JSONJSON http://en.wikipedia.org/wiki/JSON Javascript Object Notation Lightweight data storage Common format for data from application programming interfaces (APIs) Similar structure to XML but different syntax/format Data stored as · · · · · Numbers (double) Strings (double quoted) Boolean (true or false) Array (ordered, comma separated enclosed in square brackets []) Object (unorderd, comma separated collection of key:value pairs in curley brackets {}) - - - - - 17/40
  • 18. Example JSON fileExample JSON file 18/40
  • 19. Reading data from JSON {jsonlite package}Reading data from JSON {jsonlite package} library(jsonlite) # Using chembl api jsonData <- fromJSON("https://www.ebi.ac.uk/chemblws/compounds/CHEMBL1.json") names(jsonData) ## [1] "compound" jsonData$compound$chemblId ## [1] "CHEMBL1" jsonData$compound$stdInChiKey ## [1] "GHBOEFUAGSHXPO-XZOTUCIWSA-N" 19/40
  • 20. Writing data frames to JSONWriting data frames to JSON myjson <- toJSON(iris, pretty=TRUE) cat(myjson) ## [ ## { ## "Sepal.Length" : 5.1, ## "Sepal.Width" : 3.5, ## "Petal.Length" : 1.4, ## "Petal.Width" : 0.2, ## "Species" : "setosa" ## }, ## { ## "Sepal.Length" : 4.9, ## "Sepal.Width" : 3, ## "Petal.Length" : 1.4, ## "Petal.Width" : 0.2, ## "Species" : "setosa" ## }, ## { ## "Sepal.Length" : 4.7, 20/40
  • 21. Further resourcesFurther resources http://www.json.org/ A good tutorial on jsonlite - http://www.r-bloggers.com/new-package-jsonlite-a-smarter-json- encoderdecoder/ jsonlite vignette · · · 21/40
  • 22. mySQLmySQL http://en.wikipedia.org/wiki/MySQL http://www.mysql.com/ Free and widely used open source database software Widely used in internet based applications Data are structured in Each row is called a record · · · Databases Tables within databases Fields within tables - - - · 22/40
  • 23. Step 2 - Install RMySQL ConnectorStep 2 - Install RMySQL Connector On a Mac: install.packages("RMySQL") On Windows: · · Official instructions - http://biostat.mc.vanderbilt.edu/wiki/Main/RMySQL (may be useful for Mac/UNIX users as well) Potentially useful guide - http://www.ahschulz.de/2013/07/23/installing-rmysql-under- windows/ - - 23/40
  • 25. Connecting and listing databasesConnecting and listing databases library(DBI) library(RMySQL) ucscDb <- dbConnect(MySQL(),user="genome", host="genome-mysql.cse.ucsc.edu") result <- dbGetQuery(ucscDb,"show databases;"); dbDisconnect(ucscDb); ## [1] TRUE head(result) ## Database ## 1 information_schema ## 2 ailMel1 ## 3 allMis1 ## 4 anoCar1 ## 5 anoCar2 ## 6 anoGam1 25/40
  • 26. Connecting to hg19 and listing tablesConnecting to hg19 and listing tables library(RMySQL) hg19 <- dbConnect(MySQL(),user="genome", db="hg19", host="genome-mysql.cse.ucsc.edu") allTables <- dbListTables(hg19) length(allTables) ## [1] 11006 allTables[1:5] ## [1] "HInv" "HInvGeneMrna" "acembly" "acemblyClass" ## [5] "acemblyPep" 26/40
  • 27. Get dimensions of a specific tableGet dimensions of a specific table dbListFields(hg19,"affyU133Plus2") ## [1] "bin" "matches" "misMatches" "repMatches" "nCount" ## [6] "qNumInsert" "qBaseInsert" "tNumInsert" "tBaseInsert" "strand" ## [11] "qName" "qSize" "qStart" "qEnd" "tName" ## [16] "tSize" "tStart" "tEnd" "blockCount" "blockSizes" ## [21] "qStarts" "tStarts" dbGetQuery(hg19, "select count(*) from affyU133Plus2") ## count(*) ## 1 58463 27/40
  • 28. Read from the tableRead from the table affyData <- dbReadTable(hg19, "affyU133Plus2") head(affyData) ## bin matches misMatches repMatches nCount qNumInsert qBaseInsert ## 1 585 530 4 0 23 3 41 ## 2 585 3355 17 0 109 9 67 ## 3 585 4156 14 0 83 16 18 ## 4 585 4667 9 0 68 21 42 ## 5 585 5180 14 0 167 10 38 ## 6 585 468 5 0 14 0 0 ## tNumInsert tBaseInsert strand qName qSize qStart qEnd tName ## 1 3 898 - 225995_x_at 637 5 603 chr1 ## 2 9 11621 - 225035_x_at 3635 0 3548 chr1 ## 3 2 93 - 226340_x_at 4318 3 4274 chr1 ## 4 3 5743 - 1557034_s_at 4834 48 4834 chr1 ## 5 1 29 - 231811_at 5399 0 5399 chr1 ## 6 0 0 - 236841_at 487 0 487 chr1 ## tSize tStart tEnd blockCount ## 1 249250621 14361 15816 5 ## 2 249250621 14381 29483 17 28/40
  • 29. Select a specific subsetSelect a specific subset query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3") affyMis <- fetch(query); quantile(affyMis$misMatches) ## 0% 25% 50% 75% 100% ## 1 1 2 2 3 affyMisSmall <- fetch(query,n=10); dbClearResult(query); ## [1] TRUE dim(affyMisSmall) ## [1] 10 22 # close connection dbDisconnect(hg19) 29/40
  • 30. Further resourcesFurther resources RMySQL vignette http://cran.r-project.org/web/packages/RMySQL/RMySQL.pdf R data import and export Set up R odbc with postgres A nice blog post summarizing some other commands http://www.r-bloggers.com/mysql-and-r/ · · · · 30/40
  • 31. HDF5HDF5 http://www.hdfgroup.org/ Used for storing large data sets Supports storing a range of data types Heirarchical data format groups containing zero or more data sets and metadata datasets multidimensional array of data elements with metadata · · · · Have a group header with group name and list of attributes Have a group symbol table with a list of objects in group - - · Have a header with name, datatype, dataspace, and storage layout Have a data array with the data - - 31/40
  • 32. R HDF5 packageR HDF5 package The rhdf5 package works really well, although it is not in CRAN. To install it: source("http://bioconductor.org/biocLite.R") ## Bioconductor version 2.13 (BiocInstaller 1.12), ?biocLite for help ## A newer version of Bioconductor is available after installing a new ## version of R, ?BiocUpgrade for help biocLite("rhdf5") ## BioC_mirror: http://bioconductor.org ## Using Bioconductor version 2.13 (BiocInstaller 1.12.1), R version 3.0.3. ## Installing package(s) 'rhdf5' ## ## The downloaded binary packages are in ## /var/folders/pm/jg6blwt55b71g8jl64wfw8ch0000gn/T//RtmpuYnNzs/downloaded_packages 32/40
  • 33. Creating an HDF5 file and group hierarchyCreating an HDF5 file and group hierarchy library(rhdf5) h5createFile("myhdf5.h5") ## [1] TRUE h5createGroup("myhdf5.h5","foo") ## [1] TRUE h5createGroup("myhdf5.h5","baa") ## [1] TRUE h5createGroup("myhdf5.h5","foo/foobaa") ## [1] TRUE 33/40
  • 34. hdf5 continuedhdf5 continued Saving multiple objects to an HDF5 file h5ls("myhdf5.h5") ## group name otype dclass dim ## 0 / baa H5I_GROUP ## 1 / foo H5I_GROUP ## 2 /foo foobaa H5I_GROUP A = 1:7; B = 1:18; D = seq(0,1,by=0.1) h5save(A, B, D, file="newfile2.h5") h5dump("newfile2.h5") ## $A ## [1] 1 2 3 4 5 6 7 ## ## $B ## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ## 34/40
  • 35. Reading from other resourcesReading from other resources foreign package Loads data from Minitab, S, SAS, SPSS, Stata,Systat Basic functions read.foo See the help page for more details http://cran.r-project.org/web/packages/foreign/foreign.pdf · · read.arff (Weka) readline() read from console read.dta (Stata) read.clipboard() read.mtp (Minitab) read.octave (Octave) read.spss (SPSS) read.xport (SAS) - - - - - - - - · 35/40
  • 36. Reading imagesReading images jpeg - http://cran.r-project.org/web/packages/jpeg/index.html readbitmap - http://cran.r-project.org/web/packages/readbitmap/index.html png - http://cran.r-project.org/web/packages/png/index.html EBImage (Bioconductor) - http://www.bioconductor.org/packages/2.13/bioc/html/EBImage.html · · · · 36/40
  • 37. Reading GIS dataReading GIS data rgdal - http://cran.r-project.org/web/packages/rgdal/index.html rgeos - http://cran.r-project.org/web/packages/rgeos/index.html raster - http://cran.r-project.org/web/packages/raster/index.html · · · 37/40
  • 38. Reading music dataReading music data tuneR - http://cran.r-project.org/web/packages/tuneR/ seewave - http://rug.mnhn.fr/seewave/ · · 38/40
  • 39. AcknowledgemntAcknowledgemnt Jeff Leek University of Washington and Coursera Getting and Cleaning data R For Natural Resources Course R Data import comprehensive guide · · · 39/40
  • 40. 40/40