SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Pau Corral Montañés

         RUGBCN
        03/05/2012
VIDEO:
“what we used to do in Bioinformatics“


note: due to memory space limitations, the video has not been embedded into
this presentation.

Content of the video:
A search in NCBI (www.ncbi.nlm.nih.gov) for the entry NC_001477 is done
interactively, leading to a single entry in the database related to Dengue Virus,
complete genome. Thanks to the facilities that the site provides, a FASTA file for
this complete genome is downloaded and thereafter opened with a text editor.

The purpose of the video is to show how tedious is accessing the site and
downleading an entry, with no less than 5 mouse clicks.
Unified records with in-house coding:
        The 3 databases share all sequences, but they use different                   A "mirror" of the content of other databases
        accession numbers (IDs) to refer to each entry.
        They update every night.

                                                                              ACNUC
                                                                                           A series of commands and their arguments were
        NCBI – ex.: #000134                                                                defined that allow
                                           EMBL – ex.: #000012
                                                                                           (1) database opening,
                                                                                           (2) query execution,
                                                                                           (3) annotation and sequence display,
                                                                                           (4) annotation, species and keywords browsing, and
                                                                                           (5) sequence extraction




                                              DDBJ – ex.:
                                              #002221




                                                                                  Other Databases
– NCBI - National Centre for Biotechnology Information - (www.ncbi.nlm.nih.gov)
– EMBL - European Molecular Biology Laboratory - (www.ebi.ac.uk/embl)
– DDBJ - DNA Data Bank of Japan - (www.ddbj.nig.ac.jp)
– ACNUC - http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html
ACNUC retrieval programs:

     i) SeqinR - (http://pbil.univ-lyon1.fr/software/seqinr/)

     ii) C language API - (http://pbil.univ-lyon1.fr/databases/acnuc/raa_acnuc.html)

     iii) Python language API - (http://pbil.univ-lyon1.fr/cgi-bin/raapythonhelp.csh)


     a.I) Query_win – GUI client for remote ACNUC retrieval operations
                   (http://pbil.univ-lyon1.fr/software/query_win.html)
     a.II) raa_query – same functionality as Query_win in command line interface
                  (http://pbil.univ-lyon1.fr/software/query.html)
Why use seqinR:
the Wheel          - available packages
Hotline            - discussion forum
Automation         - same source code for different purposes
Reproducibility    - tutorials in vignette format
Fine tuning        - function arguments

Usage of the R environment!!



#Install the package seqinr_3.0-6 (Apirl 2012)

>install.packages(“seqinr“)
package ‘seqinr’ was built under R version 2.14.2




                                                               A vingette:
                                                               http://seqinr.r-forge.r-project.org/seqinr_2_0-7.pdf
#Choose a mirror
>chooseCRANmirror(mirror_name)
>install.packages("seqinr")

>library(seqinr)

#The command lseqinr() lists all what is defined in the package:
>lseqinr()[1:9]

  [1]   "a"              "aaa"
  [3]   "AAstat"         "acnucclose"
  [5]   "acnucopen"      "al2bp"
  [7]   "alllistranks"   "alr"
  [9]   "amb"

>length(lseqinr())
[1] 209
How many different ways are there to work with biological sequences using SeqinR?

1)Sequences you have locally:

     i) read.fasta() and s2c() and c2s() and GC() and count() and translate()
     ii) write.fasta()
     iii) read.alignment() and consensus()

2) Sequences you download from a Database:

     i) browse Databases
     ii) query() and getSequence()
FASTA files example:
 Example with DNA data: (4 different characters, normally)

                                                                   The FASTA format is very
                                                                   simple and widely used for
                                                                   simple import of biological
                                                                   sequences.
                                                                   It begins with a single-line
                                                                   description starting with a
                                                                   character '>', followed by
                                                                   lines of sequence data of
                                                                   maximum 80 character each.
                                                                   Lines starting with a semi-
                                                                   colon character ';' are
                                                                   comment lines.

  Example with Protein data: (20 different characters, normally)
                                                                   Check Wikipedia for:
                                                                   i)Sequence representation
                                                                   ii)Sequence identifiers
                                                                   iii)File extensions
Read a file with read.fasta()

#Read the sequence from a local directory
> setwd("H:/Documents and Settings/Pau/Mis documentos/R_test")
> dir()
[1] "dengue_whole_sequence.fasta"

#Use the read.fasta (see next slide) function to load the sequence
> read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA")
$`gi|9626685|ref|NC_001477.1|`
[1] "a" "g" "t" "t" "g" "t" "t" "a" "g" "t" "c" "t" "a" "c" "g" "t" "g" "g"
[19] "a" "c" "c" "g" "a" "c" "a" "a" "g" "a" "a" "c" "a" "g" "t" "t" "t" "c"
[37] "g" "a" "a" "t" "c" "g" "g" "a" "a" "g" "c" "t" "t" "g" "c" "t" "t" "a"
[55] "a" "c" "g" "t" "a" "g" "t" "t" "c" "t" "a" "a" "c" "a" "g" "t" "t" "t"
  [............................................................]
[10711] "t" "g" "g" "t" "g" "c" "t" "g" "t" "t" "g" "a" "a" "t" "c" "a" "a" "c"
[10729] "a" "g" "g" "t" "t" "c" "t"
attr(,"name")
[1] "gi|9626685|ref|NC_001477.1|"
attr(,"Annot")
[1] ">gi|9626685|ref|NC_001477.1| Dengue virus 1, complete genome"
attr(,"class")
[1] "SeqFastadna"



> read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F)
$`gi|9626685|ref|NC_001477.1|`
[1] "agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......"

> seq <- read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F)
> str(seq)
List of 1
 $ gi|9626685|ref|NC_001477.1|: chr
"agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......"
Usage:
rea d.fa s ta (file, seqtype = c("DNA", "AA"), as.string = FALSE, forceDNAtolower =
TRUE, set.attributes = TRUE, legacy.mode = TRUE, seqonly = FALSE,                   strip.desc
= FALSE, bfa = FALSE, sizeof.longlong = .Machine$sizeof.longlong,
    endian = .Platform$endian, apply.mask = TRUE)

Arguments:
file - path (relative [getwd] is used if absoulte is not given) to FASTA file
seqtype - the nature of the sequence: DNA or AA
as.string - if TRUE sequences are returned as a string instead of a vector characters
forceDNAtolower - lower- or upper-case
set.attributes - whether sequence attributes should be set
legacy.mode - if TRUE lines starting with a semicolon ’;’ are ignored
seqonly - if TRUE, only sequences as returned (execution time is divided approximately by a factor 3)
strip.desc - if TRUE, removes the '>' at the beginning
bfa - if TRUE the fasta file is in MAQ binary format sizeof.longlong
endian - relative to MAQ files
apply.mask - relative to MAQ files

Value:
a list of vector of chars
Basic manipulations:
#Turn seqeunce into characters and count how many are there
> length(s2c(seq[[1]]))
[1] 10735

#Count how many different accurrences are there
> table(s2c(seq[[1]]))
   a    c    g    t
3426 2240 2770 2299

#Count the fraction of G and C bases in the sequence
> GC(s2c(seq[[1]]))
[1] 0.4666977

#Count all possible words in a sequence with a sliding window of size = wordsize
> seq_2 <- "actg"
> count(s2c(seq_2), wordsize=2)
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0

> count(s2c(seq_2), wordsize=2, by=2)
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0

#Translate into aminoacids the genetic sequence.
> c2s(translate(s2c(seq[[1]]), frame=0, sens='F', numcode=1))
[1] "SC*STWTDKNSFESEACLT*F*QFFIREQISDEQPTEKDGSTVFQYAETREKPRVNCFTVGEEILK...-Truncated-...
RIAFRPRTHEIGDGFYSIPKISSHTSNSRNFG*MGLIQEEWSDQGDPSQDTTQQRGPTPGEAVPWW*GLEVRGDPPHNNKQHIDAGRD
QRSCCLYSIIPGTERQKMEWCC*INRF"
#Note:
#    1)frame=0 means that first nuc. is taken as the first position in the codon
#    2)sense='F' means that the sequence is tranlated in the forward sense
#    3)numcode=1 means standard genetic code is used (in SeqinR, up to 23 variations)
Write a FASTA file (and how to save time)
> dir()
[1] "dengue_whole_sequence.fasta" "ER.fasta"
> system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F))
   user system elapsed
   3.28    0.00    3.33

> length(seq)
[1] 9984
> seq[[1]]
[1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagcctt..._truncated_...ctcata"
> names(seq)[1:5]
[1] "ER0101A01F" "ER0101A03F" "ER0101A05F" "ER0101A07F" "ER0101A09F"

#Clean the sequences keeping only the ones > 25 nucleotides
> char_seq = rapply(seq, s2c, how="list")       # create a list turning strings to characters
> len_seq = rapply(char_seq, length, how="list") # count how many characters are in each list
> bigger_25_list = c()
> for (val in 1:length(len_seq)){
+     if (len_seq[[val]] >= 25){
+         bigger_25_list = c(bigger_25_list, val)
+     }
+ }
> seq_25 = seq[bigger_25_list]            #indexing to get the desired list
> length(seq_25)
[1] 8928
> seq_25[1]
$ER0101A01F
[1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagccttctcatacct...truncated..."
#Write a FASTA file (see next slide)
> write.fasta(seq_25, names = names(seq_25), file.out="clean_seq.fasta", open="w")
> dir()
[1] "clean_seq.fasta" "dengue_whole_sequence.fasta" "ER.fasta"
Usage:
w rite.fa s ta (sequences, names, file.out, open = "w", nbchar = 80)

Arguments:
sequences - A DNA or protein sequence (in the form of a vector of single characters) or a list of such sequences.
names - The name(s) of the sequences.
file.out - The name of the output file.
open - Open the output file, use "w" to write into a new file, use "a" to append at the end of an already existing file.
nbchar - The number of characters per line (default: 60)


Value:
none in the R space. A FASTA formatted file is created
Write a FASTA file (and how to save time)
# Remember what we did before
> system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F))
   user system elapsed
   3.28    0.00    3.33


# After cleanig seq, we produced a file "clean_seq.fasta" that we read now
> system.time(seq_1 <- read.fasta(file="clean_seq.fasta", seqtype="DNA", as.string=T, set.attributes=F))
   user system elapsed
   1.30    0.01    1.31

#Time can be saved with the save function:
> save(seq_25, file = "ER_CLEAN_seqs.RData")

> system.time(load("ER_CLEAN_seqs.RData"))
   user system elapsed
   0.11    0.02    0.12
Read an alignment (or create an alignment object to be aligned):
#Create an alignment object through reading a FASTA file with two sequences (see next slide)
> fasta <- read.alignment(file = system.file("sequences/Anouk.fasta", package = "seqinr"), format ="fasta")
> fasta
$nb
[1] 2
$nam
[1] "LmjF01.0030" "LinJ01.0030"
$seq
$seq[[1]]
[1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..."
$seq[[2]]
[1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..."
$com
[1] NA
attr(,"class")
[1] "alignment"

# The consensus() function aligns the two sequences, producing a consensus sequences. IUPAC symbology is used.
> fixed_align = consensus(fasta, method="IUPAC")

> table(fixed_align)
fixed_align
  a   c   g   k   m  r    s   t   w    y
411 636 595   3   5 20   13 293   2   20
Usage:
rea d.a lig nm ent(file, format, forceToLower = TRUE)

Arguments:
file - The name of the file which the aligned sequences are to be read from. If it does not contain an absolute or relative
path, the file name is relative to the current working directory, getwd.
format - A character string specifying the format of the file: mas e, clus tal, phylip, fas ta or ms f
forceToLower - A logical defaulting to TRUE stating whether the returned characters in the sequence should be in
lower case



Value:
An object is created of class alignment which is a list with the following components:
nb ->the number of aligned sequences
nam ->a vector of strings containing the names of the aligned sequences
seq ->a vector of strings containing the aligned sequences
com ->a vector of strings containing the commentaries for each sequence or NA if there are no comments
Access a remote server:
> choosebank()
 [1] "genbank"       "embl"          "emblwgs"         "swissprot"   "ensembl"       "hogenom"
 [7] "hogenomdna"    "hovergendna"   "hovergen"        "hogenom5"    "hogenom5dna"   "hogenom4"
[13] "hogenom4dna"   "homolens"      "homolensdna"     "hobacnucl"   "hobacprot"     "phever2"
[19] "phever2dna"    "refseq"        "greviews"        "bacterial"   "protozoan"     "ensbacteria"
[25] "ensprotists"   "ensfungi"      "ensmetazoa"      "ensplants"   "mito"          "polymorphix"
[31] "emglib"        "taxobacgen"    "refseqViruses"

#Access a bank and see complementary information:
> choosebank(bank="genbank", infobank=T)
> ls()
[1] "banknameSocket"
> str(banknameSocket)
List of 9
 $ socket :Classes 'sockconn', 'connection' atomic [1:1] 3
  .. ..- attr(*, "conn_id")=<externalptr>
 $ bankname: chr "genbank"
 $ banktype: chr "GENBANK"
 $ totseqs : num 1.65e+08
 $ totspecs: num 968157
 $ totkeys : num 3.1e+07
 $ release : chr "           GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012"
 $ status :Class 'AsIs' chr "on"
 $ details : chr [1:4] "              ****     ACNUC Data Base Content      ****                        " "
      GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" "139,677,722,280 bases; 152,280,170
sequences; 12,313,982 subseqs; 684,079 refers." "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive,
Universite Lyon I "
> banknameSocket$details
[1] "              ****     ACNUC Data Base Content      ****                        "
[2] "           GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012"
[3] "139,677,722,280 bases; 152,280,170 sequences; 12,313,982 subseqs; 684,079 refers."
[4] "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive, Universite Lyon I "
Make a query:
# query (see next slide) all sequences that contain the words "virus" and "dengue" in the taxonomy field and
# that are not partial sequences
> system.time(query("All_Dengue_viruses_NOTpartial", ""sp=@virus@" AND "sp=@dengue@" AND
NOT "k=partial""))
   user system elapsed
   0.78    0.00    6.72

> All_Dengue_viruses_NOTpartial[1:4]
$call
query(listname = "All_Dengue_viruses_NOTpartial", query = ""sp=@virus@" AND "sp=@dengue@" AND
NOT "k=partial"")
$name
[1] "All_Dengue_viruses_NOTpartial"
$nelem
[1] 7741
$typelist
[1] "SQ"

> All_Dengue_viruses_NOTpartial[[5]][[1]]
    name   length    frame   ncbicg
"A13666"    "456"      "0"      "1"

> myseq = getSequence(All_Dengue_viruses_NOTpartial[[5]][[1]])
> myseq[1:20]
 [1] "a" "t" "g" "g" "c" "c" "a" "t" "g" "g" "a" "c" "c" "t" "t" "g" "g" "t" "g" "a"

> closebank()
Usage:
query(listname, query, socket = autosocket(), invisible = T, verbose = F, virtual = F)

Arguments:
listname - The name of the list as a quoted string of chars
query - A quoted string of chars containing the request with the syntax given in the details section
socket - An object of class sockconn connecting to a remote ACNUC database (default is a socket to the last opened
database).
invisible - if FALSE, the result is returned visibly.
verbose - if TRUE, verbose mode is on
virtual - if TRUE, no attempt is made to retrieve the information about all the elements of the list. In this case, the req
component of the list is set to NA.

Value:
The result is a list with the following 6 components:

call - the original call
name - the ACNUC list name
nelem - the number of elements (for instance sequences) in the ACNUC list
typelist - the type of the elements of the list. Could be SQ for a list of sequence names, KW for a list of keywords, SP for
a list of species names.
req - a list of sequence names that fit the required criteria or NA when called with parameter virtual is TRUE
socket - the socket connection that was used
Pau Corral Montañés

         RUGBCN
        03/05/2012

Contenu connexe

Tendances

Protein motif. by KK Sahu sir
Protein motif. by KK Sahu sirProtein motif. by KK Sahu sir
Protein motif. by KK Sahu sirKAUSHAL SAHU
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann
 
Bioinfromatics - local alignment
Bioinfromatics - local alignmentBioinfromatics - local alignment
Bioinfromatics - local alignmentVivek Chandramohan
 
Gene Expression Data Analysis
Gene Expression Data AnalysisGene Expression Data Analysis
Gene Expression Data AnalysisJhoirene Clemente
 
New generation sequencing equipments
New generation sequencing equipmentsNew generation sequencing equipments
New generation sequencing equipmentsKalaivani P
 
Game Tree ( Oyun Ağaçları )
Game Tree ( Oyun Ağaçları )Game Tree ( Oyun Ağaçları )
Game Tree ( Oyun Ağaçları )Alp Çoker
 
Comparative genomics and proteomics
Comparative genomics and proteomicsComparative genomics and proteomics
Comparative genomics and proteomicsNikhil Aggarwal
 
NGS data analysis Overview
NGS data analysis Overview NGS data analysis Overview
NGS data analysis Overview Ravi Gandham
 
Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Sri Ambati
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation SequencingFarid MUSA
 

Tendances (20)

The STRING database
The STRING databaseThe STRING database
The STRING database
 
Protein motif. by KK Sahu sir
Protein motif. by KK Sahu sirProtein motif. by KK Sahu sir
Protein motif. by KK Sahu sir
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
Genome analysis2
Genome analysis2Genome analysis2
Genome analysis2
 
Finding motif
Finding motifFinding motif
Finding motif
 
Gen bank
Gen bankGen bank
Gen bank
 
Jan2016 pac bio giab
Jan2016 pac bio giabJan2016 pac bio giab
Jan2016 pac bio giab
 
Bioinfromatics - local alignment
Bioinfromatics - local alignmentBioinfromatics - local alignment
Bioinfromatics - local alignment
 
Gene Expression Data Analysis
Gene Expression Data AnalysisGene Expression Data Analysis
Gene Expression Data Analysis
 
Fasta
FastaFasta
Fasta
 
New generation sequencing equipments
New generation sequencing equipmentsNew generation sequencing equipments
New generation sequencing equipments
 
Game Tree ( Oyun Ağaçları )
Game Tree ( Oyun Ağaçları )Game Tree ( Oyun Ağaçları )
Game Tree ( Oyun Ağaçları )
 
Comparative genomics and proteomics
Comparative genomics and proteomicsComparative genomics and proteomics
Comparative genomics and proteomics
 
NGS data analysis Overview
NGS data analysis Overview NGS data analysis Overview
NGS data analysis Overview
 
Bioinformatics t4-alignments v2014
Bioinformatics t4-alignments v2014Bioinformatics t4-alignments v2014
Bioinformatics t4-alignments v2014
 
Micro rna
Micro rnaMicro rna
Micro rna
 
Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...
 
Protein Threading
Protein ThreadingProtein Threading
Protein Threading
 
Snp genotyping
Snp genotypingSnp genotyping
Snp genotyping
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation Sequencing
 

En vedette

Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in Rschamber
 
Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015Paul Richards
 
Phylogenetics Analysis in R
Phylogenetics Analysis in RPhylogenetics Analysis in R
Phylogenetics Analysis in RKlaus Schliep
 
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)Kate Hertweck
 
Where can tell me who I am?
Where can tell me who I am?Where can tell me who I am?
Where can tell me who I am?seltzoid
 
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.Rakesh Kumar
 
Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)Jesse Vincent
 
Exotic Orient
Exotic OrientExotic Orient
Exotic OrientRenny
 
IBM Big Data References
IBM Big Data ReferencesIBM Big Data References
IBM Big Data ReferencesRob Thomas
 
Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2Paul Pivec
 
Blackwell Esteem AFSL
Blackwell Esteem AFSLBlackwell Esteem AFSL
Blackwell Esteem AFSLsamueltay77
 
Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.Sajib Hossain Akash
 
Exposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFDExposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFDPhillip CFD
 
360Gate Business Objects portal
360Gate Business Objects portal360Gate Business Objects portal
360Gate Business Objects portalSebastien Goiffon
 
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)Sayogyo Rahman Doko
 
Tonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentationTonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentationStanford University
 
ShareThis Auto Study
ShareThis Auto Study ShareThis Auto Study
ShareThis Auto Study ShareThis
 

En vedette (20)

Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in R
 
Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015
 
Phylogenetics Analysis in R
Phylogenetics Analysis in RPhylogenetics Analysis in R
Phylogenetics Analysis in R
 
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
 
Where can tell me who I am?
Where can tell me who I am?Where can tell me who I am?
Where can tell me who I am?
 
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
 
Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)
 
Exotic Orient
Exotic OrientExotic Orient
Exotic Orient
 
IBM Big Data References
IBM Big Data ReferencesIBM Big Data References
IBM Big Data References
 
Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2
 
Blackwell Esteem AFSL
Blackwell Esteem AFSLBlackwell Esteem AFSL
Blackwell Esteem AFSL
 
Bailey capítulo-6
Bailey capítulo-6Bailey capítulo-6
Bailey capítulo-6
 
Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.
 
Exposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFDExposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFD
 
Gscm1
Gscm1Gscm1
Gscm1
 
360Gate Business Objects portal
360Gate Business Objects portal360Gate Business Objects portal
360Gate Business Objects portal
 
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
 
Geohash
GeohashGeohash
Geohash
 
Tonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentationTonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentation
 
ShareThis Auto Study
ShareThis Auto Study ShareThis Auto Study
ShareThis Auto Study
 

Similaire à SeqinR - biological data handling

Parsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelParsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelchk49
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesDongsun Kim
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Profiling distributed Java applications
Profiling distributed Java applicationsProfiling distributed Java applications
Profiling distributed Java applicationsConstantine Slisenka
 
Как разработать DBFW с нуля
Как разработать DBFW с нуляКак разработать DBFW с нуля
Как разработать DBFW с нуляPositive Hack Days
 
Database Firewall from Scratch
Database Firewall from ScratchDatabase Firewall from Scratch
Database Firewall from ScratchDenis Kolegov
 
Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011Agora Group
 
2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekingeProf. Wim Van Criekinge
 
Fosdem10
Fosdem10Fosdem10
Fosdem10wremes
 
Language Integrated Query - LINQ
Language Integrated Query - LINQLanguage Integrated Query - LINQ
Language Integrated Query - LINQDoncho Minkov
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8marctritschler
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Marc Tritschler
 

Similaire à SeqinR - biological data handling (20)

Parsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelParsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernel
 
2015 bioinformatics bio_python
2015 bioinformatics bio_python2015 bioinformatics bio_python
2015 bioinformatics bio_python
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
TiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architectureTiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architecture
 
Biopython
BiopythonBiopython
Biopython
 
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method Names
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Profiling distributed Java applications
Profiling distributed Java applicationsProfiling distributed Java applications
Profiling distributed Java applications
 
Как разработать DBFW с нуля
Как разработать DBFW с нуляКак разработать DBFW с нуля
Как разработать DBFW с нуля
 
Database Firewall from Scratch
Database Firewall from ScratchDatabase Firewall from Scratch
Database Firewall from Scratch
 
Biopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and OutlookBiopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and Outlook
 
Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011
 
Java
JavaJava
Java
 
2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge
 
Fosdem10
Fosdem10Fosdem10
Fosdem10
 
Language Integrated Query - LINQ
Language Integrated Query - LINQLanguage Integrated Query - LINQ
Language Integrated Query - LINQ
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8
 

Dernier

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Dernier (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

SeqinR - biological data handling

  • 1. Pau Corral Montañés RUGBCN 03/05/2012
  • 2. VIDEO: “what we used to do in Bioinformatics“ note: due to memory space limitations, the video has not been embedded into this presentation. Content of the video: A search in NCBI (www.ncbi.nlm.nih.gov) for the entry NC_001477 is done interactively, leading to a single entry in the database related to Dengue Virus, complete genome. Thanks to the facilities that the site provides, a FASTA file for this complete genome is downloaded and thereafter opened with a text editor. The purpose of the video is to show how tedious is accessing the site and downleading an entry, with no less than 5 mouse clicks.
  • 3. Unified records with in-house coding: The 3 databases share all sequences, but they use different A "mirror" of the content of other databases accession numbers (IDs) to refer to each entry. They update every night. ACNUC A series of commands and their arguments were NCBI – ex.: #000134 defined that allow EMBL – ex.: #000012 (1) database opening, (2) query execution, (3) annotation and sequence display, (4) annotation, species and keywords browsing, and (5) sequence extraction DDBJ – ex.: #002221 Other Databases – NCBI - National Centre for Biotechnology Information - (www.ncbi.nlm.nih.gov) – EMBL - European Molecular Biology Laboratory - (www.ebi.ac.uk/embl) – DDBJ - DNA Data Bank of Japan - (www.ddbj.nig.ac.jp) – ACNUC - http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html
  • 4.
  • 5. ACNUC retrieval programs: i) SeqinR - (http://pbil.univ-lyon1.fr/software/seqinr/) ii) C language API - (http://pbil.univ-lyon1.fr/databases/acnuc/raa_acnuc.html) iii) Python language API - (http://pbil.univ-lyon1.fr/cgi-bin/raapythonhelp.csh) a.I) Query_win – GUI client for remote ACNUC retrieval operations (http://pbil.univ-lyon1.fr/software/query_win.html) a.II) raa_query – same functionality as Query_win in command line interface (http://pbil.univ-lyon1.fr/software/query.html)
  • 6. Why use seqinR: the Wheel - available packages Hotline - discussion forum Automation - same source code for different purposes Reproducibility - tutorials in vignette format Fine tuning - function arguments Usage of the R environment!! #Install the package seqinr_3.0-6 (Apirl 2012) >install.packages(“seqinr“) package ‘seqinr’ was built under R version 2.14.2 A vingette: http://seqinr.r-forge.r-project.org/seqinr_2_0-7.pdf
  • 7. #Choose a mirror >chooseCRANmirror(mirror_name) >install.packages("seqinr") >library(seqinr) #The command lseqinr() lists all what is defined in the package: >lseqinr()[1:9] [1] "a" "aaa" [3] "AAstat" "acnucclose" [5] "acnucopen" "al2bp" [7] "alllistranks" "alr" [9] "amb" >length(lseqinr()) [1] 209
  • 8. How many different ways are there to work with biological sequences using SeqinR? 1)Sequences you have locally: i) read.fasta() and s2c() and c2s() and GC() and count() and translate() ii) write.fasta() iii) read.alignment() and consensus() 2) Sequences you download from a Database: i) browse Databases ii) query() and getSequence()
  • 9. FASTA files example: Example with DNA data: (4 different characters, normally) The FASTA format is very simple and widely used for simple import of biological sequences. It begins with a single-line description starting with a character '>', followed by lines of sequence data of maximum 80 character each. Lines starting with a semi- colon character ';' are comment lines. Example with Protein data: (20 different characters, normally) Check Wikipedia for: i)Sequence representation ii)Sequence identifiers iii)File extensions
  • 10. Read a file with read.fasta() #Read the sequence from a local directory > setwd("H:/Documents and Settings/Pau/Mis documentos/R_test") > dir() [1] "dengue_whole_sequence.fasta" #Use the read.fasta (see next slide) function to load the sequence > read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA") $`gi|9626685|ref|NC_001477.1|` [1] "a" "g" "t" "t" "g" "t" "t" "a" "g" "t" "c" "t" "a" "c" "g" "t" "g" "g" [19] "a" "c" "c" "g" "a" "c" "a" "a" "g" "a" "a" "c" "a" "g" "t" "t" "t" "c" [37] "g" "a" "a" "t" "c" "g" "g" "a" "a" "g" "c" "t" "t" "g" "c" "t" "t" "a" [55] "a" "c" "g" "t" "a" "g" "t" "t" "c" "t" "a" "a" "c" "a" "g" "t" "t" "t" [............................................................] [10711] "t" "g" "g" "t" "g" "c" "t" "g" "t" "t" "g" "a" "a" "t" "c" "a" "a" "c" [10729] "a" "g" "g" "t" "t" "c" "t" attr(,"name") [1] "gi|9626685|ref|NC_001477.1|" attr(,"Annot") [1] ">gi|9626685|ref|NC_001477.1| Dengue virus 1, complete genome" attr(,"class") [1] "SeqFastadna" > read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F) $`gi|9626685|ref|NC_001477.1|` [1] "agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......" > seq <- read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F) > str(seq) List of 1 $ gi|9626685|ref|NC_001477.1|: chr "agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......"
  • 11. Usage: rea d.fa s ta (file, seqtype = c("DNA", "AA"), as.string = FALSE, forceDNAtolower = TRUE, set.attributes = TRUE, legacy.mode = TRUE, seqonly = FALSE, strip.desc = FALSE, bfa = FALSE, sizeof.longlong = .Machine$sizeof.longlong, endian = .Platform$endian, apply.mask = TRUE) Arguments: file - path (relative [getwd] is used if absoulte is not given) to FASTA file seqtype - the nature of the sequence: DNA or AA as.string - if TRUE sequences are returned as a string instead of a vector characters forceDNAtolower - lower- or upper-case set.attributes - whether sequence attributes should be set legacy.mode - if TRUE lines starting with a semicolon ’;’ are ignored seqonly - if TRUE, only sequences as returned (execution time is divided approximately by a factor 3) strip.desc - if TRUE, removes the '>' at the beginning bfa - if TRUE the fasta file is in MAQ binary format sizeof.longlong endian - relative to MAQ files apply.mask - relative to MAQ files Value: a list of vector of chars
  • 12. Basic manipulations: #Turn seqeunce into characters and count how many are there > length(s2c(seq[[1]])) [1] 10735 #Count how many different accurrences are there > table(s2c(seq[[1]])) a c g t 3426 2240 2770 2299 #Count the fraction of G and C bases in the sequence > GC(s2c(seq[[1]])) [1] 0.4666977 #Count all possible words in a sequence with a sliding window of size = wordsize > seq_2 <- "actg" > count(s2c(seq_2), wordsize=2) aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 > count(s2c(seq_2), wordsize=2, by=2) aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 #Translate into aminoacids the genetic sequence. > c2s(translate(s2c(seq[[1]]), frame=0, sens='F', numcode=1)) [1] "SC*STWTDKNSFESEACLT*F*QFFIREQISDEQPTEKDGSTVFQYAETREKPRVNCFTVGEEILK...-Truncated-... RIAFRPRTHEIGDGFYSIPKISSHTSNSRNFG*MGLIQEEWSDQGDPSQDTTQQRGPTPGEAVPWW*GLEVRGDPPHNNKQHIDAGRD QRSCCLYSIIPGTERQKMEWCC*INRF" #Note: # 1)frame=0 means that first nuc. is taken as the first position in the codon # 2)sense='F' means that the sequence is tranlated in the forward sense # 3)numcode=1 means standard genetic code is used (in SeqinR, up to 23 variations)
  • 13. Write a FASTA file (and how to save time) > dir() [1] "dengue_whole_sequence.fasta" "ER.fasta" > system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F)) user system elapsed 3.28 0.00 3.33 > length(seq) [1] 9984 > seq[[1]] [1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagcctt..._truncated_...ctcata" > names(seq)[1:5] [1] "ER0101A01F" "ER0101A03F" "ER0101A05F" "ER0101A07F" "ER0101A09F" #Clean the sequences keeping only the ones > 25 nucleotides > char_seq = rapply(seq, s2c, how="list") # create a list turning strings to characters > len_seq = rapply(char_seq, length, how="list") # count how many characters are in each list > bigger_25_list = c() > for (val in 1:length(len_seq)){ + if (len_seq[[val]] >= 25){ + bigger_25_list = c(bigger_25_list, val) + } + } > seq_25 = seq[bigger_25_list] #indexing to get the desired list > length(seq_25) [1] 8928 > seq_25[1] $ER0101A01F [1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagccttctcatacct...truncated..." #Write a FASTA file (see next slide) > write.fasta(seq_25, names = names(seq_25), file.out="clean_seq.fasta", open="w") > dir() [1] "clean_seq.fasta" "dengue_whole_sequence.fasta" "ER.fasta"
  • 14. Usage: w rite.fa s ta (sequences, names, file.out, open = "w", nbchar = 80) Arguments: sequences - A DNA or protein sequence (in the form of a vector of single characters) or a list of such sequences. names - The name(s) of the sequences. file.out - The name of the output file. open - Open the output file, use "w" to write into a new file, use "a" to append at the end of an already existing file. nbchar - The number of characters per line (default: 60) Value: none in the R space. A FASTA formatted file is created
  • 15. Write a FASTA file (and how to save time) # Remember what we did before > system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F)) user system elapsed 3.28 0.00 3.33 # After cleanig seq, we produced a file "clean_seq.fasta" that we read now > system.time(seq_1 <- read.fasta(file="clean_seq.fasta", seqtype="DNA", as.string=T, set.attributes=F)) user system elapsed 1.30 0.01 1.31 #Time can be saved with the save function: > save(seq_25, file = "ER_CLEAN_seqs.RData") > system.time(load("ER_CLEAN_seqs.RData")) user system elapsed 0.11 0.02 0.12
  • 16. Read an alignment (or create an alignment object to be aligned): #Create an alignment object through reading a FASTA file with two sequences (see next slide) > fasta <- read.alignment(file = system.file("sequences/Anouk.fasta", package = "seqinr"), format ="fasta") > fasta $nb [1] 2 $nam [1] "LmjF01.0030" "LinJ01.0030" $seq $seq[[1]] [1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..." $seq[[2]] [1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..." $com [1] NA attr(,"class") [1] "alignment" # The consensus() function aligns the two sequences, producing a consensus sequences. IUPAC symbology is used. > fixed_align = consensus(fasta, method="IUPAC") > table(fixed_align) fixed_align a c g k m r s t w y 411 636 595 3 5 20 13 293 2 20
  • 17. Usage: rea d.a lig nm ent(file, format, forceToLower = TRUE) Arguments: file - The name of the file which the aligned sequences are to be read from. If it does not contain an absolute or relative path, the file name is relative to the current working directory, getwd. format - A character string specifying the format of the file: mas e, clus tal, phylip, fas ta or ms f forceToLower - A logical defaulting to TRUE stating whether the returned characters in the sequence should be in lower case Value: An object is created of class alignment which is a list with the following components: nb ->the number of aligned sequences nam ->a vector of strings containing the names of the aligned sequences seq ->a vector of strings containing the aligned sequences com ->a vector of strings containing the commentaries for each sequence or NA if there are no comments
  • 18. Access a remote server: > choosebank() [1] "genbank" "embl" "emblwgs" "swissprot" "ensembl" "hogenom" [7] "hogenomdna" "hovergendna" "hovergen" "hogenom5" "hogenom5dna" "hogenom4" [13] "hogenom4dna" "homolens" "homolensdna" "hobacnucl" "hobacprot" "phever2" [19] "phever2dna" "refseq" "greviews" "bacterial" "protozoan" "ensbacteria" [25] "ensprotists" "ensfungi" "ensmetazoa" "ensplants" "mito" "polymorphix" [31] "emglib" "taxobacgen" "refseqViruses" #Access a bank and see complementary information: > choosebank(bank="genbank", infobank=T) > ls() [1] "banknameSocket" > str(banknameSocket) List of 9 $ socket :Classes 'sockconn', 'connection' atomic [1:1] 3 .. ..- attr(*, "conn_id")=<externalptr> $ bankname: chr "genbank" $ banktype: chr "GENBANK" $ totseqs : num 1.65e+08 $ totspecs: num 968157 $ totkeys : num 3.1e+07 $ release : chr " GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" $ status :Class 'AsIs' chr "on" $ details : chr [1:4] " **** ACNUC Data Base Content **** " " GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" "139,677,722,280 bases; 152,280,170 sequences; 12,313,982 subseqs; 684,079 refers." "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive, Universite Lyon I " > banknameSocket$details [1] " **** ACNUC Data Base Content **** " [2] " GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" [3] "139,677,722,280 bases; 152,280,170 sequences; 12,313,982 subseqs; 684,079 refers." [4] "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive, Universite Lyon I "
  • 19. Make a query: # query (see next slide) all sequences that contain the words "virus" and "dengue" in the taxonomy field and # that are not partial sequences > system.time(query("All_Dengue_viruses_NOTpartial", ""sp=@virus@" AND "sp=@dengue@" AND NOT "k=partial"")) user system elapsed 0.78 0.00 6.72 > All_Dengue_viruses_NOTpartial[1:4] $call query(listname = "All_Dengue_viruses_NOTpartial", query = ""sp=@virus@" AND "sp=@dengue@" AND NOT "k=partial"") $name [1] "All_Dengue_viruses_NOTpartial" $nelem [1] 7741 $typelist [1] "SQ" > All_Dengue_viruses_NOTpartial[[5]][[1]] name length frame ncbicg "A13666" "456" "0" "1" > myseq = getSequence(All_Dengue_viruses_NOTpartial[[5]][[1]]) > myseq[1:20] [1] "a" "t" "g" "g" "c" "c" "a" "t" "g" "g" "a" "c" "c" "t" "t" "g" "g" "t" "g" "a" > closebank()
  • 20. Usage: query(listname, query, socket = autosocket(), invisible = T, verbose = F, virtual = F) Arguments: listname - The name of the list as a quoted string of chars query - A quoted string of chars containing the request with the syntax given in the details section socket - An object of class sockconn connecting to a remote ACNUC database (default is a socket to the last opened database). invisible - if FALSE, the result is returned visibly. verbose - if TRUE, verbose mode is on virtual - if TRUE, no attempt is made to retrieve the information about all the elements of the list. In this case, the req component of the list is set to NA. Value: The result is a list with the following 6 components: call - the original call name - the ACNUC list name nelem - the number of elements (for instance sequences) in the ACNUC list typelist - the type of the elements of the list. Could be SQ for a list of sequence names, KW for a list of keywords, SP for a list of species names. req - a list of sequence names that fit the required criteria or NA when called with parameter virtual is TRUE socket - the socket connection that was used
  • 21. Pau Corral Montañés RUGBCN 03/05/2012