SlideShare une entreprise Scribd logo
1  sur  8
rCAD: A Novel Database Schema for the Comparative Analysis of RNA
Stuart Ozer Kishore J. Doshi Weijia Xu Robin R. Gutell
Microsoft Corporation
Center for Computational
Biology and Bioinformatics
Texas Advanced
Computing Center
Institute for Cellular and
Molecular Biology
1 Microsoft Way The University of Texas at Austin
Redmond, WA 98052 Austin, Texas 78712
stuarto@microsoft.com kjdoshi@gmail.com xwj@tacc.utexas.edu robin.gutell@mail.utexas.edu
Abstract-- Beyond its direct involvement in protein
synthesis with mRNA, tRNA, and rRNA, RNA is now
being appreciated for its significance in the overall
metabolism and regulation of the cell. Comparative
analysis has been very effective in the identification and
characterization of RNA molecules, including the
accurate prediction of their secondary structure. We are
developing an integrative scalable data management and
analysis system, the RNA Comparative Analysis
Database (rCAD), implemented with SQL Server to
support RNA comparative analysis. The platform-
agnostic database schema of rCAD captures the essential
relationships between the different dimensions of
information for RNA comparative analysis datasets. The
rCAD implementation enables a variety of comparative
analysis manipulations with multiple integrated data
dimensions for advanced RNA comparative analysis
workflows. In this paper, we describe details of the
rCAD schema design and illustrate its usefulness with
two usage scenarios.
Keywords: Biological Database; RNA Sequence Analysis;
Bioinformatics; Database Schema
I. INTRODUCTION
A new perspective is now emerging in Biology: RNAs
have a dominant role in the structure, function and
regulation of the cell. Like DNA, it has a well-defined set of
rules for nucleotide base pairing – A pairs with U, and G
pairs with C. These consecutive and antiparallel base pairs
form canonical helices. Like protein, RNA is capable of
forming a three-dimensional structure composed of helices,
hairpin, internal, and multi-stem loops, and other structural
motifs, and like proteins, RNA is capable of catalyzing
chemical reactions [1-7]. It is now widely appreciated that
RNA, as a precursor to DNA and proteins, was essential to
the origin of life and to establish the mechanism for the
association of a cell’s genotype to its phenotype [8-11].
Comparative analysis, used effectively by Darwin to
compare and contrast anatomical features of animals [12],
has the potential to facilitate the identification and
characterization of the RNAs primary and higher-order
structure, and patterns of variation and conservation for the
set of analyzed sequences [13]. The utilization of
comparative analysis is based on a very important discovery
in molecular biology - the same RNA secondary and tertiary
structure can have different RNA sequences [14, 15]. The
predicted secondary structure models are generally
conserved for each of the specific RNAs. The accuracy from
these comparative studies is impressive. Approximately
98% of the base pairs identified with comparative analysis
in the three ribosomal RNAs – 16S, 23S, and 5S that
contain more than 4,500 nucleotides are in the high
resolution crystal structures [16]. In addition to the
prediction of an entire RNA structure model, comparative
analysis has identified RNA structural motifs, the basic
building blocks of RNA structure and biases in the
distribution of nucleotides in the ribosomal RNAs
secondary structure. These include: unpaired adenosines in
the rRNA secondary structures [17, 18], preponderance of
tetraloops - hairpin loops with four nucleotides [19] and
other types of irregular structural elements in the ribosomal
RNA [20].
Successful application of comparative analysis to an RNA
dataset requires an interactive workflow that includes the
acquisition, management and analysis of large amounts of
biological information divided into multiple dimensions: 1)
sequences and the alignments of those sequences based on a
common structure model, 2) evolutionary relationships
between sequences and 3) higher-order structure and
structural motifs. The iterative nature of the comparative
analysis workflow ultimately improves the predicted
structure model and the quality of the sequence alignment.
However, the comparative analysis workflow does not
lend itself to the software pipeline architecture currently
favored by most computational biology and bioinformatics
applications [21, 22]. The pipeline architecture assumes that
relevant data (e.g., sequence alignment) is primarily stored
in flat-files. Different programs load the data from flat-files
into memory, perform analysis and output the results to a
different flat-file. The pipeline is created by chaining
different programs together. Raw data enters at one end of
the pipeline and the value-added analyses exit at the other
end in an automated fashion.
The comparative analysis workflow requires a semi-
automated iterative approach. An important part of the
comparative analysis is the interaction between biologists
and the data. For example, as more sequences become
available, the existing multiple sequence alignment are
updated and curated to improve the comparative model.
Similarly, while comparative analysis leads to new
discoveries about structure and function of RNA sequences,
2011 Seventh IEEE International Conference on eScience
978-0-7695-4597-4/11 $26.00 © 2011 IEEE
DOI 10.1109/eScience.2011.11
15
th
c
a
u
c
r
a
o
s
in
th
p
n
b
d
u
d
f
e
in
e
im
a
p
to
d
s
m
hese findings
comparative m
analysis workf
using non-inter
We propose
comparative an
rapid increase i
and allows a bi
of RNA datase
structure and ev
nfrastructure i
he RNA Com
persists both th
natural form a
between entitie
database mana
using a scalab
database system
features of the
evolving data,
ntegration of m
The rCAD
entities for an
mplements a
alignment ont
phylogenetic) i
o facilitate an
database has ad
system elimin
minimizes cu
subsequently
model. Ther
flow cannot be
ractive, in-mem
Figure 1 Overvie
a different in
nalysis workflo
in size and num
iologist to perf
ets across the d
volutionary rel
is an efficient
mparative Anal
he data entities
and the biolo
es. The schema
agement system
ble, high perf
m, Microsoft
e rCAD schem
, such as ch
multiple dimen
schema enabl
RNA dataset
primary obje
tology: the in
nformation wi
nalysis and ex
dvantages over
ates the trans
stomized I/O
need to be int
refore, the f
e supported b
mory pipeline a
ew of rCAD system
nfrastructure to
ow that include
mber of suitab
form manipula
different dimen
lationships). A
data storage l
lysis Database
within each d
ogically releva
a is applicable f
m and curren
formance ente
SQL Server 2
ma are to eff
hanges in alig
sions of inform
es direct stor
in rCAD. Th
ective of the
ntegration of
th an RNA seq
xploration [23
the use of flat
slation of da
and memo
tegrated into t
full comparati
y software too
architectures.
m
o implement t
es support for t
ble RNA datas
tion and analy
nsions (sequenc
At the core of o
layer (Figure
e (rCAD), whi
imension in th
ant relationshi
for any relation
tly implement
erprise relation
2008. Two nov
fficiently supp
gnment and t
mation.
rage of the da
he rCAD syste
RNA structu
structural (a
quence alignme
]. A centraliz
-files. The rCA
ta formats, a
ry manageme
the
ive
ols
the
the
ets
ysis
ce,
our
1),
ich
heir
ips
nal
ted
nal
vel
ort
the
ata
em
ure
and
ent
zed
AD
and
ent
routine
data. In
two exa
The
analysi
While r
are rel
informa
A w
RNA
collecti
represe
models
databas
externa
users
alignme
taxonom
format
Ther
For ex
contain
alignme
alignme
probabi
trained
updates
searchi
browse
feature
analysi
sample
project
16S rRN
WAT
publicly
analysi
include
exampl
read fro
The
of RNA
types o
forms o
their n
rRNAs
RNA
alignme
RNA
maintai
various
sequenc
Chad
sequenc
es. Thus it is m
n this paper, w
amples of the b
II. R
rCAD project
is within a ce
rCAD is a uni
lated to the
ation.
well-known dat
family databa
ion of RNA se
ented by multip
s. Rfam is pri
se is frequent
al data sources
to browse, s
ents and co
my or keyword
for further ana
re are other pro
xample the R
ns bacterial
ents and som
ents maintaine
ilistic model s
from a set
s on RDP intro
ing RNA sequ
er and a taxono
of RDP is
is of bacterial r
es. Another sim
which enable
RNA sequences
TERS is a w
y available 16
is pipeline [22
e a dedicated
le, the alignm
om external so
rCAD project
A sequences an
of rRNAs – 5S
of life: bacteri
nuclear, chlor
s) as well as o
sequences are
ent data is cur
Web (CRW)
ins other dim
s analysis fe
ces and their a
do is a relati
ces published
more efficient
we describe deta
benefits obtain
RELATED WO
t integrates da
entralized rela
ique applicatio
management
ta source for R
ase (Rfam) [2
equence famili
ple sequence al
imarily a data
ly updated an
s. The web in
search and r
rresponding c
ds. Users can d
alysis.
ojects focused
Ribosomal D
and archaeal
me analysis o
ed in the RDP p
system. The pr
of representa
oduced new fe
uence informat
omy visualizat
its pyroseque
rRNA compos
milar web ser
e users to align
s [21].
workflow tool
6S rDNA ana
2]. It doesn’t
d data manag
ent has to be
ources into mem
maintains a c
nd multiple seq
S, 16S, and 23S
ia, archaea, an
roplast, and
other dimensio
e retrieved d
rated and avai
Site[13]. Th
mensions of in
eatures in ad
alignments.
ional database
d in recent
for analyzing
ails of rCAD s
ned with rCAD
ORKS
ata curation, a
ational databas
on, several oth
and analysis
RNA informa
24]. Rfam m
ies. Each RNA
lignments and
a provider ser
nd curated fro
nterface of Rfa
retrieve RNA
covariation m
download data
d on specific R
Database Proje
small subu
of this data[2
project are alig
robability para
ative sequence
eatures for bro
tion, including
tion tool. A ne
encing pipelin
ition from env
rvice is the G
n and analyze
that bundles
alysis tools to
address data c
gement compo
computed on
mory for proce
comprehensive
quence alignm
S, from the thr
nd eukaryotes
mitochondrial
ons of inform
daily from N
lable at the Co
he rCAD sch
nformation and
ddition to sto
e schema for
literature [26
large scale
schema and
.
access and
se system.
her projects
of RNA
tion is the
maintains a
A family is
covariance
rvice. This
om various
am enables
sequence
models by
a in flat file
RNA types.
ect (RDP)
unit rRNA
25]. The
gned with a
ameters are
es. Recent
owsing and
a genome
ew analytic
ne for the
vironmental
GreenGenes
e their own
a suite of
o construct
curation or
onent. For
the fly or
ssing.
e collection
ments for all
ree primary
(including
l encoded
mation. The
NCBI. The
omparative
hema also
d supports
oring raw
biological
6]. Chado
16
m
o
a
p
e
m
c
s
a
a
s
o
c
a
H
s
p
m
s
w
C
r
d
e
n
to
r
e
c
in
(
r
in
o
f
to
a
s
b
w
R
m
o
c
e
b
b
c
a
b
c
manages biolo
organisms with
associated with
protein product
existing ontolo
model organism
conform to this
support genom
analysis. Align
are typically s
sequence alignm
of the alignmen
comparative fe
analysis progra
However, it is
sequence alignm
provide an
management a
schema used in
within the rela
Chado.
III. SCH
The design
requirements: 1
dataset (seque
evolutionary re
natural relation
o the biologi
repository sup
execution of
comparative an
Our RNA da
nto three inter
(metadata and n
relationships.
nformation is
ontology [23].
for different co
o be executed
avoiding the n
separate analy
between the da
with the biolog
RNA sequence
matrices) as p
ontology [23].
column of the
entire alignmen
by the databas
base pairs, heli
columns of the
alignment are m
be devised to
containing sequ
ogical knowle
h information th
h genome sequ
ts encoded by
ogy and interop
m databases or
s schema. The
me analysis ins
ments and com
stored as pairs
ment may incl
nt. Such schem
features and e
ams (e.g. Blast)
less efficient
ments at large
integrative p
ccess and ana
n rCAD is to fa
tional database
EMA DESIGN A
n of rCAD
1) persist the
nces and seq
elationships and
nships; 2) effic
ical data, and
pporting the
computation
nalysis.
ataset for comp
rrelated dimens
nucleotides), st
The organiza
congruent wit
The unique ca
omparative ana
d on very lar
need to expo
ysis applicatio
ata entities pers
gical relations
e alignments a
proposed by t
SQL queries
e sequence al
nt into memory
se. RNA secon
ices, etc.) are
e sequence ali
mapped onto th
retrieve specif
uences from sp
edge for a w
hat can be dire
uences or the pr
a genome. Ch
perability betw
r any biologic
e primary focu
stead of comp
mparative featu
s. For exampl
ude hits and h
ma is a generic
enables results
) to be stored i
for storing ev
scale. The rCA
platform for
alysis. Therefo
acilitate comm
e system and
AND IMPLEMEN
is motivate
different elem
quence alignm
d alignments) t
iently support
d 3) provide
manipulation
nal algorithm
parative analysi
sions of inform
tructure (2-D)
ation of our
th the proposed
apabilities of S
alysis algorith
rge RNA data
ort large amou
ons (Figure 1
sisted in rCAD
ships and are
are stored as 2
the RNA stru
s can be used
ignment, with
y, using the in
ndary structure
directly relate
gnment. The s
he Tree of Life
fic fragments o
pecific taxonom
wide variety
ectly or indirec
rimary RNA a
hado is based
ween open acce
al databases th
us of Chado is
arative sequen
ures of sequenc
le, features of
high-scoring pa
c way for stori
s from extern
in a unified wa
volving multip
AD is designed
sequence da
ore, the databa
mon analysis tas
is different fro
NTATIONS
ed by seve
ments of an RN
ments, structur
that mimics th
frequent updat
a central da
n and efficie
ms involved
is is decompos
mation: sequen
and evolutiona
dimensions
d RNA structu
QL Server allo
ms and analys
asets in-proce
unts of data
). Relationshi
D are coordinat
queried direct
2-D grids (spar
ucture alignme
d to access a
hout loading t
ndexing provid
e elements (e.
d to the releva
sequences in t
e, and queries c
of the alignme
my subsets.
of
ctly
and
on
ess
hat
to
nce
ces
f a
airs
ing
nal
ay.
ple
d to
ata
ase
sks
om
eral
NA
res,
heir
tes
ata
ent
in
sed
nce
ary
of
ure
ow
ses
ess,
to
ips
ted
tly.
rse
ent
any
the
ded
g.,
ant
the
can
ent
The
compar
RNA d
Alignm
and Str
A. Sequ
Meta
Sequen
metada
Genban
Sequen
16S rR
organel
Mitoch
design
instanc
Genban
Figure 2 rC
database sc
rtments extract
dataset used
ment, Sequence
ructure Relation
uence Metadata
Figure 3
adata describin
nce Metadata
ata include
nk Accession
nceAccession
RNA) stored
llar location
hondrion) stor
of the Seque
ce to include re
nk.
CAD database sch
chema is de
ted from three
for Comparat
e Metadata, Ev
nships (Figure
ta Compartmen
Schema for seque
ng RNA sequ
compartment
external data
n and revis
table, the sequ
in Sequenc
within the
red in CellL
enceAccession
evisions of a s
hema overview.
e-normalized
e major dimens
tive Analysis:
volutionary Re
e-2).
nt
ence metadata.
uences is stor
t (Figure 3).
abase identifi
sions) stored
uence classific
ceType table
cell (e.g., N
ocationInfo t
table allows
sequence intro
into four
sions of an
Sequence
elationships
red in the
Types of
fiers (e.g.,
d in the
ation (e.g.,
and the
Nucleus or
table. The
the rCAD
oduced into
17
B
in
a
T
in
P
T
T
to
E
n
w
ta
s
(
d
C
B. Evolutionary
Figu
Evolutionary
n the Evolutio
and are obtaine
The evolutiona
n the Taxonom
ParentTaxID).
Taxonomy,
TaxonomyNam
o link betw
Evolutionary R
name of each
while all other
able for refere
stores the full l
(e.g., root/cellu
depth-based qu
C. Sequence Al
F
y Relationships
ure 4 Schema for ev
y relationships
onary Relation
ed from the N
ary relationship
my table as a s
The TaxID fi
Tax
mesOrdered ta
ween the Se
Relationships
taxon is store
r names are s
ences. The Ta
lineage to any
ular organisms/
ueries of the ev
lignment Comp
igure 5 Schema fo
s Compartmen
volutionary relatio
(taxonomy) in
nships compart
NCBI Taxonom
ps among sequ
et of parent-ch
ield is the prim
onomyNames
ables and acts
equence Met
compartments
ed in Taxono
stored in the A
axonomyName
leaf in the Lin
/Bacteria) in o
olutionary rela
partment
or sequence alignm
t
onships
rCAD are stor
tment (Figure
my database [2
uences are stor
hild pairs (TaxI
mary key for t
s a
as a foreign k
adata and t
s. The scienti
omyNames tab
AlternateNam
esOrdered tab
neageName fie
order to facilita
ationships.
ment
red
4),
7].
red
ID,
the
and
key
the
ific
ble
mes
ble
eld
ate
All
Sequen
more l
for a sp
dimens
juxtapo
The jux
remova
the alig
Unlike
sequenc
alignme
Each
unique
table (
stored i
of the A
nucleot
an align
nucleot
Physica
specific
alignme
alignme
Alignm
Table 1
5S
rRNA
16S
rRNA
23S
rRNA
This
alignme
updates
First,
values
sequenc
increas
The in
insertio
Tree of
in the
nucleot
For exa
rRNA)
[13], th
(Table
sequenc
gaps. T
gaps,
biological seq
nce Alignment
ogical sequen
pecific RNA m
sional matrix. A
osed with one a
xtaposition is a
al of gaps. Thu
gnment matrix
in other schem
ce level, rCA
ent at nucleotid
h sequence alig
key, the AlnID
(Figure 5). M
in the Alignm
AlnID to SeqID
tide in a sequen
nment as a row
tide is
alColumnNum
c nucleotide’s
ent is modifi
ent are iden
mentColumn ta
: Topology of Rib
Sequences
5355
5762
670
particular sc
ent has two a
s when changin
, in this schem
as they are inf
ces and observ
ses, the numbe
ncreasing sequ
ons and deletio
f Life. The res
alignment c
tides for any se
ample, for the
sequence alig
he average per
1). Thus, fo
ce alignment,
Therefore our
results in a s
quences in rC
t compartment
nce alignments
molecule type c
All sequences
another to iden
accomplished
us, the contents
x is either a n
ma where alig
AD directly st
de level.
gnment stored
D, and is catal
embers of a
mentSequence
D. The Alignm
nce aligned to
w record. In Al
associated
ber which doe
relative posit
fied. The colu
ntified and
able.
bosomal RNA sequ
Tree of Life
Total
Alignment
Columns
M
Se
Le
404
9655 3
17047 5
chema design
advantages, sp
ng the alignme
ma, there is no
ferred at query
ved sequence v
er of gaps also
ence variation
ons observed in
sult is that the
can greatly e
equence within
e small subuni
gnment spannin
row ratio of ga
or any given
on average 85
database sche
significant spa
CAD are stor
t, organized in
s. A sequence
an be describe
within the alig
ntify equivalent
through the ad
s of any individ
nucleotide or a
gnments are sto
tores multiple
d in rCAD is
logued in the A
sequence alig
table through
mentData table
a specific colu
lignmentData
to a
es not change
tion within the
umns of any
managed thr
uence alignments s
Max/Min
equence
ength
Avg.
Ratio
242/14 73%
316/506 85%
317/880 85%
n for storing
ace saving an
ent data.
o need to stor
y time. As the
variation in an
o increases sig
n is partially a
n specific bran
total number o
exceed the n
n the alignment
it Ribosomal R
ng the entire T
aps to nucleoti
row of the 1
5% of the colu
ema, which do
ace reduction
red in the
nto one or
alignment
d as a two-
gnment are
t positions.
ddition and
dual cell in
a gap (‘-‘).
ored at the
e sequence
assigned a
Alignment
gnment are
a mapping
maps each
umn within
table, each
specific
unless that
e sequence
y sequence
rough the
spanning the
Gap
o (%)
% ± 7%
% ± 2%
% ± 5%
sequence
nd efficient
re any gap
number of
n alignment
gnificantly.
a result of
ches of the
of columns
number of
t (Table 1).
RNA (16S
Tree of Life
ides is 85%
16S rRNA
umns have
oesn’t store
for storing
18
la
a
s
L
s
o
in
th
d
q
o
a
c
a
r
F
m
o
o
P
A
c
m
d
a
o
th
a
v
s
arge alignmen
and support v
sequence alignm
Figur
Secondly, th
LogicalColumn
supports effici
operations on
ndirection betw
he logical view
datasets are a
quickly becom
one entry per
alignment in
columnNumber
alignment, suc
require updatin
For example,
modified by co
only adds colum
of where the c
PhysicalColum
AlignmentCol
change, and n
modified (Figu
data structure
alignments, suc
operations, requ
he AlignmentD
To simplify
alignment sto
vAlignmentGri
sequence align
nts and permit
very large (>
ments.
re 6 an example of
he mapping o
nNumber in t
ient (from a
a sequence ali
ween the physi
w of the sequ
added to rCA
es the largest t
nucleotide fo
n an rCAD
r mapping, any
ch as insertin
ng a significan
when the seq
olumn insertion
mns to the “en
olumn is logic
mnNumber to L
umn table is u
no rows in th
ure 6). Without
e, global ope
ch as column i
uiring a signifi
Data table.
queries that
red in rCAD
id and vAlignm
ment with or w
ts the rCAD d
>106
rows x
f updating alignme
of PhysicalCo
the Alignmen
database per
ignment by pl
ical storage in
ence alignmen
D, the Align
table, even wit
or each sequen
D instance.
y further updat
g or deleting
nt portion of th
quence alignm
n (Figure 6), t
nd” of the align
cally inserted.
LogicalColumn
updated to refl
he Alignmen
t the column i
erations on
insertions, wou
icant number o
obtain data f
D, two view
mentGridUnga
without gaps r
database to sca
>104
column
ent data.
olumnNumber
ntColumn tab
rspective) glob
lacing a layer
the database a
nt. As new RN
mentData tab
th a minimum
nce within ea
Without th
tes of an existi
a column w
he existing row
ment topology
the data structu
nment, regardle
The mapping
nNumber in t
lect the topolo
tData table a
indirection in t
large sequen
uld be expensi
of row updates
from a sequen
ws are create
apped that retu
respectively. T
ale
ns)
to
ble
bal
of
and
NA
ble
of
ach
his
ing
will
ws.
is
ure
ess
of
the
ogy
are
the
nce
ive
to
nce
ed:
urn
The
ability
sequenc
advanc
structur
7 depic
query f
vAlignm
sequenc
among
the su
(Figure
only re
recursiv
used to
contain
alignme
(Figure
Figu
D. Stru
to retrieve spe
ce alignment th
ced applicatio
ral statistics an
cts the power o
for retrieving a
mentGrid -- l
ce alignment
sequences wi
ubset are iden
e 7a). The exa
etrieve rows
ve query (SQL
o identify all
n Bacterial seq
ent is then ret
e 7c) for the row
ure 7 An exemplar
ucture Relation
Figure 8 S
ecific sub-grid
hrough SQL qu
ons of the r
nd evolutionar
of the rCAD sy
a subset of a s
everaging the
and the ev
ithin the align
ntified by ev
ample query in
that contain
L Server comm
rows in the s
quences (Figure
trieved on a c
ws that contain
ar SQL query for re
nships Compart
chema of Structur
ds (rows x colu
ueries drives m
rCAD system
ry event counti
ystem through a
equence alignm
integration be
volutionary re
nment. First, th
volutionary re
n Figure 7 is d
Bacterial seq
mon table exp
sequence align
e 7b). The su
column by col
n Bacterial seq
etrieving partial al
tment
re Relationships
umns) of a
many of the
m such as
ing. Figure
an example
ment using
etween the
elationships
he rows of
elationships
designed to
quences. A
pression) is
nment that
ubset of the
lumn basis
quences.
lignment
19
s
s
b
F
c
lo
o
k
s
F
T
s
S
e
S
H
tw
s
lo
(
s
e
s
d
h
lo
p
S
4
ta
E
4
in
3
4
The Structu
stores seconda
structure for a
base pairs betw
From the set of
can be inferred
oops (un-paire
or multi-stem.
The Seconda
known or pred
sequence alignm
FivePrimeElem
ThreePrimeEle
secondary str
SecondaryStru
elements depe
SecondaryStru
Helices and in
wo extents for
split into n exte
oop. An exten
(index), Extent
structural elem
extent of the str
Figure 9 E
The mappin
structure into t
depicted in Fi
helices (5’ 1-4
oop (5’ 5-9, 3
pairs from
SecondaryStru
4, 3’ 32-35) ha
able (Exten
ExtentEndIndex
4, 1) and the
nternal loop (E
3’ half, but the
4). The extent
ural Relationsh
ary structure r
given sequenc
ween different p
f base pairs, ot
d such as helice
ed nucleotides)
aryStructureB
dicted base pai
ment. Each bas
mentSequenceIn
ementSequence
ructural elem
uctureExtents
ending on ext
uctureExtentT
ternal loop str
r their 5’ and 3
ents depending
nt is defined b
tStartIndex an
ent is given an
ructural elemen
Example of mappin
ng of a simpl
the Structural
igure 9. The
, 3’ 32-35 & 5
’ 26-31) and a
the two he
uctureBasePai
s two entries in
ntID, Exten
x, ExtentTypeID
3’ half is (1,
ExtentID 2) als
e hairpin loop
model enable
hips compartm
relationships.
ce, at a minim
positions of an
ther structural
es (consecutive
) classified as
BasePairs table
rs for any RN
se pair is store
ndex
eIndex. The m
ents are ide
s table and spl
tent type enu
Types table.
ructural elemen
3’ halves. Mul
g on the numbe
by its first and
nd ExtentEndIn
n identifier, Ex
nt has its own E
ng RNA structure i
le stem-loop
Relationships
stem-loop str
5’ 10-18, 3’ 20
a hairpin loop (
elices are e
irs table. The
n SecondaryS
ntOrdinal, E
ID) where the 5
2, 32, 35, 1)
so has two exte
has only one
es SQL querie
ment (Figure
RNA seconda
mum, is the set
n RNA sequenc
elements/exten
e base pairs) a
hairpin, intern
e holds the set
NA sequence in
d with two fiel
a
more complicat
entified in t
lit into their su
umerated in t
For examp
nts are split in
ti-stem loops a
er of stems in t
d last nucleoti
ndex. The ent
xtentID, and ea
ExtentOrdinal
into database
RNA seconda
compartment
ructure has tw
0-25) an intern
(16-19). All ba
entered in t
first helix (5’
tructureExten
ExtentStartInde
5’ half is (1, 1,
(Figure 9). T
ents for its 5’ a
extent (Extent
s to characteri
8)
ary
of
ce.
nts
and
nal
of
n a
lds
and
ted
the
ub-
the
ple,
nto
are
the
ide
tire
ach
.
ary
is
wo
nal
ase
the
1-
nts
ex,
, 1,
The
and
tID
ize
the patt
structur
Algo
from rC
other
program
already
exampl
present
countin
A. Struc
Here
frequen
observe
model
include
order s
terns of sequen
ral motif, acros
IV.
orithms that op
CAD streamlin
dimensions o
ms for rCAD
y organized an
les of applica
ted below: stru
ng.
Figure 10 Fin
ctural Statistic
e, we present
ncies. The fre
ed for a base p
has many app
es the predictio
structure with c
nce variation fo
ss the phylogen
USE CASE EXA
perate on sequ
ning and mana
of data. Thu
is not encum
nd cross-index
ations utilizing
uctural statistic
nding base pair fre
cs
a common ta
equency of di
pair in a refer
plications in co
on of base pair
covariation an
for different ele
netic tree.
AMPLES
uence alignme
aging their acc
us the develo
mbered with d
xed. Two rep
g the rCAD s
cs and evolutio
equency example.
ask of finding
ifferent base
rence secondar
omparative ana
rings in an RN
alysis and the
ements of a
nts benefit
cess to the
opment of
data that is
presentative
system are
onary event
g base pair
pair types
ry structure
alysis. This
NAs higher-
evaluation
20
o
e
p
c
f
s
s
c
S
s
d
1
q
(
p
p
a
B
p
th
m
d
s
th
p
n
th
1
p
c
to
H
p
id
a
th
c
of the alignmen
existing sequen
The selecte
projected acros
column ordinal
from a referenc
sequence alignm
selected using
column ordinal
SQL query is
sequence alignm
determine frequ
10 bottom).
Different stru
queries or co
(http://www.rna
presentations o
potentials for
application of o
B. Evolutionary
Figur
Positions in
patterns of vari
he RNAs high
methods for id
determine the f
sequences in tw
hese methods
prediction of a
not explicitly d
he evolution o
1983 that the
pair will enha
covariation ana
o identify the
However the
phylogenetic tr
dentified manu
and quantify th
hat time. Atte
computational
nt accuracy fo
nce alignment [
d base pair,
ss the sequenc
ls associated w
ce row (Figure
ment included
evolutionary
ls for the 5’ a
defined to r
ment, and a gr
uencies of obs
uctural statistic
mpiled applic
a.ccbb.utexas.e
on structural s
RNA foldin
our structural s
y Event Counti
re 11 Evolutionary
a sequence
iation (covariat
her-order struc
dentifying the
frequency of ea
wo columns of
have been v
an RNAs highe
determine the
of the RNA. It
evolutionary h
ance the acc
alysis [31]. We
first tertiary
number of
ree (called evo
ually and thus
hese phylogene
empts to iden
statistical alg
or any new seq
[13, 28].
labeled (1) in
e alignment by
with the 5’ an
e 10 middle).
in the frequen
relationships.
and 3’ half of
retrieve nucle
rouping statem
served base pa
cs have been de
cations. Visit
edu/SAE/2D)
statistics. Com
ng programs
tatistics[29, 30
ing
y Event Counting e
alignment th
tion) are usual
cture [17, 31]
ese positions w
ach base pair t
f the alignment
very effective
er-order structu
number of cov
has been know
history for eac
uracy and re
e utilized the p
interaction in
covariations
olutionary even
a rigorous att
etic events wa
ntify phylogen
gorithms of t
quence within
n Figure 10,
y identifying t
nd 3’ nucleotid
The rows of t
ncy tabulation a
Using the tw
f the base pair
otides from t
ment is applied
air types (Figu
eveloped as SQ
the CRW S
for mo
mputing statistic
is one speci
0].
example
hat have simi
lly base paired
. The tradition
with covariati
type for all of t
t [32, 33]. Wh
in the accura
ure [16], they
variations duri
wn since at le
ch putative ba
esolution of t
phylogenetic tr
the rRNA [3
based on t
nt counting) w
tempt to ident
as not possible
etic events w
the phylogene
an
is
the
des
the
are
wo
, a
the
to
ure
QL
Site
ore
cal
ific
ilar
d in
nal
ion
the
hile
ate
do
ing
ast
ase
the
ree
4].
the
was
ify
at
with
etic
trees im
[35, 36
How
precise
change
rCAD.
populat
candida
11). Th
Evoluti
based
applica
recursio
nodes a
version
current
more tr
false p
Xu, Oz
Prese
applica
predica
of three
sequenc
seconda
evolutio
The
traditio
sequenc
manipu
schema
alignme
Usin
perform
insertio
sequenc
evolutio
them di
an RN
analysi
queries
betwee
their oc
are easi
The
Server
algorith
memor
not req
resourc
manage
system
elemen
mproved the s
].
wever, instead
e count of the n
es on the phyl
Our method
ted with the nu
ate positions a
he in-memory
ionary Relation
on the NC
ation traverses
on to deduce t
and compute th
n of our metho
t event count
rue strong and
positives when
zer, & Gutell, m
ented here is
ation of compa
ated on the inte
e primary dim
ce alignment
ary, and
onary relations
rCAD system
onal computati
ce alignments
ulated in-memo
a directly pe
ents as sparse m
ng a column ind
ming global
on and deletion
ce alignment
onary and sec
irectly to the s
NA sequence
is algorithms c
s. For examp
en sequences, d
ccurrence on d
ily determined
rCAD databas
enables the
hms as compil
ry space of the
quire significan
ce allocation
ement and inp
places the r
nts on the en
sensitivity of
of using stati
number of chan
logenetic tree
builds an in
ucleotides obta
across the sequ
y tree structur
nships compar
CBI Taxonom
s the in-memo
the nucleotide
he evolutionary
od was publish
ting algorithm
d weakly cova
n compared to
manuscript in p
V. CONCLUSIO
s a new dat
arative analysi
egration and sim
mensions of inf
ts, associatio
three-dimensi
ships.
m is significa
ional biology
are primarily
ory by differe
ersists and in
matrices.
direction techn
alignment op
n of columns
s. The rCAD
condary struct
sequences and
e alignment.
can be impleme
ple, structural
different RNA
different parts
d with rCAD.
se implemente
development
led application
e database eng
nt amounts of c
n, parallel
nput/output op
responsibility
nterprise data
the covariatio
istical method
nges and locat
can be determ
n-memory tree
ained from pro
uence alignme
re is obtained
rtment of rCAD
my database
ory data struc
composition o
y event counts.
hed previously
m successfully
ariant positions
o other method
preparation).
ONS
tabase schem
is to RNA dat
multaneous ma
formation - seq
ons between
ional structu
antly different
workflows w
y stored in fla
ent programs. T
ndexes RNA
nique, rCAD is
perations incl
efficiently for
D schema st
ture entities a
individual nuc
Comparative
ented as declar
statistics, re
A structural ele
of the phylog
d on the Micr
t of more co
ns that execute
gine. These pr
custom progra
processing,
timization. T
for managing
abase engine,
n methods
ds, a more
tions of the
mined with
e structure,
ojecting the
ent (Figure
d from the
D, which is
[27]. The
cture using
of ancestor
. An earlier
y [37]. Our
y identifies
s with less
ds (Shang,
ma for the
tasets. It is
anipulation
quence and
primary,
ure, and
from the
where RNA
at-files and
The rCAD
sequence
capable of
luding the
very large
tores both
and relates
cleotides in
sequence
rative SQL
elationships
ements and
genetic tree
rosoft SQL
omplicated
within the
rograms do
amming for
memory
The rCAD
g the data
which is
21
optimized for manipulating large quantities of information,
and is scalable to support extremely large RNA datasets.
Readers who are interested in RNA sequence data can
visit CRW Site (http://www.rna.ccbb.utexas.edu/) which
uses rCAD for data management. Readers interested in
building customized database can visit
http://rcat.codeplex.com/ for available codes and utilities.
ACKNOWLEDGEMENTS
This project has been funded by an External Research
grant from Microsoft Research, National Institutes of Health
(GM067317 and GM085337) and the Welch Foundation
(#1427).
REFERENCES
[1] H. F. Noller and J. B. Chaires, "Functional modification of 16S
ribosomal RNA by kethoxal," Proc. Natl. Acad. Sci. USA, vol. 69, pp.
3115-8, Nov 1972.
[2] K. Kruger, P. J. Grabowski, A. J. Zaug, J. Sands, D. E. Gottschling,
and T. R. Cech, "Self-splicing RNA: autoexcision and autocyclization
of the ribosomal RNA intervening sequence of Tetrahymena," Cell,
vol. 31, pp. 147-57, Nov 1982.
[3] C. Guerrier-Takada, K. Gardiner, T. Marsh, N. Pace, and S. Altman,
"The RNA moiety of ribonuclease P is the catalytic subunit of the
enzyme," Cell, vol. 35, pp. 849-57, Dec 1983.
[4] H. F. Noller, V. Hoffarth, and L. Zimniak, "Unusual resistance of
peptidyl transferase to protein extraction procedures," Science, vol.
256, pp. 1416-9, Jun 5 1992.
[5] P. Nissen, J. Hansen, N. Ban, P. B. Moore, and T. A. Steitz, "The
structural basis of ribosome activity in peptide bond synthesis,"
Science, vol. 289, pp. 920-30, Aug 11 2000.
[6] B. T. Wimberly, D. E. Brodersen, W. M. Clemons, Jr., R. J. Morgan-
Warren, A. P. Carter, C. Vonrhein, T. Hartsch, and V. Ramakrishnan,
"Structure of the 30S ribosomal subunit," Nature, vol. 407, pp. 327-
39, Sep 21 2000.
[7] M. M. Yusupov, G. Z. Yusupova, A. Baucom, K. Lieberman, T. N.
Earnest, J. H. Cate, and H. F. Noller, "Crystal structure of the
ribosome at 5.5 A resolution," Science, vol. 292, pp. 883-96, May 4
2001.
[8] C. R. Woese and G. E. Fox, "The concept of cellular evolution," J.
Mol. Evol., vol. 10, pp. 1-6, Sep 20 1977.
[9] C. R. Woese, The genetic code; the molecular basis for genetic
expression. New York,: Harper & Row, 1967.
[10] L. E. Orgel, "Evolution of the genetic apparatus," J. Mol. Biol., vol.
38, pp. 381-93, Dec 1968.
[11] F. H. Crick, "The origin of the genetic code," J. Mol. Biol., vol. 38,
pp. 367-79, Dec 1968.
[12] C. Darwin, The origin of species. New York, NY: Barnes & Noble
Classics, 2008.
[13] J. J. Cannone, S. Subramanian, M. N. Schnare, J. R. Collett, L. M.
D'Souza, Y. Du, B. Feng, N. Lin, L. V. Madabusi, K. M. Muller, N.
Pande, Z. Shang, N. Yu, and R. R. Gutell, "The comparative RNA
web (CRW) site: an online database of comparative sequence and
structure information for ribosomal, intron, and other RNAs," BMC
Bioinformatics, vol. 3, p. 2, 2002.
[14] R. W. Holley, J. Apgar, G. A. Everett, J. T. Madison, M. Marquisee,
S. H. Merrill, J. R. Penswick, and A. Zamir, "Structure of a
ribonucleic acid," Science, vol. 147, pp. 1462-5, Mar 19 1965.
[15] G. E. Fox and C. R. Woese, "5S RNA secondary structure," Nature,
vol. 256, pp. 505-7, Aug 7 1975.
[16] R. R. Gutell, J. C. Lee, and J. J. Cannone, "The accuracy of ribosomal
RNA comparative structure models," Curr. Opin. Struct. Biol., vol.
12, pp. 301-10, Jun 2002.
[17] R. R. Gutell, B. Weiser, C. R. Woese, and H. F. Noller, "Comparative
anatomy of 16-S-like ribosomal RNA," Prog. Nucleic Acid Res. Mol.
Biol., vol. 32, pp. 155-216, 1985.
[18] R. R. Gutell, J. J. Cannone, Z. Shang, Y. Du, and M. J. Serra, "A
story: unpaired adenosine bases in ribosomal RNAs," J. Mol. Biol.,
vol. 304, pp. 335-54, Dec 1 2000.
[19] C. R. Woese, S. Winker, and R. R. Gutell, "Architecture of ribosomal
RNA: constraints on the sequence of "tetra-loops"," Proc. Natl. Acad.
Sci. USA, vol. 87, pp. 8467-71, Nov 1990.
[20] R. R. Gutell, N. Larsen, and C. R. Woese, "Lessons from an evolving
rRNA: 16S and 23S rRNA structures from a comparative
perspective," Microbiol. Rev., vol. 58, pp. 10-26, Mar 1994.
[21] T. Z. DeSantis, P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie, K.
Keller, T. Huber, D. Dalevi, P. Hu, and G. L. Andersen, "Greengenes,
a chimera-checked 16S rRNA gene database and workbench
compatible with ARB," Appl. Environ. Microbiol., vol. 72, pp. 5069-
72, Jul 2006.
[22] A. L. Hartman, S. Riddle, T. McPhillips, B. Ludascher, and J. A.
Eisen, "Introducing W.A.T.E.R.S.: a workflow for the alignment,
taxonomy, and ecology of ribosomal sequences," BMC
Bioinformatics, vol. 11, p. 317, 2010.
[23] J. W. Brown, A. Birmingham, P. E. Griffiths, F. Jossinet, R.
Kachouri-Lafond, R. Knight, B. F. Lang, N. Leontis, G. Steger, J.
Stombaugh, and E. Westhof, "The RNA structure alignment
ontology," RNA, vol. 15, pp. 1623-31, Sep 2009.
[24] P. P. Gardner, J. Daub, J. G. Tate, E. P. Nawrocki, D. L. Kolbe, S.
Lindgreen, A. C. Wilkinson, R. D. Finn, S. Griffiths-Jones, S. R.
Eddy, and A. Bateman, "Rfam: updates to the RNA families
database," Nucleic Acids Res., vol. 37, pp. D136-40, Jan 2009.
[25] J. R. Cole, Q. Wang, E. Cardenas, J. Fish, B. Chai, R. J. Farris, A. S.
Kulam-Syed-Mohideen, D. M. McGarrell, T. Marsh, G. M. Garrity,
and J. M. Tiedje, "The Ribosomal Database Project: improved
alignments and new tools for rRNA analysis," Nucleic Acids Res.,
vol. 37, pp. D141-5, Jan 2009.
[26] C. J. Mungall and D. B. Emmert, "A Chado case study: an ontology-
based modular schema for representing genome-associated biological
information," Bioinformatics, vol. 23, pp. i337-46, Jul 1 2007.
[27] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W.
Sayers, "GenBank," Nucleic Acids Res., vol. 37, pp. D26-31, Jan
2009.
[28] K. J. Doshi, "Ph.D. dissertation. The University of Texas at Austin.,"
2007.
[29] J. C. Wu, D. P. Gardner, S. Ozer, R. R. Gutell, and P. Ren,
"Correlation of RNA secondary structure statistics with
thermodynamic stability and applications to folding," J. Mol. Biol.,
vol. 391, pp. 769-83, Aug 28 2009.
[30] D. P. Gardner, P. Ren, S. Ozer, and R. R. Gutell, "Statistical
Potentials for Hairpin and Internal Loops Improve the Accuracy of
the Predicted RNA Structure," Journal of Molecular Biology, vol. in
press, 2011.
[31] C. R. Woese, R. Gutell, R. Gupta, and H. F. Noller, "Detailed
analysis of the higher-order structure of 16S-like ribosomal
ribonucleic acids," Microbiol. Rev., vol. 47, pp. 621-69, Dec 1983.
[32] D. K. Chiu and T. Kolodziejczak, "Inferring consensus structure from
nucleic acid sequences," Comput. Appl. Biosci., vol. 7, pp. 347-52, Jul
1991.
[33] R. R. Gutell, A. Power, G. Z. Hertz, E. J. Putz, and G. D. Stormo,
"Identifying constraints on the higher-order structure of RNA:
continued development and application of comparative sequence
analysis methods," Nucleic Acids Res., vol. 20, pp. 5785-95, Nov 11
1992.
[34] R. R. Gutell, H. F. Noller, and C. R. Woese, "Higher order structure
in ribosomal RNA," EMBO J., vol. 5, pp. 1111-3, May 1986.
[35] J. Dutheil, T. Pupko, A. Jean-Marie, and N. Galtier, "A model-based
approach for detecting coevolving positions in a molecule," Mol. Biol.
Evol., vol. 22, pp. 1919-28, Sep 2005.
[36] C. H. Yeang, J. F. Darot, H. F. Noller, and D. Haussler, "Detecting
the coevolution of biosequences--an example of RNA interaction
prediction," Mol. Biol. Evol., vol. 24, pp. 2119-31, Sep 2007.
[37] W. Xu, S. Ozer, and R. R. Gutell, "Covariant Evolutionary Event
Analysis for Base Interaction Prediction Using a Relational Database
Management System for RNA " Lecture Notes in Computer Science,
vol. 5566, pp. 200-216, May 2009.
22

Contenu connexe

En vedette

Muchas cosas para ti
Muchas cosas para tiMuchas cosas para ti
Muchas cosas para tiBecomecol
 
The Hunger
The  HungerThe  Hunger
The Hungerrlakis
 
unofficial transcript
unofficial transcriptunofficial transcript
unofficial transcriptRyan McIntyre
 
Erika
ErikaErika
Erika99175
 
Vesna vulovic.pptx. horacio stiusso
Vesna vulovic.pptx. horacio stiussoVesna vulovic.pptx. horacio stiusso
Vesna vulovic.pptx. horacio stiussoAntonioCabrala
 
iPhone Programming [2/17] : Introduction to iOS Programming
iPhone Programming [2/17] : Introduction to iOS ProgrammingiPhone Programming [2/17] : Introduction to iOS Programming
iPhone Programming [2/17] : Introduction to iOS ProgrammingIMC Institute
 
Gutell 084.jmb.2002.323.0035
Gutell 084.jmb.2002.323.0035Gutell 084.jmb.2002.323.0035
Gutell 084.jmb.2002.323.0035Robin Gutell
 
Technical Excellence Award
Technical Excellence AwardTechnical Excellence Award
Technical Excellence AwardBhaumik Patel
 
Convergencia multimedia de la información noticiosa y publicitaria de un medi...
Convergencia multimedia de la información noticiosa y publicitaria de un medi...Convergencia multimedia de la información noticiosa y publicitaria de un medi...
Convergencia multimedia de la información noticiosa y publicitaria de un medi...Alban Herlan
 
Inteligência Coletiva em Ambientes Corporativos
Inteligência Coletiva em Ambientes CorporativosInteligência Coletiva em Ambientes Corporativos
Inteligência Coletiva em Ambientes CorporativosElvis Fusco
 
Casa Porto seguro
Casa Porto seguroCasa Porto seguro
Casa Porto segurobusinessss
 

En vedette (20)

Muchas cosas para ti
Muchas cosas para tiMuchas cosas para ti
Muchas cosas para ti
 
Open office pantallazos
Open office pantallazosOpen office pantallazos
Open office pantallazos
 
iFluids_Trainings 2016
iFluids_Trainings 2016iFluids_Trainings 2016
iFluids_Trainings 2016
 
On The Spot Award
On The Spot AwardOn The Spot Award
On The Spot Award
 
The Hunger
The  HungerThe  Hunger
The Hunger
 
Next Play Ground
Next Play GroundNext Play Ground
Next Play Ground
 
C004
C004C004
C004
 
Bullying preentacion
Bullying preentacionBullying preentacion
Bullying preentacion
 
unofficial transcript
unofficial transcriptunofficial transcript
unofficial transcript
 
Erika
ErikaErika
Erika
 
Avestruz ii
Avestruz iiAvestruz ii
Avestruz ii
 
Vesna vulovic.pptx. horacio stiusso
Vesna vulovic.pptx. horacio stiussoVesna vulovic.pptx. horacio stiusso
Vesna vulovic.pptx. horacio stiusso
 
iPhone Programming [2/17] : Introduction to iOS Programming
iPhone Programming [2/17] : Introduction to iOS ProgrammingiPhone Programming [2/17] : Introduction to iOS Programming
iPhone Programming [2/17] : Introduction to iOS Programming
 
Gutell 084.jmb.2002.323.0035
Gutell 084.jmb.2002.323.0035Gutell 084.jmb.2002.323.0035
Gutell 084.jmb.2002.323.0035
 
Journal 9
Journal 9Journal 9
Journal 9
 
Trabajo tics
Trabajo ticsTrabajo tics
Trabajo tics
 
Technical Excellence Award
Technical Excellence AwardTechnical Excellence Award
Technical Excellence Award
 
Convergencia multimedia de la información noticiosa y publicitaria de un medi...
Convergencia multimedia de la información noticiosa y publicitaria de un medi...Convergencia multimedia de la información noticiosa y publicitaria de un medi...
Convergencia multimedia de la información noticiosa y publicitaria de un medi...
 
Inteligência Coletiva em Ambientes Corporativos
Inteligência Coletiva em Ambientes CorporativosInteligência Coletiva em Ambientes Corporativos
Inteligência Coletiva em Ambientes Corporativos
 
Casa Porto seguro
Casa Porto seguroCasa Porto seguro
Casa Porto seguro
 

Similaire à Gutell 117.rcad_e_science_stockholm_pp15-22

Gutell 107.ssdbm.2009.200
Gutell 107.ssdbm.2009.200Gutell 107.ssdbm.2009.200
Gutell 107.ssdbm.2009.200Robin Gutell
 
Gutell 121.bibm12 alignment 06392676
Gutell 121.bibm12 alignment 06392676Gutell 121.bibm12 alignment 06392676
Gutell 121.bibm12 alignment 06392676Robin Gutell
 
Gutell 116.rpass.bibm11.pp618-622.2011
Gutell 116.rpass.bibm11.pp618-622.2011Gutell 116.rpass.bibm11.pp618-622.2011
Gutell 116.rpass.bibm11.pp618-622.2011Robin Gutell
 
Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming
Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming
Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming TELKOMNIKA JOURNAL
 
Gutell 114.jmb.2011.413.0473
Gutell 114.jmb.2011.413.0473Gutell 114.jmb.2011.413.0473
Gutell 114.jmb.2011.413.0473Robin Gutell
 
Automatically converting tabular data to
Automatically converting tabular data toAutomatically converting tabular data to
Automatically converting tabular data toIJwest
 
Gutell 112.j.phys.chem.b.2010.114.13497
Gutell 112.j.phys.chem.b.2010.114.13497Gutell 112.j.phys.chem.b.2010.114.13497
Gutell 112.j.phys.chem.b.2010.114.13497Robin Gutell
 
Executive SummaryIntroductionProtein engineering
Executive SummaryIntroductionProtein engineeringExecutive SummaryIntroductionProtein engineering
Executive SummaryIntroductionProtein engineeringBetseyCalderon89
 
Pattern recognition system based on support vector machines
Pattern recognition system based on support vector machinesPattern recognition system based on support vector machines
Pattern recognition system based on support vector machinesAlexander Decker
 
Performance Analysis and Parallelization of CosineSimilarity of Documents
Performance Analysis and Parallelization of CosineSimilarity of DocumentsPerformance Analysis and Parallelization of CosineSimilarity of Documents
Performance Analysis and Parallelization of CosineSimilarity of DocumentsIRJET Journal
 
Bioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuBioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuKAUSHAL SAHU
 
Gutell 108.jmb.2009.391.769
Gutell 108.jmb.2009.391.769Gutell 108.jmb.2009.391.769
Gutell 108.jmb.2009.391.769Robin Gutell
 
art%3A10.1186%2Fs12859-016-1260-x
art%3A10.1186%2Fs12859-016-1260-xart%3A10.1186%2Fs12859-016-1260-x
art%3A10.1186%2Fs12859-016-1260-xJames Hennessy
 
Gutell 079.nar.2001.29.04724
Gutell 079.nar.2001.29.04724Gutell 079.nar.2001.29.04724
Gutell 079.nar.2001.29.04724Robin Gutell
 
Gutell 090.bmc.bioinformatics.2004.5.105
Gutell 090.bmc.bioinformatics.2004.5.105Gutell 090.bmc.bioinformatics.2004.5.105
Gutell 090.bmc.bioinformatics.2004.5.105Robin Gutell
 

Similaire à Gutell 117.rcad_e_science_stockholm_pp15-22 (20)

Gutell 107.ssdbm.2009.200
Gutell 107.ssdbm.2009.200Gutell 107.ssdbm.2009.200
Gutell 107.ssdbm.2009.200
 
Gutell 121.bibm12 alignment 06392676
Gutell 121.bibm12 alignment 06392676Gutell 121.bibm12 alignment 06392676
Gutell 121.bibm12 alignment 06392676
 
chemengine karthi acs sandiego rev1.0
chemengine karthi acs sandiego rev1.0chemengine karthi acs sandiego rev1.0
chemengine karthi acs sandiego rev1.0
 
Gutell 116.rpass.bibm11.pp618-622.2011
Gutell 116.rpass.bibm11.pp618-622.2011Gutell 116.rpass.bibm11.pp618-622.2011
Gutell 116.rpass.bibm11.pp618-622.2011
 
Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming
Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming
Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming
 
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
 
Gutell 114.jmb.2011.413.0473
Gutell 114.jmb.2011.413.0473Gutell 114.jmb.2011.413.0473
Gutell 114.jmb.2011.413.0473
 
Automatically converting tabular data to
Automatically converting tabular data toAutomatically converting tabular data to
Automatically converting tabular data to
 
Gutell 112.j.phys.chem.b.2010.114.13497
Gutell 112.j.phys.chem.b.2010.114.13497Gutell 112.j.phys.chem.b.2010.114.13497
Gutell 112.j.phys.chem.b.2010.114.13497
 
Executive SummaryIntroductionProtein engineering
Executive SummaryIntroductionProtein engineeringExecutive SummaryIntroductionProtein engineering
Executive SummaryIntroductionProtein engineering
 
Pattern recognition system based on support vector machines
Pattern recognition system based on support vector machinesPattern recognition system based on support vector machines
Pattern recognition system based on support vector machines
 
Performance Analysis and Parallelization of CosineSimilarity of Documents
Performance Analysis and Parallelization of CosineSimilarity of DocumentsPerformance Analysis and Parallelization of CosineSimilarity of Documents
Performance Analysis and Parallelization of CosineSimilarity of Documents
 
Semester Report
Semester ReportSemester Report
Semester Report
 
Bioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahuBioinformatic, and tools by kk sahu
Bioinformatic, and tools by kk sahu
 
Karyotype DAS client
Karyotype DAS clientKaryotype DAS client
Karyotype DAS client
 
Gutell 108.jmb.2009.391.769
Gutell 108.jmb.2009.391.769Gutell 108.jmb.2009.391.769
Gutell 108.jmb.2009.391.769
 
art%3A10.1186%2Fs12859-016-1260-x
art%3A10.1186%2Fs12859-016-1260-xart%3A10.1186%2Fs12859-016-1260-x
art%3A10.1186%2Fs12859-016-1260-x
 
Gutell 079.nar.2001.29.04724
Gutell 079.nar.2001.29.04724Gutell 079.nar.2001.29.04724
Gutell 079.nar.2001.29.04724
 
Gutell 090.bmc.bioinformatics.2004.5.105
Gutell 090.bmc.bioinformatics.2004.5.105Gutell 090.bmc.bioinformatics.2004.5.105
Gutell 090.bmc.bioinformatics.2004.5.105
 
Poster (1)
Poster (1)Poster (1)
Poster (1)
 

Plus de Robin Gutell

Gutell 124.rna 2013-woese-19-vii-xi
Gutell 124.rna 2013-woese-19-vii-xiGutell 124.rna 2013-woese-19-vii-xi
Gutell 124.rna 2013-woese-19-vii-xiRobin Gutell
 
Gutell 123.app environ micro_2013_79_1803
Gutell 123.app environ micro_2013_79_1803Gutell 123.app environ micro_2013_79_1803
Gutell 123.app environ micro_2013_79_1803Robin Gutell
 
Gutell 122.chapter comparative analy_russell_2013
Gutell 122.chapter comparative analy_russell_2013Gutell 122.chapter comparative analy_russell_2013
Gutell 122.chapter comparative analy_russell_2013Robin Gutell
 
Gutell 120.plos_one_2012_7_e38320_supplemental_data
Gutell 120.plos_one_2012_7_e38320_supplemental_dataGutell 120.plos_one_2012_7_e38320_supplemental_data
Gutell 120.plos_one_2012_7_e38320_supplemental_dataRobin Gutell
 
Gutell 119.plos_one_2017_7_e39383
Gutell 119.plos_one_2017_7_e39383Gutell 119.plos_one_2017_7_e39383
Gutell 119.plos_one_2017_7_e39383Robin Gutell
 
Gutell 118.plos_one_2012.7_e38203.supplementalfig
Gutell 118.plos_one_2012.7_e38203.supplementalfigGutell 118.plos_one_2012.7_e38203.supplementalfig
Gutell 118.plos_one_2012.7_e38203.supplementalfigRobin Gutell
 
Gutell 113.ploso.2011.06.e18768
Gutell 113.ploso.2011.06.e18768Gutell 113.ploso.2011.06.e18768
Gutell 113.ploso.2011.06.e18768Robin Gutell
 
Gutell 111.bmc.genomics.2010.11.485
Gutell 111.bmc.genomics.2010.11.485Gutell 111.bmc.genomics.2010.11.485
Gutell 111.bmc.genomics.2010.11.485Robin Gutell
 
Gutell 110.ant.v.leeuwenhoek.2010.98.195
Gutell 110.ant.v.leeuwenhoek.2010.98.195Gutell 110.ant.v.leeuwenhoek.2010.98.195
Gutell 110.ant.v.leeuwenhoek.2010.98.195Robin Gutell
 
Gutell 109.ejp.2009.44.277
Gutell 109.ejp.2009.44.277Gutell 109.ejp.2009.44.277
Gutell 109.ejp.2009.44.277Robin Gutell
 
Gutell 106.j.euk.microbio.2009.56.0142.2
Gutell 106.j.euk.microbio.2009.56.0142.2Gutell 106.j.euk.microbio.2009.56.0142.2
Gutell 106.j.euk.microbio.2009.56.0142.2Robin Gutell
 
Gutell 105.zoologica.scripta.2009.38.0043
Gutell 105.zoologica.scripta.2009.38.0043Gutell 105.zoologica.scripta.2009.38.0043
Gutell 105.zoologica.scripta.2009.38.0043Robin Gutell
 
Gutell 104.biology.direct.2008.03.016
Gutell 104.biology.direct.2008.03.016Gutell 104.biology.direct.2008.03.016
Gutell 104.biology.direct.2008.03.016Robin Gutell
 
Gutell 103.structure.2008.16.0535
Gutell 103.structure.2008.16.0535Gutell 103.structure.2008.16.0535
Gutell 103.structure.2008.16.0535Robin Gutell
 
Gutell 102.bioinformatics.2007.23.3289
Gutell 102.bioinformatics.2007.23.3289Gutell 102.bioinformatics.2007.23.3289
Gutell 102.bioinformatics.2007.23.3289Robin Gutell
 
Gutell 101.physica.a.2007.386.0564.good
Gutell 101.physica.a.2007.386.0564.goodGutell 101.physica.a.2007.386.0564.good
Gutell 101.physica.a.2007.386.0564.goodRobin Gutell
 
Gutell 100.imb.2006.15.533
Gutell 100.imb.2006.15.533Gutell 100.imb.2006.15.533
Gutell 100.imb.2006.15.533Robin Gutell
 
Gutell 099.nature.2006.443.0931
Gutell 099.nature.2006.443.0931Gutell 099.nature.2006.443.0931
Gutell 099.nature.2006.443.0931Robin Gutell
 
Gutell 098.jmb.2006.360.0978
Gutell 098.jmb.2006.360.0978Gutell 098.jmb.2006.360.0978
Gutell 098.jmb.2006.360.0978Robin Gutell
 
Gutell 097.jphy.2006.42.0655
Gutell 097.jphy.2006.42.0655Gutell 097.jphy.2006.42.0655
Gutell 097.jphy.2006.42.0655Robin Gutell
 

Plus de Robin Gutell (20)

Gutell 124.rna 2013-woese-19-vii-xi
Gutell 124.rna 2013-woese-19-vii-xiGutell 124.rna 2013-woese-19-vii-xi
Gutell 124.rna 2013-woese-19-vii-xi
 
Gutell 123.app environ micro_2013_79_1803
Gutell 123.app environ micro_2013_79_1803Gutell 123.app environ micro_2013_79_1803
Gutell 123.app environ micro_2013_79_1803
 
Gutell 122.chapter comparative analy_russell_2013
Gutell 122.chapter comparative analy_russell_2013Gutell 122.chapter comparative analy_russell_2013
Gutell 122.chapter comparative analy_russell_2013
 
Gutell 120.plos_one_2012_7_e38320_supplemental_data
Gutell 120.plos_one_2012_7_e38320_supplemental_dataGutell 120.plos_one_2012_7_e38320_supplemental_data
Gutell 120.plos_one_2012_7_e38320_supplemental_data
 
Gutell 119.plos_one_2017_7_e39383
Gutell 119.plos_one_2017_7_e39383Gutell 119.plos_one_2017_7_e39383
Gutell 119.plos_one_2017_7_e39383
 
Gutell 118.plos_one_2012.7_e38203.supplementalfig
Gutell 118.plos_one_2012.7_e38203.supplementalfigGutell 118.plos_one_2012.7_e38203.supplementalfig
Gutell 118.plos_one_2012.7_e38203.supplementalfig
 
Gutell 113.ploso.2011.06.e18768
Gutell 113.ploso.2011.06.e18768Gutell 113.ploso.2011.06.e18768
Gutell 113.ploso.2011.06.e18768
 
Gutell 111.bmc.genomics.2010.11.485
Gutell 111.bmc.genomics.2010.11.485Gutell 111.bmc.genomics.2010.11.485
Gutell 111.bmc.genomics.2010.11.485
 
Gutell 110.ant.v.leeuwenhoek.2010.98.195
Gutell 110.ant.v.leeuwenhoek.2010.98.195Gutell 110.ant.v.leeuwenhoek.2010.98.195
Gutell 110.ant.v.leeuwenhoek.2010.98.195
 
Gutell 109.ejp.2009.44.277
Gutell 109.ejp.2009.44.277Gutell 109.ejp.2009.44.277
Gutell 109.ejp.2009.44.277
 
Gutell 106.j.euk.microbio.2009.56.0142.2
Gutell 106.j.euk.microbio.2009.56.0142.2Gutell 106.j.euk.microbio.2009.56.0142.2
Gutell 106.j.euk.microbio.2009.56.0142.2
 
Gutell 105.zoologica.scripta.2009.38.0043
Gutell 105.zoologica.scripta.2009.38.0043Gutell 105.zoologica.scripta.2009.38.0043
Gutell 105.zoologica.scripta.2009.38.0043
 
Gutell 104.biology.direct.2008.03.016
Gutell 104.biology.direct.2008.03.016Gutell 104.biology.direct.2008.03.016
Gutell 104.biology.direct.2008.03.016
 
Gutell 103.structure.2008.16.0535
Gutell 103.structure.2008.16.0535Gutell 103.structure.2008.16.0535
Gutell 103.structure.2008.16.0535
 
Gutell 102.bioinformatics.2007.23.3289
Gutell 102.bioinformatics.2007.23.3289Gutell 102.bioinformatics.2007.23.3289
Gutell 102.bioinformatics.2007.23.3289
 
Gutell 101.physica.a.2007.386.0564.good
Gutell 101.physica.a.2007.386.0564.goodGutell 101.physica.a.2007.386.0564.good
Gutell 101.physica.a.2007.386.0564.good
 
Gutell 100.imb.2006.15.533
Gutell 100.imb.2006.15.533Gutell 100.imb.2006.15.533
Gutell 100.imb.2006.15.533
 
Gutell 099.nature.2006.443.0931
Gutell 099.nature.2006.443.0931Gutell 099.nature.2006.443.0931
Gutell 099.nature.2006.443.0931
 
Gutell 098.jmb.2006.360.0978
Gutell 098.jmb.2006.360.0978Gutell 098.jmb.2006.360.0978
Gutell 098.jmb.2006.360.0978
 
Gutell 097.jphy.2006.42.0655
Gutell 097.jphy.2006.42.0655Gutell 097.jphy.2006.42.0655
Gutell 097.jphy.2006.42.0655
 

Dernier

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Dernier (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Gutell 117.rcad_e_science_stockholm_pp15-22

  • 1. rCAD: A Novel Database Schema for the Comparative Analysis of RNA Stuart Ozer Kishore J. Doshi Weijia Xu Robin R. Gutell Microsoft Corporation Center for Computational Biology and Bioinformatics Texas Advanced Computing Center Institute for Cellular and Molecular Biology 1 Microsoft Way The University of Texas at Austin Redmond, WA 98052 Austin, Texas 78712 stuarto@microsoft.com kjdoshi@gmail.com xwj@tacc.utexas.edu robin.gutell@mail.utexas.edu Abstract-- Beyond its direct involvement in protein synthesis with mRNA, tRNA, and rRNA, RNA is now being appreciated for its significance in the overall metabolism and regulation of the cell. Comparative analysis has been very effective in the identification and characterization of RNA molecules, including the accurate prediction of their secondary structure. We are developing an integrative scalable data management and analysis system, the RNA Comparative Analysis Database (rCAD), implemented with SQL Server to support RNA comparative analysis. The platform- agnostic database schema of rCAD captures the essential relationships between the different dimensions of information for RNA comparative analysis datasets. The rCAD implementation enables a variety of comparative analysis manipulations with multiple integrated data dimensions for advanced RNA comparative analysis workflows. In this paper, we describe details of the rCAD schema design and illustrate its usefulness with two usage scenarios. Keywords: Biological Database; RNA Sequence Analysis; Bioinformatics; Database Schema I. INTRODUCTION A new perspective is now emerging in Biology: RNAs have a dominant role in the structure, function and regulation of the cell. Like DNA, it has a well-defined set of rules for nucleotide base pairing – A pairs with U, and G pairs with C. These consecutive and antiparallel base pairs form canonical helices. Like protein, RNA is capable of forming a three-dimensional structure composed of helices, hairpin, internal, and multi-stem loops, and other structural motifs, and like proteins, RNA is capable of catalyzing chemical reactions [1-7]. It is now widely appreciated that RNA, as a precursor to DNA and proteins, was essential to the origin of life and to establish the mechanism for the association of a cell’s genotype to its phenotype [8-11]. Comparative analysis, used effectively by Darwin to compare and contrast anatomical features of animals [12], has the potential to facilitate the identification and characterization of the RNAs primary and higher-order structure, and patterns of variation and conservation for the set of analyzed sequences [13]. The utilization of comparative analysis is based on a very important discovery in molecular biology - the same RNA secondary and tertiary structure can have different RNA sequences [14, 15]. The predicted secondary structure models are generally conserved for each of the specific RNAs. The accuracy from these comparative studies is impressive. Approximately 98% of the base pairs identified with comparative analysis in the three ribosomal RNAs – 16S, 23S, and 5S that contain more than 4,500 nucleotides are in the high resolution crystal structures [16]. In addition to the prediction of an entire RNA structure model, comparative analysis has identified RNA structural motifs, the basic building blocks of RNA structure and biases in the distribution of nucleotides in the ribosomal RNAs secondary structure. These include: unpaired adenosines in the rRNA secondary structures [17, 18], preponderance of tetraloops - hairpin loops with four nucleotides [19] and other types of irregular structural elements in the ribosomal RNA [20]. Successful application of comparative analysis to an RNA dataset requires an interactive workflow that includes the acquisition, management and analysis of large amounts of biological information divided into multiple dimensions: 1) sequences and the alignments of those sequences based on a common structure model, 2) evolutionary relationships between sequences and 3) higher-order structure and structural motifs. The iterative nature of the comparative analysis workflow ultimately improves the predicted structure model and the quality of the sequence alignment. However, the comparative analysis workflow does not lend itself to the software pipeline architecture currently favored by most computational biology and bioinformatics applications [21, 22]. The pipeline architecture assumes that relevant data (e.g., sequence alignment) is primarily stored in flat-files. Different programs load the data from flat-files into memory, perform analysis and output the results to a different flat-file. The pipeline is created by chaining different programs together. Raw data enters at one end of the pipeline and the value-added analyses exit at the other end in an automated fashion. The comparative analysis workflow requires a semi- automated iterative approach. An important part of the comparative analysis is the interaction between biologists and the data. For example, as more sequences become available, the existing multiple sequence alignment are updated and curated to improve the comparative model. Similarly, while comparative analysis leads to new discoveries about structure and function of RNA sequences, 2011 Seventh IEEE International Conference on eScience 978-0-7695-4597-4/11 $26.00 © 2011 IEEE DOI 10.1109/eScience.2011.11 15
  • 2. th c a u c r a o s in th p n b d u d f e in e im a p to d s m hese findings comparative m analysis workf using non-inter We propose comparative an rapid increase i and allows a bi of RNA datase structure and ev nfrastructure i he RNA Com persists both th natural form a between entitie database mana using a scalab database system features of the evolving data, ntegration of m The rCAD entities for an mplements a alignment ont phylogenetic) i o facilitate an database has ad system elimin minimizes cu subsequently model. Ther flow cannot be ractive, in-mem Figure 1 Overvie a different in nalysis workflo in size and num iologist to perf ets across the d volutionary rel is an efficient mparative Anal he data entities and the biolo es. The schema agement system ble, high perf m, Microsoft e rCAD schem , such as ch multiple dimen schema enabl RNA dataset primary obje tology: the in nformation wi nalysis and ex dvantages over ates the trans stomized I/O need to be int refore, the f e supported b mory pipeline a ew of rCAD system nfrastructure to ow that include mber of suitab form manipula different dimen lationships). A data storage l lysis Database within each d ogically releva a is applicable f m and curren formance ente SQL Server 2 ma are to eff hanges in alig sions of inform es direct stor in rCAD. Th ective of the ntegration of th an RNA seq xploration [23 the use of flat slation of da and memo tegrated into t full comparati y software too architectures. m o implement t es support for t ble RNA datas tion and analy nsions (sequenc At the core of o layer (Figure e (rCAD), whi imension in th ant relationshi for any relation tly implement erprise relation 2008. Two nov fficiently supp gnment and t mation. rage of the da he rCAD syste RNA structu structural (a quence alignme ]. A centraliz -files. The rCA ta formats, a ry manageme the ive ols the the ets ysis ce, our 1), ich heir ips nal ted nal vel ort the ata em ure and ent zed AD and ent routine data. In two exa The analysi While r are rel informa A w RNA collecti represe models databas externa users alignme taxonom format Ther For ex contain alignme alignme probabi trained updates searchi browse feature analysi sample project 16S rRN WAT publicly analysi include exampl read fro The of RNA types o forms o their n rRNAs RNA alignme RNA maintai various sequenc Chad sequenc es. Thus it is m n this paper, w amples of the b II. R rCAD project is within a ce rCAD is a uni lated to the ation. well-known dat family databa ion of RNA se ented by multip s. Rfam is pri se is frequent al data sources to browse, s ents and co my or keyword for further ana re are other pro xample the R ns bacterial ents and som ents maintaine ilistic model s from a set s on RDP intro ing RNA sequ er and a taxono of RDP is is of bacterial r es. Another sim which enable RNA sequences TERS is a w y available 16 is pipeline [22 e a dedicated le, the alignm om external so rCAD project A sequences an of rRNAs – 5S of life: bacteri nuclear, chlor s) as well as o sequences are ent data is cur Web (CRW) ins other dim s analysis fe ces and their a do is a relati ces published more efficient we describe deta benefits obtain RELATED WO t integrates da entralized rela ique applicatio management ta source for R ase (Rfam) [2 equence famili ple sequence al imarily a data ly updated an s. The web in search and r rresponding c ds. Users can d alysis. ojects focused Ribosomal D and archaeal me analysis o ed in the RDP p system. The pr of representa oduced new fe uence informat omy visualizat its pyroseque rRNA compos milar web ser e users to align s [21]. workflow tool 6S rDNA ana 2]. It doesn’t d data manag ent has to be ources into mem maintains a c nd multiple seq S, 16S, and 23S ia, archaea, an roplast, and other dimensio e retrieved d rated and avai Site[13]. Th mensions of in eatures in ad alignments. ional database d in recent for analyzing ails of rCAD s ned with rCAD ORKS ata curation, a ational databas on, several oth and analysis RNA informa 24]. Rfam m ies. Each RNA lignments and a provider ser nd curated fro nterface of Rfa retrieve RNA covariation m download data d on specific R Database Proje small subu of this data[2 project are alig robability para ative sequence eatures for bro tion, including tion tool. A ne encing pipelin ition from env rvice is the G n and analyze that bundles alysis tools to address data c gement compo computed on mory for proce comprehensive quence alignm S, from the thr nd eukaryotes mitochondrial ons of inform daily from N lable at the Co he rCAD sch nformation and ddition to sto e schema for literature [26 large scale schema and . access and se system. her projects of RNA tion is the maintains a A family is covariance rvice. This om various am enables sequence models by a in flat file RNA types. ect (RDP) unit rRNA 25]. The gned with a ameters are es. Recent owsing and a genome ew analytic ne for the vironmental GreenGenes e their own a suite of o construct curation or onent. For the fly or ssing. e collection ments for all ree primary (including l encoded mation. The NCBI. The omparative hema also d supports oring raw biological 6]. Chado 16
  • 3. m o a p e m c s a a s o c a H s p m s w C r d e n to r e c in ( r in o f to a s b w R m o c e b b c a b c manages biolo organisms with associated with protein product existing ontolo model organism conform to this support genom analysis. Align are typically s sequence alignm of the alignmen comparative fe analysis progra However, it is sequence alignm provide an management a schema used in within the rela Chado. III. SCH The design requirements: 1 dataset (seque evolutionary re natural relation o the biologi repository sup execution of comparative an Our RNA da nto three inter (metadata and n relationships. nformation is ontology [23]. for different co o be executed avoiding the n separate analy between the da with the biolog RNA sequence matrices) as p ontology [23]. column of the entire alignmen by the databas base pairs, heli columns of the alignment are m be devised to containing sequ ogical knowle h information th h genome sequ ts encoded by ogy and interop m databases or s schema. The me analysis ins ments and com stored as pairs ment may incl nt. Such schem features and e ams (e.g. Blast) less efficient ments at large integrative p ccess and ana n rCAD is to fa tional database EMA DESIGN A n of rCAD 1) persist the nces and seq elationships and nships; 2) effic ical data, and pporting the computation nalysis. ataset for comp rrelated dimens nucleotides), st The organiza congruent wit The unique ca omparative ana d on very lar need to expo ysis applicatio ata entities pers gical relations e alignments a proposed by t SQL queries e sequence al nt into memory se. RNA secon ices, etc.) are e sequence ali mapped onto th retrieve specif uences from sp edge for a w hat can be dire uences or the pr a genome. Ch perability betw r any biologic e primary focu stead of comp mparative featu s. For exampl ude hits and h ma is a generic enables results ) to be stored i for storing ev scale. The rCA platform for alysis. Therefo acilitate comm e system and AND IMPLEMEN is motivate different elem quence alignm d alignments) t iently support d 3) provide manipulation nal algorithm parative analysi sions of inform tructure (2-D) ation of our th the proposed apabilities of S alysis algorith rge RNA data ort large amou ons (Figure 1 sisted in rCAD ships and are are stored as 2 the RNA stru s can be used ignment, with y, using the in ndary structure directly relate gnment. The s he Tree of Life fic fragments o pecific taxonom wide variety ectly or indirec rimary RNA a hado is based ween open acce al databases th us of Chado is arative sequen ures of sequenc le, features of high-scoring pa c way for stori s from extern in a unified wa volving multip AD is designed sequence da ore, the databa mon analysis tas is different fro NTATIONS ed by seve ments of an RN ments, structur that mimics th frequent updat a central da n and efficie ms involved is is decompos mation: sequen and evolutiona dimensions d RNA structu QL Server allo ms and analys asets in-proce unts of data ). Relationshi D are coordinat queried direct 2-D grids (spar ucture alignme d to access a hout loading t ndexing provid e elements (e. d to the releva sequences in t e, and queries c of the alignme my subsets. of ctly and on ess hat to nce ces f a airs ing nal ay. ple d to ata ase sks om eral NA res, heir tes ata ent in sed nce ary of ure ow ses ess, to ips ted tly. rse ent any the ded g., ant the can ent The compar RNA d Alignm and Str A. Sequ Meta Sequen metada Genban Sequen 16S rR organel Mitoch design instanc Genban Figure 2 rC database sc rtments extract dataset used ment, Sequence ructure Relation uence Metadata Figure 3 adata describin nce Metadata ata include nk Accession nceAccession RNA) stored llar location hondrion) stor of the Seque ce to include re nk. CAD database sch chema is de ted from three for Comparat e Metadata, Ev nships (Figure ta Compartmen Schema for seque ng RNA sequ compartment external data n and revis table, the sequ in Sequenc within the red in CellL enceAccession evisions of a s hema overview. e-normalized e major dimens tive Analysis: volutionary Re e-2). nt ence metadata. uences is stor t (Figure 3). abase identifi sions) stored uence classific ceType table cell (e.g., N ocationInfo t table allows sequence intro into four sions of an Sequence elationships red in the Types of fiers (e.g., d in the ation (e.g., and the Nucleus or table. The the rCAD oduced into 17
  • 4. B in a T in P T T to E n w ta s ( d C B. Evolutionary Figu Evolutionary n the Evolutio and are obtaine The evolutiona n the Taxonom ParentTaxID). Taxonomy, TaxonomyNam o link betw Evolutionary R name of each while all other able for refere stores the full l (e.g., root/cellu depth-based qu C. Sequence Al F y Relationships ure 4 Schema for ev y relationships onary Relation ed from the N ary relationship my table as a s The TaxID fi Tax mesOrdered ta ween the Se Relationships taxon is store r names are s ences. The Ta lineage to any ular organisms/ ueries of the ev lignment Comp igure 5 Schema fo s Compartmen volutionary relatio (taxonomy) in nships compart NCBI Taxonom ps among sequ et of parent-ch ield is the prim onomyNames ables and acts equence Met compartments ed in Taxono stored in the A axonomyName leaf in the Lin /Bacteria) in o olutionary rela partment or sequence alignm t onships rCAD are stor tment (Figure my database [2 uences are stor hild pairs (TaxI mary key for t s a as a foreign k adata and t s. The scienti omyNames tab AlternateNam esOrdered tab neageName fie order to facilita ationships. ment red 4), 7]. red ID, the and key the ific ble mes ble eld ate All Sequen more l for a sp dimens juxtapo The jux remova the alig Unlike sequenc alignme Each unique table ( stored i of the A nucleot an align nucleot Physica specific alignme alignme Alignm Table 1 5S rRNA 16S rRNA 23S rRNA This alignme updates First, values sequenc increas The in insertio Tree of in the nucleot For exa rRNA) [13], th (Table sequenc gaps. T gaps, biological seq nce Alignment ogical sequen pecific RNA m sional matrix. A osed with one a xtaposition is a al of gaps. Thu gnment matrix in other schem ce level, rCA ent at nucleotid h sequence alig key, the AlnID (Figure 5). M in the Alignm AlnID to SeqID tide in a sequen nment as a row tide is alColumnNum c nucleotide’s ent is modifi ent are iden mentColumn ta : Topology of Rib Sequences 5355 5762 670 particular sc ent has two a s when changin , in this schem as they are inf ces and observ ses, the numbe ncreasing sequ ons and deletio f Life. The res alignment c tides for any se ample, for the sequence alig he average per 1). Thus, fo ce alignment, Therefore our results in a s quences in rC t compartment nce alignments molecule type c All sequences another to iden accomplished us, the contents x is either a n ma where alig AD directly st de level. gnment stored D, and is catal embers of a mentSequence D. The Alignm nce aligned to w record. In Al associated ber which doe relative posit fied. The colu ntified and able. bosomal RNA sequ Tree of Life Total Alignment Columns M Se Le 404 9655 3 17047 5 chema design advantages, sp ng the alignme ma, there is no ferred at query ved sequence v er of gaps also ence variation ons observed in sult is that the can greatly e equence within e small subuni gnment spannin row ratio of ga or any given on average 85 database sche significant spa CAD are stor t, organized in s. A sequence an be describe within the alig ntify equivalent through the ad s of any individ nucleotide or a gnments are sto tores multiple d in rCAD is logued in the A sequence alig table through mentData table a specific colu lignmentData to a es not change tion within the umns of any managed thr uence alignments s Max/Min equence ength Avg. Ratio 242/14 73% 316/506 85% 317/880 85% n for storing ace saving an ent data. o need to stor y time. As the variation in an o increases sig n is partially a n specific bran total number o exceed the n n the alignment it Ribosomal R ng the entire T aps to nucleoti row of the 1 5% of the colu ema, which do ace reduction red in the nto one or alignment d as a two- gnment are t positions. ddition and dual cell in a gap (‘-‘). ored at the e sequence assigned a Alignment gnment are a mapping maps each umn within table, each specific unless that e sequence y sequence rough the spanning the Gap o (%) % ± 7% % ± 2% % ± 5% sequence nd efficient re any gap number of n alignment gnificantly. a result of ches of the of columns number of t (Table 1). RNA (16S Tree of Life ides is 85% 16S rRNA umns have oesn’t store for storing 18
  • 5. la a s L s o in th d q o a c a r F m o o P A c m d a o th a v s arge alignmen and support v sequence alignm Figur Secondly, th LogicalColumn supports effici operations on ndirection betw he logical view datasets are a quickly becom one entry per alignment in columnNumber alignment, suc require updatin For example, modified by co only adds colum of where the c PhysicalColum AlignmentCol change, and n modified (Figu data structure alignments, suc operations, requ he AlignmentD To simplify alignment sto vAlignmentGri sequence align nts and permit very large (> ments. re 6 an example of he mapping o nNumber in t ient (from a a sequence ali ween the physi w of the sequ added to rCA es the largest t nucleotide fo n an rCAD r mapping, any ch as insertin ng a significan when the seq olumn insertion mns to the “en olumn is logic mnNumber to L umn table is u no rows in th ure 6). Without e, global ope ch as column i uiring a signifi Data table. queries that red in rCAD id and vAlignm ment with or w ts the rCAD d >106 rows x f updating alignme of PhysicalCo the Alignmen database per ignment by pl ical storage in ence alignmen D, the Align table, even wit or each sequen D instance. y further updat g or deleting nt portion of th quence alignm n (Figure 6), t nd” of the align cally inserted. LogicalColumn updated to refl he Alignmen t the column i erations on insertions, wou icant number o obtain data f D, two view mentGridUnga without gaps r database to sca >104 column ent data. olumnNumber ntColumn tab rspective) glob lacing a layer the database a nt. As new RN mentData tab th a minimum nce within ea Without th tes of an existi a column w he existing row ment topology the data structu nment, regardle The mapping nNumber in t lect the topolo tData table a indirection in t large sequen uld be expensi of row updates from a sequen ws are create apped that retu respectively. T ale ns) to ble bal of and NA ble of ach his ing will ws. is ure ess of the ogy are the nce ive to nce ed: urn The ability sequenc advanc structur 7 depic query f vAlignm sequenc among the su (Figure only re recursiv used to contain alignme (Figure Figu D. Stru to retrieve spe ce alignment th ced applicatio ral statistics an cts the power o for retrieving a mentGrid -- l ce alignment sequences wi ubset are iden e 7a). The exa etrieve rows ve query (SQL o identify all n Bacterial seq ent is then ret e 7c) for the row ure 7 An exemplar ucture Relation Figure 8 S ecific sub-grid hrough SQL qu ons of the r nd evolutionar of the rCAD sy a subset of a s everaging the and the ev ithin the align ntified by ev ample query in that contain L Server comm rows in the s quences (Figure trieved on a c ws that contain ar SQL query for re nships Compart chema of Structur ds (rows x colu ueries drives m rCAD system ry event counti ystem through a equence alignm integration be volutionary re nment. First, th volutionary re n Figure 7 is d Bacterial seq mon table exp sequence align e 7b). The su column by col n Bacterial seq etrieving partial al tment re Relationships umns) of a many of the m such as ing. Figure an example ment using etween the elationships he rows of elationships designed to quences. A pression) is nment that ubset of the lumn basis quences. lignment 19
  • 6. s s b F c lo o k s F T s S e S H tw s lo ( s e s d h lo p S 4 ta E 4 in 3 4 The Structu stores seconda structure for a base pairs betw From the set of can be inferred oops (un-paire or multi-stem. The Seconda known or pred sequence alignm FivePrimeElem ThreePrimeEle secondary str SecondaryStru elements depe SecondaryStru Helices and in wo extents for split into n exte oop. An exten (index), Extent structural elem extent of the str Figure 9 E The mappin structure into t depicted in Fi helices (5’ 1-4 oop (5’ 5-9, 3 pairs from SecondaryStru 4, 3’ 32-35) ha able (Exten ExtentEndIndex 4, 1) and the nternal loop (E 3’ half, but the 4). The extent ural Relationsh ary structure r given sequenc ween different p f base pairs, ot d such as helice ed nucleotides) aryStructureB dicted base pai ment. Each bas mentSequenceIn ementSequence ructural elem uctureExtents ending on ext uctureExtentT ternal loop str r their 5’ and 3 ents depending nt is defined b tStartIndex an ent is given an ructural elemen Example of mappin ng of a simpl the Structural igure 9. The , 3’ 32-35 & 5 ’ 26-31) and a the two he uctureBasePai s two entries in ntID, Exten x, ExtentTypeID 3’ half is (1, ExtentID 2) als e hairpin loop model enable hips compartm relationships. ce, at a minim positions of an ther structural es (consecutive ) classified as BasePairs table rs for any RN se pair is store ndex eIndex. The m ents are ide s table and spl tent type enu Types table. ructural elemen 3’ halves. Mul g on the numbe by its first and nd ExtentEndIn n identifier, Ex nt has its own E ng RNA structure i le stem-loop Relationships stem-loop str 5’ 10-18, 3’ 20 a hairpin loop ( elices are e irs table. The n SecondaryS ntOrdinal, E ID) where the 5 2, 32, 35, 1) so has two exte has only one es SQL querie ment (Figure RNA seconda mum, is the set n RNA sequenc elements/exten e base pairs) a hairpin, intern e holds the set NA sequence in d with two fiel a more complicat entified in t lit into their su umerated in t For examp nts are split in ti-stem loops a er of stems in t d last nucleoti ndex. The ent xtentID, and ea ExtentOrdinal into database RNA seconda compartment ructure has tw 0-25) an intern (16-19). All ba entered in t first helix (5’ tructureExten ExtentStartInde 5’ half is (1, 1, (Figure 9). T ents for its 5’ a extent (Extent s to characteri 8) ary of ce. nts and nal of n a lds and ted the ub- the ple, nto are the ide tire ach . ary is wo nal ase the 1- nts ex, , 1, The and tID ize the patt structur Algo from rC other program already exampl present countin A. Struc Here frequen observe model include order s terns of sequen ral motif, acros IV. orithms that op CAD streamlin dimensions o ms for rCAD y organized an les of applica ted below: stru ng. Figure 10 Fin ctural Statistic e, we present ncies. The fre ed for a base p has many app es the predictio structure with c nce variation fo ss the phylogen USE CASE EXA perate on sequ ning and mana of data. Thu is not encum nd cross-index ations utilizing uctural statistic nding base pair fre cs a common ta equency of di pair in a refer plications in co on of base pair covariation an for different ele netic tree. AMPLES uence alignme aging their acc us the develo mbered with d xed. Two rep g the rCAD s cs and evolutio equency example. ask of finding ifferent base rence secondar omparative ana rings in an RN alysis and the ements of a nts benefit cess to the opment of data that is presentative system are onary event g base pair pair types ry structure alysis. This NAs higher- evaluation 20
  • 7. o e p c f s s c S s d 1 q ( p p a B p th m d s th p n th 1 p c to H p id a th c of the alignmen existing sequen The selecte projected acros column ordinal from a referenc sequence alignm selected using column ordinal SQL query is sequence alignm determine frequ 10 bottom). Different stru queries or co (http://www.rna presentations o potentials for application of o B. Evolutionary Figur Positions in patterns of vari he RNAs high methods for id determine the f sequences in tw hese methods prediction of a not explicitly d he evolution o 1983 that the pair will enha covariation ana o identify the However the phylogenetic tr dentified manu and quantify th hat time. Atte computational nt accuracy fo nce alignment [ d base pair, ss the sequenc ls associated w ce row (Figure ment included evolutionary ls for the 5’ a defined to r ment, and a gr uencies of obs uctural statistic mpiled applic a.ccbb.utexas.e on structural s RNA foldin our structural s y Event Counti re 11 Evolutionary a sequence iation (covariat her-order struc dentifying the frequency of ea wo columns of have been v an RNAs highe determine the of the RNA. It evolutionary h ance the acc alysis [31]. We first tertiary number of ree (called evo ually and thus hese phylogene empts to iden statistical alg or any new seq [13, 28]. labeled (1) in e alignment by with the 5’ an e 10 middle). in the frequen relationships. and 3’ half of retrieve nucle rouping statem served base pa cs have been de cations. Visit edu/SAE/2D) statistics. Com ng programs tatistics[29, 30 ing y Event Counting e alignment th tion) are usual cture [17, 31] ese positions w ach base pair t f the alignment very effective er-order structu number of cov has been know history for eac uracy and re e utilized the p interaction in covariations olutionary even a rigorous att etic events wa ntify phylogen gorithms of t quence within n Figure 10, y identifying t nd 3’ nucleotid The rows of t ncy tabulation a Using the tw f the base pair otides from t ment is applied air types (Figu eveloped as SQ the CRW S for mo mputing statistic is one speci 0]. example hat have simi lly base paired . The tradition with covariati type for all of t t [32, 33]. Wh in the accura ure [16], they variations duri wn since at le ch putative ba esolution of t phylogenetic tr the rRNA [3 based on t nt counting) w tempt to ident as not possible etic events w the phylogene an is the des the are wo , a the to ure QL Site ore cal ific ilar d in nal ion the hile ate do ing ast ase the ree 4]. the was ify at with etic trees im [35, 36 How precise change rCAD. populat candida 11). Th Evoluti based applica recursio nodes a version current more tr false p Xu, Oz Prese applica predica of three sequenc seconda evolutio The traditio sequenc manipu schema alignme Usin perform insertio sequenc evolutio them di an RN analysi queries betwee their oc are easi The Server algorith memor not req resourc manage system elemen mproved the s ]. wever, instead e count of the n es on the phyl Our method ted with the nu ate positions a he in-memory ionary Relation on the NC ation traverses on to deduce t and compute th n of our metho t event count rue strong and positives when zer, & Gutell, m ented here is ation of compa ated on the inte e primary dim ce alignment ary, and onary relations rCAD system onal computati ce alignments ulated in-memo a directly pe ents as sparse m ng a column ind ming global on and deletion ce alignment onary and sec irectly to the s NA sequence is algorithms c s. For examp en sequences, d ccurrence on d ily determined rCAD databas enables the hms as compil ry space of the quire significan ce allocation ement and inp places the r nts on the en sensitivity of of using stati number of chan logenetic tree builds an in ucleotides obta across the sequ y tree structur nships compar CBI Taxonom s the in-memo the nucleotide he evolutionary od was publish ting algorithm d weakly cova n compared to manuscript in p V. CONCLUSIO s a new dat arative analysi egration and sim mensions of inf ts, associatio three-dimensi ships. m is significa ional biology are primarily ory by differe ersists and in matrices. direction techn alignment op n of columns s. The rCAD condary struct sequences and e alignment. can be impleme ple, structural different RNA different parts d with rCAD. se implemente development led application e database eng nt amounts of c n, parallel nput/output op responsibility nterprise data the covariatio istical method nges and locat can be determ n-memory tree ained from pro uence alignme re is obtained rtment of rCAD my database ory data struc composition o y event counts. hed previously m successfully ariant positions o other method preparation). ONS tabase schem is to RNA dat multaneous ma formation - seq ons between ional structu antly different workflows w y stored in fla ent programs. T ndexes RNA nique, rCAD is perations incl efficiently for D schema st ture entities a individual nuc Comparative ented as declar statistics, re A structural ele of the phylog d on the Micr t of more co ns that execute gine. These pr custom progra processing, timization. T for managing abase engine, n methods ds, a more tions of the mined with e structure, ojecting the ent (Figure d from the D, which is [27]. The cture using of ancestor . An earlier y [37]. Our y identifies s with less ds (Shang, ma for the tasets. It is anipulation quence and primary, ure, and from the where RNA at-files and The rCAD sequence capable of luding the very large tores both and relates cleotides in sequence rative SQL elationships ements and genetic tree rosoft SQL omplicated within the rograms do amming for memory The rCAD g the data which is 21
  • 8. optimized for manipulating large quantities of information, and is scalable to support extremely large RNA datasets. Readers who are interested in RNA sequence data can visit CRW Site (http://www.rna.ccbb.utexas.edu/) which uses rCAD for data management. Readers interested in building customized database can visit http://rcat.codeplex.com/ for available codes and utilities. ACKNOWLEDGEMENTS This project has been funded by an External Research grant from Microsoft Research, National Institutes of Health (GM067317 and GM085337) and the Welch Foundation (#1427). REFERENCES [1] H. F. Noller and J. B. Chaires, "Functional modification of 16S ribosomal RNA by kethoxal," Proc. Natl. Acad. Sci. USA, vol. 69, pp. 3115-8, Nov 1972. [2] K. Kruger, P. J. Grabowski, A. J. Zaug, J. Sands, D. E. Gottschling, and T. R. Cech, "Self-splicing RNA: autoexcision and autocyclization of the ribosomal RNA intervening sequence of Tetrahymena," Cell, vol. 31, pp. 147-57, Nov 1982. [3] C. Guerrier-Takada, K. Gardiner, T. Marsh, N. Pace, and S. Altman, "The RNA moiety of ribonuclease P is the catalytic subunit of the enzyme," Cell, vol. 35, pp. 849-57, Dec 1983. [4] H. F. Noller, V. Hoffarth, and L. Zimniak, "Unusual resistance of peptidyl transferase to protein extraction procedures," Science, vol. 256, pp. 1416-9, Jun 5 1992. [5] P. Nissen, J. Hansen, N. Ban, P. B. Moore, and T. A. Steitz, "The structural basis of ribosome activity in peptide bond synthesis," Science, vol. 289, pp. 920-30, Aug 11 2000. [6] B. T. Wimberly, D. E. Brodersen, W. M. Clemons, Jr., R. J. Morgan- Warren, A. P. Carter, C. Vonrhein, T. Hartsch, and V. Ramakrishnan, "Structure of the 30S ribosomal subunit," Nature, vol. 407, pp. 327- 39, Sep 21 2000. [7] M. M. Yusupov, G. Z. Yusupova, A. Baucom, K. Lieberman, T. N. Earnest, J. H. Cate, and H. F. Noller, "Crystal structure of the ribosome at 5.5 A resolution," Science, vol. 292, pp. 883-96, May 4 2001. [8] C. R. Woese and G. E. Fox, "The concept of cellular evolution," J. Mol. Evol., vol. 10, pp. 1-6, Sep 20 1977. [9] C. R. Woese, The genetic code; the molecular basis for genetic expression. New York,: Harper & Row, 1967. [10] L. E. Orgel, "Evolution of the genetic apparatus," J. Mol. Biol., vol. 38, pp. 381-93, Dec 1968. [11] F. H. Crick, "The origin of the genetic code," J. Mol. Biol., vol. 38, pp. 367-79, Dec 1968. [12] C. Darwin, The origin of species. New York, NY: Barnes & Noble Classics, 2008. [13] J. J. Cannone, S. Subramanian, M. N. Schnare, J. R. Collett, L. M. D'Souza, Y. Du, B. Feng, N. Lin, L. V. Madabusi, K. M. Muller, N. Pande, Z. Shang, N. Yu, and R. R. Gutell, "The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs," BMC Bioinformatics, vol. 3, p. 2, 2002. [14] R. W. Holley, J. Apgar, G. A. Everett, J. T. Madison, M. Marquisee, S. H. Merrill, J. R. Penswick, and A. Zamir, "Structure of a ribonucleic acid," Science, vol. 147, pp. 1462-5, Mar 19 1965. [15] G. E. Fox and C. R. Woese, "5S RNA secondary structure," Nature, vol. 256, pp. 505-7, Aug 7 1975. [16] R. R. Gutell, J. C. Lee, and J. J. Cannone, "The accuracy of ribosomal RNA comparative structure models," Curr. Opin. Struct. Biol., vol. 12, pp. 301-10, Jun 2002. [17] R. R. Gutell, B. Weiser, C. R. Woese, and H. F. Noller, "Comparative anatomy of 16-S-like ribosomal RNA," Prog. Nucleic Acid Res. Mol. Biol., vol. 32, pp. 155-216, 1985. [18] R. R. Gutell, J. J. Cannone, Z. Shang, Y. Du, and M. J. Serra, "A story: unpaired adenosine bases in ribosomal RNAs," J. Mol. Biol., vol. 304, pp. 335-54, Dec 1 2000. [19] C. R. Woese, S. Winker, and R. R. Gutell, "Architecture of ribosomal RNA: constraints on the sequence of "tetra-loops"," Proc. Natl. Acad. Sci. USA, vol. 87, pp. 8467-71, Nov 1990. [20] R. R. Gutell, N. Larsen, and C. R. Woese, "Lessons from an evolving rRNA: 16S and 23S rRNA structures from a comparative perspective," Microbiol. Rev., vol. 58, pp. 10-26, Mar 1994. [21] T. Z. DeSantis, P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie, K. Keller, T. Huber, D. Dalevi, P. Hu, and G. L. Andersen, "Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB," Appl. Environ. Microbiol., vol. 72, pp. 5069- 72, Jul 2006. [22] A. L. Hartman, S. Riddle, T. McPhillips, B. Ludascher, and J. A. Eisen, "Introducing W.A.T.E.R.S.: a workflow for the alignment, taxonomy, and ecology of ribosomal sequences," BMC Bioinformatics, vol. 11, p. 317, 2010. [23] J. W. Brown, A. Birmingham, P. E. Griffiths, F. Jossinet, R. Kachouri-Lafond, R. Knight, B. F. Lang, N. Leontis, G. Steger, J. Stombaugh, and E. Westhof, "The RNA structure alignment ontology," RNA, vol. 15, pp. 1623-31, Sep 2009. [24] P. P. Gardner, J. Daub, J. G. Tate, E. P. Nawrocki, D. L. Kolbe, S. Lindgreen, A. C. Wilkinson, R. D. Finn, S. Griffiths-Jones, S. R. Eddy, and A. Bateman, "Rfam: updates to the RNA families database," Nucleic Acids Res., vol. 37, pp. D136-40, Jan 2009. [25] J. R. Cole, Q. Wang, E. Cardenas, J. Fish, B. Chai, R. J. Farris, A. S. Kulam-Syed-Mohideen, D. M. McGarrell, T. Marsh, G. M. Garrity, and J. M. Tiedje, "The Ribosomal Database Project: improved alignments and new tools for rRNA analysis," Nucleic Acids Res., vol. 37, pp. D141-5, Jan 2009. [26] C. J. Mungall and D. B. Emmert, "A Chado case study: an ontology- based modular schema for representing genome-associated biological information," Bioinformatics, vol. 23, pp. i337-46, Jul 1 2007. [27] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W. Sayers, "GenBank," Nucleic Acids Res., vol. 37, pp. D26-31, Jan 2009. [28] K. J. Doshi, "Ph.D. dissertation. The University of Texas at Austin.," 2007. [29] J. C. Wu, D. P. Gardner, S. Ozer, R. R. Gutell, and P. Ren, "Correlation of RNA secondary structure statistics with thermodynamic stability and applications to folding," J. Mol. Biol., vol. 391, pp. 769-83, Aug 28 2009. [30] D. P. Gardner, P. Ren, S. Ozer, and R. R. Gutell, "Statistical Potentials for Hairpin and Internal Loops Improve the Accuracy of the Predicted RNA Structure," Journal of Molecular Biology, vol. in press, 2011. [31] C. R. Woese, R. Gutell, R. Gupta, and H. F. Noller, "Detailed analysis of the higher-order structure of 16S-like ribosomal ribonucleic acids," Microbiol. Rev., vol. 47, pp. 621-69, Dec 1983. [32] D. K. Chiu and T. Kolodziejczak, "Inferring consensus structure from nucleic acid sequences," Comput. Appl. Biosci., vol. 7, pp. 347-52, Jul 1991. [33] R. R. Gutell, A. Power, G. Z. Hertz, E. J. Putz, and G. D. Stormo, "Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods," Nucleic Acids Res., vol. 20, pp. 5785-95, Nov 11 1992. [34] R. R. Gutell, H. F. Noller, and C. R. Woese, "Higher order structure in ribosomal RNA," EMBO J., vol. 5, pp. 1111-3, May 1986. [35] J. Dutheil, T. Pupko, A. Jean-Marie, and N. Galtier, "A model-based approach for detecting coevolving positions in a molecule," Mol. Biol. Evol., vol. 22, pp. 1919-28, Sep 2005. [36] C. H. Yeang, J. F. Darot, H. F. Noller, and D. Haussler, "Detecting the coevolution of biosequences--an example of RNA interaction prediction," Mol. Biol. Evol., vol. 24, pp. 2119-31, Sep 2007. [37] W. Xu, S. Ozer, and R. R. Gutell, "Covariant Evolutionary Event Analysis for Base Interaction Prediction Using a Relational Database Management System for RNA " Lecture Notes in Computer Science, vol. 5566, pp. 200-216, May 2009. 22