Gutell 117.rcad_e_science_stockholm_pp15-22

rCAD: A Novel Database Schema for the Comparative Analysis of RNA
Stuart Ozer Kishore J. Doshi Weijia Xu Robin R. Gutell
Microsoft Corporation
Center for Computational
Biology and Bioinformatics
Texas Advanced
Computing Center
Institute for Cellular and
Molecular Biology
1 Microsoft Way The University of Texas at Austin
Redmond, WA 98052 Austin, Texas 78712
stuarto@microsoft.com kjdoshi@gmail.com xwj@tacc.utexas.edu robin.gutell@mail.utexas.edu
Abstract-- Beyond its direct involvement in protein
synthesis with mRNA, tRNA, and rRNA, RNA is now
being appreciated for its significance in the overall
metabolism and regulation of the cell. Comparative
analysis has been very effective in the identification and
characterization of RNA molecules, including the
accurate prediction of their secondary structure. We are
developing an integrative scalable data management and
analysis system, the RNA Comparative Analysis
Database (rCAD), implemented with SQL Server to
support RNA comparative analysis. The platform-
agnostic database schema of rCAD captures the essential
relationships between the different dimensions of
information for RNA comparative analysis datasets. The
rCAD implementation enables a variety of comparative
analysis manipulations with multiple integrated data
dimensions for advanced RNA comparative analysis
workflows. In this paper, we describe details of the
rCAD schema design and illustrate its usefulness with
two usage scenarios.
Keywords: Biological Database; RNA Sequence Analysis;
Bioinformatics; Database Schema
I. INTRODUCTION
A new perspective is now emerging in Biology: RNAs
have a dominant role in the structure, function and
regulation of the cell. Like DNA, it has a well-defined set of
rules for nucleotide base pairing – A pairs with U, and G
pairs with C. These consecutive and antiparallel base pairs
form canonical helices. Like protein, RNA is capable of
forming a three-dimensional structure composed of helices,
hairpin, internal, and multi-stem loops, and other structural
motifs, and like proteins, RNA is capable of catalyzing
chemical reactions [1-7]. It is now widely appreciated that
RNA, as a precursor to DNA and proteins, was essential to
the origin of life and to establish the mechanism for the
association of a cell’s genotype to its phenotype [8-11].
Comparative analysis, used effectively by Darwin to
compare and contrast anatomical features of animals [12],
has the potential to facilitate the identification and
characterization of the RNAs primary and higher-order
structure, and patterns of variation and conservation for the
set of analyzed sequences [13]. The utilization of
comparative analysis is based on a very important discovery
in molecular biology - the same RNA secondary and tertiary
structure can have different RNA sequences [14, 15]. The
predicted secondary structure models are generally
conserved for each of the specific RNAs. The accuracy from
these comparative studies is impressive. Approximately
98% of the base pairs identified with comparative analysis
in the three ribosomal RNAs – 16S, 23S, and 5S that
contain more than 4,500 nucleotides are in the high
resolution crystal structures [16]. In addition to the
prediction of an entire RNA structure model, comparative
analysis has identified RNA structural motifs, the basic
building blocks of RNA structure and biases in the
distribution of nucleotides in the ribosomal RNAs
secondary structure. These include: unpaired adenosines in
the rRNA secondary structures [17, 18], preponderance of
tetraloops - hairpin loops with four nucleotides [19] and
other types of irregular structural elements in the ribosomal
RNA [20].
Successful application of comparative analysis to an RNA
dataset requires an interactive workflow that includes the
acquisition, management and analysis of large amounts of
biological information divided into multiple dimensions: 1)
sequences and the alignments of those sequences based on a
common structure model, 2) evolutionary relationships
between sequences and 3) higher-order structure and
structural motifs. The iterative nature of the comparative
analysis workflow ultimately improves the predicted
structure model and the quality of the sequence alignment.
However, the comparative analysis workflow does not
lend itself to the software pipeline architecture currently
favored by most computational biology and bioinformatics
applications [21, 22]. The pipeline architecture assumes that
relevant data (e.g., sequence alignment) is primarily stored
in flat-files. Different programs load the data from flat-files
into memory, perform analysis and output the results to a
different flat-file. The pipeline is created by chaining
different programs together. Raw data enters at one end of
the pipeline and the value-added analyses exit at the other
end in an automated fashion.
The comparative analysis workflow requires a semi-
automated iterative approach. An important part of the
comparative analysis is the interaction between biologists
and the data. For example, as more sequences become
available, the existing multiple sequence alignment are
updated and curated to improve the comparative model.
Similarly, while comparative analysis leads to new
discoveries about structure and function of RNA sequences,
2011 Seventh IEEE International Conference on eScience
978-0-7695-4597-4/11 $26.00 © 2011 IEEE
DOI 10.1109/eScience.2011.11
15

th
c
a
u
c
r
a
o
s
in
th
p
n
b
d
u
d
f
e
in
e
im
a
p
to
d
s
m
hese findings
comparative m
analysis workf
using non-inter
We propose
comparative an
rapid increase i
and allows a bi
of RNA datase
structure and ev
nfrastructure i
he RNA Com
persists both th
natural form a
between entitie
database mana
using a scalab
database system
features of the
evolving data,
ntegration of m
The rCAD
entities for an
mplements a
alignment ont
phylogenetic) i
o facilitate an
database has ad
system elimin
minimizes cu
subsequently
model. Ther
flow cannot be
ractive, in-mem
Figure 1 Overvie
a different in
nalysis workflo
in size and num
iologist to perf
ets across the d
volutionary rel
is an efficient
mparative Anal
he data entities
and the biolo
es. The schema
agement system
ble, high perf
m, Microsoft
e rCAD schem
, such as ch
multiple dimen
schema enabl
RNA dataset
primary obje
tology: the in
nformation wi
nalysis and ex
dvantages over
ates the trans
stomized I/O
need to be int
refore, the f
e supported b
mory pipeline a
ew of rCAD system
nfrastructure to
ow that include
mber of suitab
form manipula
different dimen
lationships). A
data storage l
lysis Database
within each d
ogically releva
a is applicable f
m and curren
formance ente
SQL Server 2
ma are to eff
hanges in alig
sions of inform
es direct stor
in rCAD. Th
ective of the
ntegration of
th an RNA seq
xploration [23
the use of flat
slation of da
and memo
tegrated into t
full comparati
y software too
architectures.
m
o implement t
es support for t
ble RNA datas
tion and analy
nsions (sequenc
At the core of o
layer (Figure
e (rCAD), whi
imension in th
ant relationshi
for any relation
tly implement
erprise relation
2008. Two nov
fficiently supp
gnment and t
mation.
rage of the da
he rCAD syste
RNA structu
structural (a
quence alignme
]. A centraliz
-files. The rCA
ta formats, a
ry manageme
the
ive
ols
the
the
ets
ysis
ce,
our
1),
ich
heir
ips
nal
ted
nal
vel
ort
the
ata
em
ure
and
ent
zed
AD
and
ent
routine
data. In
two exa
The
analysi
While r
are rel
informa
A w
RNA
collecti
represe
models
databas
externa
users
alignme
taxonom
format
Ther
For ex
contain
alignme
alignme
probabi
trained
updates
searchi
browse
feature
analysi
sample
project
16S rRN
WAT
publicly
analysi
include
exampl
read fro
The
of RNA
types o
forms o
their n
rRNAs
RNA
alignme
RNA
maintai
various
sequenc
Chad
sequenc
es. Thus it is m
n this paper, w
amples of the b
II. R
rCAD project
is within a ce
rCAD is a uni
lated to the
ation.
well-known dat
family databa
ion of RNA se
ented by multip
s. Rfam is pri
se is frequent
al data sources
to browse, s
ents and co
my or keyword
for further ana
re are other pro
xample the R
ns bacterial
ents and som
ents maintaine
ilistic model s
from a set
s on RDP intro
ing RNA sequ
er and a taxono
of RDP is
is of bacterial r
es. Another sim
which enable
RNA sequences
TERS is a w
y available 16
is pipeline [22
e a dedicated
le, the alignm
om external so
rCAD project
A sequences an
of rRNAs – 5S
of life: bacteri
nuclear, chlor
s) as well as o
sequences are
ent data is cur
Web (CRW)
ins other dim
s analysis fe
ces and their a
do is a relati
ces published
more efficient
we describe deta
benefits obtain
RELATED WO
t integrates da
entralized rela
ique applicatio
management
ta source for R
ase (Rfam) [2
equence famili
ple sequence al
imarily a data
ly updated an
s. The web in
search and r
rresponding c
ds. Users can d
alysis.
ojects focused
Ribosomal D
and archaeal
me analysis o
ed in the RDP p
system. The pr
of representa
oduced new fe
uence informat
omy visualizat
its pyroseque
rRNA compos
milar web ser
e users to align
s [21].
workflow tool
6S rDNA ana
2]. It doesn’t
d data manag
ent has to be
ources into mem
maintains a c
nd multiple seq
S, 16S, and 23S
ia, archaea, an
roplast, and
other dimensio
e retrieved d
rated and avai
Site[13]. Th
mensions of in
eatures in ad
alignments.
ional database
d in recent
for analyzing
ails of rCAD s
ned with rCAD
ORKS
ata curation, a
ational databas
on, several oth
and analysis
RNA informa
24]. Rfam m
ies. Each RNA
lignments and
a provider ser
nd curated fro
nterface of Rfa
retrieve RNA
covariation m
download data
d on specific R
Database Proje
small subu
of this data[2
project are alig
robability para
ative sequence
eatures for bro
tion, including
tion tool. A ne
encing pipelin
ition from env
rvice is the G
n and analyze
that bundles
alysis tools to
address data c
gement compo
computed on
mory for proce
comprehensive
quence alignm
S, from the thr
nd eukaryotes
mitochondrial
ons of inform
daily from N
lable at the Co
he rCAD sch
nformation and
ddition to sto
e schema for
literature [26
large scale
schema and
.
access and
se system.
her projects
of RNA
tion is the
maintains a
A family is
covariance
rvice. This
om various
am enables
sequence
models by
a in flat file
RNA types.
ect (RDP)
unit rRNA
25]. The
gned with a
ameters are
es. Recent
owsing and
a genome
ew analytic
ne for the
vironmental
GreenGenes
e their own
a suite of
o construct
curation or
onent. For
the fly or
ssing.
e collection
ments for all
ree primary
(including
l encoded
mation. The
NCBI. The
omparative
hema also
d supports
oring raw
biological
6]. Chado
16

m
o
a
p
e
m
c
s
a
a
s
o
c
a
H
s
p
m
s
w
C
r
d
e
n
to
r
e
c
in
(
r
in
o
f
to
a
s
b
w
R
m
o
c
e
b
b
c
a
b
c
manages biolo
organisms with
associated with
protein product
existing ontolo
model organism
conform to this
support genom
analysis. Align
are typically s
sequence alignm
of the alignmen
comparative fe
analysis progra
However, it is
sequence alignm
provide an
management a
schema used in
within the rela
Chado.
III. SCH
The design
requirements: 1
dataset (seque
evolutionary re
natural relation
o the biologi
repository sup
execution of
comparative an
Our RNA da
nto three inter
(metadata and n
relationships.
nformation is
ontology [23].
for different co
o be executed
avoiding the n
separate analy
between the da
with the biolog
RNA sequence
matrices) as p
ontology [23].
column of the
entire alignmen
by the databas
base pairs, heli
columns of the
alignment are m
be devised to
containing sequ
ogical knowle
h information th
h genome sequ
ts encoded by
ogy and interop
m databases or
s schema. The
me analysis ins
ments and com
stored as pairs
ment may incl
nt. Such schem
features and e
ams (e.g. Blast)
less efficient
ments at large
integrative p
ccess and ana
n rCAD is to fa
tional database
EMA DESIGN A
n of rCAD
1) persist the
nces and seq
elationships and
nships; 2) effic
ical data, and
pporting the
computation
nalysis.
ataset for comp
rrelated dimens
nucleotides), st
The organiza
congruent wit
The unique ca
omparative ana
d on very lar
need to expo
ysis applicatio
ata entities pers
gical relations
e alignments a
proposed by t
SQL queries
e sequence al
nt into memory
se. RNA secon
ices, etc.) are
e sequence ali
mapped onto th
retrieve specif
uences from sp
edge for a w
hat can be dire
uences or the pr
a genome. Ch
perability betw
r any biologic
e primary focu
stead of comp
mparative featu
s. For exampl
ude hits and h
ma is a generic
enables results
) to be stored i
for storing ev
scale. The rCA
platform for
alysis. Therefo
acilitate comm
e system and
AND IMPLEMEN
is motivate
different elem
quence alignm
d alignments) t
iently support
d 3) provide
manipulation
nal algorithm
parative analysi
sions of inform
tructure (2-D)
ation of our
th the proposed
apabilities of S
alysis algorith
rge RNA data
ort large amou
ons (Figure 1
sisted in rCAD
ships and are
are stored as 2
the RNA stru
s can be used
ignment, with
y, using the in
ndary structure
directly relate
gnment. The s
he Tree of Life
fic fragments o
pecific taxonom
wide variety
ectly or indirec
rimary RNA a
hado is based
ween open acce
al databases th
us of Chado is
arative sequen
ures of sequenc
le, features of
high-scoring pa
c way for stori
s from extern
in a unified wa
volving multip
AD is designed
sequence da
ore, the databa
mon analysis tas
is different fro
NTATIONS
ed by seve
ments of an RN
ments, structur
that mimics th
frequent updat
a central da
n and efficie
ms involved
is is decompos
mation: sequen
and evolutiona
dimensions
d RNA structu
QL Server allo
ms and analys
asets in-proce
unts of data
). Relationshi
D are coordinat
queried direct
2-D grids (spar
ucture alignme
d to access a
hout loading t
ndexing provid
e elements (e.
d to the releva
sequences in t
e, and queries c
of the alignme
my subsets.
of
ctly
and
on
ess
hat
to
nce
ces
f a
airs
ing
nal
ay.
ple
d to
ata
ase
sks
om
eral
NA
res,
heir
tes
ata
ent
in
sed
nce
ary
of
ure
ow
ses
ess,
to
ips
ted
tly.
rse
ent
any
the
ded
g.,
ant
the
can
ent
The
compar
RNA d
Alignm
and Str
A. Sequ
Meta
Sequen
metada
Genban
Sequen
16S rR
organel
Mitoch
design
instanc
Genban
Figure 2 rC
database sc
rtments extract
dataset used
ment, Sequence
ructure Relation
uence Metadata
Figure 3
adata describin
nce Metadata
ata include
nk Accession
nceAccession
RNA) stored
llar location
hondrion) stor
of the Seque
ce to include re
nk.
CAD database sch
chema is de
ted from three
for Comparat
e Metadata, Ev
nships (Figure
ta Compartmen
Schema for seque
ng RNA sequ
compartment
external data
n and revis
table, the sequ
in Sequenc
within the
red in CellL
enceAccession
evisions of a s
hema overview.
e-normalized
e major dimens
tive Analysis:
volutionary Re
e-2).
nt
ence metadata.
uences is stor
t (Figure 3).
abase identifi
sions) stored
uence classific
ceType table
cell (e.g., N
ocationInfo t
table allows
sequence intro
into four
sions of an
Sequence
elationships
red in the
Types of
fiers (e.g.,
d in the
ation (e.g.,
and the
Nucleus or
table. The
the rCAD
oduced into
17

B
in
a
T
in
P
T
T
to
E
n
w
ta
s
(
d
C
B. Evolutionary
Figu
Evolutionary
n the Evolutio
and are obtaine
The evolutiona
n the Taxonom
ParentTaxID).
Taxonomy,
TaxonomyNam
o link betw
Evolutionary R
name of each
while all other
able for refere
stores the full l
(e.g., root/cellu
depth-based qu
C. Sequence Al
F
y Relationships
ure 4 Schema for ev
y relationships
onary Relation
ed from the N
ary relationship
my table as a s
The TaxID fi
Tax
mesOrdered ta
ween the Se
Relationships
taxon is store
r names are s
ences. The Ta
lineage to any
ular organisms/
ueries of the ev
lignment Comp
igure 5 Schema fo
s Compartmen
volutionary relatio
(taxonomy) in
nships compart
NCBI Taxonom
ps among sequ
et of parent-ch
ield is the prim
onomyNames
ables and acts
equence Met
compartments
ed in Taxono
stored in the A
axonomyName
leaf in the Lin
/Bacteria) in o
olutionary rela
partment
or sequence alignm
t
onships
rCAD are stor
tment (Figure
my database [2
uences are stor
hild pairs (TaxI
mary key for t
s a
as a foreign k
adata and t
s. The scienti
omyNames tab
AlternateNam
esOrdered tab
neageName fie
order to facilita
ationships.
ment
red
4),
7].
red
ID,
the
and
key
the
ific
ble
mes
ble
eld
ate
All
Sequen
more l
for a sp
dimens
juxtapo
The jux
remova
the alig
Unlike
sequenc
alignme
Each
unique
table (
stored i
of the A
nucleot
an align
nucleot
Physica
specific
alignme
alignme
Alignm
Table 1
5S
rRNA
16S
rRNA
23S
rRNA
This
alignme
updates
First,
values
sequenc
increas
The in
insertio
Tree of
in the
nucleot
For exa
rRNA)
[13], th
(Table
sequenc
gaps. T
gaps,
biological seq
nce Alignment
ogical sequen
pecific RNA m
sional matrix. A
osed with one a
xtaposition is a
al of gaps. Thu
gnment matrix
in other schem
ce level, rCA
ent at nucleotid
h sequence alig
key, the AlnID
(Figure 5). M
in the Alignm
AlnID to SeqID
tide in a sequen
nment as a row
tide is
alColumnNum
c nucleotide’s
ent is modifi
ent are iden
mentColumn ta
: Topology of Rib
Sequences
5355
5762
670
particular sc
ent has two a
s when changin
, in this schem
as they are inf
ces and observ
ses, the numbe
ncreasing sequ
ons and deletio
f Life. The res
alignment c
tides for any se
ample, for the
sequence alig
he average per
1). Thus, fo
ce alignment,
Therefore our
results in a s
quences in rC
t compartment
nce alignments
molecule type c
All sequences
another to iden
accomplished
us, the contents
x is either a n
ma where alig
AD directly st
de level.
gnment stored
D, and is catal
embers of a
mentSequence
D. The Alignm
nce aligned to
w record. In Al
associated
ber which doe
relative posit
fied. The colu
ntified and
able.
bosomal RNA sequ
Tree of Life
Total
Alignment
Columns
M
Se
Le
404
9655 3
17047 5
chema design
advantages, sp
ng the alignme
ma, there is no
ferred at query
ved sequence v
er of gaps also
ence variation
ons observed in
sult is that the
can greatly e
equence within
e small subuni
gnment spannin
row ratio of ga
or any given
on average 85
database sche
significant spa
CAD are stor
t, organized in
s. A sequence
an be describe
within the alig
ntify equivalent
through the ad
s of any individ
nucleotide or a
gnments are sto
tores multiple
d in rCAD is
logued in the A
sequence alig
table through
mentData table
a specific colu
lignmentData
to a
es not change
tion within the
umns of any
managed thr
uence alignments s
Max/Min
equence
ength
Avg.
Ratio
242/14 73%
316/506 85%
317/880 85%
n for storing
ace saving an
ent data.
o need to stor
y time. As the
variation in an
o increases sig
n is partially a
n specific bran
total number o
exceed the n
n the alignment
it Ribosomal R
ng the entire T
aps to nucleoti
row of the 1
5% of the colu
ema, which do
ace reduction
red in the
nto one or
alignment
d as a two-
gnment are
t positions.
ddition and
dual cell in
a gap (‘-‘).
ored at the
e sequence
assigned a
Alignment
gnment are
a mapping
maps each
umn within
table, each
specific
unless that
e sequence
y sequence
rough the
spanning the
Gap
o (%)
% ± 7%
% ± 2%
% ± 5%
sequence
nd efficient
re any gap
number of
n alignment
gnificantly.
a result of
ches of the
of columns
number of
t (Table 1).
RNA (16S
Tree of Life
ides is 85%
16S rRNA
umns have
oesn’t store
for storing
18

la
a
s
L
s
o
in
th
d
q
o
a
c
a
r
F
m
o
o
P
A
c
m
d
a
o
th
a
v
s
arge alignmen
and support v
sequence alignm
Figur
Secondly, th
LogicalColumn
supports effici
operations on
ndirection betw
he logical view
datasets are a
quickly becom
one entry per
alignment in
columnNumber
alignment, suc
require updatin
For example,
modified by co
only adds colum
of where the c
PhysicalColum
AlignmentCol
change, and n
modified (Figu
data structure
alignments, suc
operations, requ
he AlignmentD
To simplify
alignment sto
vAlignmentGri
sequence align
nts and permit
very large (>
ments.
re 6 an example of
he mapping o
nNumber in t
ient (from a
a sequence ali
ween the physi
w of the sequ
added to rCA
es the largest t
nucleotide fo
n an rCAD
r mapping, any
ch as insertin
ng a significan
when the seq
olumn insertion
mns to the “en
olumn is logic
mnNumber to L
umn table is u
no rows in th
ure 6). Without
e, global ope
ch as column i
uiring a signifi
Data table.
queries that
red in rCAD
id and vAlignm
ment with or w
ts the rCAD d
>106
rows x
f updating alignme
of PhysicalCo
the Alignmen
database per
ignment by pl
ical storage in
ence alignmen
D, the Align
table, even wit
or each sequen
D instance.
y further updat
g or deleting
nt portion of th
quence alignm
n (Figure 6), t
nd” of the align
cally inserted.
LogicalColumn
updated to refl
he Alignmen
t the column i
erations on
insertions, wou
icant number o
obtain data f
D, two view
mentGridUnga
without gaps r
database to sca
>104
column
ent data.
olumnNumber
ntColumn tab
rspective) glob
lacing a layer
the database a
nt. As new RN
mentData tab
th a minimum
nce within ea
Without th
tes of an existi
a column w
he existing row
ment topology
the data structu
nment, regardle
The mapping
nNumber in t
lect the topolo
tData table a
indirection in t
large sequen
uld be expensi
of row updates
from a sequen
ws are create
apped that retu
respectively. T
ale
ns)
to
ble
bal
of
and
NA
ble
of
ach
his
ing
will
ws.
is
ure
ess
of
the
ogy
are
the
nce
ive
to
nce
ed:
urn
The
ability
sequenc
advanc
structur
7 depic
query f
vAlignm
sequenc
among
the su
(Figure
only re
recursiv
used to
contain
alignme
(Figure
Figu
D. Stru
to retrieve spe
ce alignment th
ced applicatio
ral statistics an
cts the power o
for retrieving a
mentGrid -- l
ce alignment
sequences wi
ubset are iden
e 7a). The exa
etrieve rows
ve query (SQL
o identify all
n Bacterial seq
ent is then ret
e 7c) for the row
ure 7 An exemplar
ucture Relation
Figure 8 S
ecific sub-grid
hrough SQL qu
ons of the r
nd evolutionar
of the rCAD sy
a subset of a s
everaging the
and the ev
ithin the align
ntified by ev
ample query in
that contain
L Server comm
rows in the s
quences (Figure
trieved on a c
ws that contain
ar SQL query for re
nships Compart
chema of Structur
ds (rows x colu
ueries drives m
rCAD system
ry event counti
ystem through a
equence alignm
integration be
volutionary re
nment. First, th
volutionary re
n Figure 7 is d
Bacterial seq
mon table exp
sequence align
e 7b). The su
column by col
n Bacterial seq
etrieving partial al
tment
re Relationships
umns) of a
many of the
m such as
ing. Figure
an example
ment using
etween the
elationships
he rows of
elationships
designed to
quences. A
pression) is
nment that
ubset of the
lumn basis
quences.
lignment
19

s
s
b
F
c
lo
o
k
s
F
T
s
S
e
S
H
tw
s
lo
(
s
e
s
d
h
lo
p
S
4
ta
E
4
in
3
4
The Structu
stores seconda
structure for a
base pairs betw
From the set of
can be inferred
oops (un-paire
or multi-stem.
The Seconda
known or pred
sequence alignm
FivePrimeElem
ThreePrimeEle
secondary str
SecondaryStru
elements depe
SecondaryStru
Helices and in
wo extents for
split into n exte
oop. An exten
(index), Extent
structural elem
extent of the str
Figure 9 E
The mappin
structure into t
depicted in Fi
helices (5’ 1-4
oop (5’ 5-9, 3
pairs from
SecondaryStru
4, 3’ 32-35) ha
able (Exten
ExtentEndIndex
4, 1) and the
nternal loop (E
3’ half, but the
4). The extent
ural Relationsh
ary structure r
given sequenc
ween different p
f base pairs, ot
d such as helice
ed nucleotides)
aryStructureB
dicted base pai
ment. Each bas
mentSequenceIn
ementSequence
ructural elem
uctureExtents
ending on ext
uctureExtentT
ternal loop str
r their 5’ and 3
ents depending
nt is defined b
tStartIndex an
ent is given an
ructural elemen
Example of mappin
ng of a simpl
the Structural
igure 9. The
, 3’ 32-35 & 5
’ 26-31) and a
the two he
uctureBasePai
s two entries in
ntID, Exten
x, ExtentTypeID
3’ half is (1,
ExtentID 2) als
e hairpin loop
model enable
hips compartm
relationships.
ce, at a minim
positions of an
ther structural
es (consecutive
) classified as
BasePairs table
rs for any RN
se pair is store
ndex
eIndex. The m
ents are ide
s table and spl
tent type enu
Types table.
ructural elemen
3’ halves. Mul
g on the numbe
by its first and
nd ExtentEndIn
n identifier, Ex
nt has its own E
ng RNA structure i
le stem-loop
Relationships
stem-loop str
5’ 10-18, 3’ 20
a hairpin loop (
elices are e
irs table. The
n SecondaryS
ntOrdinal, E
ID) where the 5
2, 32, 35, 1)
so has two exte
has only one
es SQL querie
ment (Figure
RNA seconda
mum, is the set
n RNA sequenc
elements/exten
e base pairs) a
hairpin, intern
e holds the set
NA sequence in
d with two fiel
a
more complicat
entified in t
lit into their su
umerated in t
For examp
nts are split in
ti-stem loops a
er of stems in t
d last nucleoti
ndex. The ent
xtentID, and ea
ExtentOrdinal
into database
RNA seconda
compartment
ructure has tw
0-25) an intern
(16-19). All ba
entered in t
first helix (5’
tructureExten
ExtentStartInde
5’ half is (1, 1,
(Figure 9). T
ents for its 5’ a
extent (Extent
s to characteri
8)
ary
of
ce.
nts
and
nal
of
n a
lds
and
ted
the
ub-
the
ple,
nto
are
the
ide
tire
ach
.
ary
is
wo
nal
ase
the
1-
nts
ex,
, 1,
The
and
tID
ize
the patt
structur
Algo
from rC
other
program
already
exampl
present
countin
A. Struc
Here
frequen
observe
model
include
order s
terns of sequen
ral motif, acros
IV.
orithms that op
CAD streamlin
dimensions o
ms for rCAD
y organized an
les of applica
ted below: stru
ng.
Figure 10 Fin
ctural Statistic
e, we present
ncies. The fre
ed for a base p
has many app
es the predictio
structure with c
nce variation fo
ss the phylogen
USE CASE EXA
perate on sequ
ning and mana
of data. Thu
is not encum
nd cross-index
ations utilizing
uctural statistic
nding base pair fre
cs
a common ta
equency of di
pair in a refer
plications in co
on of base pair
covariation an
for different ele
netic tree.
AMPLES
uence alignme
aging their acc
us the develo
mbered with d
xed. Two rep
g the rCAD s
cs and evolutio
equency example.
ask of finding
ifferent base
rence secondar
omparative ana
rings in an RN
alysis and the
ements of a
nts benefit
cess to the
opment of
data that is
presentative
system are
onary event
g base pair
pair types
ry structure
alysis. This
NAs higher-
evaluation
20

o
e
p
c
f
s
s
c
S
s
d
1
q
(
p
p
a
B
p
th
m
d
s
th
p
n
th
1
p
c
to
H
p
id
a
th
c
of the alignmen
existing sequen
The selecte
projected acros
column ordinal
from a referenc
sequence alignm
selected using
column ordinal
SQL query is
sequence alignm
determine frequ
10 bottom).
Different stru
queries or co
(http://www.rna
presentations o
potentials for
application of o
B. Evolutionary
Figur
Positions in
patterns of vari
he RNAs high
methods for id
determine the f
sequences in tw
hese methods
prediction of a
not explicitly d
he evolution o
1983 that the
pair will enha
covariation ana
o identify the
However the
phylogenetic tr
dentified manu
and quantify th
hat time. Atte
computational
nt accuracy fo
nce alignment [
d base pair,
ss the sequenc
ls associated w
ce row (Figure
ment included
evolutionary
ls for the 5’ a
defined to r
ment, and a gr
uencies of obs
uctural statistic
mpiled applic
a.ccbb.utexas.e
on structural s
RNA foldin
our structural s
y Event Counti
re 11 Evolutionary
a sequence
iation (covariat
her-order struc
dentifying the
frequency of ea
wo columns of
have been v
an RNAs highe
determine the
of the RNA. It
evolutionary h
ance the acc
alysis [31]. We
first tertiary
number of
ree (called evo
ually and thus
hese phylogene
empts to iden
statistical alg
or any new seq
[13, 28].
labeled (1) in
e alignment by
with the 5’ an
e 10 middle).
in the frequen
relationships.
and 3’ half of
retrieve nucle
rouping statem
served base pa
cs have been de
cations. Visit
edu/SAE/2D)
statistics. Com
ng programs
tatistics[29, 30
ing
y Event Counting e
alignment th
tion) are usual
cture [17, 31]
ese positions w
ach base pair t
f the alignment
very effective
er-order structu
number of cov
has been know
history for eac
uracy and re
e utilized the p
interaction in
covariations
olutionary even
a rigorous att
etic events wa
ntify phylogen
gorithms of t
quence within
n Figure 10,
y identifying t
nd 3’ nucleotid
The rows of t
ncy tabulation a
Using the tw
f the base pair
otides from t
ment is applied
air types (Figu
eveloped as SQ
the CRW S
for mo
mputing statistic
is one speci
0].
example
hat have simi
lly base paired
. The tradition
with covariati
type for all of t
t [32, 33]. Wh
in the accura
ure [16], they
variations duri
wn since at le
ch putative ba
esolution of t
phylogenetic tr
the rRNA [3
based on t
nt counting) w
tempt to ident
as not possible
etic events w
the phylogene
an
is
the
des
the
are
wo
, a
the
to
ure
QL
Site
ore
cal
ific
ilar
d in
nal
ion
the
hile
ate
do
ing
ast
ase
the
ree
4].
the
was
ify
at
with
etic
trees im
[35, 36
How
precise
change
rCAD.
populat
candida
11). Th
Evoluti
based
applica
recursio
nodes a
version
current
more tr
false p
Xu, Oz
Prese
applica
predica
of three
sequenc
seconda
evolutio
The
traditio
sequenc
manipu
schema
alignme
Usin
perform
insertio
sequenc
evolutio
them di
an RN
analysi
queries
betwee
their oc
are easi
The
Server
algorith
memor
not req
resourc
manage
system
elemen
mproved the s
].
wever, instead
e count of the n
es on the phyl
Our method
ted with the nu
ate positions a
he in-memory
ionary Relation
on the NC
ation traverses
on to deduce t
and compute th
n of our metho
t event count
rue strong and
positives when
zer, & Gutell, m
ented here is
ation of compa
ated on the inte
e primary dim
ce alignment
ary, and
onary relations
rCAD system
onal computati
ce alignments
ulated in-memo
a directly pe
ents as sparse m
ng a column ind
ming global
on and deletion
ce alignment
onary and sec
irectly to the s
NA sequence
is algorithms c
s. For examp
en sequences, d
ccurrence on d
ily determined
rCAD databas
enables the
hms as compil
ry space of the
quire significan
ce allocation
ement and inp
places the r
nts on the en
sensitivity of
of using stati
number of chan
logenetic tree
builds an in
ucleotides obta
across the sequ
y tree structur
nships compar
CBI Taxonom
s the in-memo
the nucleotide
he evolutionary
od was publish
ting algorithm
d weakly cova
n compared to
manuscript in p
V. CONCLUSIO
s a new dat
arative analysi
egration and sim
mensions of inf
ts, associatio
three-dimensi
ships.
m is significa
ional biology
are primarily
ory by differe
ersists and in
matrices.
direction techn
alignment op
n of columns
s. The rCAD
condary struct
sequences and
e alignment.
can be impleme
ple, structural
different RNA
different parts
d with rCAD.
se implemente
development
led application
e database eng
nt amounts of c
n, parallel
nput/output op
responsibility
nterprise data
the covariatio
istical method
nges and locat
can be determ
n-memory tree
ained from pro
uence alignme
re is obtained
rtment of rCAD
my database
ory data struc
composition o
y event counts.
hed previously
m successfully
ariant positions
o other method
preparation).
ONS
tabase schem
is to RNA dat
multaneous ma
formation - seq
ons between
ional structu
antly different
workflows w
y stored in fla
ent programs. T
ndexes RNA
nique, rCAD is
perations incl
efficiently for
D schema st
ture entities a
individual nuc
Comparative
ented as declar
statistics, re
A structural ele
of the phylog
d on the Micr
t of more co
ns that execute
gine. These pr
custom progra
processing,
timization. T
for managing
abase engine,
n methods
ds, a more
tions of the
mined with
e structure,
ojecting the
ent (Figure
d from the
D, which is
[27]. The
cture using
of ancestor
. An earlier
y [37]. Our
y identifies
s with less
ds (Shang,
ma for the
tasets. It is
anipulation
quence and
primary,
ure, and
from the
where RNA
at-files and
The rCAD
sequence
capable of
luding the
very large
tores both
and relates
cleotides in
sequence
rative SQL
elationships
ements and
genetic tree
rosoft SQL
omplicated
within the
rograms do
amming for
memory
The rCAD
g the data
which is
21

optimized for manipulating large quantities of information,
and is scalable to support extremely large RNA datasets.
Readers who are interested in RNA sequence data can
visit CRW Site (http://www.rna.ccbb.utexas.edu/) which
uses rCAD for data management. Readers interested in
building customized database can visit
http://rcat.codeplex.com/ for available codes and utilities.
ACKNOWLEDGEMENTS
This project has been funded by an External Research
grant from Microsoft Research, National Institutes of Health
(GM067317 and GM085337) and the Welch Foundation
(#1427).
REFERENCES
[1] H. F. Noller and J. B. Chaires, "Functional modification of 16S
ribosomal RNA by kethoxal," Proc. Natl. Acad. Sci. USA, vol. 69, pp.
3115-8, Nov 1972.
[2] K. Kruger, P. J. Grabowski, A. J. Zaug, J. Sands, D. E. Gottschling,
and T. R. Cech, "Self-splicing RNA: autoexcision and autocyclization
of the ribosomal RNA intervening sequence of Tetrahymena," Cell,
vol. 31, pp. 147-57, Nov 1982.
[3] C. Guerrier-Takada, K. Gardiner, T. Marsh, N. Pace, and S. Altman,
"The RNA moiety of ribonuclease P is the catalytic subunit of the
enzyme," Cell, vol. 35, pp. 849-57, Dec 1983.
[4] H. F. Noller, V. Hoffarth, and L. Zimniak, "Unusual resistance of
peptidyl transferase to protein extraction procedures," Science, vol.
256, pp. 1416-9, Jun 5 1992.
[5] P. Nissen, J. Hansen, N. Ban, P. B. Moore, and T. A. Steitz, "The
structural basis of ribosome activity in peptide bond synthesis,"
Science, vol. 289, pp. 920-30, Aug 11 2000.
[6] B. T. Wimberly, D. E. Brodersen, W. M. Clemons, Jr., R. J. Morgan-
Warren, A. P. Carter, C. Vonrhein, T. Hartsch, and V. Ramakrishnan,
"Structure of the 30S ribosomal subunit," Nature, vol. 407, pp. 327-
39, Sep 21 2000.
[7] M. M. Yusupov, G. Z. Yusupova, A. Baucom, K. Lieberman, T. N.
Earnest, J. H. Cate, and H. F. Noller, "Crystal structure of the
ribosome at 5.5 A resolution," Science, vol. 292, pp. 883-96, May 4
2001.
[8] C. R. Woese and G. E. Fox, "The concept of cellular evolution," J.
Mol. Evol., vol. 10, pp. 1-6, Sep 20 1977.
[9] C. R. Woese, The genetic code; the molecular basis for genetic
expression. New York,: Harper & Row, 1967.
[10] L. E. Orgel, "Evolution of the genetic apparatus," J. Mol. Biol., vol.
38, pp. 381-93, Dec 1968.
[11] F. H. Crick, "The origin of the genetic code," J. Mol. Biol., vol. 38,
pp. 367-79, Dec 1968.
[12] C. Darwin, The origin of species. New York, NY: Barnes & Noble
Classics, 2008.
[13] J. J. Cannone, S. Subramanian, M. N. Schnare, J. R. Collett, L. M.
D'Souza, Y. Du, B. Feng, N. Lin, L. V. Madabusi, K. M. Muller, N.
Pande, Z. Shang, N. Yu, and R. R. Gutell, "The comparative RNA
web (CRW) site: an online database of comparative sequence and
structure information for ribosomal, intron, and other RNAs," BMC
Bioinformatics, vol. 3, p. 2, 2002.
[14] R. W. Holley, J. Apgar, G. A. Everett, J. T. Madison, M. Marquisee,
S. H. Merrill, J. R. Penswick, and A. Zamir, "Structure of a
ribonucleic acid," Science, vol. 147, pp. 1462-5, Mar 19 1965.
[15] G. E. Fox and C. R. Woese, "5S RNA secondary structure," Nature,
vol. 256, pp. 505-7, Aug 7 1975.
[16] R. R. Gutell, J. C. Lee, and J. J. Cannone, "The accuracy of ribosomal
RNA comparative structure models," Curr. Opin. Struct. Biol., vol.
12, pp. 301-10, Jun 2002.
[17] R. R. Gutell, B. Weiser, C. R. Woese, and H. F. Noller, "Comparative
anatomy of 16-S-like ribosomal RNA," Prog. Nucleic Acid Res. Mol.
Biol., vol. 32, pp. 155-216, 1985.
[18] R. R. Gutell, J. J. Cannone, Z. Shang, Y. Du, and M. J. Serra, "A
story: unpaired adenosine bases in ribosomal RNAs," J. Mol. Biol.,
vol. 304, pp. 335-54, Dec 1 2000.
[19] C. R. Woese, S. Winker, and R. R. Gutell, "Architecture of ribosomal
RNA: constraints on the sequence of "tetra-loops"," Proc. Natl. Acad.
Sci. USA, vol. 87, pp. 8467-71, Nov 1990.
[20] R. R. Gutell, N. Larsen, and C. R. Woese, "Lessons from an evolving
rRNA: 16S and 23S rRNA structures from a comparative
perspective," Microbiol. Rev., vol. 58, pp. 10-26, Mar 1994.
[21] T. Z. DeSantis, P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie, K.
Keller, T. Huber, D. Dalevi, P. Hu, and G. L. Andersen, "Greengenes,
a chimera-checked 16S rRNA gene database and workbench
compatible with ARB," Appl. Environ. Microbiol., vol. 72, pp. 5069-
72, Jul 2006.
[22] A. L. Hartman, S. Riddle, T. McPhillips, B. Ludascher, and J. A.
Eisen, "Introducing W.A.T.E.R.S.: a workflow for the alignment,
taxonomy, and ecology of ribosomal sequences," BMC
Bioinformatics, vol. 11, p. 317, 2010.
[23] J. W. Brown, A. Birmingham, P. E. Griffiths, F. Jossinet, R.
Kachouri-Lafond, R. Knight, B. F. Lang, N. Leontis, G. Steger, J.
Stombaugh, and E. Westhof, "The RNA structure alignment
ontology," RNA, vol. 15, pp. 1623-31, Sep 2009.
[24] P. P. Gardner, J. Daub, J. G. Tate, E. P. Nawrocki, D. L. Kolbe, S.
Lindgreen, A. C. Wilkinson, R. D. Finn, S. Griffiths-Jones, S. R.
Eddy, and A. Bateman, "Rfam: updates to the RNA families
database," Nucleic Acids Res., vol. 37, pp. D136-40, Jan 2009.
[25] J. R. Cole, Q. Wang, E. Cardenas, J. Fish, B. Chai, R. J. Farris, A. S.
Kulam-Syed-Mohideen, D. M. McGarrell, T. Marsh, G. M. Garrity,
and J. M. Tiedje, "The Ribosomal Database Project: improved
alignments and new tools for rRNA analysis," Nucleic Acids Res.,
vol. 37, pp. D141-5, Jan 2009.
[26] C. J. Mungall and D. B. Emmert, "A Chado case study: an ontology-
based modular schema for representing genome-associated biological
information," Bioinformatics, vol. 23, pp. i337-46, Jul 1 2007.
[27] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W.
Sayers, "GenBank," Nucleic Acids Res., vol. 37, pp. D26-31, Jan
2009.
[28] K. J. Doshi, "Ph.D. dissertation. The University of Texas at Austin.,"
2007.
[29] J. C. Wu, D. P. Gardner, S. Ozer, R. R. Gutell, and P. Ren,
"Correlation of RNA secondary structure statistics with
thermodynamic stability and applications to folding," J. Mol. Biol.,
vol. 391, pp. 769-83, Aug 28 2009.
[30] D. P. Gardner, P. Ren, S. Ozer, and R. R. Gutell, "Statistical
Potentials for Hairpin and Internal Loops Improve the Accuracy of
the Predicted RNA Structure," Journal of Molecular Biology, vol. in
press, 2011.
[31] C. R. Woese, R. Gutell, R. Gupta, and H. F. Noller, "Detailed
analysis of the higher-order structure of 16S-like ribosomal
ribonucleic acids," Microbiol. Rev., vol. 47, pp. 621-69, Dec 1983.
[32] D. K. Chiu and T. Kolodziejczak, "Inferring consensus structure from
nucleic acid sequences," Comput. Appl. Biosci., vol. 7, pp. 347-52, Jul
1991.
[33] R. R. Gutell, A. Power, G. Z. Hertz, E. J. Putz, and G. D. Stormo,
"Identifying constraints on the higher-order structure of RNA:
continued development and application of comparative sequence
analysis methods," Nucleic Acids Res., vol. 20, pp. 5785-95, Nov 11
1992.
[34] R. R. Gutell, H. F. Noller, and C. R. Woese, "Higher order structure
in ribosomal RNA," EMBO J., vol. 5, pp. 1111-3, May 1986.
[35] J. Dutheil, T. Pupko, A. Jean-Marie, and N. Galtier, "A model-based
approach for detecting coevolving positions in a molecule," Mol. Biol.
Evol., vol. 22, pp. 1919-28, Sep 2005.
[36] C. H. Yeang, J. F. Darot, H. F. Noller, and D. Haussler, "Detecting
the coevolution of biosequences--an example of RNA interaction
prediction," Mol. Biol. Evol., vol. 24, pp. 2119-31, Sep 2007.
[37] W. Xu, S. Ozer, and R. R. Gutell, "Covariant Evolutionary Event
Analysis for Base Interaction Prediction Using a Relational Database
Management System for RNA " Lecture Notes in Computer Science,
vol. 5566, pp. 200-216, May 2009.
22

Gutell 117.rcad_e_science_stockholm_pp15-22

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Similaire à Gutell 117.rcad_e_science_stockholm_pp15-22

Similaire à Gutell 117.rcad_e_science_stockholm_pp15-22 (20)

Plus de Robin Gutell

Plus de Robin Gutell (20)

Dernier

Dernier (20)

Gutell 117.rcad_e_science_stockholm_pp15-22