Slides from a short tutorial on querying Chado dabatases using SQL, given for the Pathogen Genomics group at the Wellcome Trust Sanger Institute on 12th February 2009.
2. Overview
• Relational databases
• Chado
• Writing queries
• Saving the results
• More examples
Monday, February 16, 2009
3. Relational database
• Data are organised in tables:
• The columns of the table represent attributes,
• The rows represent entities.
Monday, February 16, 2009
11. The core of Chado
Organism
organism_id
genus
species CV
cv_id
name
Feature
feature_id
CVterm
organism_id
cvterm_id
type_id
cv_id
uniquename
name
name
residues
Monday, February 16, 2009
12. The core of Chado
Organism
organism_id
genus
species CV
cv_id
name
Feature
feature_id
CVterm
organism_id
cvterm_id
type_id
cv_id
uniquename
name
name
residues
Monday, February 16, 2009
13. Connecting to the database
• Make sure you have an account on the database,
• Log onto pcs4,
• Type “chado”,
• Enter your database password.
Monday, February 16, 2009
14. Connecting to the database
Welcome to psql 8.2.5, the PostgreSQL interactive terminal.
• Make sure you have an account on the database,
Type: copyright for distribution terms
h for help with SQL commands
• Log ontohelp with psql commands
? for pcs4,
g or terminate with semicolon to execute query
Type to quit
q “chado”,
•
malaria_workshop=>
• Enter your database password.
Monday, February 16, 2009
16. Example queries
d for ‘describe’
d cv
Monday, February 16, 2009
17. Example queries
d cv
select * from cv;
Monday, February 16, 2009
18. Example queries
d cv
select * from cv;
* means ‘aquot;
columns’
Monday, February 16, 2009
19. Example queries
d cv
select * from cv;
Name of
* means ‘aquot;
table
columns’
Monday, February 16, 2009
20. Example queries
d cv Queries end with
a semicolon
select * from cv;
Name of
* means ‘aquot;
table
columns’
Monday, February 16, 2009
21. Example queries
d cv
select * from cv;
d cvterm
Monday, February 16, 2009
22. Example queries
d cv
select * from cv;
d cvterm
select name from cvterm
where cv_id = 10;
Monday, February 16, 2009
23. Example queries
d cv
select * from cv;
d cvterm the terms like this is pretty baffling.
Just seeing
If you want to understand the structure of the
ontology better,from download OBO-Edit
you can cvterm
select name
'om oboedit.org, and the sequence ontology
where cv_id sequenceontology.org
'om = 10;
Monday, February 16, 2009
24. Example queries
select name from cvterm
where cv_id = 10;
select cvterm.name
from cvterm
join cv on cv.cv_id = cvterm.cv_id
where cv.name = 'sequence';
Monday, February 16, 2009
25. Example queries
select name from cvterm
where cv_id = 10;
select cvterm.name
from cvterm
join cv on cv.cv_id = cvterm.cv_id
where cv.name = 'sequence';
select species from organism where
genus = 'Staphylococcus';
Monday, February 16, 2009
26. Count the genes in MRSA252
select count(*)
from feature gene
where gene.type_id in (
select cvterm.cvterm_id
from cvterm
join cv on cv.cv_id = cvterm.cv_id
where cv.name = 'sequence'
and cvterm.name = 'gene'
)
and gene.organism_id in (
select organism_id
from organism
where genus = 'Staphylococcus'
and species = 'aureus (MRSA252)'
);
Monday, February 16, 2009
27. Editing queries
• Now type e (for “edit”),
• Change “gene” to “pseudogene”:
• The query will run again, and count the
pseudogenes.
Monday, February 16, 2009
28. More Chado tables
• Locations are stored in the table featureloc.
Featureloc
featureloc_id
refers to the gene
feature_id
refers to the chromosome
srcfeature_id
fmin
} interbase coordinates
fmax
1 (forward) or -1 (reverse)
strand
locgroup
} both 0 for the principal location
rank
Monday, February 16, 2009
29. More Chado tables
• Locations are stored in the table featureloc.
Interbase coordinates
Featureloc
featureloc_id
ACGGTCCATACGGTCCATACGGTCCATCGGTTA
refers to the gene
feature_id
refers to the chromosome
0 1 2 3srcfeature_id
45 etc.
fmin
} interbase coordinates
fmax
13–18(forward) or -1 (reverse)
1
strand
locgroup
} both 0 for the principal location
rank
Monday, February 16, 2009
30. More Chado tables
• Locations are stored in the table featureloc.
Featureloc
featureloc_id
refers to the gene
feature_id
refers to the chromosome
srcfeature_id
fmin
} interbase coordinates
fmax
1 (forward) or -1 (reverse)
strand
locgroup
} both 0 for the principal location
rank
Monday, February 16, 2009
31. Location example
select avg(geneloc.fmax - geneloc.fmin)
from feature gene
join featureloc geneloc
on geneloc.feature_id = gene.feature_id
where gene.type_id in (
select cvterm.cvterm_id
Find the mean gene length of MRSA252
from cvterm
join cv on cv.cv_id on the forward strand.
genes = cvterm.cv_id
where cv.name = 'sequence'
and cvterm.name = 'gene'
)
and gene.organism_id in (
select organism_id
from organism
where genus = 'Staphylococcus'
and species = 'aureus (MRSA252)'
)
and geneloc.locgroup = 0
and geneloc.rank = 0
and geneloc.strand = 1;
Monday, February 16, 2009
32. Location example
select avg(geneloc.fmax - geneloc.fmin)
from feature gene
join featureloc geneloc
on geneloc.feature_id = gene.feature_id
where gene.type_id in (
select cvterm.cvterm_id
from cvterm
join cv on cv.cv_id = cvterm.cv_id
where cv.name = 'sequence'
and cvterm.name = 'gene'
)
and gene.organism_id in (
select organism_id
from organism
where genus = 'Staphylococcus'
and species = 'aureus (MRSA252)'
)
and geneloc.locgroup = 0
and geneloc.rank = 0
and geneloc.strand = 1;
Monday, February 16, 2009
33. Another location example
select chromosome.uniquename as chromosome
, count(*) as quot;number of genesquot;
from feature gene
join featureloc geneloc
on geneloc.feature_id = gene.feature_id
join feature chromosome
on geneloc.srcfeature_id = chromosome.feature_id
where gene.type_id in (
select cvterm.cvterm_id
How many genes are on each
from cvterm
chromosome in Plasmodium falciparum?
join cv on cv.cv_id = cvterm.cv_id
where cv.name = 'sequence'
and cvterm.name = 'gene'
)
and gene.organism_id in (
select organism_id
from organism
where genus = 'Plasmodium'
and species = 'falciparum'
)
and geneloc.locgroup = 0
and geneloc.rank = 0
group by chromosome.uniquename
;
Monday, February 16, 2009
34. Another location example
select chromosome.uniquename as chromosome
, count(*) as quot;number of genesquot;
from feature gene
join featureloc geneloc
on geneloc.feature_id = gene.feature_id
join feature chromosome
on geneloc.srcfeature_id = chromosome.feature_id
where gene.type_id in (
select cvterm.cvterm_id
from cvterm
join cv on cv.cv_id = cvterm.cv_id
where cv.name = 'sequence'
and cvterm.name = 'gene'
)
and gene.organism_id in (
select organism_id
from organism
where genus = 'Plasmodium'
and species = 'falciparum'
)
and geneloc.locgroup = 0
and geneloc.rank = 0
group by chromosome.uniquename
;
Monday, February 16, 2009
35. Transcripts and exons
Feature_relationship
subject_id
} feature
object_id
type_id cvterm
• Each exon is related to a transcript,
• Each transcript is related to a gene,
• Each polypeptide is related to a transcript,
• Annotation is attached to the polypeptide.
Monday, February 16, 2009
36. Annotation
Products
Feature_cvterm Most other things
feature_id
cvterm_id
Featureprop
feature_id
type_id
value
Monday, February 16, 2009