1. CALIPHO: two missions for one goal:
increasing our knowledge on human
proteins
Amos
Bairoch
April
4,
2012
2. Computer
Analysis
and
Laboratory
Inves6ga6on
of
Proteins
of
Human
Origin
Its
two
missions:
Carry
out
laboratory
experiments
on
selected
sets
of
uncharacterized
human
proteins
to
discover
their
func6on
Develop
neXtProt,
an
ambi6ous
new
knowledge
resource
centered
around
human
proteins
3. ‘The’
human
genome
• Sequencing
a
human
genome
is
no
longer
a
technological
challenge;
• Making
sense
of
what
it
tells
us
is
s6ll
much
more
problema6c
then
anyone
ever
expected.
4. Almost 12 years ago, at the 4th
Siena meeting, we proposed to
annotate in Swiss-Prot all the
human proteins
5. Some
stats
on
human
proteins
from
UniProtKB/Swiss-‐Prot
• 20’244
reviewed
entries
(~protein-‐coding
genes);
• 16’000
addi6onal
isoforms
in
about
8’100
entries
(40%
but
will
probably
rise
to
>60%):
50’000
different
protein
sequences;
• 65’000
variants;
22’500
linked
to
diseases;
the
rest
are
SNPs
that
are
SAPs
(2
per
proteins).
This
is
the
6p
of
the
iceberg;
• 80’000
PTMs
(50%
of
which
are
experimental).
This
is
the
6p
of
the
6p
of
the
iceberg!
6. Some
issues
about
protein-‐coding
genes
• We
completely
agree
with
what
was
shown
earlier
in
this
mee6ng
by
the
HAVANA
group:
that
there
are
slightly
less
then
20K
protein-‐coding
genes;
• Many
weirdos
in
the
genome:
bicistronic
mRNAs,
genes
that
produce
through
splicing
proteins
with
no
sequence
rela6onship,
mul6ple
genes
for
the
same
protein,
etc;
• Varia6on
in
term
not
only
of
SNPs
but
of
copy
number.
And
some
segrega6ng
pseudogenes
(olfactory
receptors);
• How
many
have
been
proven
at
protein
level?
– Using
the
protein
evidence
“metric”
used
at
UniProt
and
neXtProt,
we
are
now
at
about
70%;
– But
if
we
were
hun6ng
everywhere
in
good-‐quality
MS
data,
it
would
rise
to
about
85%.
The
big
issue
in
proteomics
is
how
to
hunt
for
the
last
15%
8. From
genome
to
proteome
~ 20’000 protein ~ 5'000'000
coding-genes different
proteins
post-translational
modifications of proteins
alternative splicing
(PTMs)
of mRNA
50-100 fold increase
2-5 fold increase
~ 50 to 100’000
transcripts
(mRNAs)
Protein
complexity
10. Many
human
proteins
for
which
we
lack
func6onal
knowledge
1. Similar
to
characterized
proteins
in
distant
organisms
(bacteria,
plants,
yeast),
but
no
valida6on
in
mammals;
2. Presence
of
domains
that
help
predict
a
‘general’
func6on
but
not
a
precise
one
(examples:
hydrolase
fold,
GPCR);
3. Presence
of
domains
or
sequence
features
that
help
define
some
proper6es
(examples:
PDZ
-‐>
PPI,
many
TMs
-‐>
integral
membrane
protein);
4. “Orphan”.
With
no
similarity
to
any
characterized
proteins
but
that
can
be
conserved
across
a
more
or
less
wide
taxonomic
space.
About
5’000
human
proteins
are
in
one
of
the
above
four
categories
11. Overview
of
the
CALIPHO
wet
lab
strategy
In
silico
selecCon
:
sequence
analysis,
phylogeny,
data
mining
Tissue/cell
line
expression
(RT-‐PCR)
Cloning
of
cDNA
in
the
Gateway
system
Yeast
two
hybrid
Subcellular
locaCon
in
HeLa
cells
Recombinant
protein
(confocal
imaging)
producCon
in
E.Coli
ValidaCon
of
protein-‐protein
interacCons
(GST
pull
down,
co-‐IP)
3D
structure
by
NMR
Data
mining,
Modelling
Hypothesis
generaCon
FuncConal
assays
on
cell
lines
(RNAi)
In
vivo
validaCon
(animal
models
eg
zebrafish)
CALIPHO@UniGe
collaborators
CALIPHO@SIB
12. Aner
2.5
years…
• A
protein
involved
in
ciliogenesis;
• An
enzyme
involved
in
a
salvage
pathway
not
yet
characterized
in
vertebrates;
• A
myristoylated
and
palmitoylated
protein
that
could
be
involved
in
membrane
blebbing;
• A
mitochondrial
protein
that
may
play
a
role
in
a
Mt
import
mechanism.
13. Personal
view
• Cons:
– It
takes
much
longer
than
what
you
expect
or
want!
And
magic
and
luck
seem
to
be
the
most
important
factors
in
successful
experiments!
– The
low
ra6o
of
quality/cost
for
many
lab
reagents
(defec6ve
an6bodies
for
example!);
– You
can’t
freely
share
preliminary
results
with
everyone
because
you
may
(will!)
be
scooped.
• Pros:
– Fun
to
see
bioinforma6cs
predic6ons
confirmed
in
the
lab;
– Nice
collabora6ons;
– Great
lab
atmosphere.
14. • What:
a
comprehensive
resource
that
complements
SIB/
EBI
Swiss-‐Prot
human
protein
annota6on
efforts.
We
expect
neXtProt
to
become
a
central
resource
for
human
protein-‐centric
informa6on;
• How:
– by
mining,
in
the
most
appropriate
way
and
with
stringent
quality
criteria,
many
high-‐throughput
data
resources.
We
plan
to
add
addi6onal
protein/protein
and
protein/small
molecules
interac6ons,
proteomics
data,
pathways/networks
informa6on,
varia6on
data
(such
as
SNP
frequencies),
siRNA
screen
data,
phylogene6c
profiling,
etc.;
– by
integra6ng
experimental
results
from
an
extensive
network
of
collabora6ng
laboratories.
15. Sequence databases Enzyme and pathway
Proteomics EMBL databases
HPA IPI BioCyc
PeptideAtlas PIR BRENDA
PRIDE RefSeq Pathway_Interaction_DB Family and domain
UniGene Reactome
databases
Gene3D
InterPro
2D-gel databases PANTHER
PIRSF
ANU-2DPAGE
Aarhus/Ghent-2DPAGE In Swiss-Prot users always need to navigate Pfam
PRINTS
Cornea-2DPAGE toward many external resources so as to ProDom
DOSAC-COBS-2DPAGE
PROSITE
HSC-2DPAGE consolidate data into knowledge
SMART
OGP
TIGRFAMs
PMMA-2DPAGE
REPRODUCTION-2DPAGE
SWISS-2DPAGE
World-2DPAGE UniProtKB/Swiss-Prot
Human entries links Miscellaneous
ArrayExpress
Organism-specific Bgee
databases BindingDB
CleanEx
GeneCards
dbSNP
H-InvDB
HGNC
In neXtProt the most pertinent data will be DIP
DrugBank
MIM integrated so as to enable complex queries
GO
Orphanet
HOGENOM
PharmGKB
HOVERGEN
IntAct
LinkHub
NextBio
Genome annotation
databases 3D structure Protein family/group
databases databases
Ensembl
GeneID PTM databases DisProt GermOnline
KEGG GlycoSuiteDB HSSP MEROPS
NMPDR PhosphoSite PDB PeroxiBase
PDBsum REBASE
SMR TCDB
16. What
is
not
neXtProt?
• No,
neXtProt
is
not
a
replacement
for
UniProtKB/Swiss-‐Prot;
• No,
neXtProt
is
not
universal
in
coverage,
it
is
intended
to
provide
knowledge
per6nent
to
human
proteins;
• No,
neXtProt
is
not
a
sequence
resource:
it
uses
the
sequence
data
curated
in
Swiss-‐Prot.
17. When
and
what?
• In
early
2011
we
released
a
first
public
version
that
contained
in
terms
of
data:
– All
of
Swiss-‐Prot
human
data:
sequences
and
annota6ons;
– Human
Protein
Atlas
(HPA)
organ
and
6ssue
expression
informa6on
from
IHC
(an6bodies);
– Metadata
on
mRNA
expression
from
microarrays
and
ESTs
from
Bgee
(analyzed
from
ArrayExpress
and
UniGene);
– Addi6onal
SNPs
from
dbSNP
and
Ensembl;
– Chromosomal
loca6on
and
exons
mapping
from
Ensembl;
– Affymetrix
and
Illumina
chip
sets
iden6fiers.
• In
terms
of
interface,
it
offers:
– An
intui6ve
query
interface;
– Many
specialized
views
(func6on,
medical,
expression,
etc);
– The
possibility
to
tag
and
label
proteins.
18. Bronze,
silver
and
gold
• We
have
a
three-‐6ered
approach
as
to
data
quality:
– Bronze:
noisy
or
low
quality
data
that
is
not
imported
in
the
plarorm;
– Silver:
good
data,
but…..
– Gold:
data
that
we
believe
to
be
of
a
swiss-‐(prot)-‐level
quality.
• By
default
searches
in
neXtProt
are
carried
out
on
gold
data;
• Quality
classifica6on
is
a
dynamic
process.
26. PTMs
We
are
loading
high-‐quality
sets
of
PTMs,
star6ng
with
N-‐glycosyla6on
and
phosphoryla6on
27. Pep6de
iden6fica6ons
• HUPO
brain
and
plasma
project
pep6des
from
Pep6deAtlas;
• Sets
linked
with
PTMs;
• Carapito
et
al
mitochondrial
N-‐terminome
project.
And
to
be
loaded
soon:
• Other
HUPO
data
sets;
• Data
from
various
labs
(Vienna,
Geneva,
Roche
(Basel),
Montpellier,
etc.).
28. New
subcellular
localiza6on
data
• From
two
projects:
DKFZ
GFP-‐cDNA@EMBL
and
WIS
Kahn
Dynamic
Proteomics
db
29. Data
export
• Export
of
data
both
in
XML
and
in
PEFF
formats;
• neXtProt
is
the
first
resource
to
offer
support
to
the
PSI
PEFF
format;
• This
enriched
FASTA
format
allows
search
engines
and
other
tools
to
easily
and
consistently
access
data
essen6al
to
the
success
of
HPP,
namely
sequence
varia6ons
and
PTMs.
30. Download
by
FTP
• At
np.nextprot.org
• To
obtain
downloads
in
XML
or
PEFF;
• These
files
are
also
available
per
chromosome
as
well
as
‘report’
files
31. What’s
next
in
term
of
tools
• A
tool
for
the
the
analysis
of
lists
of
proteins
so
as
to
explore
their
enrichment
in
various
types
of
annota6ons,
including
Gene
Ontology
(GO)
terms.
32. Programma6c
access
• We
will
build
an
API
to
allow
third
party
sonware
tools
to
make
use
of
the
data
in
neXtProt;
• Together
with
BIONEXT,
we
have
obtained
a
grant
to
develop
this
API
and
integrate
a
version
of
their
3D
structure
visualisa6on
tool
in
neXtProt.
33. A
note
about
variants
• There
are
now
over
420’000
variants
loaded
in
neXtProt;
• The
65’000
from
Swiss-‐Prot,
the
others
have
been
loaded
from
dbSNP
through
Ensembl;
• We
will
also
load
the
Cosmic
variants
as
well
as
other
sources.
34. We
also
want
to
do
many
other
things
as
quickly
as
possible
but…
35. The
road
map:
principles
• Our
vision
is
to
gradually
build
up
neXtProt,
not
only
by
adding
new
data
resources
but:
– By
integra6ng
state
of
the
art
data
mining
tools;
– By
integra6ng
some
forms
of
“social
networking”
func6onali6es
allowing
researchers
to
share
ideas
and
data;
– By
enabling
the
modeling
of
hypothesis
inside
the
framework
of
the
plarorm.
• To
work
closely
with
collaborators
and
users
to
define
how
the
data
and
tools
that
we
will
incorporate
into
neXtProt
will
be
useful
for
their
research.
36. A
new
resource
for
cell
lines
• There
are
three
ontologies
catering
for
cell
lines
(MCCL
CLO,
Brenda);
• A
large
number
of
on-‐line
catalogs:
ATCC,
CBA,
CCRID,
Coriell,
DSMZ,
ECACC,
ICLC,
IFO,
IZSLER,
JCRB,
RCB,
Riken;
• There
are
informa6on
resources:
CABRI,
CCLE,
COPE,
HyperCLDB,
Lonza;
• Databases
storing
cell
lines
as
“samples”:
Cosmic
• Topical
reviews
on
‘categories’
of
cell
lines;
• Various
lists
of
contaminated
cell
lines….
But
there
were
so
far
no
single
resource
pooling
together
all
this
informa6on
in
an
awempt
to
create
a
cell
line
thesaurus..
37.
• Not
an
ontology,
but
a
thesaurus;
• Links
to
all
the
ontologies,
catalogs,
resources,
publica6ons,
web
sites,
etc.
(over
20’000
Xref);
• Current
version:
8766
cell
lines.
The
next
version
(May)
will
have
over
10’000
lines,
5’000
synonyms;
• Scope:
vertebrates
(80%
human,
15%
mouse
and
rat,
the
reminder
are
associated
with
about
100
species;
• Currently
available
in
a
Swiss-‐Prot
like
text-‐based
format
at:
np://np.nextprot.org/
• But
it
will
soon
also
be
available
in
OBO
format
as
it
has
a
number
of
rela6onships
(derives_from,
etc.);
• Currently:
no
links
to
6ssues
and
diseases,
but
this
will
be
added
later.
38. ID 22Rv1!
AC CVCL_1045!
SY 22RV1; 22Rv-1; CWR22-Rv1; CWR22R-V1; CWR22Rv1!
DR CLO; CLO_0001199!
DR CLO; CLO_0001200!
DR Brenda; BTO:0002999!
DR CLDB; cl7072!
DR ATCC; CRL-2505!
DR CCLE; 22RV1_PROSTATE!
DR CCRID; 3131C0001000700100!
DR Cosmic; 924100!
DR DSMZ; ACC-438!
DR ECACC; 05092802!
DR PubMed; 14518029!
WW http://capcelllines.ca/details.asp?id=53!
WW http://bio.lonza.com/extras/cell-transfection-database/..!
OX NCBI_TaxID=9606; ! Homo sapiens!
HI CVCL_3967 ! CWR22!
//!
39.
40. The
ISB
• A
young
society
but
already
very
ac6ve:
• Pros:
– Over
310
ac6ve
members
from
15
countries;
– The
interna6onal
mee6ng
(now
yearly);
– Good
links
to
journals
such
as
Database
and
NAR;
– Common
projects
such
as
BioDBCore
• Cons:
– Not
enough
grass
root
involvements
of
the
members;
– Not
yet
enough
awareness
of
the
existence
of
the
society
by
would-‐be
members
in
many
countries
(Eastern
Europe,
South
America,
etc.)
but
also
closer
to
‘home’
(in
the
US).
Be
more
proacCve!
41. Biocura6on
is
an
expanding
field
• Good
news:
– Increasing
number
of
biocurators
in
academia
and
industry;
– More
and
more
knowledge
resources
incorporate
some
amount
of
manual
biocura6on.
• Bad
news:
– The
usual
problem
of
long-‐term
funding
and
sustainability
of
key
resources;
– A
lot
of
re-‐inven6ng
the
wheel
as
annota6on
SOPs
are
generally
not
easily
available.
42. The
data
flood
• Yes
it
exists
but…..
• A
big
propor6on
of
the
data
that
accumulates
today
is
not
going
to
be
useful
in
a
few
years;
• For
example:
if
we
have
clean
full
length
genome
sequence
of
“all”
representa6ve
species
on
earth
this
is
only
10
petabytes
of
informa6on
(10
million
species
with
1
billion
bp
each);
• The
genome
of
a
human
being
stored
as
variant
file
is
only
60
Mb
(compressed).
So
storing
the
varia6on
informa6on
for
10
billion
individuals
is
slightly
less
than
1
exabyte
–
not
a
big
challenge
in
term
of
technology
and
price
in
2020;
• In
the
meanwhile
we
are
s6ll
encapsula6ng
our
most
important
knowledge
using
a
16th
century
technology:
free
43. CALIPHO@UniGe_and_SIB
• neXtProt
content:
– Coordinator:
Pascale
Gaudet
– Biocurators:
Guislaine
Argoud-‐Puy,
Aurore
Britan,
Jonas
Cicenas,
Isabelle
Cusin,
Paula
Duek,
Nevila,
Nouspikel
– QA:
Monique
Zahn
• neXtProt
sobware
developers:
– Olivier
Evalet,
Alain
Gateau,
Anne
Gleizes,
Mario
Pereira,
Catherine
Zwahlen
(and
for
two
years:
Alexandre
Masselot)
• Laboratory
research:
– Franck
Bontems,
Marjorie
Desmurs,
Camille
Mary,
Rachel
Porcelli,
Irene
Rossito,
Lisa
Salleron,
Fabiana
Tirone
• Directed
by:
– Amos
Bairoch
and
Lydie
Lane
And
we
have
a
posi6on
open
for
a
Java
developer
(will
soon
be
announced
on
the
ISB
web)