Bairoch ISB closing-talk: CALIPHO

CALIPHO: two missions for one goal:
increasing our knowledge on human
proteins

Amos
Bairoch

April
4,
2012

Computer
Analysis
and
Laboratory

Inves6ga6on
of
Proteins
of
Human
Origin

Its
two
missions:

Carry
out
laboratory
experiments
on
selected
sets

of
uncharacterized
human
proteins
to
discover

their
func6on

Develop
neXtProt,
an
ambi6ous
new
knowledge

resource
centered
around
human
proteins

‘The’
human
genome

•  Sequencing
a
human

genome
is
no
longer
a

technological
challenge;

•  Making
sense
of
what
it

tells
us
is
s6ll
much
more

problema6c
then
anyone

ever
expected.

Almost 12 years ago, at the 4th
Siena meeting, we proposed to
annotate in Swiss-Prot all the
human proteins

Some
stats
on
human
proteins
from

UniProtKB/Swiss-‐Prot

•  20’244
reviewed
entries
(~protein-‐coding
genes);

•  16’000
addi6onal
isoforms
in
about
8’100
entries

(40%
but
will
probably
rise
to
>60%):
50’000

diﬀerent
protein
sequences;

•  65’000
variants;
22’500
linked
to
diseases;
the
rest

are
SNPs
that
are
SAPs
(2
per
proteins).
This
is
the

6p
of
the
iceberg;

•  80’000
PTMs
(50%
of
which
are
experimental).
This

is
the
6p
of
the
6p
of
the
iceberg!

Some
issues
about
protein-‐coding
genes

•  We
completely
agree
with
what
was
shown
earlier
in
this

mee6ng
by
the
HAVANA
group:
that
there
are
slightly
less

then
20K
protein-‐coding
genes;

•  Many
weirdos
in
the
genome:
bicistronic
mRNAs,
genes

that
produce
through
splicing
proteins
with
no
sequence

rela6onship,
mul6ple
genes
for
the
same
protein,
etc;

•  Varia6on
in
term
not
only
of
SNPs
but
of
copy
number.

And
some
segrega6ng
pseudogenes
(olfactory
receptors);

•  How
many
have
been
proven
at
protein
level?

–  Using
the
protein
evidence
“metric”
used
at
UniProt
and

neXtProt,
we
are
now
at
about
70%;

–  But
if
we
were
hun6ng
everywhere
in
good-‐quality
MS
data,
it

would
rise
to
about
85%.
The
big
issue
in
proteomics
is
how
to

hunt
for
the
last
15%

The
PTM
world
is
s6ll
largely
uncharted
1-
(3R)-3-hydroxyasparagine, (3R)-3-hydroxyaspartate, (3S)-3-hydroxyasparagine, 1'-histidyl-3'-tyrosine,
thioglycine, 2',4',5'-topaquinone, 2,3-didehydroalanine, 3'-(S-cysteinyl)-tyrosine, 3-hydroxyproline, 3-
oxoalanine, 4-amino-3-isothiazolidinone serine, 4-carboxyglutamate, 4-hydroxyproline, 5-glutamyl, 5-
glutamyl glycerylphosphorylethanolamine, 5-hydroxylysine, 5-imidazolinone, ADP-ribosylasparagine, ADP-
ribosylcysteine, ADP-ribosylserine, Allysine, Arginine amide, Asparagine amide, Aspartate 1-(chondroitin
4-sulfate)-ester, Asymmetric dimethylarginine, Beta-decarboxylated aspartate, Cholesterol glycine ester,
Citrulline, Cysteine methyl ester, Cysteine sulfenic acid, Cysteinyl-selenocysteine, Deamidated
asparagine, Deamidated glutamine, Dimethylated arginine, Diphthamide, Disulfide bond, GPI-anchor
amidated alanine, GPI-anchor amidated asparagine, GPI-anchor amidated aspartate, GPI-anchor
amidated cysteine, GPI-anchor amidated glycine, GPI-anchor amidated serine, Glutamic acid 1-amide,
Glutamine amide, Glycine amide, Glycyl adenylate, Glycyl lysine isopeptide, Hydroxyproline,
Hydroxyproline, Hypusine, Isoglutamyl cysteine thioester, Isoglutamyl lysine isopeptide, Isoleucine amide,
Leucine amide, Leucine methyl ester, Lysine amide, Lysine tyrosylquinone, Methionine amide, N,N,N-
And
all
this
does
not
include
all
the
diﬀerent

trimethylalanine, N-acetylalanine, N-acetylaspartate, N-acetylcysteine, N-acetylglutamate, N-acetylglycine,
N-acetylmethionine, N-acetylproline, N-acetylserine, N-acetylthreonine, N-acetylvaline, N-myristoyl
glycosyla6on
forms
and
the
processing
events

glycine, N-palmitoyl cysteine, N-palmitoyl glycine, N-pyruvate 2-iminyl-valine, N4,N4-dimethylasparagine,
N6,N6,N6-trimethyllysine, N6,N6-dimethyllysine, N6-(pyridoxal phosphate)lysine, N6-(retinylidene)lysine,
N6-1-carboxyethyl lysine, N6-acetyllysine, N6-biotinyllysine, N6-carboxylysine, N6-lipoyllysine, N6-
methylated lysine, N6-methyllysine, N6-myristoyl lysine, Nitrated tyrosine, O-(pantetheine 4'-
phosphoryl)serine, O-AMP-threonine, O-AMP-tyrosine, O-acetylserine, O-acetylthreonine, O-decanoyl
serine, O-palmitoyl serine, Omega-N-methylarginine, Omega-N-methylated arginine, Omega-
hydroxyceramide glutamate ester, Phenylalanine amide, Phosphatidylethanolamine amidated glycine,
Phosphohistidine, Phosphoserine, Phosphothreonine, Phosphotyrosine, PolyADP-ribosyl glutamic acid,
Proline amide, Pyrrolidone carboxylic acid, Pyruvic acid, S-(dipyrrolylmethanemethyl)cysteine, S-8alpha-
FAD cysteine, S-Lysyl-methionine sulfilimine, S-cysteinyl cysteine, S-farnesyl cysteine, S-geranylgeranyl
cysteine, S-glutathionyl cysteine, S-methylcysteine, S-nitrosocysteine, S-palmitoyl cysteine, S-stearoyl
cysteine, Sulfoserine, Sulfotyrosine, Symmetric dimethylarginine, Tele-8alpha-FAD histidine, Tele-

From
genome
to
proteome

~ 20’000 protein ~ 5'000'000
coding-genes different
proteins
post-translational
modifications of proteins
alternative splicing
(PTMs)
of mRNA
50-100 fold increase
2-5 fold increase

~ 50 to 100’000
transcripts
(mRNAs)

Protein
complexity

The
complexity
of
life
and

of
its
molecular
actors
is

fractal

Many
human
proteins
for
which
we

lack
func6onal
knowledge

1.  Similar
to
characterized
proteins
in
distant

organisms
(bacteria,
plants,
yeast),
but
no
valida6on

in
mammals;

2.  Presence
of
domains
that
help
predict
a
‘general’

func6on
but
not
a
precise
one
(examples:
hydrolase

fold,
GPCR);

3.  Presence
of
domains
or
sequence
features
that
help

deﬁne
some
proper6es
(examples:
PDZ
-‐>
PPI,
many

TMs
-‐>
integral
membrane
protein);

4.  “Orphan”.
With
no
similarity
to
any
characterized

proteins
but
that
can
be
conserved
across
a
more
or

less
wide
taxonomic
space.

About
5’000
human
proteins
are
in
one
of
the
above
four
categories

Overview
of
the
CALIPHO
wet
lab
strategy

In
silico
selecCon
:
sequence
analysis,
phylogeny,
data
mining

Tissue/cell
line
expression
(RT-‐PCR)

Cloning
of
cDNA

in
the
Gateway
system

Yeast
two
hybrid

Subcellular
locaCon
in
HeLa
cells
Recombinant
protein

(confocal
imaging)
producCon
in
E.Coli

ValidaCon
of

protein-‐protein
interacCons

(GST
pull
down,
co-‐IP)
3D
structure

by
NMR

Data
mining,

Modelling

Hypothesis
generaCon

FuncConal
assays

on
cell
lines
(RNAi)

In
vivo
validaCon

(animal
models
eg
zebraﬁsh)

CALIPHO@UniGe
collaborators
CALIPHO@SIB

Aner
2.5
years…

•  A
protein
involved
in
ciliogenesis;

•  An
enzyme
involved
in
a
salvage
pathway
not

yet
characterized
in
vertebrates;

•  A
myristoylated
and
palmitoylated
protein

that
could
be
involved
in
membrane
blebbing;

•  A
mitochondrial
protein
that
may
play
a
role

in
a
Mt
import
mechanism.

Personal
view

•  Cons:

–  It
takes
much
longer
than
what
you
expect
or
want!
And

magic
and
luck
seem
to
be
the
most
important
factors
in

successful

experiments!

–  The
low
ra6o
of
quality/cost
for
many
lab
reagents

(defec6ve
an6bodies
for
example!);

–  You
can’t
freely
share
preliminary
results
with
everyone

because
you
may
(will!)
be
scooped.

•  Pros:

–  Fun
to
see
bioinforma6cs
predic6ons
conﬁrmed
in
the

lab;

–  Nice
collabora6ons;

–  Great
lab
atmosphere.

•  What:
a
comprehensive
resource
that
complements
SIB/
EBI
Swiss-‐Prot
human
protein
annota6on
eﬀorts.
We

expect
neXtProt
to
become
a
central
resource
for
human

protein-‐centric
informa6on;

•  How:

–  by
mining,
in
the
most
appropriate
way
and
with
stringent
quality

criteria,
many
high-‐throughput
data
resources.

We
plan
to
add
addi6onal
protein/protein
and
protein/small

molecules
interac6ons,
proteomics
data,
pathways/networks

informa6on,
varia6on
data
(such
as
SNP
frequencies),
siRNA

screen
data,
phylogene6c
proﬁling,
etc.;

–  by
integra6ng
experimental
results
from
an
extensive
network
of

collabora6ng
laboratories.

Sequence databases Enzyme and pathway
Proteomics EMBL databases
HPA IPI BioCyc
PeptideAtlas PIR BRENDA
PRIDE RefSeq Pathway_Interaction_DB Family and domain
UniGene Reactome
databases
Gene3D
InterPro
2D-gel databases PANTHER
PIRSF
ANU-2DPAGE
Aarhus/Ghent-2DPAGE In Swiss-Prot users always need to navigate Pfam
PRINTS
Cornea-2DPAGE toward many external resources so as to ProDom
DOSAC-COBS-2DPAGE
PROSITE
HSC-2DPAGE consolidate data into knowledge

SMART
OGP
TIGRFAMs
PMMA-2DPAGE
REPRODUCTION-2DPAGE
SWISS-2DPAGE
World-2DPAGE UniProtKB/Swiss-Prot
Human entries links Miscellaneous
ArrayExpress
Organism-specific Bgee
databases BindingDB
CleanEx
GeneCards
dbSNP
H-InvDB
HGNC
In neXtProt the most pertinent data will be DIP
DrugBank
MIM integrated so as to enable complex queries

GO
Orphanet
HOGENOM
PharmGKB
HOVERGEN
IntAct
LinkHub
NextBio
Genome annotation
databases 3D structure Protein family/group
databases databases
Ensembl
GeneID PTM databases DisProt GermOnline
KEGG GlycoSuiteDB HSSP MEROPS
NMPDR PhosphoSite PDB PeroxiBase
PDBsum REBASE
SMR TCDB

What
is
not
neXtProt?

•  No,
neXtProt
is
not
a
replacement
for

UniProtKB/Swiss-‐Prot;

•  No,
neXtProt
is
not
universal
in
coverage,
it
is

intended
to
provide
knowledge
per6nent
to

human
proteins;

•  No,
neXtProt
is
not
a
sequence
resource:
it

uses
the
sequence
data
curated
in
Swiss-‐Prot.

When
and
what?

•  In
early
2011
we
released
a
first
public
version
that
contained
in

terms
of
data:

–  All
of
Swiss-‐Prot
human
data:
sequences
and
annota6ons;

–  Human
Protein
Atlas
(HPA)
organ
and
6ssue
expression

informa6on
from
IHC
(an6bodies);

–  Metadata
on
mRNA
expression
from
microarrays
and
ESTs
from

Bgee
(analyzed
from
ArrayExpress
and
UniGene);

–  Addi6onal
SNPs
from
dbSNP
and
Ensembl;

–  Chromosomal
loca6on
and
exons
mapping
from
Ensembl;

–  Affymetrix
and
Illumina
chip
sets
iden6fiers.

•  In
terms
of
interface,
it
offers:

–  An
intui6ve
query
interface;

–  Many
specialized
views
(func6on,
medical,
expression,
etc);

–  The
possibility
to
tag
and
label
proteins.

Bronze,
silver
and
gold

•  We
have
a
three-‐6ered
approach
as
to
data

quality:

–  Bronze:
noisy
or
low
quality
data
that
is
not
imported

in
the
plarorm;

–  Silver:
good
data,
but…..

–  Gold:
data
that
we
believe
to
be
of
a
swiss-‐(prot)-‐level

quality.

•  By
default
searches
in
neXtProt
are
carried
out
on

gold
data;

•  Quality
classiﬁca6on
is
a
dynamic
process.

A
variety
of
views
for
a
single
protein

An
innova6ve
sequence
viewer

Informa6on
at
the
genomic
level

Expression
data
at
mRNA
and
protein

levels

A
new
proteomics
page

PTMs

We
are
loading
high-‐quality
sets
of
PTMs,
star6ng

with
N-‐glycosyla6on
and
phosphoryla6on

Pep6de
iden6ﬁca6ons

•  HUPO
brain
and
plasma
project
pep6des
from

Pep6deAtlas;

•  Sets
linked
with
PTMs;

•  Carapito
et
al
mitochondrial
N-‐terminome

project.

And
to
be
loaded
soon:

•  Other
HUPO
data
sets;

•  Data
from
various
labs
(Vienna,
Geneva,

Roche
(Basel),
Montpellier,
etc.).

New
subcellular
localiza6on
data

•  From
two
projects:
DKFZ
GFP-‐cDNA@EMBL
and

WIS
Kahn
Dynamic
Proteomics
db

Data
export

•  Export
of
data
both
in
XML
and
in
PEFF
formats;

•  neXtProt
is
the
ﬁrst
resource
to
oﬀer
support
to

the
PSI
PEFF
format;

•  This
enriched
FASTA
format
allows
search

engines
and
other
tools
to
easily
and

consistently
access
data
essen6al
to
the
success

of
HPP,
namely
sequence
varia6ons
and
PTMs.

Download
by
FTP

•  At
np.nextprot.org

•  To
obtain
downloads
in
XML
or
PEFF;

•  These
ﬁles
are
also
available
per
chromosome
as

well
as
‘report’
ﬁles

What’s
next
in
term
of
tools

•  A
tool
for
the
the
analysis
of
lists
of
proteins

so
as

to
explore
their
enrichment
in
various
types
of

annota6ons,
including
Gene
Ontology
(GO)
terms.

Programma6c
access

•  We
will
build
an
API
to
allow
third
party
sonware

tools
to
make
use
of
the
data
in
neXtProt;

•  Together
with
BIONEXT,
we
have
obtained
a
grant

to
develop
this
API
and
integrate
a
version
of
their

3D
structure
visualisa6on
tool
in
neXtProt.

A
note
about
variants

•  There
are
now
over
420’000
variants
loaded
in

neXtProt;

•  The
65’000
from
Swiss-‐Prot,
the
others
have
been

loaded
from
dbSNP
through
Ensembl;

•  We
will
also
load
the
Cosmic
variants
as
well
as

other
sources.

We
also
want
to
do
many
other
things
as

quickly
as
possible
but…

The
road
map:
principles

•  Our
vision
is
to
gradually
build
up
neXtProt,
not

only
by
adding
new
data
resources
but:

–  By
integra6ng
state
of
the
art
data
mining
tools;

–  By
integra6ng
some
forms
of
“social
networking”

func6onali6es
allowing
researchers
to
share
ideas

and
data;

–  By
enabling
the
modeling
of
hypothesis
inside
the

framework
of
the
plarorm.

•  To
work
closely
with
collaborators
and
users
to

deﬁne
how
the
data
and
tools
that
we
will

incorporate
into
neXtProt
will
be
useful
for
their

research.

A
new
resource
for
cell
lines

•  There
are
three
ontologies
catering
for
cell
lines

(MCCL
CLO,
Brenda);

•  A
large
number
of
on-‐line
catalogs:
ATCC,
CBA,

CCRID,
Coriell,
DSMZ,
ECACC,
ICLC,
IFO,
IZSLER,

JCRB,
RCB,
Riken;

•  There
are
informa6on
resources:
CABRI,
CCLE,

COPE,
HyperCLDB,
Lonza;

•  Databases
storing
cell
lines
as
“samples”:
Cosmic

•  Topical
reviews
on
‘categories’
of
cell
lines;

•  Various
lists
of
contaminated
cell
lines….

But
there
were
so
far
no
single
resource
pooling

together
all
this
informa6on
in
an
awempt
to
create
a

cell
line
thesaurus..

•  Not
an
ontology,
but
a
thesaurus;

•  Links
to
all
the
ontologies,
catalogs,
resources,

publica6ons,
web
sites,
etc.
(over
20’000
Xref);

•  Current
version:
8766
cell
lines.
The
next
version
(May)

will
have
over
10’000
lines,
5’000
synonyms;

•  Scope:
vertebrates
(80%
human,
15%
mouse
and
rat,

the
reminder
are
associated
with
about
100
species;

•  Currently
available
in
a
Swiss-‐Prot
like
text-‐based

format
at:

np://np.nextprot.org/

•  But
it
will
soon
also
be
available
in
OBO
format
as
it
has

a
number
of
rela6onships
(derives_from,
etc.);

•  Currently:
no
links
to
6ssues
and
diseases,
but
this
will

be
added
later.

ID 22Rv1!
AC CVCL_1045!
SY 22RV1; 22Rv-1; CWR22-Rv1; CWR22R-V1; CWR22Rv1!
DR CLO; CLO_0001199!
DR CLO; CLO_0001200!
DR Brenda; BTO:0002999!
DR CLDB; cl7072!
DR ATCC; CRL-2505!
DR CCLE; 22RV1_PROSTATE!
DR CCRID; 3131C0001000700100!
DR Cosmic; 924100!
DR DSMZ; ACC-438!
DR ECACC; 05092802!
DR PubMed; 14518029!
WW http://capcelllines.ca/details.asp?id=53!
WW http://bio.lonza.com/extras/cell-transfection-database/..!
OX NCBI_TaxID=9606; ! Homo sapiens!
HI CVCL_3967 ! CWR22!
//!

The
ISB

•  A
young
society
but
already
very
ac6ve:

•  Pros:

–  Over
310
ac6ve
members
from
15
countries;

–  The
interna6onal
mee6ng
(now
yearly);

–  Good
links
to
journals
such
as
Database
and
NAR;

–  Common
projects
such
as
BioDBCore

•  Cons:

–  Not
enough
grass
root
involvements
of
the
members;

–  Not
yet
enough
awareness
of
the
existence
of
the
society

by
would-‐be
members
in
many
countries
(Eastern
Europe,

South
America,
etc.)
but
also
closer
to
‘home’
(in
the
US).

Be
more
proacCve!

Biocura6on
is
an
expanding
ﬁeld

•  Good
news:

–  Increasing
number
of
biocurators
in
academia
and

industry;

–  More
and
more
knowledge
resources
incorporate

some
amount
of
manual
biocura6on.

•  Bad
news:

–  The
usual
problem
of
long-‐term
funding
and

sustainability
of
key
resources;

–  A
lot
of
re-‐inven6ng
the
wheel
as
annota6on
SOPs

are
generally
not
easily
available.

The
data

ﬂood

•  Yes
it
exists
but…..

•  A
big
propor6on
of
the
data
that
accumulates
today
is
not

going
to
be
useful
in
a
few
years;

•  For
example:
if
we
have
clean
full
length
genome

sequence
of
“all”
representa6ve
species
on
earth
this
is

only
10
petabytes
of
informa6on
(10
million
species
with
1

billion
bp
each);

•  The
genome
of
a
human
being
stored
as
variant
ﬁle
is
only

60
Mb
(compressed).
So
storing
the
varia6on
informa6on

for
10
billion
individuals
is
slightly
less
than
1
exabyte
–

not
a
big
challenge
in
term
of
technology
and
price
in

2020;

•  In
the
meanwhile
we
are
s6ll
encapsula6ng
our
most

important
knowledge
using
a
16th
century
technology:
free

CALIPHO@UniGe_and_SIB

•  neXtProt
content:

–  Coordinator:
Pascale
Gaudet

–  Biocurators:
Guislaine
Argoud-‐Puy,
Aurore
Britan,
Jonas

Cicenas,
Isabelle
Cusin,
Paula
Duek,
Nevila,
Nouspikel

–  QA:
Monique
Zahn

•  neXtProt
sobware
developers:

–  Olivier
Evalet,
Alain
Gateau,
Anne
Gleizes,
Mario
Pereira,

Catherine
Zwahlen
(and
for
two
years:
Alexandre
Masselot)

•  Laboratory
research:

–  Franck
Bontems,
Marjorie
Desmurs,
Camille
Mary,
Rachel

Porcelli,
Irene
Rossito,
Lisa
Salleron,
Fabiana
Tirone

•  Directed
by:

–  Amos
Bairoch
and
Lydie
Lane

And
we
have
a
posi6on
open
for
a
Java
developer
(will

soon
be
announced
on
the
ISB
web)

Bairoch ISB closing-talk: CALIPHO

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à Bairoch ISB closing-talk: CALIPHO

Similaire à Bairoch ISB closing-talk: CALIPHO (20)

Dernier

Dernier (20)

Bairoch ISB closing-talk: CALIPHO