4. Data and tools to support life science research
www.ebi.ac.uk/services
Bioinformatics services
5. What services do we provide? Labs around the
world send us their
data and we…
Archive it
Classify it
Share it with
other data
providers
Analyse, add
value and
integrate it
…provide tools
to help
researchers
use it
A collaborative
enterprise
6. ~64 million
requests to EMBL-EBI websites
every day
273 petabytes
of raw storage in our data centres
22 500
participants to EMBL-EBI Training
events
Requests from
20 million
unique IP addresses
Big Data, big demand for EMBL-EBI data services…
8. Data resources for Genomics – Molecular Archives
BioSamples database - centralised resource for FAIR sample data
(>12 million samples)
Experimental Factor Ontology - systematic description of experimental
variables available in EBI databases and projects (26,764 terms)
European Genome-phenome Archive - sequence and genotype
experiments, including case-control and population studies (3,445 studies)
European Nucleotide Archive (ENA) - record of the world's nucleotide
sequencing information (>2,400 million sequences, > 7,200 billion bases)
European Variation Archive - sole international resource for human and
non-human variation
9. Data resources for Genomics – Genes, Genomes & Variation
Ensembl - genome browser (human: >0.6 billion SNV, >6 million SV)
Ensembl Genomes - 275 vertebrate species / strains; Metazoa; Plants;
Fungi; Protists; Bacteria
GWAS Catalog - moved to EBI in 2015 (4,390 publicn., > 17,000 assocn.)
HGNC - 41,787 approved gene entries (19,320 protein coding)
International Genome Sample Resource - ensures future usability and
accessibility of 1000 Genomes Project data
10. VEP started as a simple wrapper around the Ensembl API to map variants to
transcripts and predict molecular consequence.
As new data sets and algorithms have become available, functionality has
increased and VEP is now an extensive and sophisticated tool
The Ensembl Variant Effect Predictor
11. New resource for Genomics
• New resource for gene expression and splicing QTLs
• https://www.ebi.ac.uk/eqtl/
12. Global Alliance for Genomics and Health (GA4GH)
• Chaired by EMBL-EBI Director Ewan Birney
• EMBL-EBI teams leading various activities in Technical Work Streams:
• Large Scale Genomics (file formats and htsget subgroups)
• Clinical and phenotypic data capture
• Data Use and Researcher identification
• ENA/EGA/EVA and HCA DCP are also Driver Projects
13. Data resources for Genomics – Molecular Atlas
• Human Cell Atlas Data Coordination Platform
• In 2017, Chan Zuckerberg Initiative (CZI) funding to EMBL-
EBI, Broad Institute and the UCSC Genomics Institute, to
build a cloud-based data coordination platform
• HCA will generate petabytes of data for billions of cells,
across multiple modalities, generated by hundreds of labs
around the world
• DCP will organise, curate, standardise analyse this data
and enable open data access
14. Data resources for Genomics – Proteins and Protein Families
A free to use resource for the archiving,
assembly, analysis, & browsing of
microbiome data
AnalysisData archiving Assembly
15. NEW Resource: BioImage Archive
Molecules Cells
Tissues /
Organisms
Molecular
Machines
Graphic courtesy of Jan Ellenberg
Light Sheet
Microscopy
High Throughput
Microscopy
Superresolution
Microscopy
Cryo Electron
Microscopy
Correlate Technologies
Integrate Data
0.1 TB / day
0.5 TB / dataset
0.5 TB / day
7.5 TB / dataset
40 TB / day
10 TB / dataset
5 TB / day
20 TB / dataset
18. Pedro Beltrao: Functional landscape of the human phosphoproteome
Ochoa et al Nature Biotech 2019
• Created largest phospho-
proteome resource to date
(120,000 human phosphosites)
• Used machine learning methods
to compile and analyse large
phosphorylation related biological
datasets
• Identifying new functional
phosphosites has enormous
potential to progress research
into many biological processes
and diseases
19. Evangelia Petsalaki: Inference of kinase-kinase regulatory networks
from phosphoproteomics data (collaboration with Beltrao group)
Invergo*,Petursson* et al, bioRxiv
20. Moritz Gerstung: Pan-cancer computational histopathology
• Analysis with deep learning extracts histopathological patterns
• accurately discriminates 28 cancer and 14 normal tissue types
• Predicts: whole genome duplications; focal amplifications and deletions; driver gene
mutations
• Correlations with gene expression indicative of immune infiltration and proliferation
• Prognostic information augments conventional grading and histopathology subtyping
https://doi.org/10.1101/813543
21. Zam Iqbal: Mykrobe – predicting TB drug resistance from WGS data
https://wellcomeopenresearch.org/articles/4-191/v1
24. An example of best practice for complex datasets
Single Cell RNA-Seq analysis at EMBL-EBI
From Irene Papatheodorou
Team Leader – Gene Expression
25. ArrayExpress – functional genomics archive
• started in 2000 as an archive
for microarray data
• evolved into general archive for
high-throughput functional
genomics data (microarray- or
NGS- based)
• all data are manually curated
prior to inclusion
• microarray data stored directly
in ArrayExpress
• sequencing data brokered to
and stored in ENA
• curated datasets support
reproducible and re-usable
research
26. Annotare – Minimum information about a scRNA-Seq
experiment
single cell
isolation
single cell well
quality
OK
doublet
debris
single cell
identifier barcode
UMI
cDNA
read
pass
fail
post-analysis single
cell quality
library
construction
inferred
cell type
R1
R2
I1
files
sample
metadata
https://arxiv.org/abs/1910.14623
27. From database to knowledgebase: Expression Atlases
165 baseline expression
~ 3,350 differential expression
> 3,500 bulk datasets
62 species
> 955,000 assays
> 120 single-cell datasets
12 species
https://www.ebi.ac.uk/gxa
29. Interactive Analysis with Galaxy
https://humancellatlas.usegalaxy.eu/
Flexible
Interoperable
Scalable
30. Main Points
• Enabling rational choices when composing workflows
• Using a common exchange format as ‘workflow glue’
• Galaxy integrations
31. What people usually do...
Read Filter Normalise Compare Cluster Markers
Read Filter Normalise Compare Cluster Markers
Read Filter Normalise Compare Cluster Markers
OR
OR
32. What we really should be doing
Read Filter Normalise Compare Cluster Markers
33. Problem 2:
need format glue!
... but to do that we need interoperable components
Read Filter Normalise Compare Cluster Markers
Read Filter Normalise Compare Cluster Markers
Read Filter Normalise Compare Cluster Markers
Read Filter Normalise Compare Cluster Markers
Read Filter Normalise Compare Cluster Markers
Read Filter Normalise Compare Cluster Markers
Read Filter Normalise Compare Cluster Markers
Problem 1:
components in different
languages
39. Drug discovery
• Finding the right biological target for a
drug requires bioinformatics to:
• identify promising targets
• select candidate medicines.
• EMBL-EBI services support all stages
of drug discovery:
• Ensembl
• UniProt
• ChEMBL
• Protein Data Bank in Europe
• Reactome
40. • Pinpointing the processes in the human body
that have a demonstrable effect on disease
• Aims to improve the success rate in the
discovery and repurposing of medicines
• A new kind of collaboration with:
• GSK
• EMBL-EBI
• Wellcome Sanger Institute
• Biogen
• Takeda
• Celgene
• Sanofi
Open Targets
www.opentargets.org
41. Open Targets Platform and Open Targets Genetics
www.targetvalidation.org genetics.opentargets.org
42. Challenges for the near future
• Non-coding SNVs
• Data standardization to enable AI/ML
• Connecting data
• Moving to the cloud