Advanced Bioinformatics for Genomics and BioData Driven Research

European Bioinformatics Institute -
the home for big data in biology
www.ebi.ac.uk
Advanced Bioinformatics for Genomics
and BioData Driven Research

The European Molecular Biology Laboratory
Heidelberg, Germany
Main Laboratory
Barcelona, Spain
Tissue Biology, Disease Modeling
80+ nationalities
Hinxton, Cambridge, UK
Bioinformatics
Mouse Biology
Rome, Italy
>1700 personnel
Grenoble, France
Hamburg, Germany
Structural Biology
6 sites in Europe
Structural Biology

Our mission
Deliver
excellent
research
Train the
next
generation
of scientists
Engage with
industry
Coordinate
bioinformatics
in Europe
Deliver
scientific
services

Data and tools to support life science research
www.ebi.ac.uk/services
Bioinformatics services

What services do we provide? Labs around the
world send us their
data and we…
Archive it
Classify it
Share it with
other data
providers
Analyse, add
value and
integrate it
…provide tools
to help
researchers
use it
A collaborative
enterprise

~64 million
requests to EMBL-EBI websites
every day
273 petabytes
of raw storage in our data centres
22 500
participants to EMBL-EBI Training
events
Requests from
20 million
unique IP addresses
Big Data, big demand for EMBL-EBI data services…

Data resources for Genomics – Molecular Archives
BioSamples database - centralised resource for FAIR sample data
(>12 million samples)
Experimental Factor Ontology - systematic description of experimental
variables available in EBI databases and projects (26,764 terms)
European Genome-phenome Archive - sequence and genotype
experiments, including case-control and population studies (3,445 studies)
European Nucleotide Archive (ENA) - record of the world's nucleotide
sequencing information (>2,400 million sequences, > 7,200 billion bases)
European Variation Archive - sole international resource for human and
non-human variation

Data resources for Genomics – Genes, Genomes & Variation
Ensembl - genome browser (human: >0.6 billion SNV, >6 million SV)
Ensembl Genomes - 275 vertebrate species / strains; Metazoa; Plants;
Fungi; Protists; Bacteria
GWAS Catalog - moved to EBI in 2015 (4,390 publicn., > 17,000 assocn.)
HGNC - 41,787 approved gene entries (19,320 protein coding)
International Genome Sample Resource - ensures future usability and
accessibility of 1000 Genomes Project data

VEP started as a simple wrapper around the Ensembl API to map variants to
transcripts and predict molecular consequence.
As new data sets and algorithms have become available, functionality has
increased and VEP is now an extensive and sophisticated tool
The Ensembl Variant Effect Predictor

New resource for Genomics
• New resource for gene expression and splicing QTLs
• https://www.ebi.ac.uk/eqtl/

Global Alliance for Genomics and Health (GA4GH)
• Chaired by EMBL-EBI Director Ewan Birney
• EMBL-EBI teams leading various activities in Technical Work Streams:
• Large Scale Genomics (file formats and htsget subgroups)
• Clinical and phenotypic data capture
• Data Use and Researcher identification
• ENA/EGA/EVA and HCA DCP are also Driver Projects

Data resources for Genomics – Molecular Atlas
• Human Cell Atlas Data Coordination Platform
• In 2017, Chan Zuckerberg Initiative (CZI) funding to EMBL-
EBI, Broad Institute and the UCSC Genomics Institute, to
build a cloud-based data coordination platform
• HCA will generate petabytes of data for billions of cells,
across multiple modalities, generated by hundreds of labs
around the world
• DCP will organise, curate, standardise analyse this data
and enable open data access

Data resources for Genomics – Proteins and Protein Families
A free to use resource for the archiving,
assembly, analysis, & browsing of
microbiome data
AnalysisData archiving Assembly

NEW Resource: BioImage Archive
Molecules Cells
Tissues /
Organisms
Molecular
Machines
Graphic courtesy of Jan Ellenberg
Light Sheet
Microscopy
High Throughput
Microscopy
Superresolution
Microscopy
Cryo Electron
Microscopy
Correlate Technologies
Integrate Data
0.1 TB / day
0.5 TB / dataset
0.5 TB / day
7.5 TB / dataset
40 TB / day
10 TB / dataset
5 TB / day
20 TB / dataset

Data-driven discovery
Research
www.ebi.ac.uk/research

Zamin
Iqbal
Thomas
Keene
John
Marioni
Janet
Thornton
Andrew
Leach
Evangelia
Petsalaki
Virginie
Uhlmann
Daniel
Zerbino
Paul
Flicaek
Nick
Goldman
Rob
Finn
Alvis
Brazma
Pedro
Beltrao
Alex
Bateman
Ewan
Birney
Moritz
Gerstung
Isidro
Cortes-
Ciriano
Research groups at EMBL-EBI
Irene
Papatheodorou
In 2018, EMBL-EBI had 165 grants awarded, 120 jointly funded with researchers and institutes in 62 countries

Pedro Beltrao: Functional landscape of the human phosphoproteome
Ochoa et al Nature Biotech 2019
• Created largest phospho-
proteome resource to date
(120,000 human phosphosites)
• Used machine learning methods
to compile and analyse large
phosphorylation related biological
datasets
• Identifying new functional
phosphosites has enormous
potential to progress research
into many biological processes
and diseases

Evangelia Petsalaki: Inference of kinase-kinase regulatory networks
from phosphoproteomics data (collaboration with Beltrao group)
Invergo*,Petursson* et al, bioRxiv

Moritz Gerstung: Pan-cancer computational histopathology
• Analysis with deep learning extracts histopathological patterns
• accurately discriminates 28 cancer and 14 normal tissue types
• Predicts: whole genome duplications; focal amplifications and deletions; driver gene
mutations
• Correlations with gene expression indicative of immune infiltration and proliferation
• Prognostic information augments conventional grading and histopathology subtyping
https://doi.org/10.1101/813543

Zam Iqbal: Mykrobe – predicting TB drug resistance from WGS data
https://wellcomeopenresearch.org/articles/4-191/v1

Virginie Uhlmann: Mathematical models for bioimage analysis
doi.org/10.1371/journal.pone.0173433

Dictionary Learning for Two-Dimensional Kendall Shapes
https://arxiv.org/abs/1903.11356

An example of best practice for complex datasets
Single Cell RNA-Seq analysis at EMBL-EBI
From Irene Papatheodorou
Team Leader – Gene Expression

ArrayExpress – functional genomics archive
• started in 2000 as an archive
for microarray data
• evolved into general archive for
high-throughput functional
genomics data (microarray- or
NGS- based)
• all data are manually curated
prior to inclusion
• microarray data stored directly
in ArrayExpress
• sequencing data brokered to
and stored in ENA
• curated datasets support
reproducible and re-usable
research

Annotare – Minimum information about a scRNA-Seq
experiment
single cell
isolation
single cell well
quality
OK
doublet
debris
single cell
identifier barcode
UMI
cDNA
read
pass
fail
post-analysis single
cell quality
library
construction
inferred
cell type
R1
R2
I1
files
sample
metadata
https://arxiv.org/abs/1910.14623

From database to knowledgebase: Expression Atlases
165 baseline expression
~ 3,350 differential expression
> 3,500 bulk datasets
62 species
> 955,000 assays
> 120 single-cell datasets
12 species
https://www.ebi.ac.uk/gxa

https://www.ebi.ac.uk/gxa/sc/home

Interactive Analysis with Galaxy
https://humancellatlas.usegalaxy.eu/
Flexible
Interoperable
Scalable

Main Points
• Enabling rational choices when composing workflows
• Using a common exchange format as ‘workflow glue’
• Galaxy integrations

What people usually do...
Read Filter Normalise Compare Cluster Markers
OR
OR

What we really should be doing

Problem 2:
need format glue!
... but to do that we need interoperable components
Problem 1:
components in different
languages

Our solution
Environments &
containers
Workflows
CLI CLI CLI CLI CLI CLIScripts layer

Galaxy integrations
• Extended Galaxy init container:
• Thin tool wrappers leveraging Bioconda wrappers
• Starting tertiary workflows
• Added logic for dynamic destinations
• Leverage existing Kubernetes integrations
• Improved LSF functionality for non-DRMAA clusters:
• Improved CLI executor
https://github.com/ebi-gene-expression-group/container-galaxy-sc-tertiary
Pablo
Moreno

Summary
• ArrayExpress/Annotare for data Submissions
• Expression Atlas/Single Cell Expression Atlas
• Analysis Workflows in Galaxy

Open Targets
Data integration Platforms

Drug discovery
• Finding the right biological target for a
drug requires bioinformatics to:
• identify promising targets
• select candidate medicines.
• EMBL-EBI services support all stages
of drug discovery:
• Ensembl
• UniProt
• ChEMBL
• Protein Data Bank in Europe
• Reactome

• Pinpointing the processes in the human body
that have a demonstrable effect on disease
• Aims to improve the success rate in the
discovery and repurposing of medicines
• A new kind of collaboration with:
• GSK
• EMBL-EBI
• Wellcome Sanger Institute
• Biogen
• Takeda
• Celgene
• Sanofi
Open Targets
www.opentargets.org

Open Targets Platform and Open Targets Genetics
www.targetvalidation.org genetics.opentargets.org

Challenges for the near future
• Non-coding SNVs
• Data standardization to enable AI/ML
• Connecting data
• Moving to the cloud

www.ebi.ac.uk
Stay in touch
Twitter: @emblebi
Facebook: EMBLEBI
LinkedIn: /company/ebi
YouTube: EMBLMedia

Advanced Bioinformatics for Genomics and BioData Driven Research

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Advanced Bioinformatics for Genomics and BioData Driven Research

Similaire à Advanced Bioinformatics for Genomics and BioData Driven Research (20)

Dernier

Dernier (20)

Advanced Bioinformatics for Genomics and BioData Driven Research