Irida bccdc dec10_2015

Canada’s Integrated Rapid Infectious Disease Analysis Platform for
Genomic Epidemiology

Outline
• IRIDA Project Overview
• IRIDA Design Principles
• Platform Overview
• Demo
• Extending IRIDA with the REST API
• Works-in-progress and future directions
• Discussion
2

Genome Canada Bioinformatics Competition: Large-Scale Project
“A Federated Bioinformatics Platform for
Public Health Microbial Genomics”
Our Goal
An open source, standards compliant, high quality, and user friendly
genomic epidemiology analysis platform based on web-technology to
support real-time (food-borne) disease outbreak investigations
3 www.IRIDA.ca

Core Functions
• Rapid processing of genomic sequence data
• Informative display of genomic data
• Sample, Case, and aggregate data (“metadata”)
Management
• Exporting of data for downstream analysis
Integrated Rapid Infectious Disease Analysis informatics platform

Target Audience
• Public health agencies and other organizations who
need a platform to manage and process large amount
of pathogen genomic data
• Public health agencies who need a platform to use
genomics for outbreak investigations
• NML and BCCDC are our first testing centres
• Have engaged Australian and Scottish Public Health
Laboratories; other countries also expressed interests

GP1: Empowering the end users
• We design IRIDA for the end-users!
• User-friendly interface and usage
• Users’ institutes are in control of the system
• Local or private-cloud installation to allow each
organization to host their own instance of IRIDA
• Built on top of popular workflow engine: tools are
highly customizable for specific end users
• Users don’t have to “fork over” their data to use
IRIDA

Interviews with key personnel to identify barriers
to implement genomic epidemiology in public
health agencies
7

GP2: Promoting Responsible Data
Sharing
• Data sharing needs to be seamless and painless
• Designed programming interface (API) to import
and export data between instances of IRIDA or
between IRIDA and 3rd party (authenticated)
applications
• End users are in control of what to share and when
to share
• Harmonized metadata formats to facilitate data
sharing
• Use of Ontology to map different usages to the same
concept and standardized term

GP3: Accessible and Scalable
• Open-source development means anyone can help to
improve the platform and reduce redundant efforts
• Openness = transparent analysis pipeline and
comparable results
• Web technology = accessible from remote devices
• Well documented
• Installable on a single computer to a cluster of
computers
• VM available – deployed on a cloud possible
• Free!

GP4: Secure and High Quality
• Use industrial standard authentication and
authorization protocols for access
• Built-in QC components (under testing)
• Full data and pipeline provenances to keep track of
tasks performed by end-users
• Versioned reference databases (under testing)

Genomics Analytic Ecosystem
Raw Genomics
data
IRIDA backend: data
management and routine
analysis
Public Health Data
Raw Genomics
data
Public Health Data
Geneious: ad-hoc
analysis (interactive)
Galaxy: ad-hoc
analysis (pipelined)
Bionumerics: PulseNet
specific (pipelined)
CDMart and other
Marts
EDW (LADW)
Primary Data
Processed Data (ready
for analysis) and
routine / automated
analysis
Analysis Tools
Dashboards
(information display)
IRIDA frontend:
Visualization

Components of Genomic Analysis
25GB-250GB
(data transfer
bottleneck)
5: public Data repository and
other collaborating centres
3: Public Health Data
1: data generation 2: data storage and/or analysis
platform
4: Analysts
Raw data
Secure transfer
GSC?
NML?

IRIDA Project Phases
• Phase 1: genomics process and analysis pipeline to
produce categorical data (MLST and SNPs) suitable for
current epidemiological analysis – almost completed
• Phase 2: combine the categorical data with subset of
public health data (line list approach to replace current
Excel based approach) and export of categorical data to
CDMARTs – in progress
• Phase 3: Develop IRIDA as an exploratory platform for
new ways of interpreting genomics data in light of
epidemiological and clinical data – in progress;
continuous process beyond current project

Public Health Contextual Data
Integration into IRIDA is needed
• Genomic data requires special storage and analysis
considerations due to its size and complexity
• While we can export genomic typing results to
existing epi. analysis systems, by bringing
“contextual” info into IRIDA, we can come up with
more complex visualization and analysis tools (e.g.
GenGIS)
14
Genomics, Epidemiology, Clinical, Lab Data

Platform Overview
• Suite of tools to facilitate genomic epidemiology
15
IRIDA
Sequencing
Instruments
Web
Application
Data
management
Built-in
Analytical
Tools
External
Galaxy
Command-
line Tools

Sequence Instruments
• Typically MiSeq or NextSeq
• An easy to use batch uploader is available to send
data from MiSeq or a data-staging PC to IRIDA
• Uploader available on Molecular-PC (BCCDC Lab’s
data staging PC)

Platform Overview – data model
• Data model inspired by INSDC
• Makes data uploading to NCBI
easy
• Currently with limited metadata
(we will see in demo)
• Plan to extend data model
17
Project
Run
Metadata
Sample
Metadata
Sequencing
Data
Sample
Metadata
Sequencing
Data
Members
Metadata

Platform Overview
• System Architecture
18
IRIDAServletContainer
REST API
Central File
Storage
Web
Interface
ApplicationLogic
Compute
Cluster
Galaxy
$ ~ >_ Galaxy

Demo Topics
• Getting data into IRIDA
• Organizing data
• Getting data out of IRIDA
• Analyzing data in IRIDA

Getting data into IRIDA
• Manual web interface upload
• Automated instrument upload (Illumina MiSeq)
20

Organizing Data
• Sample management actions
• Creating
• Editing metadata
• Copy/move
samples
21

Analyzing data in IRIDA
• Integrated analytical workflows
• Built-in Galaxy
• Assembly (SPAdes) and annotation (prokka)
• Phylogenomics (SNVPhyl)
• Uses Galaxy in the back-end
• Extendable, if you can write a Galaxy workflow for a tool, it’s straightforward to integrate
into IRIDA.
22

Getting data out of IRIDA
• Sharing project data
• Downloading
• Export to external Galaxy instance
• Exporting to the command-line
23

Works-in-progress and Future
directions
• Ontology integration, line-list tool
• Quality control and Quality assurance in analytical workflows
• Robust metadata integration and management
• Simplification of SRA submission
24

Types of (Meta)Data Standardized Within IRIDA
Lab Analytics
Genomics, PFGE
Serotyping, Phage typing
MLST, AMR
Sample Metadata
Isolation Source (Food, Host
Body Product,
Environmental), BioSample
Epidemiology Investigation
Exposures
Clinical Data
Patient demographics, Medical
History, Comorbidities,
Symptoms, Health Status
Reporting
Case/Investigation Status
“Not just what data IS collected, but what SHOULD be collected”

Types of (Meta)Data Standardized Within IRIDA
Lab Analytics
Genomics, PFGE
Serotyping, Phage typing
MLST, Ribotyping
Sample Metadata
Isolation Source (Food, Host
Body Product,
Environmental), BioSample
Epidemiology Investigation
Exposures
Clinical Data
Patient demographics, Medical
History, Comorbidities,
Symptoms, Health Status
Reporting
Case/Investigation Status
“Not just what data IS collected, but what SHOULD be collected”

27
Improved Querying Using Genomic Epidemiology Application Ontology
1. Create a hierarchy of well-defined terms
(harmonized from different sources, for different
organisms)
2. Provide clearly-defined relationships between terms
3. Use OBO architecture
Water Related Exposure
Treated Untreated
Bottled Municipal Individual Pond River Lake
Transmissio
n through
ingestion or
contact
Transmissio
n through
ingestion or
contact

Advantage of Using Ontology
• Flexible – allow more transparent integration
• Invisible to the User (but you’ll feel the
convenience and familiarity)
• With defined relationships, computer can be used
to assist reasoning (better querying and better
automation)
• Build on a large body of existing work (OBO) means
we can benefit from other people’s effort

29
• “Person, place, time”
• Exposure, food items, geographical information, symptoms, onset of symptoms
• Created (manually in excel) on ad hoc basis per investigation
• Need to be shared between stakeholders, but data governance is an issue
The Line List : The Primary Tool for Epidemiological Investigations
• Data integration and ontology based reasoning 
automated case definitions!
Integrating
genomics and
epidemiological
data!

IRIDA Offers Line List Visualizations of Selectable Data!
1. Line List
View
2. Timeline
View
Hideable cases
Selectable fields
Travel
Symptoms and Onset
Exposure Types
Hospitalization

Ontology - In Progress
• Create a smaller core (Lab, Epi exposure, and Food)
ontology for line-list testing
• Create a consortium for group to take on different
domains of Genomic Epidemiology Application
Ontology
• Pursuing longer term funding for ontology

Workflow Quality Control Tool
Objectives:
• Develop an universal QC module for
various bioinformatics tools
• Provide generic text mining tools for
extracting key variables from
pipeline component log files.
• Make it easier to adjust pipeline QC
threshold parameters.
• Standardized rule engine with access
to many Python functions.
• Pathogen-specific configuration
settings.
• Galaxy tool or command line.
• E.g. quality control system for the
assembly, annotation, and snp-
calling pipelines.

Input:
log files (datasets)
Output: report file + optionally halt workflow
+ rule file (json)

Simple workflow add-on:
Complex rule capability:

External applications
• Authorized external applications can connect to
IRIDA to obtain data seamlessly via REST API
• Example: GenGIS can extract sample phylogeny
and geographic data from IRIDA to generate a map
that shows the phylogenetic and geographic
information associated with outbreak isolates
• Data can also be output to Dashboard applications
for real-time queries

2011 Cholera Outbreak
Red = Haiti Blue = Nepal

2011 Cholera Cartogram
Red = Haiti Blue = Nepal

Comments and Feedback on
IRIDA
• What existing features do you like?
• What existing features don’t you like?
• What features do you want to see soon?
• What’s needed before you will use the tool
• What features do you want to see eventually?
• Longer term functionality

Discussion
• Infrastructure
• Short-term (Jenn’s cluster)
• Long-term?
• Network connectivity (speed and security)
• Access to Metadata (Epi and Lab)
• Sustainability
• Currently supported by a Genome Canada grant (expiring
June 2016)
• NML committed to maintain the core development going
• Buy-In from BCCDC (customization and maintenance of the
platform)

Genomics Analytic Platforms
Raw Genomics
data
IRIDA backend: data
management and routine
analysis
Public Health Data
Raw Genomics
data
Public Health Data
Geneious: ad-hoc
analysis (interactive)
Galaxy: ad-hoc
analysis (pipelined)
Bionumerics: PulseNet
specific (pipelined)
CDMart and other
Marts
EDW (LADW)
Primary Data
Processed Data (ready
for analysis) and
routine / automated
analysis
Analysis Tools
Dashboards
(information display)
IRIDA frontend:
Visualization

Contact
• Project Information: http://www.irida.ca
• Project source:
• https://github.com/phac-nml/irida
• https://github.com/phac-nml/irida-miseq-uploader
• https://github.com/phac-nml/irida-galaxy-importer
• Documentation: https://irida.corefacility.ca/documentation/
• E-mail: IRIDA-mail@sfu.ca
• IRC: #irida on irc.freenode.net
Many slides provided by Franklin Bristow (NML), Alex Keddy and Rob
Beiko (Dalhousie U.), Melanie Courtot, Emma Griffiths, and Damion
Dooley
41

Extending IRIDA with the REST API
• OAuth2 authorization (industry standard)
• HTTP API
• Examples:
• External Galaxy importer tool
• Command-line linker
• GenGIS
43

Retrieving IRIDA data through REST API
GenGIS: kiwi.cs.dal.ca/GenGIS

Initial Data View
(mock data set retrieved from IRIDA)

Geographic Locations
Required fields: Site ID, Latitude, Longitude

Individual Samples
Site IDs keyed to locations.
Many Sequence IDs(=multiple samples) can key to single Site IDs.

Phylogenetic Tree
Tree leaf IDs keyed to samples.

Geographically Coupled Phylogenetic
Distance (GCPD)

Irida bccdc dec10_2015

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (15)

Similaire à Irida bccdc dec10_2015

Similaire à Irida bccdc dec10_2015 (20)

Plus de IRIDA_community

Plus de IRIDA_community (14)

Dernier

Dernier (20)

Irida bccdc dec10_2015

Notes de l'éditeur