2. Outline
• IRIDA Project Overview
• IRIDA Design Principles
• Platform Overview
• Demo
• Extending IRIDA with the REST API
• Works-in-progress and future directions
• Discussion
2
3. Genome Canada Bioinformatics Competition: Large-Scale Project
“A Federated Bioinformatics Platform for
Public Health Microbial Genomics”
Our Goal
An open source, standards compliant, high quality, and user friendly
genomic epidemiology analysis platform based on web-technology to
support real-time (food-borne) disease outbreak investigations
3 www.IRIDA.ca
4. Core Functions
• Rapid processing of genomic sequence data
• Informative display of genomic data
• Sample, Case, and aggregate data (“metadata”)
Management
• Exporting of data for downstream analysis
Integrated Rapid Infectious Disease Analysis informatics platform
5. Target Audience
• Public health agencies and other organizations who
need a platform to manage and process large amount
of pathogen genomic data
• Public health agencies who need a platform to use
genomics for outbreak investigations
• NML and BCCDC are our first testing centres
• Have engaged Australian and Scottish Public Health
Laboratories; other countries also expressed interests
6. GP1: Empowering the end users
• We design IRIDA for the end-users!
• User-friendly interface and usage
• Users’ institutes are in control of the system
• Local or private-cloud installation to allow each
organization to host their own instance of IRIDA
• Built on top of popular workflow engine: tools are
highly customizable for specific end users
• Users don’t have to “fork over” their data to use
IRIDA
7. Interviews with key personnel to identify barriers
to implement genomic epidemiology in public
health agencies
7
8. GP2: Promoting Responsible Data
Sharing
• Data sharing needs to be seamless and painless
• Designed programming interface (API) to import
and export data between instances of IRIDA or
between IRIDA and 3rd party (authenticated)
applications
• End users are in control of what to share and when
to share
• Harmonized metadata formats to facilitate data
sharing
• Use of Ontology to map different usages to the same
concept and standardized term
9. GP3: Accessible and Scalable
• Open-source development means anyone can help to
improve the platform and reduce redundant efforts
• Openness = transparent analysis pipeline and
comparable results
• Web technology = accessible from remote devices
• Well documented
• Installable on a single computer to a cluster of
computers
• VM available – deployed on a cloud possible
• Free!
10. GP4: Secure and High Quality
• Use industrial standard authentication and
authorization protocols for access
• Built-in QC components (under testing)
• Full data and pipeline provenances to keep track of
tasks performed by end-users
• Versioned reference databases (under testing)
11. Genomics Analytic Ecosystem
Raw Genomics
data
IRIDA backend: data
management and routine
analysis
Public Health Data
Raw Genomics
data
Public Health Data
Geneious: ad-hoc
analysis (interactive)
Galaxy: ad-hoc
analysis (pipelined)
Bionumerics: PulseNet
specific (pipelined)
CDMart and other
Marts
EDW (LADW)
Primary Data
Processed Data (ready
for analysis) and
routine / automated
analysis
Analysis Tools
Dashboards
(information display)
IRIDA frontend:
Visualization
12. Components of Genomic Analysis
25GB-250GB
(data transfer
bottleneck)
5: public Data repository and
other collaborating centres
3: Public Health Data
1: data generation 2: data storage and/or analysis
platform
4: Analysts
Raw data
Secure transfer
GSC?
NML?
13. IRIDA Project Phases
• Phase 1: genomics process and analysis pipeline to
produce categorical data (MLST and SNPs) suitable for
current epidemiological analysis – almost completed
• Phase 2: combine the categorical data with subset of
public health data (line list approach to replace current
Excel based approach) and export of categorical data to
CDMARTs – in progress
• Phase 3: Develop IRIDA as an exploratory platform for
new ways of interpreting genomics data in light of
epidemiological and clinical data – in progress;
continuous process beyond current project
14. Public Health Contextual Data
Integration into IRIDA is needed
• Genomic data requires special storage and analysis
considerations due to its size and complexity
• While we can export genomic typing results to
existing epi. analysis systems, by bringing
“contextual” info into IRIDA, we can come up with
more complex visualization and analysis tools (e.g.
GenGIS)
14
Genomics, Epidemiology, Clinical, Lab Data
15. Platform Overview
• Suite of tools to facilitate genomic epidemiology
15
IRIDA
Sequencing
Instruments
Web
Application
Data
management
Built-in
Analytical
Tools
External
Galaxy
Command-
line Tools
16. Sequence Instruments
• Typically MiSeq or NextSeq
• An easy to use batch uploader is available to send
data from MiSeq or a data-staging PC to IRIDA
• Uploader available on Molecular-PC (BCCDC Lab’s
data staging PC)
17. Platform Overview – data model
• Data model inspired by INSDC
• Makes data uploading to NCBI
easy
• Currently with limited metadata
(we will see in demo)
• Plan to extend data model
17
Project
Run
Metadata
Sample
Metadata
Sequencing
Data
Sample
Metadata
Sequencing
Data
Members
Metadata
18. Platform Overview
• System Architecture
18
IRIDAServletContainer
REST API
Central File
Storage
Web
Interface
ApplicationLogic
Compute
Cluster
Galaxy
$ ~ >_ Galaxy
19. Demo Topics
• Getting data into IRIDA
• Organizing data
• Getting data out of IRIDA
• Analyzing data in IRIDA
20. Getting data into IRIDA
• Manual web interface upload
• Automated instrument upload (Illumina MiSeq)
20
22. Analyzing data in IRIDA
• Integrated analytical workflows
• Built-in Galaxy
• Assembly (SPAdes) and annotation (prokka)
• Phylogenomics (SNVPhyl)
• Uses Galaxy in the back-end
• Extendable, if you can write a Galaxy workflow for a tool, it’s straightforward to integrate
into IRIDA.
22
23. Getting data out of IRIDA
• Sharing project data
• Downloading
• Export to external Galaxy instance
• Exporting to the command-line
23
24. Works-in-progress and Future
directions
• Ontology integration, line-list tool
• Quality control and Quality assurance in analytical workflows
• Robust metadata integration and management
• Simplification of SRA submission
24
25. Types of (Meta)Data Standardized Within IRIDA
Lab Analytics
Genomics, PFGE
Serotyping, Phage typing
MLST, AMR
Sample Metadata
Isolation Source (Food, Host
Body Product,
Environmental), BioSample
Epidemiology Investigation
Exposures
Clinical Data
Patient demographics, Medical
History, Comorbidities,
Symptoms, Health Status
Reporting
Case/Investigation Status
“Not just what data IS collected, but what SHOULD be collected”
26. Types of (Meta)Data Standardized Within IRIDA
Lab Analytics
Genomics, PFGE
Serotyping, Phage typing
MLST, Ribotyping
Sample Metadata
Isolation Source (Food, Host
Body Product,
Environmental), BioSample
Epidemiology Investigation
Exposures
Clinical Data
Patient demographics, Medical
History, Comorbidities,
Symptoms, Health Status
Reporting
Case/Investigation Status
“Not just what data IS collected, but what SHOULD be collected”
27. 27
Improved Querying Using Genomic Epidemiology Application Ontology
1. Create a hierarchy of well-defined terms
(harmonized from different sources, for different
organisms)
2. Provide clearly-defined relationships between terms
3. Use OBO architecture
Water Related Exposure
Treated Untreated
Bottled Municipal Individual Pond River Lake
Transmissio
n through
ingestion or
contact
Transmissio
n through
ingestion or
contact
28. Advantage of Using Ontology
• Flexible – allow more transparent integration
• Invisible to the User (but you’ll feel the
convenience and familiarity)
• With defined relationships, computer can be used
to assist reasoning (better querying and better
automation)
• Build on a large body of existing work (OBO) means
we can benefit from other people’s effort
29. 29
• “Person, place, time”
• Exposure, food items, geographical information, symptoms, onset of symptoms
• Created (manually in excel) on ad hoc basis per investigation
• Need to be shared between stakeholders, but data governance is an issue
The Line List : The Primary Tool for Epidemiological Investigations
• Data integration and ontology based reasoning
automated case definitions!
Integrating
genomics and
epidemiological
data!
30. IRIDA Offers Line List Visualizations of Selectable Data!
1. Line List
View
2. Timeline
View
Hideable cases
Selectable fields
Travel
Symptoms and Onset
Exposure Types
Hospitalization
31. Ontology - In Progress
• Create a smaller core (Lab, Epi exposure, and Food)
ontology for line-list testing
• Create a consortium for group to take on different
domains of Genomic Epidemiology Application
Ontology
• Pursuing longer term funding for ontology
32. Workflow Quality Control Tool
Objectives:
• Develop an universal QC module for
various bioinformatics tools
• Provide generic text mining tools for
extracting key variables from
pipeline component log files.
• Make it easier to adjust pipeline QC
threshold parameters.
• Standardized rule engine with access
to many Python functions.
• Pathogen-specific configuration
settings.
• Galaxy tool or command line.
• E.g. quality control system for the
assembly, annotation, and snp-
calling pipelines.
33. Workflow Quality Control Tool
Input:
log files (datasets)
Output: report file + optionally halt workflow
+ rule file (json)
35. External applications
• Authorized external applications can connect to
IRIDA to obtain data seamlessly via REST API
• Example: GenGIS can extract sample phylogeny
and geographic data from IRIDA to generate a map
that shows the phylogenetic and geographic
information associated with outbreak isolates
• Data can also be output to Dashboard applications
for real-time queries
38. Comments and Feedback on
IRIDA
• What existing features do you like?
• What existing features don’t you like?
• What features do you want to see soon?
• What’s needed before you will use the tool
• What features do you want to see eventually?
• Longer term functionality
39. Discussion
• Infrastructure
• Short-term (Jenn’s cluster)
• Long-term?
• Network connectivity (speed and security)
• Access to Metadata (Epi and Lab)
• Sustainability
• Currently supported by a Genome Canada grant (expiring
June 2016)
• NML committed to maintain the core development going
• Buy-In from BCCDC (customization and maintenance of the
platform)
40. Genomics Analytic Platforms
Raw Genomics
data
IRIDA backend: data
management and routine
analysis
Public Health Data
Raw Genomics
data
Public Health Data
Geneious: ad-hoc
analysis (interactive)
Galaxy: ad-hoc
analysis (pipelined)
Bionumerics: PulseNet
specific (pipelined)
CDMart and other
Marts
EDW (LADW)
Primary Data
Processed Data (ready
for analysis) and
routine / automated
analysis
Analysis Tools
Dashboards
(information display)
IRIDA frontend:
Visualization
41. Contact
• Project Information: http://www.irida.ca
• Project source:
• https://github.com/phac-nml/irida
• https://github.com/phac-nml/irida-miseq-uploader
• https://github.com/phac-nml/irida-galaxy-importer
• Documentation: https://irida.corefacility.ca/documentation/
• E-mail: IRIDA-mail@sfu.ca
• IRC: #irida on irc.freenode.net
Many slides provided by Franklin Bristow (NML), Alex Keddy and Rob
Beiko (Dalhousie U.), Melanie Courtot, Emma Griffiths, and Damion
Dooley
41
Platform Overview
Data structure
System Architecture
Getting data into IRIDA
Web interface
From Illumina MiSeq instruments
Organizing data
Projects as containers
Managing samples (moving, copying, merging)
Getting data out of IRIDA
Sharing data (permissions, sharing project data)
Downloading data
Exporting: to Galaxy and the command-line
Analyzing data in IRIDA
Assembly and annotation, and core SNV Phylogenomics pipelines
Extending IRIDA with the REST API
Data model exposed over HTTP
OAuth2 authorization
Example: GenGIS
Works-in-progress and future directions for IRIDA
Ontology integration
Epi line-list
QA/QC
IRIDA was conceived about 2 years ago through a Genome Canada Bioinformatics Grant. It is an effort to build an open source, standards compliant, high quality genomic epidemiology analysis platform to support real-time disease outbreak investigations, initially focused on food-borne illnesses
Extending more interdisciplinary data integration further…
Both at NML and BCCDC
Build the foundation and now ready for more engagement
No manual copying and pasting into spreadsheets between programs
We have a standardized way to allow other programs to access data from IRIDA securely
So where do I see IRIDA sits in the Genomic analytics Ecosystem
What is IRIDA?
Sequencing Instruments
Java Web application
REST API
User interface
Central file storage area
Internal Galaxy
External Galaxy
Command-line tools
Emphasize that it’s a mock data set. Constructed from real sequences but created with “user generated” geographic locations.Locations were faked inside the IRIDA instance, and not after retrieval