1. IPRStats: a Visualization Tool for
InterProScan
Iddo Friedberg
Microbiology and
Computer Science & Software Engineering
Miami University
http://github.com/devrkel/IPRStats.git
2. Microbes are Everywhere
●
1030 prokaryotic cells on Earth
(give or take a couple)
● Dominate the biosphere
● 90% of the cells in your body
are prokaryotic (1014)
● Found in the most hostile
environments
3. t
os
alm
Microbes do Everything
● Nutrient reservoir:
●
4x1010 tons carbon (rivaling
plants)
●
1x1010 tons Nitrogen
●
1x109 tons phosphorous
●
4. Of course there is health...
● Communicable
diseases
● Heart disease
● Gastric cancer
● Irritable Bowel
Syndrome
11. What is Metagenomics?
• Culture independent approach to study
microbial communities
– < 1% of microbes can be cultured
– DNA directly isolated from environmental sample
and sequenced
• Examining genomic content of organisms in
community/environment to better understand:
– Diversity of organisms
– Their roles and interactions in the ecosystem
13. Some things we can learn using Metagenomics
●Taxonomic content: Taxon diversity in a habitat (using taxonomic
markers)
• Functional content: biological functions, qualitative and quantitative
profiles
• Coping with the environment: differences in functional content
between habitats
• Decompose the biotic / abiotic elements in a habitat: metadata
analysis
16. A Metagenomic project
● Sequencing
● Assembly
● Annotation
● Gene finding
Population
● Function prediction analysis tools
● Diversity analysis
● Comparative
analysis
17. InterProScan
● Signature search against an
integrated resource of domains
and functional sites
● Easy to install, cluster-enabled
(pleasantly parallel)
● Maintained by EBI
● Can annotate whole genomes
● PIR, Pfam, TIGRFam, Panther,
Prodom, PRINTS,...
● Needs a visualization tool for
population / metagenomic
annotation
18. Open XML file Charting
Python SAX Parser
GUI: wxPython
Excel export: xlwt
Full Databases
IPRStats
File Help
PFAM
PIR
GENE3D
Aggregate
Queries
HAMAP
PANTHER
PRINTS
PRODOM
Resulting Tables PROFILE
PROSITE
SMART
SUPERFAMILY
TIGRFAMs
19. IPRStats Architecture
IPRStats standalone
importers (wx.Frame)
Menu
XML (wx.MenuBar)
PropertiesDlg
IPS (wx.Dialog)
Settings
Chart
(wx.StaticBitmap)
exporters
Table
(wx.PyGridTableBase)
HTML
StatsData
XLS
(using xlwt)
Results
(sqlite or pytables)
IPS
20. ?
What is PyTables?
- package for creating data structures that can handle large amounts of data
- uses NumPy (for in memory) and HDF5 (for disk storage) structures
- uses Numexpr (jit compiler) for evaluating expressions (like queries)
- in the context of IPRScan, it provides a way of accessing a huge table
of data without requiring that all the data be in memory
Pros Cons
- HDF5 provides very fast, compact and - Large memory overhead (particularly
efficient indexing in comparison to smaller datasets)
- NumPy provides efficient in-memory - Many large, complex dependencies
storage including HDF5, NumPy, Numexpr and
- Minimizes disk and memory usage Cython
- Very fast read times compared to - Slow write times (particularly important
SQLite and MySQL since IPRStats bottlenecks with writing)
24. Conclusions & Future
● A lightweight, machine-independent
visualization tool for InterProScan annotations
● License: AFL
● Todo:
● Comparative population analysis
● Large dataset handling
● More graphic options
● Anything else you like...
– http://github.com/devrkel/IPRStats.git
25. Thanks
● David Ream
● Han Wang
● Ian Fleming
● David Vincent
● Ryan Kelly
● EBI
● Miami University startup funding
● Miami University Undergraduate Summer Scholars
Program
26. The Friedberg Lab is Recruiting
● Graduate students
● Postdocs
● Catch me later, email me, or look at
iddo-friedberg.net to learn more