Towards a brokering framework for knowledge-based services: learning from the Pistoia Alliance SESL pilot
Ian Harrow PhD for the Pistoia Alliance
This presentation describes a pilot project to determine the feasibility of biomedical knowledge brokering. It shows query across multiple disparate data sources through a brokering demonstrator built from RDF triple store technology. The learning from this pilot is contributing to larger scale projects such as the Innovative Medicines Initiative, OpenPFACTs.
Why Teams call analytics are critical to your entire business
Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011
1. Towards a brokering framework
for knowledge-based services:
Learning from the Pistoia Alliance
SESL pilot
Ian Harrow, PhD
Co-Leader of Pistoia Alliance SESL pilot (ex-Pfizer)
Founder, Director & Principal Consultant at Ian Harrow Consulting Ltd
Bio IT World, Hanover, October 2011
http://pistoiaalliance.org
2. Outline
• Industry Drivers
• Mission and Strategy of Pistoia
• Vision for the SESL pilot
• Minimal configuration to test a
brokering service
• Public demonstrator and standards
• Deliverables achieved by SESL pilot
• Learning and future direction
2
3. What is Core to your Business?
What is Critical?
Core?
Externalize
Focus
for 1990
Staff on
Best
Critical?
Innovation
Practices
2012
Reduce Externalize
Non-Value for Cost
Added Work Reduction
3
4. Why the Pistoia Alliance?
• Industry was at a cross roads Henry Chesbrough, UC Berlkey 2011
– Change in business models required
• We are all in this (mess) together (Life Science,
technology vendors, service IT, academia, etc.)
• Need industry applicable services and
standards
• Collect all the stakeholders together
– Agree on commonly-shared, pre-competitive use
cases
• Focus on delivery of proofs of concept to
stimulate and foster new business models
4
5. The Mission of the Pistoia Alliance
Lowering the barriers to innovation
by improving the interoperability of
R&D business processes
via pre-competitive collaborations
5
11. Domains of Action
Biology &
Translational Chemistry
Medicine
Scientific
Collaboration
11
12. The Focus of Each Domain
Big Data,
Supply Chain,
Analytics,
Tech Transfer
Semantics
Biology Chemistry
Vocabularies,
Use Cases,
Best Practices
Scientific Collaboration 12
13. Try this at your desk….
Which diseases are correlated to the gene, TCF7L2?
Gene/Protein Literature - Abstracts Literature – Full Text
Inherited diseases Gene expression
13
14. Try it again with Pistoia’s SESL….
Gene naming/synonyms
Gene Function
Literature statistics
Disease co-occurrences
Gene/protein interactions
…all in one report from one
search
HOW? A standard vocabulary,
data model, query language,
report structure, etc.
14
15. SESL Pilot project description
• Deliverables:
– Publication of standards and recommendations for brokering service
implementation
– Public demonstrator service for a single disease area
– Dialogue and assessment of potential business impact with key content
suppliers
• Scope:
– Development of an assertion database in combination with a user
interface and associated web services for one
disease/indication/phenotype of broad interest: Type II Diabetes
– Assertional content derived from 3 structured data sources and limited
Journal content (co-occurrence and statistical derivation from full text)
– Assertional evidence for filtering and drill down to primary data.
– Limited vocabulary development for area of focus: Type II Diabetes
• Participants and Cost:
– AZ, Pfizer, GSK, Roche, Unilever, EMBL-EBI, NPG, OUP, Elsevier & RSC
– Single contract between Pistoia Alliance & EMBL-EBI
– £200K cost (=2 x FTEs) – shared by industry
– 12 month project, January 2010 start
15
16. The Knowledge Service Framework
Multiple
Consumers
‘Consumer’
Disease Dossier Knowledge
Applications
Firewall
Service Layer Std Public
Common
Open Assertion & Meta Data Management Vocabularies
Service
Stand Transform /Translate (RDF triples) Business Broker
-ards Integrator/Aggregator (Triple store) Rules
Supplier
Firewall Content
Suppliers
Db 2
Db 4
Corpus 1
Db 3 Corpus 5
16
16
17. Minimal configuration to test the technical
feasibility of a Knowledge Broker Service
Interface
User Interface Layer
Service Layer Std Public Service Layer Std Public
Condition:
Brokering service
Vocabularies Vocabularies
Assertion & Meta Data Mgmt Assertion & Meta Data Mgmt
Identical structure.
Transform / Translate Query Transform / Translate Query
Different content
which can overlap Triple store 1
templates
Triple store 2
templates Layer
Broker #1 Broker #2
Primary source
Layer
RSC
UK-Pubmed NPG OUP
corpus
Central corpus corpus
EBI Uniprot corpus EBI Array EBI Uniprot
database Express database
Elsevier database
NCBI OMIM corpus
database 17
18. Simple Graphical User Interface to the
SESL public demonstrator
1. Single point of query through a simple GUI 2. Aggregated Results on a single web page
Full text detail
A. Gene query results summary
Title: Authors:
1) Co-occurrence Documents Citation
2) Uniprot names and annotation Co-occurrence of
3) OMIM disease names gene and disease
4) Array express disease and/or mentions in text
pancreas expression extracts
5) Uniprot GO terms
6) Uniprot Binary interactions
A. Gene Query
Show: and/or The results include links out to the primary sources
B. Disease Query Full text detail
B. Disease query results summary
Title: Authors:
1) Co-occurrence Documents Citation
2) OMIM disease names Co-occurrence of
3) Array express disease expression gene and disease
Filtered by:
1) Everything mentions in text
extracts
2) Consensus
3) Co-occurrence
4) OMIM
5) Array Express SESL public demonstrator:
http://www.pistoia-sesl.org
18
20. Gene discovery in SESL demonstrator
Pancreas T2D disease
1 gene
expression
in Array mention
Express db in OMIM db
3 1 Gene count
20 10 0
3
intersections from
4
the data sources in
the demonstrator
T2D disease T2D disease
genes in gene
Full Text 1 mention in
documents Uniprot db
20
21. Selected content loaded as RDF triples
Source Description # triples %
Expression data Array Express 182,840 0.5%
Experimental Factor Ontology from Array Express 49,026 0.1%
Disease vocabulary from UMLS 6,906,735 18.8%
Vocabulary from Disease Ontology 1,863,664 5.1%
Terms from Gene Ontology 495,595 1.3%
Human genes from Uniprot 12,552,239 34.1%
Meta data from Full Text documents 3,485,212 9.5%
Gene annotations from Full Text documents 2,373,584 6.5%
Disease annotations from Full Text documents 4,983,788 13.6%
GO annotations from Full Text documents 3,870,834 10.5%
Totals 36,763,517 100%
21
22. Signposting: Standards used in SESL
Category Name Community
RDF W3C
SPARQL W3C
Triple Store Jena, Sesame,
Open Source
Virtuoso
leXML EBI & CALBC
EBI, NaCTeM, U of
Text Mining LexEBI/BioLexicon
Pisa
CALCBC EBI & CALBC
UniProt EBI, PIR, SBI, etc
Disease Ontology and UMLS OBO, NIH/NLM
Blending of
URIs ArrayExpress EBI existing
NCBI Taxonomy NCBI standards
Dublin Core W3C
N3 notation W3C
RDF Schema Co-occurrence of gene-
EBI
disease
PMC doc standard NCBI
Relation ontology OBO
Ontology URI server W3C
22
23. The Deliverables of the SESL pilot
• A proof-of-concept to demonstrate feasibility and
clarify requirements
– http://www.pistoia-sesl.org
• A functional specification for query brokering,
result filtering, report generation
– Expect publication by end 2011
– http://www.pistoiaalliance.com/workinggroups/sesl.html
• Academia, Life Science Industry and Publishers
– Attained a better understanding of each other’s needs
– Demonstration of potential for a new business model
– Explore follow-on via Open Innovation consortia
23
24. Learning and Future Direction
• Framework to maximise re-use of existing standards
– Minimise use of bespoke, hard-coded implementations
• Crucial features of a knowledge brokering service:-
– RDF triples for a scalable, meta index to broker across
primary sources (both databases and literature)
– Important to define business rules for query & extraction
– Recommend a registry of suitable data sources
• similar to web services registry
• What is next?
– Example, follow-on to the SESL pilot:-
– Open PHACTs consortium => www.openphacts.org
– 3 year IMI pre-competitive project (started early 2011)
– Data providers and Life Science industry working together 24
25. Acknowledgements
Industry EMBL-EBI Publishers
Wendy Filsell - Unilever Dietrich Rebholz Schuhmann Claire Bird – OUP
(SESL co-leader) (Technical Team Leader) Richard O’Bierne – OUP
Ian Stott - Unilever Christoph Grabmueller
Silvestras Kavaliauskas Colin Batchelor – RSC
Nigel Wilkinson - PFE Richard Kidd – RSC
Catherine Marshall - PFE Dominic Clark
Roderigo Lopez David Hoole – NPG
Peter Woollard - GSK Jo McEntyre – UK-PMC Alf Eaton – NGP
Ashley George - GSK Janet Thornton
Jabe Wilson – Elsevier
Mike Westaway - AZ Bradley Allen – Elsevier
Nick Lynch - AZ
Ian Dix - AZ
Michael Braxenthaler – Roche
John Wise – Pistoia Alliance
25