Mapping a human brain generates petabytes of gene listings and the corresponding locations of these genes throughout the human brain. Due to the large dataset a prototype Semantic Web application was created with the unique ability to link new datasets from similar fields of research, and present these new models to an online community. The resulting application presents a large set of gene to location mappings and provides new information about diseases, drugs, and side effects in relation to the genes and areas of the human brain.
In this presentation we will discuss the normalization processes and tools for adding new datasets, the user experience throughout the publishing process, the underlying technologies behind the application, and demonstrate the preliminary use cases of the project.
2. Today we are discussing…
• What is the use case and who requested it?
• How do you import and normalize thousands of RDF
•
•
•
•
triples worth of gene data?
How do we enrich the normalized gene data with parallel
research data sets?
Creating instance pages without knowing exactly what will
be displayed on them.
Demonstration of the initial use cases
Question and answer session
3. Why?
• Prototype: How do we assemble the data mine and
refine the authoring tools?
How do we expand this to the research
community?
• How do we expand ownership of the data to research
professionals?
• How do we build systems in a way that research
professionals can author and link the data?
• How do we publish these new relationships to the wider
research community?
4. What is the Allen Institute for Brain
Science?
• Launched in 2003 with seed funding from founder and
philanthropist Paul G. Allen.
• Serving the scientific community is at the center of our mission
to accelerate progress toward understanding the brain and
neurological systems.
• The Allen Institute's multidisciplinary staff includes
neuroscientists, molecular biologists, informaticists, and
engineers.
“The Allen Institute for Brain Science is an
independent 501(c)(3) nonprofit medical
research organization dedicated to accelerating
the understanding of how the human brain
works.”
5. Human Brain Map
• Open, public online access
• A detailed, interactive three-
•
•
•
•
dimensional anatomic atlas of the
"normal" human brain
Data from multiple human brains
Genomic analysis of every brain
structure, providing a quantitative
inventory of which genes are
turned on where
High-resolution atlases of key brain
structures, pinpointing where
selected genes are expressed
down to the cellular level
Navigation and analysis tools for
accessing and mining the data
6. Biological Linked Data Map
• Open, public online access
• Data from multiple RDF data
•
•
•
•
stores
Complete import pipeline using
LDIF framework
Outlines of each imported
instance embedding inline wiki
properties and providing views of
imported properties from original
RDF datasets
Charting tools that „pivot‟ SPARQL
queries providing several views of
each query
Navigation and composition tools
for accessing and mining the data
7. Where did we get the data?
• KEGG : Kyoto Encyclopedia of Genes and Genomes
• “KEGG GENES is a collection of gene catalogs for all complete genomes
generated from publicly available resources, mostly NCBI RefSeq.”
• Diseasome
• “The Diseasome website is a disease/disorder relationships explorer and
a sample of an innovative map-oriented scientific work. Built by a team of
researchers and engineers, it uses the Human Disease Network data set.”
• DrugBank
• “The DrugBank database is a unique bioinformatics and cheminformatics
resource that combines detailed drug data with comprehensive drug target
information.”
• SIDER
• “SIDER contains information on marketed medicines and their recorded
adverse drug reactions. The information is extracted from public
documents and package inserts.”
8. New ontology map for import
•
Genes
•
•
•
•
Diseases
•
•
•
DrugBank : 4,772
KEGG : 2,482
SIDER : 924
Effects
•
•
Diseasome : 4,213
KEGG : 459
Drugs
•
•
•
•
DrugBank : 4,553
Diseasome : 3,919
KEGG : 9,841
SIDER : 1,737
Pathways
•
KEGG : 28,442
We chose to intentionally simplify the
ontology due to disagreements between
researchers about entity relationships and
subclasses.
9. Importing and mapping the Linked Data
•
R2R
•
•
•
•
32,900 instances were converted to the
wiki ontology.
Networked
Storage
Local
Storage
Download
583,746 properties mapped
Pathways were ignored for wiki
ontology import, but are available within
the triple store KEGG Pathway graph.
SIEVE
•
20,849 instances available in wiki
ontology after SILK normalization
•
Instance merging effected drugs,
genes, and diseases across datasets.
• Triple Store SPARQL Update
R2R
Mapping
Engine
Maps Entities to
New Ontology
Import to
Wiki
Sieve
Mapping
Engine
Normalizes Entities
across data sources
Normalize
Entities
Triple
Store
Available with
SPARQL Queries
10. Importing and mapping the Linked Data
•
R2R
•
•
•
•
32,900 instances were converted to the
wiki ontology.
Networked
Storage
Local
Storage
Download
583,746 properties mapped
Pathways were ignored for wiki
ontology import, but are available within
the triple store KEGG Pathway graph.
SIEVE
•
20,849 instances available in wiki
ontology after SILK normalization
•
Instance merging effected drugs,
genes, and diseases across datasets.
• Triple Store SPARQL Update
R2R
Mapping
Engine
Maps Entities to
New Ontology
Import to
Wiki
Sieve
Mapping
Engine
Normalizes Entities
across data sources
Normalize
Entities
Triple
Store
Available with
SPARQL Queries
12. Linked Data challenges
• Data sources that overlap in content may:
• Use a wide range of different RDF vocabularies
• Use different identifiers for the same real-world entity
• Provide conflicting values for the same properties
• Implications
• Queries become hand crafted for a specific RDF data set – no
different than using a proprietary API.
• Individual, improvised and manual merging techniques for data
sets.
• Integrating public datasets with internal databases poses
the same problems
13. Linked Data Integration Framework
• LDIF normalizes the Linked Data from multiple sources
into a clean, local target representation while keeping
track of data provenance.
1
Collect data: Managed download and update
2
Translate data into a single, target vocabulary
3
Resolve identifier aliases into local target URIs
4
Cleanse data and resolve conflicting values
5
Output to local file system or triple store
14. LDIF Pipeline
1
Collect data
2
Translate data
3
Supported Data Formats
Resolve
identities
4
Cleanse data
5
Output data
•
•
•
RDF Files (Multiple Formats
SPARQL Endpoints
Crawling Linked Data
Component Stack
15. LDIF Pipeline
1
Collect data
2
Translate data
Sources use a wide range of different
RDF vocabularies
dbpedia-owl:City
schema:Place
R2R
location:City
fb:location.citytown
3
Resolve
identities
4
Cleanse data
5
Output data
Component Stack
16. LDIF Pipeline
1
Collect data
2
Sources use different identifiers for the
same entity
Translate data
London, England
London, MA, USA
London, TN, USA
London, TX, USA
SILK
London
3
London =
London, England
Resolve
identities
4
Cleanse data
5
Output data
Component Stack
17. LDIF Pipeline
1
Collect data
2
Translate data
3
Sources provide different values for the
same property
London, England
has a population
of 8.174M people
London, England
has a population
of 9.2M people
SILK
rdfs:population:
8.174M
Resolve
identities
4
Cleanse data
5
Output data
Component Stack
18. LDIF Pipeline
1
Collect data
2
Translate data
3
Supported Output Formats
•
•
•
N-Quads
N-Triples
SPARQL Update Stream
Resolve
identities
4
Cleanse data
5
Output data
Provenance tracking using Named Graphs
Component Stack
23. Semantic MediaWiki
Semantic MediaWiki is a full-fledged framework, in
conjunction with many spinoff extensions, that can turn a
wiki into a powerful and flexible knowledge management
system. All data created within SMW can easily be
published via the Semantic Web, allowing other systems to
use this data seamlessly.
24. Four initial templates for each instance by
category
1. Custom infobox within outline
template
•
Visible inline properties
2. Outline template providing instance
information
3. Widget template displaying dynamic
charts or third party services
•
Donut charts and AIBS gene feed
4. Broad table SPARQL queries
showing instance relationships
5. Hidden inline properties for other
extensions
25. Creating instance wiki pages
• The Triple Store now contained tens of
thousands of recognized category
instances. Creating the pages require a
bot.
Create List of Page
Names
1.0
RDF Data
Download
1. Fetch the RDF dumps from an active
D2R server
2. Use regex to fetch the rdf:label property
that was mapped by R2R as an instance
name
3. Open category specific text file of wiki
markup (page of template includes)
4. Contact Neurowiki and request a new
page from the list of names with the
category content
Sanitize
Script
2.0
Create CSV
Category
Page Names
Text of Wiki
markup for page
instance
Read Open
3.0
Create MediaWiki Page
MediaWiki
Gateway rb
Framework
REST
interface
4.0
Neurowiki
Instance
Page
28. How are base entities like Calcium
represented?
1. The wiki page and
corresponding template
components are rendered.
Drug Search
1.0
Wiki Page
Aggregate
Page of
Components
2. Relations are pulled from the
normalized data store of linked
data.
2.0
Calcium
Relations
Neurobase
Data Stores
3. The JavaScript components are
3.0
Selected
Widget for
Display
populated via a data feed
29. How are base entities like Calcium
represented?
• Because so many
organisms contain
calcium the
mappings to
affected species
were never created
to conserve space
in the data store.
Drug and Disease Class Ratios of Calcium
Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
30. What are the dangers of Propofol?
1. Propofol DrugBank relations are
Drug Search
Neurobase
Data Stores
rendered in corresponding
JavaScript components.
1.0
Propofol
Relations
2.0
Aggregate
Components
2. The Diseasome disease
relations show classes of illness
Propofol affects.
Propofol
Disease
Relations
3. An aggregate of SIDER side
3.0
Propofol Side
Effects
effects are rendered in relation
to Propofol and disease classes.
34. Which drugs are used in Chemotherapy?
1.
2.
Disease
Search
DrugBank and AIBS relations to
genes affected by both the disease
and drug.
3.
SIDER side effects related to the
gene, disease, and drug.
4.
DrugBank drug glossary definition
specifying various forms of Cancer
treatment.
Neurobase
Data Stores
1.0
Disease
Relations
Diseasome disease relations
normalized by LDIF.
Aggregate
Components
2.0
Gene Drug
Relations
3.0
Drug Side
Effects
4.0
Drug Info Box
36. Which drugs are used in Chemotherapy?
Drug and Disease Class Ratios of AR
Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
37. Which drugs are used in Chemotherapy?
Drug and Side Effect Ratios of AR
Inner Circle: Drugs by Affected Species, Outer Circle: Side Effect Ratios of Drugs
39. Which drugs are used in Chemotherapy?
Drug and Disease Class Ratios of Nilutamide
Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
40. Which drugs are used in Chemotherapy?
Drug and Disease Class Ratios of Bicalutamide
Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
42. Expanding the Prototype
• Semantic MediaWiki query construction
• Could this be done in SPARQL?
• Authoring SILK / R2R mappings for the LDIF Pipeline
• Extremely difficult and the editors are not intuitive
• How do you get data owners to fuse the sets and create
the data store themselves?
• Tested with Aura Wiki prototype
• Expand authoring provenance
• How do we ensure new data / links comes from an authoritative
source?
43. Today we discussed…
• The Allen Institute for Brain Science (AIBS)
• Four similar research data sets to interlink with the AIBS
•
•
•
•
•
data set
An import pipeline named Link Data Integration
Framework (LDIF)
The interlinking process for 5 concurrent research data
sets (AIBS, DrugBank, Diseasome, KEGG, SIDER)
A prototype neurobiology authoring platform.
Creating instance pages to display the new connections.
Demonstration of the initial use cases.
Hello, My name is william smith and today we will be talking about a project near and dear to my heart.I served as project manager for a prototype application, worked closely with 2 German teams, and we were the first customer for several of the tools used to assemble this application. I was also the chief integration point into Vulcan so am well aware of the technologies, code bases, and data sets that went into assembling this project…
So what are we discussing today?First and foremost this was a project for an internal organization at Vulcan involved in mapping the human brain. This, of course, generates petabytes of data and millions of triples worth of gene mappings – but we took a smaller slice of a couple hundred thousand genes for the initial prototype. There were also several parallel research programs generating data in a format we could use, and a conference was held of industry professionals to find the interlinking pieces of these datasets. Finally, I’m going to walk through the data pipeline, the application itself, and a set of our original use cases.
Why?Well a core problem that has been in neurobiology, and most sciences for that matter, is the inability to share and author sets of data across projects by industry professionals. This leaves an odd gap where people with computer science degrees are linking data they don’t fully understand, and the people that understand the data don’t have the ability to add the interlinks for greater vision into the data.With this problem known our original prototype soon expanded into how do we get these tools into the hands of the research community, and that in itself created 3 core questions. Ownership, Authorship, and publishing provenance of the newly linked data.
The organization that chartered this project, and provided the original data sets is the Allen Institute for Brain Science – or AIBS. When you hear me say AIBS on accident I’m referring to this organization. It was launched in 2003 by Paul G. Allen and has the explicit focus of mapping the human brain to accelerate our understand of the brain and neurological systems. Furthermore, the institute is a 501c(3) nonprofit medical research organization employing hundreds of neuroscientists, molecular biologists, informaticists, and engineers within the seattle area.----- Meeting Notes (1/28/14 12:15) -----So who requested this?accelerate our understanding
And this is the Institutes core product… or several screen shots of the core product. Here we have gene heat maps… some location data… where it all is location wise in the human brain. As odd as those screen caps are they are accessed by thousands of researchers daily and this is considered a major success.It’s open, the public right now can go to this site and browse the catalog. There are currently 3 human brains fully mapped with a 4th in progress. Each of these donors have generated genomic analysis of brain structure and have created a thorough catalog of genes with respect to location. While the captions are small they are part of a much larger suite of atlas navigation tools with several components – ie. Heat map – pinpointing genes expressed down to the cellular level.And most importantly, for our purposes, they generate terabytes of data with industry wide IDS we can link to other sources!
And here’s our prototype in screenshots. No page is hand type, no graph is hand entered, 4 static templates pulling data from our normalized mine creating all these pretty pictures and full pages of text. There are over 30 thousand of these pages.We will be discussing the first two points in depth – RDF and the LDIF pipeline. Charting tools use SPARQL which we will not be discussing in depth – however I have a hidden slide of the details should somebody be really malicious and want to ask about SPARQL queries. Finally, our navigation closely resembles the common MediaWiki installation which everybody who has been on the internet in the last 10 years is familiar with… editing on the other hand is very different and currently only bots create and maintain the pages.
Which brings us to these parallel tracks of research data I keep mentioning.To choose these sets we had a conference of industry researchers and data professionals go through the hundreds of biology mines looking for useful projects that closely relate to genes found in the human brain. The 4 prototype sets chosen were <read slides>
Our original cross section of data found these connections. Not the full dump, but with roughly 15 thousand gene connections plenty of pages produced relevant connections and filled pages with interesting data points. <read numbers>And to the right we have our simplified ontology. Looks incredible right… hey they can’t all be winners and don’t blame me – blame protégé.This was generated with basic 1-1 relations, domain-range logic, where applicable. <joke about line colors>The simplification was created in part because nobody that does anything in neuroscience agrees with another person that does the same thing. We could get them agree that in some gray area way these things are related on the domain-range level … so that generates that and it looks way worst if I try to spread the boxes out in any other way.
Which brings us to the pretty graph I hate… because it makes unifying things into that ugly protégé graph look easy.It’s not, but it does give a good overall view of what we able to convert directly to the wiki be 32 thousand 900 instances turned directly into pages with over 500 thousand properties across the set. Even more important after “same as” connections were made we had 20 thousand fully populates pages – and these are the pages with connections across the datasets. That brings up an important point, if I imported all of the gene data I would end up with a huge wiki by page count, but the better part of these pages would be nothing more than a page title and empty templates. Hence the importance of finding these connections and only tracking the useful data points – like pages with more than a title.On the right we have the simplified process which I will be going into more detail very soon <read right graph>----- Meeting Notes (1/28/14 12:15) -----But it does give a good overall view of what we are able to convert directly into the wiki.
And those parts that just turned red - <read red parts> - is the process we will be discussing for a section I like to call: Linked Data Integration Framework
\----- Meeting Notes (1/28/14 13:28) -----Created over the last 4 yearsCreated by Free University of BerlinSame team that helped build the prototypeFirst customerStill active, last update late 20132 main components, R2R and SILK
And this is why I don’t like the oversimplification of that process chart. Plenty of difficult computer science problems and none of them cut and dry to solve…Assuming we can find overlapping data sources you then have to unify vocabularies – the predicate of the triple. Once this is done and you can agree on what the name of the entity is, then you will have data sets with the same entity going by a range of names and ids. Finally, once you’ve located the same entities there’s no guarantee the normalized vocabularies will be referencing the same value.Without the normalization pipeline – LDIF - this creates queries that are silo’d to a specific data set basically creating an API… and that’s good for companies like facebook and Google but terrible for independent research. The last point is less of a problem for us because we decided long ago this was a philanthropic prototype with 501-c-3 data – but it is something to be considered when working with say – national security data.
Lucky for us, as customer 1 of the LDIF framework, we get to test all of the steps in normalization and hope for the best or fix it ourselves!If this works right we will…<read steps>
And here’s the LDIF architecture.All this stuff on the bottom are the 5 data sets, the arrows don’t really apply because they didn’t link up that well before LDIF, and then to the pipeline.After processing and RE-releasing the arrows apply, and then we shove that all in our own public triple store for use in the application.
And here’s your application.
----- Meeting Notes (1/28/14 13:28) -----Pubby created 5 years agoUsed in dbpediaFree univeristy of berlinNo search, have to follow linksNot very modern viewing experienceNo expression of data via links
Less than helpful – FINE.
Well I am in this business to please the consumer, and my consumer understands common web architectures – even if they don’t know they do - so let’s try an installation of Semantic MediaWikiInvented roughly 5 years ago it’s a series of plugins, that run on mediawiki, which was created by the good folks that invented wikipedia! Millions of people see it everyday while researching homework they don’t feel like doing, when sloppily referencing college term papers, or in my opinion creating one of the most accurate and comprehensive encyclopedias humanity has to date. Even better we can display the semantic properties of our normalized data inline! <show arrows><can you expand> of course I can.
I’m going to build you 4 base templates by category – Gene, Drug, Disease, and Side Effect.These templates will have the base information displaying our semantic properties - <run through wireframe>
----- Meeting Notes (1/28/14 12:15) -----This created a problem - namely how do I create 30,000 pages and not get fired for entering data over the course of 2 years. So, a lot of what you see on wikipedia isn' t actually input or maintained by humans. The gene pages all have very complex info boxes tracking ids, regions, and a variety of known properties mined from other sources. The pieces of code that do this mining and page creation are called wiki-bots.We wrote a wiki-bot to create our 30,000 pages, one for each page type, and this is the creation pipeline these bots utilized.
----- Meeting Notes (1/28/14 12:15) -----I'll be running through 3 core use cases we used to test the project and explaining how the pages and graphs were generated. All of the graphs related to the genes, diseases, drugs, and side effects within the next few slides are generated from the wiki.However, it's far easier to view the wiki when you have access behind the vulcan firewall... so I had to run on screen shots for this portion.
----- Meeting Notes (1/28/14 12:15) -----Calcium - difficult use case- within all creatures- has lots of connections to other entities- but we don't want to create all the pages
----- Meeting Notes (1/28/14 13:28) ------ 15 minutes of fame 5 years ago- Powerful seditive used in anesthesiology-- You should not use it as a sleep aid- Listed as cause of death for popular musician
Fix this
----- Meeting Notes (1/28/14 12:15) -----Finally, we head over to drug bank and search for an obscure drug page... Bicalutamide...It's an oral steroid used in the treatment of cancer that effects the androgen receptor. Thus validating our links across the data. An example of how a not-so-simple correlation of data can give researchers deeper vision by merging sets and presenting the interlinks.
----- Meeting Notes (1/28/14 12:15) -----Aura wiki - it was used to test crowd sourcing of data authoring for a proto-AI.