SciBite is an award-winning leading provider of semantic solutions for the life sciences industry. Our fast, scalable easy-to-use semantic technologies understand the complexity and variability of content within life sciences. We can quickly identify and extract scientific terminology from unstructured text and transform it into valuable machine-readable data for your downstream applications. Our hand-curated ontologies ensure accuracy and reliability of high-quality results. Headquartered in the UK, we support our customers with additional sites in the US and Japan.
More infos at: www.scibite.com
2. • Majority of scientific information is
unstructured and underused
• Information overload (Volume,
Variety, Velocity, Quality)
• Highly synonymous and ambiguous
terminology
• Complex hierarchical relationships
Science isn’t simple
3. • Poor results when applying computational/AI
approaches
• Up to 80% of time taken to prepare data
• Inaccessible and underused data
• Duplicity of research
• Not building on existing knowledge
The downstream impact
4. Our Purpose
To enable scientists to use insights locked in
unstructured data to power their decision and speed up
innovation by:
• Using world class ontologies to revolutionise the
access to and utilisation of scientific information
• Transforming unstructured text into contextualised,
machine readable data suitable for computational
analysis
5. The SciBite Platform
Harmonise terminology
Scientific ontologies
Adhering to public standards
Manage / augment / curate your
own
#MANAGE
Automated cleansing of semi-
structured data
Text to data
Standardise data formats
Indexing data at point of entry
#CLEAN
Semantic search
Regular expressions
Knowledge networks
Visualise results
Platform enrichment
#DISCOVER
6. TERMite
TERMite
VOCab
TERMite
TEXT IN
Any format of
biomedical text-based
document can be
processed by
TERMite
STRUCTURED
DATA OUT
Contextualized,
machine readable
data ready for
analysis
Augment VOCabVOCab Creation
Variation engine
i.e. breast cancer auto expanded to
include the syns breast neoplasm,
cancer of the breast & mammary tumour
Source
Ontology Expert Curation
Synonym
Expansion
Disambiguation
settings
Iterative testing
VOCabs can be updated by users with simple 3 column augment
files, the following (saved as drug.dictionary.aug) would add the
extra synonym, extrasyn to the DRUG entity aspirin:
# ID name syns
CHEMBL25 extrasyn
Java based, RESTful service
RDBMS, NOSQL, Solr/Elastic, Hadoop, RDF, AWS &
Docker compatible
Scalable & fast. Runs on a server, cloud, laptop
7. • Hand curated and maintained by our expert
team
• Comprehensive coverage
• Aligned to industry standards to maintain
interoperability
• Enriched with synonyms and rules to
manage. the complexity of scientific
language
• Customize, augment our existing or deploy
your own vocabularies
VOCabs
Ontologies are at the heart of everything
CORE
CLINICAL
AGRO
BIO-PHARM
BUS
INT
GEN
PHEN
8. Modular Microservices Architecture
Compile / test
vocabularies
Manage /
distribute
ontologies
VOCabulary
curation
Data cleaning
platform
Smart forms
(HTML/JS)
Automated
data ingestion
Semantic
search UI
Pattern
matching
Browser-
based
enrichment
Workflow
automation
(PLP/KNIME)
AI-based
classification
13. The Problem
• Poor keyword search results
• Inability to search across a specific concept e.g. [GENE]
• Unable to manage synonymy/ambiguity
The Solution
• SciBite Vocabularies cover >80 different Scientific concepts
• Rule-based system to translate language of science
• Flexible architecture to integrate seamlessly with partner systems
The Outcome
• Powerful, enterprise search transformed into scientifically aware system
Enterprise Search
15. Articles identified that don’t
use the word “Gilenya” but
do use a synonym
Articles must mention an
indication. We don’t care
which at this stage
16. The Problem
• Search functionality often limited to keyword with no synonym
support
• Difficult to gain an aggregate view of the innovation within the
business
• Data not structured/tagged to facilitate linking with other stores
The Solution
• Extraction and semantic enrichment of ELN records transforms the
knowledge into richly annotated, machine-readable data
• Interoperable output able to be delivered into various downstream
environments
The Outcome
• Greatly improved ability to search and analyse internal R&D
information
• Understand who is researching what and how people/groups are
interacting
Improving searchability in an ELN
17. • Highly experienced curation team
• Active engagement with initiatives such as Pistoia, OBO, ICBO…
• Supported by custom-developed curation software to rapidly develop and
maintain new or existing ontologies
Ontology & Curation Services
For new domains or
with using internal
data sources
Bespoke
Ontology
Curation
For example in the
areas of bioassays,
technologies and
devices
Enrich &
Manage Public
Ontologies
Expand/customise
our hand-curated
vocabs (gene,
indication, etc…)
Augment
SciBite Vocabs
Thought leadership
on standards,
ontologies and
metadata
Engage Our
Experts
For new domains or
to find novel entity
relationships
Bespoke
Semantic
Queries
18. BioAssay Data Repositories
The Problem
• Legacy systems missing metadata
• Limited ability to search results in high duplicity
The Solution
• Retrospective generation of metadata using semantic
entity recognition to find relevant terms
• Prospective auto-metadata curation using intelligent
forms incorporating semantic autocomplete
• Flexible architecture to integrate seamlessly into existing
systems
The Outcome
• Greater ability to find relevant information improves re-
use of legacy data and reduced duplicity.
• Semantically annotated, interoperable assay data
19. Monitoring Patient Forums
The Problem
• The data from Patient/Social media forums are of increasing
interest/value to researchers but present a challenge to monitor
• Multiple formats, locations, structures of data make integration a difficult
process
• Consumer language used in these forums doesn’t map to standardised
ontologies
The Solution
• Index data irrespective of source and store in central repository for
analysis
• Customisable vocabularies can accommodate for consumer language
and map to existing public standards
• DOCstore provides customisable and extensible search capabilities
• Alerting function allows monitoring of relevant threads.
The Outcome
• Ability to transform, integrate and analyse patient forum data alongside
existing workflows.
• Powerful multi-source search through simple, easy to use interface.
• Tailored vocabularies provide unique search environment
20. CI/ Horizon Scanning
The Problem
• Many sources of unstructured external data difficult to monitor and
search consistently across
• Data aggregation and review is a time-consuming process
• Persisting legacy data – not all information is relevant right now
The Solution
• Index data irrespective of source and store in central repository
for analysis
• Customisable vocabularies allows for unique / proprietary search
methodologies
• DOCstore provides customisable and extensible search
capabilities
• Alerting function allows monitoring of many pre-defined search
strategies
The Outcome
• Powerful multi-source search through simple, easy to use
interface.
• Tailored vocabularies provide unique search environment
• Reduce data review times by up to 80% https://www.scibite.com/artificial-intelligence-platform/
DOCstore
News, Grants, Publications – any Data Source
Semantic
enrichment + text
analytics using
customised
vocabularies
21. Phenotypic Triangulation
The Problem
• Many diseases are understudied and lack clear molecular
mechanisms
• Some entities (e.g. Phenotypes) are highly synonymous and
difficult to standardise
• Scraping, standardising, and analysing research is time-
consuming
The Solution
• Standardise terminology using SciBite VOCabularies
• Transform unstructured text into interoperable machine-readable
data compatible with downstream applications
• Build network views of disease-phenotype mappings to identify
common mechanistic pathways and shared knowledge
The Outcome
• Uncovering novel relationships in disease biology not previously
evident in the source data
• Scalable, structured analysis mappable to public ontologies with
the flexibility to integrate additional sources over time
22. Data Preparation / Cleansing
The Problem
• Many sources of internal data is ‘messy’, even if
structured it’s not always consistently tagged
• Messy data in = Messy data out
• Cleaning/curating data is time-consuming manual
process
The Solution
• FactBio + SciBite integration = automated cleaning/
annotation using highly curated vocabularies spanning
life science research
• User-friendly blend of automated tagging augmented
with manual review where necessary
• Flexible architecture to integrate seamlessly into existing
systems
The Outcome
• Greatly reduced effort required to cleanse / prepare data
for downstream utility
• Semantically annotated, interoperable assay data