We will present the updated Allotrope framework and cover .adf files and how they are used. We’ll demonstrate semantic modeling in .adf (OWL models + the SHACL constraint language). We’ll show how the data description layer in .adf can be extended via a “semantic hub” that we call Reference Master Data Management, which can be used across the enterprise. RMDM provides a means to integrate metadata about any data source within your enterprise – including structured, semi-structured and unstructured data. Customer examples from current project work will be given where possible. Last we’ll show scalability of this approach using data science techniques can be employed beyond just the metadata – we refer to this as Big Analysis.
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
From allotrope to reference master data management
1. V.2.2
Eric Little, PhD
Chief Data Officer
OSTHUS
eric.little@osthus.com
From Allotrope to Reference Master
Data Management:
How semantic metadata in .adf can
be extended across the enterprise
2. Slide 2
LIMS
Studies
Registration
The Silo Situation: Expensive, Ineffective and Error Prone
… ISO3
… DEU
… FRA
… …
… Country
… Germany
… France
… …
… ISO3-num
… 276
… 250
… …
?
?
• Applications use different names for the same things.
• Data exchange is expensive and limited (mapping knowledge in interfaces).
3. Slide 3
Situation with Semantic Reference Master Data Management
LIMS
Study
Management
Product
Registration
DataGovernance
Semantic Reference Master Data System
“France”@en
“FRA”
“250”
…
“EU”
“European Union”@en
registered
“AAFYZ-1217”
products locations
Value of Semantics:
• standardized naming conventions
for your core entities
• standardized meta models
(vendor agnostic)
• reuse of public ontologies (see
e.g., BioPortal)
• well defined hierarchies
• synonyms & mappings
• qualified relationships
• flexibility of graph models
• rules and inference
• data validation
4. Slide 4
Documents are processed for term/concept extraction
Extracted concepts are checked for accuracy
A Gold Standard Doc is created by a human – fully accurate reading
Documents are re-run based on human/machine corrections
Machine Learning improves performance over time
How Text Extraction Basically works (highly simplified version)
Documents
Text
Analytics
Engine
• Strains
• Persons
• Organizations
• Seasons
• Locations
• Etc…
Gold
Standard
Document
Extracted
Entities
Human In
The Loop
Feedback Loop for
Learning/Improvement
5. Slide 5
Extracted entities from the text source are stored in a DB or File Store
They are mapped to other data
Legacy RDBs
Semantic Models (shown here)
Other data sources
The semantic model adds context to the extracted information
A term can now be related to other objects from other sources
Linking to Semantics (Knowledge Graph)
Semantic Model
Documents
Text
Analytics
Engine
• Strains
• Persons
• Organizations
• Seasons
• Locations
• Etc…
Extracted
Entities
6. Slide 6
A Semantic Framework can connect the entire enterprise using a common semantics
The Semantic Hub should only focus on metadata (not instance level data)
Benefits: Common Terms, Models, Queries, Rules and Results (End-to-End)
Integrating Data Across the Enterprise
Lab Instruments Clinical Trials Regulatory AffairsProduction eArchiving
8. Slide 8
Allotrope Structure 2017
Astrix Technology Group
BSSN Software
Elemental Machines
Erasmus MC
Fraunhofer IPA
The HDF Group
LabAnswer
LabWare
Mettler Toledo
NIST
SciBite
Stanford University
University of Illinois at Chicago
University of Southampton
10. Slide 10
Allotrope Data Format (ADF)
HDF5
Platform Independent File Format
Allotrope Data Format (ADF)
Descriptive metadata about
• Method, instrument, sample,
process, result, etc.
• Provenance, audit trail
• Data Cube, Data Package
Analytical data represented by
one- or multidimensional arrays
of homogeneous data structures.
Analytical data represented by
arbitrary formats, incl. native
instrument formats, images,
pdf, video, etc.
Specifically designed to store
and organize large amounts
of scientific data.
Data Description
Semantic Graph Model
Data Cubes
Universal Data Container
Data Package
Virtual File System
APIs(Java&.NETclasslibraries)
Chromatogram 2D HDF
13. Slide 13
expected answer
specified percentage of components,
e.g. 25% A, 75% B
specified composition of components,
e.g. A = 0.5 mol/L Acetonitrile, B = Methanol
specified qualities of chemical compounds
What mobile phase is required ?
MeCN/MeOH 40/60
14. Slide 14
What mobile phase is required ?
specification of
mobile phase
composition
of
mobile
phase
device
experiment
19. Slide 19
manual state of batch comparison
Final Step
Manual Report
LIMS
Purity Summary –
Crude to Drug Prod
Batch Comparison Table
Early/Late
Impurities
Batch to Batch
Comparison
Submission/Sample #
Embedded in ELN
Analyst
ELN
Manual Communication
SME
• Significant amounts of
manual effort
• Disconnected data sources
• Locally stored information
• Lack of traceability
• Data is difficult to interpret
or manipulate
Instrument
Data
Inst.
File
DB
% Purity
Full Lngth
Prod
% Indiv
Impurities
• No Automation
• Limited Batch Comparisons
can be produced
• Limited Distribution
20. Slide 20
Integration for batch comparison
Final Step
Manual Report
LIMS
Purity Summary –
Crude to Drug Prod
Batch Comparison Table
Early/Late
Impurities
Batch to Batch
Comparison
Submission/Sample #
Embedded in ELN
Analyst
ELN
Manual Communication SME
• Shows data integration
capabilities from LIMS + ELN
data
• Utilizes important metadata
• Metadata is key component of
ADF flies (Data Description)
Instrument
Data
Inst.File
DB
% Purity
Full Lngth
Prod
% Indiv
Impurities
• Can be expanded to include
all Batch Comparison steps
• Provides Integration +
Automation over time
21. Slide 21
Moving to “product genealogy”
• ZONTAL integrates data across the
enterprise
• Reporting and visibility utilizes the
entire Data Lake
• Instrument data is captured via the
Allotrope Framework
• Expanded to include all scientific
data feeding into ELN, LIMS, etc.
Enterprise-Wide User Community
22. Slide 22
Benefits of Data Lifecycle Management
Cost Saving Measures:
• Scientists spend more time doing science – not computer science
• Data can be generated and found easily – saves time/money
• Conceptual information is more easily shared/understood upstream
and downstream (w traceability)
• Faster project decisions can be made (with more complete data)
• Managing data/projects across multiple locations/labs is easier
• Integration provides a more complete picture
Innovation:
• Leading your organization to better leverage the value of Data
Science
• Adopting new technologies fosters new ideas and breakthroughs
• 86% of CEO’s surveyed said “technological advances will transform
business the most over the next 5 years” (PWC, Jan 2014)
1