SlideShare a Scribd company logo
1 of 22
Using Automated
Workflow Tools to
Improve Wikipedia
MITCH MILLER
SCIENTIFIC THINKING
VERMONT CODE CAMP 2016
SEPTEMBER 17, 2016
Disclaimer
 This talk represents my opinion and personal experience using software
systems developed by third parties
 The software systems shown are very complex and have hundreds of
components. I have only worked with a small number.
 Every task shown today can be accomplished in multiple ways. I’m only
showing some of those ways.
Overview
 Introduction: how are we improving Wikipedia? Why are we doing this?
 The list of information we need to compile
 First method of generating the list
 The second method of generating the list
 The third method of generating the list
What chemistry does Wikipedia
contain?
 9,736 articles with the Chembox; 5,656 with the Drug box (15, 392 total)
[source: https://en.wikipedia.org/wiki/Wikipedia:List_of_infoboxes]
 Chembox? Drug box?
 Templates of selected content within Wikipedia articles
 Contents of Chembox:
 Molecular structure image
 Name (systematically assigned name + synonyms)
 Identifiers: CASNo, ChEBI, ChEMBL, ChemSpiderID, DrugBank, InChI, KEGG,
PubChem, SMILES, UNII…
 Key properties
Chemical identifiers
 Different specific databases
 Individual IDs have strengths and weakness
 The UNII is a non- proprietary, free, unique, unambiguous, non semantic,
alphanumeric identifier based on a substance’s composition and/or
descriptive information.
 http://www.fda.gov/ForIndustry/DataStandards/SubstanceRegistrationSystem-
UniqueIngredientIdentifierUNII/
 UNIIs contain 9 randomly generated alphanumeric characters with a tenth
check alphanumeric character
 When two samples have the same UNII, “they represent the same molecular
entity or elements upon which the definition is based.”
SRS group goal
 Manages Substance Registration System (SRS)
 Assure uniformity of UNII assignments across internet resources that
reference UNIIs
The assignment
 Generate a report of all chemicals and drugs in Wikipedia
 Name, UNII (when present), CAS (when present),Wikipedia URL
 Idea: subject matter experts will review list and correct assignments, add
new UNIIs to Wikipedia as needed
 Result: more accurate Wikipedia that links to the FDA’s Substance
Registration System unambiguously
 https://fdasis.nlm.nih.gov/srs/srs.jsp
Development tool: KNIME
 Graphic, component based programming environment
 Drag functional components from palette onto canvas to create program
 Configure most components by setting parameters
 Connect components to route data from one to another
 Run and observe data traveling down the lines
 KNIME stands for KoNstanz Information MinEr
 Pronounced “Nighm”
 Originally a production of the University of Konstanz, Germany 2004
 Currently produced by KNIME.com AG, a company in Zurich, Switzerland
 Free version available for download
 Windows, Linux, Mac
First method of report generation
 Read list of pages with each infobox
 E.g.,
https://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere/Template:Ch
embox&limit=50000&from=16225610&back=0
 Retrieve each individual page mentioned in the list
 Parse HTML
 Use Xpath to get Name, CAS, UNII
 The Infobox templates lead to pages with defined structure – straightforward to
parse
 Format data for output
 Write to a file
First method: pluses/minuses
 Plus: it works
 Minus: had to run in batches to get all records
 Minus: XPath parsing was more cumbersome than expected
 Minus: misses some data
The Semantic Web
 A connected set of data resources that can be understood by machines
 Data encoded in a standard way that allows unattended processors to
traverse links from one entity to another across organizational and
geographic boundaries
 [Standard WWW is a web of documents meant to be understood by
humans]
 Tim Berners-Lee has a great Ted talk on the semantic web
 https://www.youtube.com/watch?v=OM6XIICm_qo
Understand Semantic Web in
comparison to WWW
 Compare pages on same subject:
 Wikipedia article on ethanol: https://en.wikipedia.org/wiki/Ethanol
 Wikidata page on ethanol: https://www.wikidata.org/wiki/Q153
Technological foundations of Semantic
Web
 RDF – Resource Definition Framework – organizing facts as
 Subject – Predicate – Object
 Conceptual example:
 [Ethanol] [has a boiling point] [173 degrees Fahrenheit]
 Coded example:
 Wd:Q153 wdt:P2102 “173±1 degree Fahrenheit” .
 Represented in Turtle - Terse RDF Triple Language
SPARQL
 Query language for RDF data
 SPARQL Protocol and RDF Query Language
 Similar to SQL
 Syntax based on the RDF triple
Wikidata
 Conceptually: semantic web version of Wikipedia
 Add grain of salt
 “Free and open knowledge base that can be read and edited by both
humans and machines. “
 Designed as ‘central storage’ for Wikipedia and other Wikimedia projects
 Approximately: programmatic interface to Wikipedia
 See https://query.wikidata.org/
 Run the example queries
Second method
 Search Wikidata programmatically for chemical information
 Wikidata SPARQL interface
 Format list
 Write file
SPARQL for chemical and
pharmaceutical compounds
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX bd: <http://www.bigdata.com/rdf#>
#All Chemicals with, optionally, CAS registry numbers and UNIIs in Wikidata
SELECT DISTINCT ?compound ?compoundLabel ?formula ?unii ?pubchem ?cas WHERE {
?compound wdt:P31 wd:Q11173 .
OPTIONAL { ?compound wdt:P231 ?cas . }
OPTIONAL { ?compound wdt:P274 ?formula . }
OPTIONAL { ?compound wdt:P652 ?unii . }
OPTIONAL { ?compound wdt:P662 ?pubchem . }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
Second method: pluses/minus
 Fast and easy!
 Data arrives in a format we can use – no parsing!
 Minus:
 *some* Wikidata data does not match up with Wikipedia!
Third method
 Hybrid approach
 Use Wikidata SPARQL query to get list of chemicals
 Query Wikipedia for individual items to compare values
Conclusion
 Using Wikidata, Wikipedia and KNIME we compiled a list of chemicals with
the required data
 Subject matter experts are in the process of updating Wikipedia
 Semantic web technology made the job easier!
 Thank you!
References
 Scholarly article on KNIME and Pipeline Pilot
 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3414708/
 KNIME
 www.knime.org
 Wikipedia
 https://en.wikipedia.org/wiki/Template:Chembox
 https://en.wikipedia.org/wiki/Wikipedia:Chemical_infobox
 Wikidata: https://query.wikidata.org
Who is your speaker?
 Mitch Miller, Ph.D. in Chemistry and 20+ years of IT experience
 Independent consultant: Scientific Thinking, LLC
 mitch.miller@thinkscience.us
 Some recent projects
 Ongoing custodian of one chemical database implementation for ChemIDplus
project within the National Library of Medicine
 Reporting systems
 Web service to link collaborative object management system to reporting
system
 Import wizard for chemical array designer
 Merged a set of chemical databases and harmonized data

More Related Content

What's hot

Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Dag Endresen
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
 
NGB Documenation System SESTO (4 February 2004)
NGB Documenation System SESTO (4 February 2004)NGB Documenation System SESTO (4 February 2004)
NGB Documenation System SESTO (4 February 2004)Dag Endresen
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsKhalid Belhajjame
 
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?Dr. Haxel Consult
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
 
Public Identifiers in Scholarly Publishing
Public Identifiers in Scholarly PublishingPublic Identifiers in Scholarly Publishing
Public Identifiers in Scholarly PublishingAnita de Waard
 
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)Dag Endresen
 
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Alasdair Gray
 

What's hot (12)

Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
 
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox Chemicals Dash...
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox  Chemicals Dash...PFAS Chemistry: Range, Complexity, Groupings, and the CompTox  Chemicals Dash...
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox Chemicals Dash...
 
Pride and ProteomeXchange
Pride and ProteomeXchangePride and ProteomeXchange
Pride and ProteomeXchange
 
NGB Documenation System SESTO (4 February 2004)
NGB Documenation System SESTO (4 February 2004)NGB Documenation System SESTO (4 February 2004)
NGB Documenation System SESTO (4 February 2004)
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its Extensions
 
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
 
The CIARD RINGValeri
The CIARD RINGValeriThe CIARD RINGValeri
The CIARD RINGValeri
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
Public Identifiers in Scholarly Publishing
Public Identifiers in Scholarly PublishingPublic Identifiers in Scholarly Publishing
Public Identifiers in Scholarly Publishing
 
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)
 
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
 

Similar to Using Automated Workflow Tools to Improve Wikipedia Chemistry Data

Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis Martin Walker
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSImplementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSValery Tkachenko
 
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can EditWikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can EditDario Taraborelli
 
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...Dr. Haxel Consult
 
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)BigData_Europe
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Juan Sequeda
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Valery Tkachenko
 

Similar to Using Automated Workflow Tools to Improve Wikipedia Chemistry Data (20)

Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis Data rich chemistry inside wikipedia and other wikis
Data rich chemistry inside wikipedia and other wikis
 
ChemSpider Overview SLides August 2007
ChemSpider Overview SLides August 2007ChemSpider Overview SLides August 2007
ChemSpider Overview SLides August 2007
 
Presentation of ChemSPider at PubChem Public Meeting
Presentation of ChemSPider at PubChem Public MeetingPresentation of ChemSPider at PubChem Public Meeting
Presentation of ChemSPider at PubChem Public Meeting
 
Building an integrated system for chemistry markup and online publishing inte...
Building an integrated system for chemistry markup and online publishing inte...Building an integrated system for chemistry markup and online publishing inte...
Building an integrated system for chemistry markup and online publishing inte...
 
RSC ChemSpider is the online chemistry database where community contributions...
RSC ChemSpider is the online chemistry database where community contributions...RSC ChemSpider is the online chemistry database where community contributions...
RSC ChemSpider is the online chemistry database where community contributions...
 
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
 
Crawling Across the Web of Chemistry Using ChemSpider
Crawling Across the Web of Chemistry Using ChemSpider Crawling Across the Web of Chemistry Using ChemSpider
Crawling Across the Web of Chemistry Using ChemSpider
 
Implementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTSImplementing chemistry platform for OpenPHACTS
Implementing chemistry platform for OpenPHACTS
 
Why Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpiderWhy Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpider
 
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can EditWikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit
Wikidata: Verifiable, Linked Open Knowledge That Anyone Can Edit
 
The Benefits to Chemical Vendors of Putting their data on ChemSpider
The Benefits to Chemical Vendors of Putting their data on ChemSpiderThe Benefits to Chemical Vendors of Putting their data on ChemSpider
The Benefits to Chemical Vendors of Putting their data on ChemSpider
 
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
 
Checking, Curating And Qualifying Chemistry
Checking, Curating And Qualifying ChemistryChecking, Curating And Qualifying Chemistry
Checking, Curating And Qualifying Chemistry
 
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
 
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
BDE SC1 Workshop 3 - Open PHACTS Pilot (Kiera McNeice)
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...
 
Navigating the Complex Web of Chemistry Using ChemSpider
Navigating the Complex Web of Chemistry Using ChemSpiderNavigating the Complex Web of Chemistry Using ChemSpider
Navigating the Complex Web of Chemistry Using ChemSpider
 
Bringing it all together: A Web-based Database for Chemical and Biological Da...
Bringing it all together: A Web-based Database for Chemical and Biological Da...Bringing it all together: A Web-based Database for Chemical and Biological Da...
Bringing it all together: A Web-based Database for Chemical and Biological Da...
 
Navigating the Complex Web of Chemistry Using ChemSpider
Navigating the Complex Web of Chemistry Using ChemSpiderNavigating the Complex Web of Chemistry Using ChemSpider
Navigating the Complex Web of Chemistry Using ChemSpider
 

Recently uploaded

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Using Automated Workflow Tools to Improve Wikipedia Chemistry Data

  • 1. Using Automated Workflow Tools to Improve Wikipedia MITCH MILLER SCIENTIFIC THINKING VERMONT CODE CAMP 2016 SEPTEMBER 17, 2016
  • 2. Disclaimer  This talk represents my opinion and personal experience using software systems developed by third parties  The software systems shown are very complex and have hundreds of components. I have only worked with a small number.  Every task shown today can be accomplished in multiple ways. I’m only showing some of those ways.
  • 3. Overview  Introduction: how are we improving Wikipedia? Why are we doing this?  The list of information we need to compile  First method of generating the list  The second method of generating the list  The third method of generating the list
  • 4. What chemistry does Wikipedia contain?  9,736 articles with the Chembox; 5,656 with the Drug box (15, 392 total) [source: https://en.wikipedia.org/wiki/Wikipedia:List_of_infoboxes]  Chembox? Drug box?  Templates of selected content within Wikipedia articles  Contents of Chembox:  Molecular structure image  Name (systematically assigned name + synonyms)  Identifiers: CASNo, ChEBI, ChEMBL, ChemSpiderID, DrugBank, InChI, KEGG, PubChem, SMILES, UNII…  Key properties
  • 5. Chemical identifiers  Different specific databases  Individual IDs have strengths and weakness  The UNII is a non- proprietary, free, unique, unambiguous, non semantic, alphanumeric identifier based on a substance’s composition and/or descriptive information.  http://www.fda.gov/ForIndustry/DataStandards/SubstanceRegistrationSystem- UniqueIngredientIdentifierUNII/  UNIIs contain 9 randomly generated alphanumeric characters with a tenth check alphanumeric character  When two samples have the same UNII, “they represent the same molecular entity or elements upon which the definition is based.”
  • 6. SRS group goal  Manages Substance Registration System (SRS)  Assure uniformity of UNII assignments across internet resources that reference UNIIs
  • 7. The assignment  Generate a report of all chemicals and drugs in Wikipedia  Name, UNII (when present), CAS (when present),Wikipedia URL  Idea: subject matter experts will review list and correct assignments, add new UNIIs to Wikipedia as needed  Result: more accurate Wikipedia that links to the FDA’s Substance Registration System unambiguously  https://fdasis.nlm.nih.gov/srs/srs.jsp
  • 8. Development tool: KNIME  Graphic, component based programming environment  Drag functional components from palette onto canvas to create program  Configure most components by setting parameters  Connect components to route data from one to another  Run and observe data traveling down the lines  KNIME stands for KoNstanz Information MinEr  Pronounced “Nighm”  Originally a production of the University of Konstanz, Germany 2004  Currently produced by KNIME.com AG, a company in Zurich, Switzerland  Free version available for download  Windows, Linux, Mac
  • 9. First method of report generation  Read list of pages with each infobox  E.g., https://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere/Template:Ch embox&limit=50000&from=16225610&back=0  Retrieve each individual page mentioned in the list  Parse HTML  Use Xpath to get Name, CAS, UNII  The Infobox templates lead to pages with defined structure – straightforward to parse  Format data for output  Write to a file
  • 10. First method: pluses/minuses  Plus: it works  Minus: had to run in batches to get all records  Minus: XPath parsing was more cumbersome than expected  Minus: misses some data
  • 11. The Semantic Web  A connected set of data resources that can be understood by machines  Data encoded in a standard way that allows unattended processors to traverse links from one entity to another across organizational and geographic boundaries  [Standard WWW is a web of documents meant to be understood by humans]  Tim Berners-Lee has a great Ted talk on the semantic web  https://www.youtube.com/watch?v=OM6XIICm_qo
  • 12. Understand Semantic Web in comparison to WWW  Compare pages on same subject:  Wikipedia article on ethanol: https://en.wikipedia.org/wiki/Ethanol  Wikidata page on ethanol: https://www.wikidata.org/wiki/Q153
  • 13. Technological foundations of Semantic Web  RDF – Resource Definition Framework – organizing facts as  Subject – Predicate – Object  Conceptual example:  [Ethanol] [has a boiling point] [173 degrees Fahrenheit]  Coded example:  Wd:Q153 wdt:P2102 “173±1 degree Fahrenheit” .  Represented in Turtle - Terse RDF Triple Language
  • 14. SPARQL  Query language for RDF data  SPARQL Protocol and RDF Query Language  Similar to SQL  Syntax based on the RDF triple
  • 15. Wikidata  Conceptually: semantic web version of Wikipedia  Add grain of salt  “Free and open knowledge base that can be read and edited by both humans and machines. “  Designed as ‘central storage’ for Wikipedia and other Wikimedia projects  Approximately: programmatic interface to Wikipedia  See https://query.wikidata.org/  Run the example queries
  • 16. Second method  Search Wikidata programmatically for chemical information  Wikidata SPARQL interface  Format list  Write file
  • 17. SPARQL for chemical and pharmaceutical compounds PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX wd: <http://www.wikidata.org/entity/> PREFIX wikibase: <http://wikiba.se/ontology#> PREFIX bd: <http://www.bigdata.com/rdf#> #All Chemicals with, optionally, CAS registry numbers and UNIIs in Wikidata SELECT DISTINCT ?compound ?compoundLabel ?formula ?unii ?pubchem ?cas WHERE { ?compound wdt:P31 wd:Q11173 . OPTIONAL { ?compound wdt:P231 ?cas . } OPTIONAL { ?compound wdt:P274 ?formula . } OPTIONAL { ?compound wdt:P652 ?unii . } OPTIONAL { ?compound wdt:P662 ?pubchem . } SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } }
  • 18. Second method: pluses/minus  Fast and easy!  Data arrives in a format we can use – no parsing!  Minus:  *some* Wikidata data does not match up with Wikipedia!
  • 19. Third method  Hybrid approach  Use Wikidata SPARQL query to get list of chemicals  Query Wikipedia for individual items to compare values
  • 20. Conclusion  Using Wikidata, Wikipedia and KNIME we compiled a list of chemicals with the required data  Subject matter experts are in the process of updating Wikipedia  Semantic web technology made the job easier!  Thank you!
  • 21. References  Scholarly article on KNIME and Pipeline Pilot  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3414708/  KNIME  www.knime.org  Wikipedia  https://en.wikipedia.org/wiki/Template:Chembox  https://en.wikipedia.org/wiki/Wikipedia:Chemical_infobox  Wikidata: https://query.wikidata.org
  • 22. Who is your speaker?  Mitch Miller, Ph.D. in Chemistry and 20+ years of IT experience  Independent consultant: Scientific Thinking, LLC  mitch.miller@thinkscience.us  Some recent projects  Ongoing custodian of one chemical database implementation for ChemIDplus project within the National Library of Medicine  Reporting systems  Web service to link collaborative object management system to reporting system  Import wizard for chemical array designer  Merged a set of chemical databases and harmonized data