SlideShare a Scribd company logo
1 of 31
Download to read offline
Embedding Metadata and Other
      Semantics In Word-Processing
              Documents
                 Peter Sefton (University Southern Queensland)
                   Ian Barnes (Australian National University)
                  Ron Ward (University Southern Queensland)
             Jim Downing (University of Cambridge) (presenting)


                                         [breath]


The paper supporting this presentation provides important detail and can be obtained from
http://www.dspace.cam.ac.uk/handle/1810/206423
Agenda

Motivations

Axioms of choice

Interoperability is Hard

The approach

Examples (+ chemistry)

                           http://www.flickr.com/photos/forezt/524108228
Why is this interesting?


We want to move towards semantically-rich
documents for e-Research. In some disciplines 100%
of documents start life in a word processor.

Introduction of real world constraints yields
interesting result
Semantically Rich Documents


         Enable automation

         Prevent information loss

         Better discovery

         Improved presentation




Automation - zero click upload, not filling in redundant forms etc
Information loss - rich data reduced to tables, images.
Semantic information leads to richer alternatives for discovery and communication of
research.
Fully Supported Research - all the supporting data delivered with the text
Constraints
                    Work
                    in the
               real world,
                    today
                                                http://www.flickr.com/photos/amirjina/2281612876
Solution had to work in ICE - the Integrated Content Environment, a distributed authoring
system in production at USQ.

Therefore the approach is PRAGMATIC!
Real World

Metadata, semantics and data not easily distinguished

Document creation == Metadata creation

 Not separable activities

 Metadata is in the document

Documents have multiple, distributed authors
Tools and Formats

Microsoft Word [Adoption]

OpenOffice.org writer
[Access]

ICE - Integrated Content
Environment

                 .doc, .docx, OOXML, ODF

                 HTML, PDF
The Difference Between
       Standards and Interoperability




This is the test that semantic solutions must work inside to be useful in production - once
semantics are created, they must survive when the document is edited in the wild.
This is the simple subset of document interop we’re talking about, including only word and
OO Writer.

In the wild you can’t control what formats people use to save, or the software they use.

If any of these routes destroys semantics, then we’ve lost interoperability.

There are a lot of standards already involved in this space, but none of them on their own
deliver semantic data interoperability.
Interoperability in Publishing




PDF - scholarly publishing now
HTML - the medium term future of scholarly publishing.

Converter needed since HTML and PDF creation in OOo Writer and MS Word produce pretty
poor results.
http://www.flickr.com/photos/druclimb/289636172


When you apply these interoperability constraints, the solution space gets very small.
<metaphor>Like walking along a ridge, keep it simple and take small steps. The paths off to
the side lead quickly to peril.</metaphor>
Approaches Ruled Out
MS Word “Smart Tags”

 No interop with OOo, but not necessarily a bad idea

MS Word foreign namespace XML encoding

 Expensive, no interop with OOo, lock-in issues

ODF 1.2 embedded semantic

 No Word equivalent in sight

Things that would destroy WYSIWYG such as using wiki
markup in the word processor.
Define A New Encoding
                       Standard?




Codifying a standard wouldn’t work unless vanilla wp software can be shown not to destroy
the information.

For delivering interoperability in this area, standards are not sufficient.
Microformats!




      http://www.flickr.com/photos/onion/2046003604
Encoding Microformats

         Tables: for, like, tabulating things

         Styles: The original extensible inline semantic
         mechanism for word processing and still working!

         Links

         Frames: fragile

         Bookmarks and fields: require lots of field testing, not
         all that reliable in an interop situation


The paper contains much more detail about the mechanism.
Styles




The style approach is: -
 * Simple
 * Metadata schema agnostic
 * User extensible

It doesn’t /need/ any plugin / customized software to work.
Style: p-meta-author


         Style: p-meta-affiliation


d


d
              Style: p-meta-issued                             Style: p-meta-abstract




    Styles can be nested by placing inline styled text within styled paragraphs.
{ 'title':['Metadata in ICE documents'],
                      'author':[{'name':'Ian Barnes',
                                  'affiliation':'ANU'},
                                 {'name':'Peter Sefton',
                                  'affiliation':'USQ'}
                               ]
                    }




Tables are also useful since the layout implies semantics
Toolbars




The toolbars are implemented for Word and Writer. They provide easy access to the common
microformat encoding styles and structures. They also contain macros for communicating
with the ICE system, and uploading the document to the Institutional Repo / publisher system
etc.
http://www.flickr.com/photos/jima/460348206


To make it even easier, templates can be used that include sample text in the relevant places
- all the user has to do is replace the sample text.
Dublin Core metadata can be extracted directly from the document.
As can RDF metadata using the ORE vocabulary.
ICE-TheOREM

        Semantics in chemistry thesis documents

        Structural elements, Chapters, Appendices etc

        Data (molecules, spectral data etc)

        Chemical entities in text




http://wwmm.ch.cam.ac.uk/trac/theorem/
Chemistry

                                               Style: p-exptl-compound

                                                  Link to data.

                                                 Style: p-exptl-compnum



                                                     Style: p-compound-name




This text from a synthetic chemistry thesis.

Highlights the grey area between data and metadata - the compound name is data, but also
the subject of the document.
These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/
demos/index.htm
These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/
demos/index.htm
These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/
demos/index.htm
These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/
demos/index.htm
These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/
demos/index.htm
FIN
                 Thank you.




http://www.flickr.com/photos/jaysun/367670007
ICE - Integrated Content Environment http://ice.usq.edu.au/
       Demos at http://ice.usq.edu.au/presentations/demos/
                   ICE-TheOREM. Tag: jisctheorem
              https://wwmm.ch.cam.ac.uk/trac/theorem
http://www.jisc.ac.uk/whatwedo/programmes/digitalrepositories2007/
                          theoremice.aspx
                           Peter Sefton
                       sefton@usq.edu.au
                       http://ptsefton.com/
                         Jim Downing
                      ojd20@cam.ac.uk
            http://wwmm.ch.cam.ac.uk/blogs/downing/

More Related Content

Similar to Embedding Metadata In Word Processing Documents

SLIDEGen: Approach to automatic Slides Generation
SLIDEGen: Approach to automatic Slides GenerationSLIDEGen: Approach to automatic Slides Generation
SLIDEGen: Approach to automatic Slides GenerationIRJET Journal
 
Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsIRJET Journal
 
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Stuart Chalk
 
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...Jerry SILVER
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling TechniqueCarmen Sanborn
 
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...IEEEMEMTECHSTUDENTSPROJECTS
 
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEEMEMTECHSTUDENTPROJECTS
 
IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET Journal
 
PowerPoint
PowerPointPowerPoint
PowerPointVideoguy
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshopl_ernest
 
0001 introduction to database management system
0001 introduction to database management system0001 introduction to database management system
0001 introduction to database management systemJugdambay S
 
The need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsThe need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsMarkus Neteler
 
Trekk cross media series using xml to create once - distribute everywhere - e...
Trekk cross media series using xml to create once - distribute everywhere - e...Trekk cross media series using xml to create once - distribute everywhere - e...
Trekk cross media series using xml to create once - distribute everywhere - e...Jeffrey Stewart
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldRob Gillen
 
A web standards & ud approach for access (bps public)
A web standards & ud approach for access (bps   public)A web standards & ud approach for access (bps   public)
A web standards & ud approach for access (bps public)Howard Kramer
 
How to Find a Needle in the Haystack
How to Find a Needle in the HaystackHow to Find a Needle in the Haystack
How to Find a Needle in the HaystackAdrian Stevenson
 

Similar to Embedding Metadata In Word Processing Documents (20)

SLIDEGen: Approach to automatic Slides Generation
SLIDEGen: Approach to automatic Slides GenerationSLIDEGen: Approach to automatic Slides Generation
SLIDEGen: Approach to automatic Slides Generation
 
Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive Graphs
 
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
 
Bibliographic metadata (including citation)
Bibliographic metadata (including citation)Bibliographic metadata (including citation)
Bibliographic metadata (including citation)
 
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
 
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
 
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
 
IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction Framework
 
Sword Bl 0903[1]
Sword Bl 0903[1]Sword Bl 0903[1]
Sword Bl 0903[1]
 
PowerPoint
PowerPointPowerPoint
PowerPoint
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
 
0001 introduction to database management system
0001 introduction to database management system0001 introduction to database management system
0001 introduction to database management system
 
Metadata Cloud
Metadata CloudMetadata Cloud
Metadata Cloud
 
The need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsThe need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formats
 
Trekk cross media series using xml to create once - distribute everywhere - e...
Trekk cross media series using xml to create once - distribute everywhere - e...Trekk cross media series using xml to create once - distribute everywhere - e...
Trekk cross media series using xml to create once - distribute everywhere - e...
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
A web standards & ud approach for access (bps public)
A web standards & ud approach for access (bps   public)A web standards & ud approach for access (bps   public)
A web standards & ud approach for access (bps public)
 
Sweo talk
Sweo talkSweo talk
Sweo talk
 
How to Find a Needle in the Haystack
How to Find a Needle in the HaystackHow to Find a Needle in the Haystack
How to Find a Needle in the Haystack
 

More from Jim Downing

The Metaverse in Fashion
The Metaverse in FashionThe Metaverse in Fashion
The Metaverse in FashionJim Downing
 
Metail and eTryOn for De Montfort Uni Fashion
Metail and eTryOn for De Montfort Uni FashionMetail and eTryOn for De Montfort Uni Fashion
Metail and eTryOn for De Montfort Uni FashionJim Downing
 
Creative Cambridge Metail presentation
Creative Cambridge Metail presentationCreative Cambridge Metail presentation
Creative Cambridge Metail presentationJim Downing
 
XR in fashion & the eTryOn project
XR in fashion  & the eTryOn projectXR in fashion  & the eTryOn project
XR in fashion & the eTryOn projectJim Downing
 
Towards Lensfield
Towards LensfieldTowards Lensfield
Towards LensfieldJim Downing
 
Web Feeds and Repositories
Web Feeds and RepositoriesWeb Feeds and Repositories
Web Feeds and RepositoriesJim Downing
 

More from Jim Downing (6)

The Metaverse in Fashion
The Metaverse in FashionThe Metaverse in Fashion
The Metaverse in Fashion
 
Metail and eTryOn for De Montfort Uni Fashion
Metail and eTryOn for De Montfort Uni FashionMetail and eTryOn for De Montfort Uni Fashion
Metail and eTryOn for De Montfort Uni Fashion
 
Creative Cambridge Metail presentation
Creative Cambridge Metail presentationCreative Cambridge Metail presentation
Creative Cambridge Metail presentation
 
XR in fashion & the eTryOn project
XR in fashion  & the eTryOn projectXR in fashion  & the eTryOn project
XR in fashion & the eTryOn project
 
Towards Lensfield
Towards LensfieldTowards Lensfield
Towards Lensfield
 
Web Feeds and Repositories
Web Feeds and RepositoriesWeb Feeds and Repositories
Web Feeds and Repositories
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Embedding Metadata In Word Processing Documents

  • 1. Embedding Metadata and Other Semantics In Word-Processing Documents Peter Sefton (University Southern Queensland) Ian Barnes (Australian National University) Ron Ward (University Southern Queensland) Jim Downing (University of Cambridge) (presenting) [breath] The paper supporting this presentation provides important detail and can be obtained from http://www.dspace.cam.ac.uk/handle/1810/206423
  • 2. Agenda Motivations Axioms of choice Interoperability is Hard The approach Examples (+ chemistry) http://www.flickr.com/photos/forezt/524108228
  • 3. Why is this interesting? We want to move towards semantically-rich documents for e-Research. In some disciplines 100% of documents start life in a word processor. Introduction of real world constraints yields interesting result
  • 4. Semantically Rich Documents Enable automation Prevent information loss Better discovery Improved presentation Automation - zero click upload, not filling in redundant forms etc Information loss - rich data reduced to tables, images. Semantic information leads to richer alternatives for discovery and communication of research. Fully Supported Research - all the supporting data delivered with the text
  • 5. Constraints Work in the real world, today http://www.flickr.com/photos/amirjina/2281612876 Solution had to work in ICE - the Integrated Content Environment, a distributed authoring system in production at USQ. Therefore the approach is PRAGMATIC!
  • 6. Real World Metadata, semantics and data not easily distinguished Document creation == Metadata creation Not separable activities Metadata is in the document Documents have multiple, distributed authors
  • 7. Tools and Formats Microsoft Word [Adoption] OpenOffice.org writer [Access] ICE - Integrated Content Environment .doc, .docx, OOXML, ODF HTML, PDF
  • 8. The Difference Between Standards and Interoperability This is the test that semantic solutions must work inside to be useful in production - once semantics are created, they must survive when the document is edited in the wild.
  • 9. This is the simple subset of document interop we’re talking about, including only word and OO Writer. In the wild you can’t control what formats people use to save, or the software they use. If any of these routes destroys semantics, then we’ve lost interoperability. There are a lot of standards already involved in this space, but none of them on their own deliver semantic data interoperability.
  • 10. Interoperability in Publishing PDF - scholarly publishing now HTML - the medium term future of scholarly publishing. Converter needed since HTML and PDF creation in OOo Writer and MS Word produce pretty poor results.
  • 11. http://www.flickr.com/photos/druclimb/289636172 When you apply these interoperability constraints, the solution space gets very small. <metaphor>Like walking along a ridge, keep it simple and take small steps. The paths off to the side lead quickly to peril.</metaphor>
  • 12. Approaches Ruled Out MS Word “Smart Tags” No interop with OOo, but not necessarily a bad idea MS Word foreign namespace XML encoding Expensive, no interop with OOo, lock-in issues ODF 1.2 embedded semantic No Word equivalent in sight Things that would destroy WYSIWYG such as using wiki markup in the word processor.
  • 13. Define A New Encoding Standard? Codifying a standard wouldn’t work unless vanilla wp software can be shown not to destroy the information. For delivering interoperability in this area, standards are not sufficient.
  • 14. Microformats! http://www.flickr.com/photos/onion/2046003604
  • 15. Encoding Microformats Tables: for, like, tabulating things Styles: The original extensible inline semantic mechanism for word processing and still working! Links Frames: fragile Bookmarks and fields: require lots of field testing, not all that reliable in an interop situation The paper contains much more detail about the mechanism.
  • 16. Styles The style approach is: - * Simple * Metadata schema agnostic * User extensible It doesn’t /need/ any plugin / customized software to work.
  • 17. Style: p-meta-author Style: p-meta-affiliation d d Style: p-meta-issued Style: p-meta-abstract Styles can be nested by placing inline styled text within styled paragraphs.
  • 18. { 'title':['Metadata in ICE documents'], 'author':[{'name':'Ian Barnes', 'affiliation':'ANU'}, {'name':'Peter Sefton', 'affiliation':'USQ'} ] } Tables are also useful since the layout implies semantics
  • 19. Toolbars The toolbars are implemented for Word and Writer. They provide easy access to the common microformat encoding styles and structures. They also contain macros for communicating with the ICE system, and uploading the document to the Institutional Repo / publisher system etc.
  • 20. http://www.flickr.com/photos/jima/460348206 To make it even easier, templates can be used that include sample text in the relevant places - all the user has to do is replace the sample text.
  • 21. Dublin Core metadata can be extracted directly from the document.
  • 22. As can RDF metadata using the ORE vocabulary.
  • 23. ICE-TheOREM Semantics in chemistry thesis documents Structural elements, Chapters, Appendices etc Data (molecules, spectral data etc) Chemical entities in text http://wwmm.ch.cam.ac.uk/trac/theorem/
  • 24. Chemistry Style: p-exptl-compound Link to data. Style: p-exptl-compnum Style: p-compound-name This text from a synthetic chemistry thesis. Highlights the grey area between data and metadata - the compound name is data, but also the subject of the document.
  • 25. These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/ demos/index.htm
  • 26. These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/ demos/index.htm
  • 27. These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/ demos/index.htm
  • 28. These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/ demos/index.htm
  • 29. These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/ demos/index.htm
  • 30. FIN Thank you. http://www.flickr.com/photos/jaysun/367670007
  • 31. ICE - Integrated Content Environment http://ice.usq.edu.au/ Demos at http://ice.usq.edu.au/presentations/demos/ ICE-TheOREM. Tag: jisctheorem https://wwmm.ch.cam.ac.uk/trac/theorem http://www.jisc.ac.uk/whatwedo/programmes/digitalrepositories2007/ theoremice.aspx Peter Sefton sefton@usq.edu.au http://ptsefton.com/ Jim Downing ojd20@cam.ac.uk http://wwmm.ch.cam.ac.uk/blogs/downing/