SlideShare une entreprise Scribd logo
1  sur  11
Left to Their Own Devices:
  Automating XML Parsing and
Rendering for Scholarly Publishing
    Alex Garnett & John Willinsky
      Public Knowledge Project
What do we want? XML Publishing!
• When do we want it? 2004 would’ve been
  nice…

• We’ve known the value of properly marked up
  documents for a few decades now
  – Unfortunately, this entails hours of marking.

• Open-source publishers on limited budgets can’t
  afford the outsourcing or the grad students that
  normally make this possible
The Public Knowledge Project
• Developers of Open Journal Systems &
  Open Monograph Press
  – Open source software to
    support open access
    publishing.
  – http://pkp.sfu.ca


• Our userbase happens to include many such
  small publishers, who publish almost exclusively
  in PDF, given its ease.
Nice things that PDF doesn’t have
•   Well-structured text mining & indexing
•   Rendering in different formats (e.g. mobile)
•   Embedded dynamic content
•   Citation parsing and lookup
•   Reliable metadata

• So why are we still using it, again?
XML Publishing Workflows
• Are complex and underdocumented, requiring
  lots of manual labour, since no author will ever
  write in XML, and only a small fraction will use
  Markdown or LaTeX or some other text format
  that’s easy to transform, and most automated
  parsing tools are in deplorable condition
  anyhow, rant rant rant, despite the fact that
  there are many very good piecemeal tools
  available at different stages of these
  workflows. We put some of them together.
Toolchain




• External Services:
  – LibreOffice – document conversion
  – pdfx – fuzzy parsing
  – ParsCit – fuzzy citation parsing
  – citeproc/CSL – citation transformation
Future Work
• After incorporating upstream changes from pdfx
  (fixing punctutation & non-English languages)
  we’re aiming to have an OJS plugin by March.
• OMP will follow soon after.

• By the end of our initial funding period in June,
  we’ll have a source release (without pdfx) and
  plan to be supporting a set of OJS/OMP users.
Future Work not done by us
• Collaborators at Heidelberg University are
  working on a WYSIWYG in-browser XML
  editor for manually revising article formatting.

• The University of Michigan’s mPach system will
  add ePub generation and HathiTrust ingest.

• CrossRef will be contributing functionality to
  look up, verify, and link parsed citations.
Thanks
• Damion Dooley, our primary developer
• Steve Pettifer and the University of Manchester
  for allowing us to use pdfx
• Juan Alperin and the rest of the PKP team for
  their support and earlier work
• Alf Eaton from the NLM for stylesheets
• MediaX for funding this project
Questions?
• If you want to use our service for document
  preparation right now, contact me (Alex) at
  axfelix@gmail.com.

• We’ll have a stable version available by the end
  of January (probably free with registration)

• OJS/OMP integration and standalone release
  (without pdfx) coming soon!

Contenu connexe

Tendances

Kafka is simple, it is just an infinite file
Kafka is simple, it is just an infinite fileKafka is simple, it is just an infinite file
Kafka is simple, it is just an infinite fileGabrielMironBrezai
 
TypeScript 1.6 - How I learned to Stop Worrying and Love JavaScript
TypeScript 1.6 - How I learned to Stop Worrying and Love JavaScriptTypeScript 1.6 - How I learned to Stop Worrying and Love JavaScript
TypeScript 1.6 - How I learned to Stop Worrying and Love JavaScriptWekoslav Stefanovski
 
EDF2013: Selected Talk Søren Roug: Reportnet – a Case Study
EDF2013: Selected Talk Søren Roug: Reportnet – a Case StudyEDF2013: Selected Talk Søren Roug: Reportnet – a Case Study
EDF2013: Selected Talk Søren Roug: Reportnet – a Case StudyEuropean Data Forum
 
.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel ZikmundKarel Zikmund
 
.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel ZikmundKarel Zikmund
 
.NET MeetUp Prague 2017 - .NET Standard -- Karel Zikmund
.NET MeetUp Prague 2017 - .NET Standard -- Karel Zikmund.NET MeetUp Prague 2017 - .NET Standard -- Karel Zikmund
.NET MeetUp Prague 2017 - .NET Standard -- Karel ZikmundKarel Zikmund
 
Intro to Graphs for Fedict
Intro to Graphs for FedictIntro to Graphs for Fedict
Intro to Graphs for FedictRik Van Bruggen
 
Benefits of using Ruby on rails for Apps Development
Benefits of using Ruby on rails for Apps Development Benefits of using Ruby on rails for Apps Development
Benefits of using Ruby on rails for Apps Development Chetu
 
.Net framework
.Net framework.Net framework
.Net frameworksanya6900
 
Backing Library Operations with Open Source Applications
Backing Library Operations with Open Source ApplicationsBacking Library Operations with Open Source Applications
Backing Library Operations with Open Source ApplicationsMyka Kennedy Stephens
 
Potential Next Steps for Peering Automation by Martin Levy [APRICOT 2015]
Potential Next Steps for Peering Automation by Martin Levy [APRICOT 2015]Potential Next Steps for Peering Automation by Martin Levy [APRICOT 2015]
Potential Next Steps for Peering Automation by Martin Levy [APRICOT 2015]APNIC
 
Translation Automation Going Cloud: The New Landscape for Professional Transl...
Translation Automation Going Cloud: The New Landscape for Professional Transl...Translation Automation Going Cloud: The New Landscape for Professional Transl...
Translation Automation Going Cloud: The New Landscape for Professional Transl...ABBYY Language Serivces
 
#RADC4L16: An API-First Archives Approach at NPR
#RADC4L16: An API-First Archives Approach at NPR#RADC4L16: An API-First Archives Approach at NPR
#RADC4L16: An API-First Archives Approach at NPRCamille Salas
 
Developer Conference 1.5 - Making the Move to Visual COBOL (Transvive)
Developer Conference 1.5 - Making the Move to Visual COBOL (Transvive)Developer Conference 1.5 - Making the Move to Visual COBOL (Transvive)
Developer Conference 1.5 - Making the Move to Visual COBOL (Transvive)Micro Focus
 
Wei's Self Intro
Wei's Self IntroWei's Self Intro
Wei's Self Introsunmast
 

Tendances (20)

Kafka is simple, it is just an infinite file
Kafka is simple, it is just an infinite fileKafka is simple, it is just an infinite file
Kafka is simple, it is just an infinite file
 
TypeScript 1.6 - How I learned to Stop Worrying and Love JavaScript
TypeScript 1.6 - How I learned to Stop Worrying and Love JavaScriptTypeScript 1.6 - How I learned to Stop Worrying and Love JavaScript
TypeScript 1.6 - How I learned to Stop Worrying and Love JavaScript
 
EDF2013: Selected Talk Søren Roug: Reportnet – a Case Study
EDF2013: Selected Talk Søren Roug: Reportnet – a Case StudyEDF2013: Selected Talk Søren Roug: Reportnet – a Case Study
EDF2013: Selected Talk Søren Roug: Reportnet – a Case Study
 
.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET Fringe 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
 
.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
.NET MeetUp Prague 2017 - Challenges of Managing CoreFX repo -- Karel Zikmund
 
Client server
Client serverClient server
Client server
 
.NET MeetUp Prague 2017 - .NET Standard -- Karel Zikmund
.NET MeetUp Prague 2017 - .NET Standard -- Karel Zikmund.NET MeetUp Prague 2017 - .NET Standard -- Karel Zikmund
.NET MeetUp Prague 2017 - .NET Standard -- Karel Zikmund
 
Intro to Graphs for Fedict
Intro to Graphs for FedictIntro to Graphs for Fedict
Intro to Graphs for Fedict
 
Evalution about programming language part 1
Evalution about programming language part 1Evalution about programming language part 1
Evalution about programming language part 1
 
Benefits of using Ruby on rails for Apps Development
Benefits of using Ruby on rails for Apps Development Benefits of using Ruby on rails for Apps Development
Benefits of using Ruby on rails for Apps Development
 
.Net framework
.Net framework.Net framework
.Net framework
 
Salcedo BSI and ISO STS
Salcedo BSI and ISO STSSalcedo BSI and ISO STS
Salcedo BSI and ISO STS
 
Backing Library Operations with Open Source Applications
Backing Library Operations with Open Source ApplicationsBacking Library Operations with Open Source Applications
Backing Library Operations with Open Source Applications
 
Potential Next Steps for Peering Automation by Martin Levy [APRICOT 2015]
Potential Next Steps for Peering Automation by Martin Levy [APRICOT 2015]Potential Next Steps for Peering Automation by Martin Levy [APRICOT 2015]
Potential Next Steps for Peering Automation by Martin Levy [APRICOT 2015]
 
Plug saiku
Plug   saikuPlug   saiku
Plug saiku
 
Translation Automation Going Cloud: The New Landscape for Professional Transl...
Translation Automation Going Cloud: The New Landscape for Professional Transl...Translation Automation Going Cloud: The New Landscape for Professional Transl...
Translation Automation Going Cloud: The New Landscape for Professional Transl...
 
#RADC4L16: An API-First Archives Approach at NPR
#RADC4L16: An API-First Archives Approach at NPR#RADC4L16: An API-First Archives Approach at NPR
#RADC4L16: An API-First Archives Approach at NPR
 
Developer Conference 1.5 - Making the Move to Visual COBOL (Transvive)
Developer Conference 1.5 - Making the Move to Visual COBOL (Transvive)Developer Conference 1.5 - Making the Move to Visual COBOL (Transvive)
Developer Conference 1.5 - Making the Move to Visual COBOL (Transvive)
 
Wei's Self Intro
Wei's Self IntroWei's Self Intro
Wei's Self Intro
 
Apache flink
Apache flinkApache flink
Apache flink
 

En vedette

Verd color verd
Verd color verdVerd color verd
Verd color verdresafer
 
6 15-11 virtual party-updated
6 15-11 virtual party-updated6 15-11 virtual party-updated
6 15-11 virtual party-updatedJessica Gheiler
 
Network Strategy Overview
Network Strategy OverviewNetwork Strategy Overview
Network Strategy OverviewJessica Gheiler
 
Beyond pdfgarnett2011
Beyond pdfgarnett2011Beyond pdfgarnett2011
Beyond pdfgarnett2011Alex Garnett
 
Evaluating networks slides_final_monitor3.8.11
Evaluating networks slides_final_monitor3.8.11Evaluating networks slides_final_monitor3.8.11
Evaluating networks slides_final_monitor3.8.11Jessica Gheiler
 
23 march the role of network funders
23 march the role of network funders23 march the role of network funders
23 march the role of network fundersJessica Gheiler
 
HowArtWorks: Web Conversation Prep
HowArtWorks: Web Conversation PrepHowArtWorks: Web Conversation Prep
HowArtWorks: Web Conversation PrepJessica Gheiler
 
3-21-12 How Art Works web conversation
3-21-12 How Art Works web conversation3-21-12 How Art Works web conversation
3-21-12 How Art Works web conversationJessica Gheiler
 
Dzone core java concurrency -_
Dzone core java concurrency -_Dzone core java concurrency -_
Dzone core java concurrency -_Surendra Sirvi
 

En vedette (16)

Informator
InformatorInformator
Informator
 
Verd color verd
Verd color verdVerd color verd
Verd color verd
 
Ak paris0305
Ak paris0305Ak paris0305
Ak paris0305
 
6 15-11 virtual party-updated
6 15-11 virtual party-updated6 15-11 virtual party-updated
6 15-11 virtual party-updated
 
8 feb11 net_impact
8 feb11 net_impact8 feb11 net_impact
8 feb11 net_impact
 
Network Strategy Overview
Network Strategy OverviewNetwork Strategy Overview
Network Strategy Overview
 
Informator
InformatorInformator
Informator
 
Beyond pdfgarnett2011
Beyond pdfgarnett2011Beyond pdfgarnett2011
Beyond pdfgarnett2011
 
Dabbawala
DabbawalaDabbawala
Dabbawala
 
Evaluating networks slides_final_monitor3.8.11
Evaluating networks slides_final_monitor3.8.11Evaluating networks slides_final_monitor3.8.11
Evaluating networks slides_final_monitor3.8.11
 
23 march the role of network funders
23 march the role of network funders23 march the role of network funders
23 march the role of network funders
 
HowArtWorks: Web Conversation Prep
HowArtWorks: Web Conversation PrepHowArtWorks: Web Conversation Prep
HowArtWorks: Web Conversation Prep
 
3-21-12 How Art Works web conversation
3-21-12 How Art Works web conversation3-21-12 How Art Works web conversation
3-21-12 How Art Works web conversation
 
10 18-11 nw-leadership
10 18-11 nw-leadership10 18-11 nw-leadership
10 18-11 nw-leadership
 
Dzone core java concurrency -_
Dzone core java concurrency -_Dzone core java concurrency -_
Dzone core java concurrency -_
 
10 18-11 nw-strategy
10 18-11 nw-strategy10 18-11 nw-strategy
10 18-11 nw-strategy
 

Similaire à MediaX (Jan 2013) -- PKP XML Parsing

Julia Computing - an alternative to Hadoop
Julia Computing - an alternative to HadoopJulia Computing - an alternative to Hadoop
Julia Computing - an alternative to HadoopShaurya Shekhar
 
How community software supports language documentation and data analysis
How community software supports language documentation and data analysisHow community software supports language documentation and data analysis
How community software supports language documentation and data analysisPeter Bouda
 
Managing Complex Print Deliverables with Arbortext - PTC/USER 2010
Managing Complex Print Deliverables with Arbortext - PTC/USER 2010Managing Complex Print Deliverables with Arbortext - PTC/USER 2010
Managing Complex Print Deliverables with Arbortext - PTC/USER 2010Gareth Oakes
 
Free Libre Open Source Software at FFZG library
Free Libre Open Source Software at FFZG libraryFree Libre Open Source Software at FFZG library
Free Libre Open Source Software at FFZG libraryDobrica Pavlinušić
 
Introduction to Python Programming Basics
Introduction  to  Python  Programming BasicsIntroduction  to  Python  Programming Basics
Introduction to Python Programming BasicsDhana malar
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache StanbolAlkuvoima
 
Application of Library Management Software: NewGenLib
Application of Library Management Software: NewGenLibApplication of Library Management Software: NewGenLib
Application of Library Management Software: NewGenLibDavid Nzoputa Ofili
 
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinalProf. Wim Van Criekinge
 
The XML Forms Architecture
The XML Forms ArchitectureThe XML Forms Architecture
The XML Forms ArchitectureiText Group nv
 
Building bridges - Plone Conference 2015 Bucharest
Building bridges   - Plone Conference 2015 BucharestBuilding bridges   - Plone Conference 2015 Bucharest
Building bridges - Plone Conference 2015 BucharestAndreas Jung
 
Citizen Developer Tools (session at SharePoint Saturday Houston 4/28/2018) by...
Citizen Developer Tools (session at SharePoint Saturday Houston 4/28/2018) by...Citizen Developer Tools (session at SharePoint Saturday Houston 4/28/2018) by...
Citizen Developer Tools (session at SharePoint Saturday Houston 4/28/2018) by...Antti Koskela
 
But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?gagravarr
 
Php training in bhubaneswar
Php training in bhubaneswar Php training in bhubaneswar
Php training in bhubaneswar litbbsr
 
Php training in bhubaneswar
Php training in bhubaneswar Php training in bhubaneswar
Php training in bhubaneswar litbbsr
 

Similaire à MediaX (Jan 2013) -- PKP XML Parsing (20)

Day3 edupub tokyo_idpf
Day3 edupub tokyo_idpfDay3 edupub tokyo_idpf
Day3 edupub tokyo_idpf
 
Julia Computing - an alternative to Hadoop
Julia Computing - an alternative to HadoopJulia Computing - an alternative to Hadoop
Julia Computing - an alternative to Hadoop
 
How community software supports language documentation and data analysis
How community software supports language documentation and data analysisHow community software supports language documentation and data analysis
How community software supports language documentation and data analysis
 
Managing Complex Print Deliverables with Arbortext - PTC/USER 2010
Managing Complex Print Deliverables with Arbortext - PTC/USER 2010Managing Complex Print Deliverables with Arbortext - PTC/USER 2010
Managing Complex Print Deliverables with Arbortext - PTC/USER 2010
 
EPUB NOW AND FUTURE
EPUB NOW AND FUTUREEPUB NOW AND FUTURE
EPUB NOW AND FUTURE
 
Kerscher, Gunderson, and Wise "Unprecedented Access: Improving the User Expe...
Kerscher, Gunderson, and Wise "Unprecedented Access:  Improving the User Expe...Kerscher, Gunderson, and Wise "Unprecedented Access:  Improving the User Expe...
Kerscher, Gunderson, and Wise "Unprecedented Access: Improving the User Expe...
 
Bill McCoy氏:電子出版の将来展望
Bill McCoy氏:電子出版の将来展望Bill McCoy氏:電子出版の将来展望
Bill McCoy氏:電子出版の将来展望
 
Free Libre Open Source Software at FFZG library
Free Libre Open Source Software at FFZG libraryFree Libre Open Source Software at FFZG library
Free Libre Open Source Software at FFZG library
 
Introduction to Python Programming Basics
Introduction  to  Python  Programming BasicsIntroduction  to  Python  Programming Basics
Introduction to Python Programming Basics
 
Interactive E-Books
Interactive E-BooksInteractive E-Books
Interactive E-Books
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Application of Library Management Software: NewGenLib
Application of Library Management Software: NewGenLibApplication of Library Management Software: NewGenLib
Application of Library Management Software: NewGenLib
 
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
 
The XML Forms Architecture
The XML Forms ArchitectureThe XML Forms Architecture
The XML Forms Architecture
 
Galichet XML for Standards Publishers October 9
Galichet XML for Standards Publishers October 9Galichet XML for Standards Publishers October 9
Galichet XML for Standards Publishers October 9
 
Building bridges - Plone Conference 2015 Bucharest
Building bridges   - Plone Conference 2015 BucharestBuilding bridges   - Plone Conference 2015 Bucharest
Building bridges - Plone Conference 2015 Bucharest
 
Citizen Developer Tools (session at SharePoint Saturday Houston 4/28/2018) by...
Citizen Developer Tools (session at SharePoint Saturday Houston 4/28/2018) by...Citizen Developer Tools (session at SharePoint Saturday Houston 4/28/2018) by...
Citizen Developer Tools (session at SharePoint Saturday Houston 4/28/2018) by...
 
But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?But we're already open source! Why would I want to bring my code to Apache?
But we're already open source! Why would I want to bring my code to Apache?
 
Php training in bhubaneswar
Php training in bhubaneswar Php training in bhubaneswar
Php training in bhubaneswar
 
Php training in bhubaneswar
Php training in bhubaneswar Php training in bhubaneswar
Php training in bhubaneswar
 

MediaX (Jan 2013) -- PKP XML Parsing

  • 1. Left to Their Own Devices: Automating XML Parsing and Rendering for Scholarly Publishing Alex Garnett & John Willinsky Public Knowledge Project
  • 2. What do we want? XML Publishing! • When do we want it? 2004 would’ve been nice… • We’ve known the value of properly marked up documents for a few decades now – Unfortunately, this entails hours of marking. • Open-source publishers on limited budgets can’t afford the outsourcing or the grad students that normally make this possible
  • 3. The Public Knowledge Project • Developers of Open Journal Systems & Open Monograph Press – Open source software to support open access publishing. – http://pkp.sfu.ca • Our userbase happens to include many such small publishers, who publish almost exclusively in PDF, given its ease.
  • 4. Nice things that PDF doesn’t have • Well-structured text mining & indexing • Rendering in different formats (e.g. mobile) • Embedded dynamic content • Citation parsing and lookup • Reliable metadata • So why are we still using it, again?
  • 5. XML Publishing Workflows • Are complex and underdocumented, requiring lots of manual labour, since no author will ever write in XML, and only a small fraction will use Markdown or LaTeX or some other text format that’s easy to transform, and most automated parsing tools are in deplorable condition anyhow, rant rant rant, despite the fact that there are many very good piecemeal tools available at different stages of these workflows. We put some of them together.
  • 6.
  • 7. Toolchain • External Services: – LibreOffice – document conversion – pdfx – fuzzy parsing – ParsCit – fuzzy citation parsing – citeproc/CSL – citation transformation
  • 8. Future Work • After incorporating upstream changes from pdfx (fixing punctutation & non-English languages) we’re aiming to have an OJS plugin by March. • OMP will follow soon after. • By the end of our initial funding period in June, we’ll have a source release (without pdfx) and plan to be supporting a set of OJS/OMP users.
  • 9. Future Work not done by us • Collaborators at Heidelberg University are working on a WYSIWYG in-browser XML editor for manually revising article formatting. • The University of Michigan’s mPach system will add ePub generation and HathiTrust ingest. • CrossRef will be contributing functionality to look up, verify, and link parsed citations.
  • 10. Thanks • Damion Dooley, our primary developer • Steve Pettifer and the University of Manchester for allowing us to use pdfx • Juan Alperin and the rest of the PKP team for their support and earlier work • Alf Eaton from the NLM for stylesheets • MediaX for funding this project
  • 11. Questions? • If you want to use our service for document preparation right now, contact me (Alex) at axfelix@gmail.com. • We’ll have a stable version available by the end of January (probably free with registration) • OJS/OMP integration and standalone release (without pdfx) coming soon!

Notes de l'éditeur

  1. (5 minute demo happens here)