1. Left to Their Own Devices:
Automating XML Parsing and
Rendering for Scholarly Publishing
Alex Garnett & John Willinsky
Public Knowledge Project
2. What do we want? XML Publishing!
• When do we want it? 2004 would’ve been
nice…
• We’ve known the value of properly marked up
documents for a few decades now
– Unfortunately, this entails hours of marking.
• Open-source publishers on limited budgets can’t
afford the outsourcing or the grad students that
normally make this possible
3. The Public Knowledge Project
• Developers of Open Journal Systems &
Open Monograph Press
– Open source software to
support open access
publishing.
– http://pkp.sfu.ca
• Our userbase happens to include many such
small publishers, who publish almost exclusively
in PDF, given its ease.
4. Nice things that PDF doesn’t have
• Well-structured text mining & indexing
• Rendering in different formats (e.g. mobile)
• Embedded dynamic content
• Citation parsing and lookup
• Reliable metadata
• So why are we still using it, again?
5. XML Publishing Workflows
• Are complex and underdocumented, requiring
lots of manual labour, since no author will ever
write in XML, and only a small fraction will use
Markdown or LaTeX or some other text format
that’s easy to transform, and most automated
parsing tools are in deplorable condition
anyhow, rant rant rant, despite the fact that
there are many very good piecemeal tools
available at different stages of these
workflows. We put some of them together.
8. Future Work
• After incorporating upstream changes from pdfx
(fixing punctutation & non-English languages)
we’re aiming to have an OJS plugin by March.
• OMP will follow soon after.
• By the end of our initial funding period in June,
we’ll have a source release (without pdfx) and
plan to be supporting a set of OJS/OMP users.
9. Future Work not done by us
• Collaborators at Heidelberg University are
working on a WYSIWYG in-browser XML
editor for manually revising article formatting.
• The University of Michigan’s mPach system will
add ePub generation and HathiTrust ingest.
• CrossRef will be contributing functionality to
look up, verify, and link parsed citations.
10. Thanks
• Damion Dooley, our primary developer
• Steve Pettifer and the University of Manchester
for allowing us to use pdfx
• Juan Alperin and the rest of the PKP team for
their support and earlier work
• Alf Eaton from the NLM for stylesheets
• MediaX for funding this project
11. Questions?
• If you want to use our service for document
preparation right now, contact me (Alex) at
axfelix@gmail.com.
• We’ll have a stable version available by the end
of January (probably free with registration)
• OJS/OMP integration and standalone release
(without pdfx) coming soon!