Putting quality to the test ... How do you define quality in content conversion? Is it only about the output? Or do the performance, speed and reliability of the process matter too? Stilo has put its leading OmniMark content processing solution to the test to see how it stands up against the DITA Open Toolkit. See the results of this technical benchmarking exercise in the following presentation and contact Stilo to learn how you can accelerate the adoption of DITA with OmniMark. www.stilo.com
2. Darwin Information Typing Architecture (DITA)
An OASIS standard for content Reduce
Reuse
Repurpose
p
DITA Open ToolKit Editors
Tools CMSes
Topic-Level Metadata-Based
Transclusion Specialization
Maps Filtering
HTML Established Foundations Topic Types
HyTime and Best Practices CALS Tables
XML SGML
DITA Publishing
Depends on efficient assembly, interpretation, filtering & formatting of
content components
3. The DITA Open Toolkit Factor
The Open Toolkit has been a big part of DITA's success
DITA s
Open source
Active development community
Thorough implementation of DITA
Out-of-the-box support for multiple output formats
Modular architecture
Easily customized
Components of the Open Toolkit are replaceable
Users h
U have a choice of XSLT and FO processor components
h i f d t
Many commercial products bundle the Open Toolkit
As a result DITA is closely identified with the Open Toolkit
4. DITA Editors incorporating Open Toolkit
Adobe Framemaker 8
Information Mapping Content Mapper
rcom.pdf
Inmedius DITA Storm
ools/STC_Inter
STC intercom, April 2008 (DITA issue)
In.vision DITA Studio
ditanews.com/to
Justsystems XMetaL Author Enterprise 5 1
5.1
DITA Tools from a to z,
PTC Arbortext 5.3
http://www.d
SyncRO S ft <oXygen/> 9.1
S RO Soft X / 91
f
Bob Doyle
Syntext Serna 3.5
XMLmind XML Editor 3.6
Source:
5. DITA CMS Integration with Open Toolkit
Astoria On Demand
Author-it
rcom.pdf
Bluestream XDocs
ools/STC_Inter
DITA Exchange
STC intercom, April 2008 (DITA issue)
DocZone
Inmedius Horizon
ditanews.com/to
IXIASOFT DITA CMS Framework
DITA Tools from a to z,
PTC Arbortext Content Manager
http://www.d
f
SiberLogic SiberSafe
Bob Doyle
Trisoft Infoshare
Vasont
Source:
X-Hive Docato
XyEnterprise Content@
6. Exploiting DITA
As DITA evolves, it will be applied to ever more demanding situations
evolves
Many industries publish huge volumes of data
Aerospace, automotive, oil services, legal publishing
Aspects of DITA can be used for their own sake
DITA specialization may spin off into its own standard
Transclusion can allow reuse even among monolithic documents
Metadata-based
Metadata based filtering can provide general purpose effectivity support
general-purpose
DITA is a very modular specification
Some of these scenarios will have very demanding requirements
Very l
V large "t i "
"topics"
Large numbers of topics
7. The DITA Continuum at Stilo
Content Details Drivers
Pure DITA Semi- FrameMaker Authoring costs;
conductor source; PDF Consistency;
Datasheets publishing Customized Pubs
Legal e-Learning; Word Adaptable; Simplified
Procedures and HTML source authoring; Integration
with existing XML tools
Aerospace Monolithic; SGML, Many legacy formats;
Standards, Interleaf, Word Multi-target; access to
Semi-DITA 2 projects source; publish to sub-contractors; S1000D
ATA, S1000D, new support
web services
Aircraft Monolithic; ATA Efficient update;
Maint. source; E-manuals Targeting; Costs;
Manuals Regulatory compliance
Automotive Monolithic; Multiple Efficient update;
sources; SGML Targeting; Costs;
Software Topics; SGML; Authoring costs; Multi-
Non-DITA Docs RDBMS storage target; Reuse;
8. Pushing the Boundaries
How well does the Toolkit cope with these situations?
The Toolkit has a modular architecture
It can be used as a base for partial DITA applications
Some coding tricks are required
XSLT rules must be implemented carefully to preserve support for
specialization
Most importantly, XSLT is not known as a
fast processing technology
Can the Toolkit cope with high volumes of data?
We can test this
9. Building a DITA Stress Test
Sample input is the DITA language reference
p p g g
200+ topics
1468 conref references
741 targets referenced by conref
1.06
1 06 MB
Average file size 5 kB
The DITA language reference was inflated in two ways
Topic sizes were increased up to a factor of 100 ( 500KB p file)
p p (to per )
Number of files was increased up to a factor of 100 (to 20,000 files)
To increase topic sizes
The body of each topic was replicated
A random prefix was added to each word to create unique content
The number of links increased proportionately
To increase the number of files
The whole topic was replicated
p p
A random prefix was added to each word, each id, and each idref
The number of links and link targets increased proportionately
10. Open Toolkit performance (1)
Processing Time vs. Average File Size: from 5 kB to 50 kB
g g
seconds hours
4000 1.1
1.0
3500
0.9 AVG SIZE (kB) Open Toolkit
3000 TIME (s)
0.8
me
5 80
cessing Tim
2500 0.7
07
DITA Accelerator 10 156
0.6
2000
Open Toolkit
21 428
0.5
31 852
1500 0.4
04
Proc
41 1422
1000 0.3
51 2160
0.2
500
0.1
0 0.0
0 10 20 30 40 50 60
Average File Size (kB)
11. Open Toolkit performance (2)
Processing Time vs. Average File Size: from 5 kB to 500 kB
g g
seconds hours AVG SIZE Open Toolkit DITA Open
(kB) TIME (s) Toolkit TIME
40000 11
(hr)
10
35000 5 80 0.02
9
10 156 0.04
30000
8
21 428 0.12
me
7
cessing Tim
25000
31 852 0.24
0 24
6
20000 41 1422 0.40
5
51 2160 0.60
15000 4
DITA Accelerator 103 8160 2.27
Proc
3
10000 206 33660 9.35
Open Toolkit
2
5000 309
1 OUT OF
412
0 0 MEMORY
0 100 200 300 400 500 600 515
Average File Size (kB)
12. Open Toolkit performance (3)
Processing Time vs. Number of Files: from 200 to 2,000
g ,
1000
Number of DITA Open
DITA Accelerator Files Toolkit TIME
800 (s)
Open Toolkit
essing Tim (s)
206 80
me
600
412 144
824 286
400
1236 415
Proce
1648 557
200
2060 699
0
0 500 1000 1500 2000 2500
Number of Files
13. Open Toolkit performance (4)
Processing Time vs. Number of Files: from 200 to 20,000
g ,
Number of Open Toolkit
Files TIME (s)
3000
206 80
2500
412 144
essing Tim (s)
2000 824 286
me
1236 415
1500
1648 557
1000 2060 699
Proce
DITA Accelerator 4120 1429
500 Open Toolkit
8240
0 12360
OUT OF
0 5000 10000 15000 20000 25000 MEMORY
16480
Number of Files 20600
14. Accelerating DITA for Production
An alternative to the Toolkit is required
Production-level quality
No limits on large volumes of content
Consistently high throughput speed as volume increases
y g g p p
Robust and maintainable
Rapid development architecture
Out-of-the-box rendering for standard DITA schemas/DTDs
Easily customized
DITA-aware
Built-in support for DITA concepts
Transclusion
Specialization
Filtering
No programming tricks required
15. OmniMark DITA Accelerator
DITA Accelerator implements HTML p
p publishing
g
Implements all functionality required for language reference
HTML support still requires completion
PDF to be implemented in the future
Behavior is modeled on the Toolkit
Automated tests were written to ensure that the output is almost identical
The output of the DITA Accelerator is nearly identical to the Open Toolkit
index.html from the Open Toolkit
p
index.html from the DITA Accelerator
Some small differences remain
Table cell borders are inconsistent in some cases
Some errors in the DITA toolkit are corrected in the Accelerator
High performance is achieved with streaming technology
Leverages OmniMark's built-in support for streaming
Makes heavy use of referents
y
A DITA-aware library has been implemented
Programmers do not have to employ coding tricks
16. Gentlemen, start your engines
DITA language reference
206 files
1414 elements with ids (potential link or conref targets)
1468 conref references
741 targets referenced by conref
1.06 MB
Average file size 5 kB
Initial results are promising
DITA Open Toolkit: 1 minute, 21 seconds
DITA Accelerator: 18 seconds
Speed improvement: 4X
What about larger input sets?
g p
17. Comparing DITA Accelerator and Open Toolkit (1)
Processing Time vs. Average File S e from 5 kB to 50 kB
ocess g e s e age e Size: o o
seconds hours
4000
4000 1.1 AVG SIZE Open DITA
1.0 (kB) Toolkit Accelerator
3500
3500 TIME (s) TIME (s)
0.9
3000 5 80 18
Processin Time
3000
0.8
10 156 20
2500
2500 0.7
ng
DITA Accelerator
DITA Accelerator 21 428 35
0.6
2000
2000 31 852 41
Open Toolkit
Open Toolkit 0.5
41 1422 46
1500
1500 0.4
51 2160 57
1000
1000 0.3
03
0.2
500
500
0.1
0
0 0.0
00
0
0 10
10 20
20 30
30 40 50 60
Average File Size (kB)
18. Comparing DITA Accelerator and Open Toolkit (2)
Processing Time vs. Average File Size: from 5 kB to 500 kB
g g
seconds hours AVG SIZE Open DITA
40000 (kB) Toolkit Accelerator
11
TIME (s) TIME (s)
10
35000 5 80 18
9
10 156 20
Processing Time
30000
8
21 428 35
g
25000 7
31 852 41
6 41 1422 46
20000
5 51 2160 57
15000
P
4 103 8160 86
DITA Accelerator
10000 3 206 33660 150
Open Toolkit
2 309 217
5000 OUT OF
1 412 292
MEMORY
0 0 515 369
0 100 200 300 400 500 600 =9 =6
Average File Size (kB) hours minutes
20. Comparing DITA Accelerator and Open Toolkit (3)
Processing Time vs. Number of Files: from 200 to 20,000
g ,
Number of Open Toolkit DITA
3000
Files TIME (s) Accelerator
TIME (s)
2500
Processing Time (s)
206 80 18
2000 412 144 49.5
824 286 100.2
1500
1236 415 142.4
1000 1648 557 193.8
DITA Accelerator 2060 699 247.1
247 1
P
500
Open Toolkit 4120 1429 491.2
0 8240 1055.2
0 5000 10000 15000 20000 25000 12360 OUT OF 1601.5
1601 5
Number of Files 16480 MEMORY 2143.4
20600 2788.5
21. Throughput rate as number of files increases
Number Open Toolkit DITA 70.0
of Files THROUGHPUT Accelerator
(kB/s) THROUGHPUT 60.0
(kB/s)
kB/s)
206 13.3 35 50.0
hput Rate (k
412 14.7 35
40.0
824 14.8 42
30.0 DITA Accelerator
1236 15.3 47
Through
DITA Open Toolkit
1648 15.2 47
20.0
2060 15.2 45
4120 14.8
14 8 43 10.0
8240 42 0.0
12360 42 0 5000 10000 15000 20000 25000
OUT OF
16480 MEMORY
O 41 Number of files
20600 39
22. Interpretation of timing statistics
DITA Open Toolkit is best for light duty
Performance degrades rapidly as file sizes increase
Performance is fairly flat as the number of files increase
In both sets of tests, the toolkit eventually fails when it runs out of
memory
A great starting point
OmniMark DITA Accelerator is robust and scales well
Does not run out of memory
Throughput rate is fairly flat in both types of testing
DITA can play in demanding production environments
Because DITA is a standard, technology can be changed without
changing the information architecture
23. Ongoing analysis
Tests used DITA Toolkit "out-of-the-box"
out-of-the-box
Different XSLT processors may improve performance
Forum discussions suggest a workaround for
memory exhaustion
Reload XSLT stylesheet on every transformation
Currently requires toolkit modification (may be configurable in 1 5)
1.5)
Expect slower performance on smaller topics
Even with improvements, best
performance will still be quadratic Linear
Quadratic
for increasing file sizes
There will b room f i
Th ill be for improvement f th
t for the
foreseeable future
24. Role of OmniMark
Most of the performance is due to engineering
"behind the scenes"
Native efficiency of OmniMark
Streaming architecture reduces memory requirements
Record shelves can be used to implement high speed lookup for
DITA processing rules
OmniMark referents simplify support for transclusion
Referents are a streaming mechanism for reordering content
Eliminate complex book-keeping
OmniMark language i easily extended
O iM k l is il t d d
Macros
Modules (functions and data types)
Bonus: SGML support included
25. Usability
XSLT supports DITA reluctantly
XSLT rule selection mechanism is not DITA-aware
Two templates that match the element "u":
<xsl:template match="*[contains(@class,' hi-d/u ')]">
<xsl:template match="*[contains(@class,' topic/ph ')]">
Both have equal priority
Programmer must use tricks to ensure that the "hi-d/u" takes
precedence over the "topic/ph" rule
Extra conditions on the "topic/ph" rule can invert the hierarchy!
y
The spaces around the class names are required
And no more than one on each side
XSLT does not enforce this
d t f thi
Programmer must code carefully to avoid inexplicable behavior
26. OmniMark extensions provide DITA support
DITA Accelerator augments OmniMark with "DITA rules"
DITA rules
Automatically prioritized according to the specialization hierarchy
Rule selection is optimized so that performance stays consistent
as more rules are added
l dd d
DITA rules can be grouped into sets, like OmniMark rules
DITA rules can be supplied as OmniMark modules
Local DITA rules take precedence over imported rules for the
same DITA class
Module supplies support functions that understand DITA
class specialization
27. DITA Accelerator specialization support
The syntax of DITA rules is based on OmniMark element rules
Element rules specify element names
element "u"
output "<u>" || "%c" || "</u>"
p
DITA rules specify classes instead of elements Selection by DITA
declare dita-rule hi-d-u-rule class – understands
class "hi-d/u" specialization
output "<u>" || dita.process-content || "</u>"
DITA rule for "hi-d/u" will take precedence over "topic/ph"
Based on the class specialization in the DTD Processes content,
,
Currently implemented by macros like "%c" in element
Allows access to full OmniMark language rules
DITA module also provides utility functions
p y
DITA class-based queries for current and ancestor elements
Mimics the element tests built into OmniMark
28. Conclusions
The OmniMark-based DITA Accelerator provides scalability
y
Robust
Consistent throughput as volumes increase
No catastrophic failures
The OmniMark language can be easily extended to p
g g y provide a natural DITA
programming environment
Programmers can "think in DITA", rather than trying to align a pre-existing
programming model with the DITA semantics
Standards are about choice of tools
DITA Toolkit is a good choice for
Learning DITA
Prototyping
Less demanding production uses
OmniMark DITA Accelerator
Demanding production environments
Most importantly, tool choice must be governed by the unique characteristics
of your environment