2. Software Architecture 04/2008/KW
1 VERSIONING INFORMATION
• V0.1 – Version 0.1 – April/May/June2008: Start Version; Klemens Wald-
hör, Heartsome Europe - TOSS_Software_Architecure.doc;
• V1.0 – Version 1.0 – 05.08.2008: Initial version; Klemens Waldhör, Heart-
some Europe; based on discussion with Michael Schneider, beodoc,
04.07.2008 - OpenTMS_Software_Architecure_v1.0.doc
• V1.1 – Version 1.1 – 30.08.2008: Modifications based on the FOLT inter-
nal architecture discussion meeting, 29.08.2008, Acolada GmbH, Nürn-
berg. Participants: Ulrike Baral, beodoc; Torsten Kuprat; Michael Schnei-
der, beodoc; Klemens Waldhör, Heartsome Europe; Thomas Wedde, eu-
roscript; OpenTMS_Software_Architecure_v1.1.doc
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 2/72
3. Software Architecture 04/2008/KW
2 PREFACE
This manual gives an overview of the software architecture OpenTMS. It is based
on the requirements defined in the FOLT Open Source Initiative (Folt, 2007b).
The architecture of OpenTMS is mainly based on several models. These models
describe the key components of OpenTMS. Each model handles a specific aspect
of the translation process and its requirements. The models form a framework
which guide the construction of language specific software tools.
The following core models are identified:
• Security model
• Document model
• Process model
• User model
• Data model
• GUI model
• Interface model
On top of those models the application model organises real applications (like the
GUI model).
OpenTMS uses a data source in the data model which organises the access to
database or any kind device which allows to store (TM or terminology) data.
The architecture also contains a description of some basic functions
which can form the basic core of translation tools. The architecture is
defined in such a way that is can be easily extended with new functions
or combining existing functions to new functionality.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 3/72
4. Software Architecture 04/2008/KW
CONTENTS
1 VERSIONING INFORMATION .........................................................................2
2 PREFACE.........................................................................................................3
3 LIST OF TABLES AND FIGURES ...................................................................7
4 DEFINITIONS ...................................................................................................8
5 INTRODUCTION ............................................................................................12
5.1 Arguments for an OpenTMS Software Architcture......................................12
5.2 Basics .........................................................................................................12
5.2.1 Naming conventions........................................................................................ 12
5.2.2 Naming of OpenTMS specific functions/methods ............................................ 13
5.3 Character set ..............................................................................................13
5.4 Standards ...................................................................................................13
5.5 Basic Requirements ...................................................................................14
5.6 Architecture ................................................................................................14
6 OPENTMS ARCHITECTURE AND MODELS................................................16
6.1 Parameters in OpenTMS models ...............................................................16
6.2 Core Models of OpenTMS ..........................................................................18
6.3 OpenTMS Core Library...............................................................................20
6.4 The Application Model ................................................................................20
6.5 Implementation Languages ........................................................................21
7 SECURITY MODEL........................................................................................22
7.1 Security, OpenTMS and Programming Languages ....................................23
7.2 Communication Level .................................................................................24
7.3 Document Level..........................................................................................24
7.4 Database Level...........................................................................................25
7.5 Security Level .............................................................................................25
8 BASIC OPENTMS COMPONENTS ...............................................................27
9 DOCUMENT MODEL .....................................................................................30
9.1 Documents ...............................................................................................30
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 4/72
5. Software Architecture 04/2008/KW
9.2 Character Sets.........................................................................................31
9.3 XML document handling ........................................................................31
9.4 XLIFF Documents ....................................................................................31
9.4.1 OpenTMS and Skeleton files ........................................................................... 32
9.4.2 Security and encryption in XLIFF – secureXLIFF............................................. 33
9.5 TMX Documents ......................................................................................33
9.5.1 Security and encryption in TMX – secureTMX................................................. 34
9.6 TBX Documents .......................................................................................34
9.6.1 Security and encryption in TBX – secure TBX ................................................. 34
9.7 Other Documents ....................................................................................35
9.8 Basic Document Access Functionality ........................................................35
10 OPENTMS AS A CLIENT/SERVER ARCHITECTURE..................................37
11 DATA MODEL................................................................................................41
11.1 Data sources ..............................................................................................41
11.2 TM Matches................................................................................................43
11.3 Basic data source access functionality .......................................................44
11.4 Databases ..................................................................................................47
11.4.1 Open source SQL data bases ......................................................................... 47
11.4.2 Closed source SQL databases ........................................................................ 47
11.4.3 Alternatives ..................................................................................................... 47
11.4.4 Database Access ............................................................................................ 49
11.4.5 Database and data source configuration ......................................................... 49
12 TRANSLATION OBJECTS ............................................................................51
12.1 Format information .....................................................................................52
12.2 Terminology versus Translation Memory....................................................52
12.3 Variables , placeholders, replacement classes...........................................53
13 PROCESS MODEL ........................................................................................56
13.1 OpenTMS Process .....................................................................................56
13.2 OpenTMS Scripting Language ...................................................................56
13.3 OpenTMSL Communication Methods.........................................................58
14 USER MODEL................................................................................................59
14.1 User roles ...................................................................................................59
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 5/72
6. Software Architecture 04/2008/KW
14.2 Basic user functionality ...............................................................................60
15 GUI MODEL ...................................................................................................61
16 INTERFACE MODEL .....................................................................................62
17 CONFIGURING OPENTMS............................................................................63
17.1 Naming of the configuration file ..................................................................64
17.2 Structure of the configuration file ................................................................64
17.3 Configuration Options .................................................................................65
18 DMS INTERFACE ..........................................................................................66
19 BIBLIOGRAPHY ............................................................................................68
20 APPENDIX .....................................................................................................69
20.1 Multiple translations for a linguistic concept................................................69
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 6/72
7. Software Architecture 04/2008/KW
3 LIST OF TABLES AND FIGURES
Fig 1: OpenTMSName defined as a regular expression 12
Fig 2: Naming of OpenTMS functions for export 13
Fig 3: OpenTMS Procedure description 15
Fig 4: OpenTMS Models 18
Fig 5: Example securing XLIFF document exchange 23
Fig 6: OpenTMS Objects 28
Fig 7: XLIFF File 32
Fig 8: Some basic XLIFF File functions 36
Fig 9: Hierarchy of processes 38
Fig 10: Applications 38
Fig 11: Pipeline Architecture 40
Fig 12: Data sources and data components 41
Fig 13: Data sources with several data components 42
Fig 14: Data source access types 45
Fig 15: Data source access types 46
Fig 16:Configuring different database types 49
Fig 17: Representation of linguistic entities as General Linguistic Object
52
Fig 18: Conversions of linguistic entities 53
Fig 19: OpenTMS Scripting Language 56
Fig 20: OpenTMSL Inter-process and computer communication 57
Fig 21: Some basic user functions 60
Fig 22: Configuration of OpenTMS 63
Fig 23: Configuration file naming example 64
Fig 24: Configuration option structure 65
Fig 25: OpenTMS options table 65
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 7/72
8. Software Architecture 04/2008/KW
4 DEFINITIONS
Client: A client is an application or system that accesses a (remote) service on
another computer system known as a server by way of a network. URL:
http://en.wikipedia.org/wiki/Client_%28computing%29
Client-Server: Client-server is a computing architecture which separates a client
from a server, and is almost always implemented over a computer network. A cli-
ent-server application is a distributed system that constitutes of both client and
server software. A client is a software or process that may initiate a communica-
tion session, while a server can not initiate sessions, but is waiting for a requests
from a client. Client and server may also aim at the host computer hardware con-
nected to a network, that are residing the client and server software respectively.
URL: http://en.wikipedia.org/wiki/Client-server
Doclet: Als Doclet bezeichnet man in Anlehnung an Applets Module, die von Do-
kumentationswerkzeugen zur Verarbeitung und automatischen Erzeugung von
Dokumentation und eventuell auch Code eingesetzt werden. Bekannt sind Doclets
insbesondere im Umfeld der Programmiersprache Java, wo sie als Module im Do-
kumentationswerkzeug Javadoc eingesetzt werden. URL:
http://de.wikipedia.org/wiki/Doclet.
GUI: Graphical User Interface. An application which allows a human user to inter-
act with a program thru windows, menus etc.
“A graphical user interface (GUI) (IPA: /ˈguːiː/) is a type of user interface which al-
lows people to interact with electronic devices like computers, hand-held devices
(MP3 Players, Portable Media Players, Gaming devices), household appliances
and office equipment. A GUI offers graphical icons, and visual indicators as op-
posed to text-based interfaces, typed command labels or text navigation to fully
represent the information and actions available to a user. The actions are usually
performed through direct manipulation of the graphical elements.” URL:
http://en.wikipedia.org/wiki/GUI
FOLT: Forum Open Language Tools URL: www.folt.org
HTTP: Hypertext Transfer Protocol (HTTP) is a communications protocol for the
transfer of information on intranets and the World Wide Web. Its original purpose
Dok. Nr.: HEA-1-2008; Version 00 ; Rev.00; April 2007
8
9. Software Architecture 04/2008/KW
was to provide a way to publish and retrieve hypertext pages over the Internet.
URL: http://en.wikipedia.org/wiki/HTTP
HTTPS: Hypertext Transfer Protocol over Secure Socket Layer or HTTPS is a URI
scheme used to indicate a secure HTTP connection. It is syntactically identical to
the http:// scheme normally used for accessing resources using HTTP. Using an
https: URL indicates that HTTP is to be used, but with a different default TCP port
(443) and an additional encryption/authentication layer between the HTTP and
TCP. This system was designed by Netscape Communications Corporation to
provide authentication and encrypted communication and is widely used on the
World Wide Web for security-sensitive communication such as payment transac-
tions and corporate logons. URL: http://en.wikipedia.org/wiki/Https
Open Source: Open source is a development methodology,[1] which offers practi-
cal accessibility to a product's source (goods and knowledge). Some consider
open source as one of various possible design approaches, while others consider
it a critical strategic element of their operations. Before open source became
widely adopted, developers and producers used a variety of phrases to describe
the concept; the term open source gained popularity with the rise of the Internet,
which provided access to diverse production models, communication paths, and
interactive communities.
The open source model of operation and decision making allows concurrent input
of different agendas, approaches and priorities, and differs from the more closed,
centralized models of development.[2] The principles and practices are commonly
applied to the development of source code for software that is made available for
public collaboration, and it is usually released as open-source software. URL:
http://en.wikipedia.org/wiki/Open_source
RPC: Remote procedure call (RPC) is a technology that allows a computer pro-
gram to cause a subroutine or procedure to execute in another address space
(commonly on another computer on a shared network) without the programmer
explicitly coding the details for this remote interaction. That is, the programmer
would write essentially the same code whether the subroutine is local to the exe-
cuting program, or remote. When the software in question is written using object-
oriented principles, RPC may be referred to as remote invocation or remote
method invocation. URL: http://en.wikipedia.org/wiki/Remote_procedure_call
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 9/72
10. Software Architecture 04/2008/KW
Server: In information technology, a server is an application or device that per-
forms services for connected clients as part of a client-server architecture. A
server application, as defined by RFC 2616 (HTTP/1.1), is "an application program
that accepts connections in order to service requests by sending back responses."
Server computers are devices designed to run such an application or applications,
often for extended periods of time with minimal human direction. Examples of d-
class servers include web servers, e-mail servers, and file servers. URL:
http://en.wikipedia.org/wiki/Server_%28computing%29
Software Architecture: The software architecture of a program or computing sys-
tem is the structure or structures of the system, which comprise software
components, the externally visible properties of those components, and the
relationships between them. The term also refers to documentation of a sys-
tem's software architecture. Documenting software architecture facilitates com-
munication between stakeholders, documents early decisions about high-level de-
sign, and allows reuse of design components and patterns between projects. URL:
http://en.wikipedia.org/wiki/Software_architecture.
TOMCAT: Apache Tomcat is a Servlet container developed by the Apache Soft-
ware Foundation (ASF). Tomcat implements the Java Servlet and the JavaServer
Pages (JSP) specifications from Sun Microsystems, and provides a "pure Java"
HTTP web server environment for Java code to run. … Apache Tomcat includes
tools for configuration and management, but can also be configured by editing
configuration files that are normally XML-formatted. URL:
http://en.wikipedia.org/wiki/Apache_Tomcat
UML (Unified Modeling Language): In the field of software engineering, the Uni-
fied / Universal Modeling Language (UML) is a standardized visual specification
language for object modeling. UML is a general-purpose modeling language that
includes a graphical notation used to create an abstract model of a system, re-
ferred to as a UML model. UML is officially defined at the Object Management
Group (OMG) by the UML metamodel, a Meta-Object Facility metamodel (MOF).
Like other MOF-based specifications, UML has allowed software developers to
concentrate more on design and architecture URL:
http://en.wikipedia.org/wiki/Unified_Modeling_Language
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 10/72
11. Software Architecture 04/2008/KW
Unicode: In computing, Unicode is an industry standard allowing computers to
consistently represent and manipulate text expressed in most of the world's writing
systems. Developed in tandem with the Universal Character Set standard and
published in book form as The Unicode Standard, Unicode consists of a repertoire
of more than 100,000 characters, a set of code charts for visual reference, an en-
coding methodology and set of standard character encodings, an enumeration of
character properties such as upper and lower case, a set of reference data com-
puter files, and a number of related items, such as character properties, rules for
normalization, decomposition, collation, rendering and bidirectional display order
(for the correct display of text containing both right-to-left scripts, such as Arabic or
Hebrew, and left-to-right scripts). URL: http://en.wikipedia.org/wiki/Unicode
UTF-8: UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length
character encoding for Unicode. It is able to represent any character in the Uni-
code standard, yet the initial encoding of byte codes and character assignments
for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily
becoming the preferred encoding for e-mail, web pages, and other places where
characters are stored or streamed. URL: http://en.wikipedia.org/wiki/UTF-8
XML-RPC: XML-RPC is a remote procedure call protocol which uses XML to en-
code its calls and HTTP as a transport mechanism. URL:
http://en.wikipedia.org/wiki/Xml-rpc
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 11/72
12. Software Architecture 04/2008/KW
5 INTRODUCTION
5.1 Arguments for an OpenTMS Software Architcture
The arguments for an open source based localization tool have been discussed in
FOLT, 2007a.
Software design principles:
For end users (translators): easy to install
For translation providers: server version, networking
For customers: running own servers; secure interfaces
5.2 Basics
5.2.1 Naming conventions
OpenTMS uses a standardized naming convention scheme for variables, names in
xml file etc.
Each legal OpenTMS name (string, literal, variable name, function names) con-
sists of one or more words. Variables starts with an uppercase letter. Function
names (e.g. identifying processes) start with lowercase. Only the characters [A-Z]
are allowed. The remaining characters are either [a-z] or [0-9]. No blanks are al-
lowed between words.
Word := [A-Z]([a-z]|[0-9])*
word := [a-z]([a-z]|[0-9])*
OpenTMSName := Word+
OpenTMSFunctionName := word Word*
Examples:
• The variable: xliffDocument
• The function: openXliffDocument
Fig 1: OpenTMSName defined as a regular expression
Exceptions from the naming conventions could be introduced if acronyms etc. are
used for words (e.g. TMX). Nevertheless it is not recommended to do this.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 12/72
13. Software Architecture 04/2008/KW
5.2.2 Naming of OpenTMS specific functions/methods
It is suggested using a consistent OpenTMS naming system for functions and
variables which are exported from OpenTMS. Exported functions refer to functions
which can be used in applications (similar to the public concept in Java or C++).
This immediately helps to identify code which is used in systems outside of
OpenTMS. The special string “OpenTMS_” is used for this purpose.
ExportOpenTMSName:= “OpenTMS_” Word+
ExportOpenTMSFunctionName := “OpenTMS_” word Word*
Examples:
• The variable: OpenTMS_Ecoding
• The function: OpenTMS_openXliffDocument
Fig 2: Naming of OpenTMS functions for export
5.3 Character set
OpenTMS uses UTF-8 as basic character set, esp. for exchanging files.
5.4 Standards
FOLT builds heavily on the idea of Open Source and using standards. Therefore
the FOLT requirements use well-established localization standards to represent
various types of localization information - based on XML.
• XLIFF - XML based localization exchange format
• TTX – Trados TM format
• TMX - XML based localization translation memory exchange format
• SRX - XML based format for describing segmentation rules
• GMX – standard for measuring quantitative aspects in the translation
process
• TBX / MARTIF / OLIF – formats for representing terminology
• CSV
• Language Encoding ISO 639…
In general the basic architecture makes heavy use of XML. XML based structures
are used as the basic mechanism to exchange information between different ap-
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 13/72
14. Software Architecture 04/2008/KW
plications (->Translets). Using XML has the advantage that many (open source)
parsers are available for different programming languages which enables imple-
menting the core OpenTMS architecture in different languages and environments.
5.5 Basic Requirements
The following is taken from the FOLT (2007b); it extracts the main requirements:
• Software: Web based application; thin client; no installation no properiatary run
time components; preferred open source software (FOLT, 2007b, p. 17)
• Operating System: OS Independent
• Hardware: standard hardware (FOLT, 2007b, p. 17)
• Interfaces: Integration into CMS, workflow management should be supported
(FOLT, 2007b, p. 17).
• Product interfaces: Exchange supported through XLIFF and TMX (FOLT,
2007b, p. 18).
• Database: Open source database (FOLT, 2007b, p. 21); basically all SQL da-
tabases should be supported, therefore a generic database interface is re-
quired.
• Scalability: single and multi user requirement
5.6 Architecture
The architecture is described mainly in diagrams and text. The target group of this
document are mainly non technicians. Therefore it is tried to keep the document
as informal as possible without loosing the necessary precision. Further docu-
ments or versions of this document may add more details to the various items dis-
cussed. If possible the basic methods and classes have been written in Java but
this should not induce that the implementation requires Java as an implementation
language.
The various components described in the document are called models. A model
organizes a certain functionality or aspect of the OpenTMS systems. An example
of a model is the security model of OpenTMS. This model describes all necessary
functions and structures to implement the OpenTMS security system.
There are several methods to describe architecture, methods and objects of a
piece of software. Within this document mainly diagrams and block diagrams are
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 14/72
15. Software Architecture 04/2008/KW
used to show the structure of the software. For describing methods and objects an
XML based methodology is used (taken from Tomcat).
The following is an example of a method call description using the Tomcat inter-
face description. The method will be enhanced by describing also the possible re-
turn values.
<translet>
<translet -name>ApplyTranslationMemoryToSegment</translet-name>
<translet-class>com.OpenTMS.translet.translateSegment</translet-
class>
<init-param>
<param-name>
TMXDB
</param-name>
<param-value>
OpenTMSexampledatabase
</param-value>
</init-param>
<init-param>
<param-name>
SEGMENT
</param-name>
<param-value>
This segments needs to be translated.
</param-value>
</init-param>
<init-param>
<param-name>
FUZZYQUALITY
</param-name>
<param-value>
70
</param-value>
</init-param>
</translet>
Fig 3: OpenTMS Procedure description
Annotation: In order to keep the text more compact function naming does not in-
clude the naming scheme described in chapter 5.2.2. But this jus for readability
purposes. The real implementation should adhere to the naming scheme.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 15/72
16. Software Architecture 04/2008/KW
6 OPENTMS ARCHITECTURE AND MODELS
The OpenTMS architecture is composed of several models. Each model imple-
ments a specific aspect and behavior of the OpenTMS system. Each model com-
municates with the other model through parameters and values.
6.1 Parameters in OpenTMS models
Parameter and their realization, esp. their types, independently from a specific pro-
gramming languages is not really trivial – apart from trivial types like characters,
strings, integers or other numbers. Transferring more complex structured informa-
tion has to be organized based on those primitive types. Programming languages
typically uses “serialization” approaches to achieve at least a transfer of date from
one application instance to another instance.
OpenTMS tries to use a general parameter / value model which addresses both
programming language specific and programming language independent parame-
ter / value transfer. In order to make the integration of existing applications possi-
ble OpenTMS supports different options for parameter representation.
The following methods should be supported:
• XML based parameters: all values should be transferred thru xml elements
where the value is given thru the element content (string), the name of the
parameter as attribute and the type of the parameter as an attribute too. XL
based parameter / value transfer is esp. useful when transferring complex
structured values between functions (e.g. objects). Nevertheless complex
parameters (objects) need to be serialized. It is suggested that OpenTMS
defines some additional basic parameter types which often occur in transla-
tion tools (e.g. date type, TransUnits from XLIFF, tu or tuvs in TMX).
• Tomcat parameters: This follows the way how the TOMCAT server engine
defines method calls with parameter values. Actually also XML based.
• XML-RPC parameter: This follows the way how XML-RPC defines method
calls with parameter values. It supports some basic types like integer etc.
More complex parameters have to be serialized.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 16/72
17. Software Architecture 04/2008/KW
• Programming Language specific parameters: Those parameters should
be wrapped in a specific object thru serialisation. This parameter type
should only be used within a specific implementation where it is very
unlikely that it will be used by other programming languages.
• Hash tables: Hash tables are supported by most programming languages
and transfer between database is often supported. Basically an entry in the
table contains a key (the name of the parameter) and the value of the pa-
rameter (value of the key).
The kernel of each language specific OpenTMS implementation contains a basic
library which supports creating reading and writing OpenTMS parameters.
Type Comment
int Integer as in Java
float Float as in Java
char Character as in Java
String String as in Java
Time
Date
TransUnit XML based XLIFF TransUnit Structure
tu XML based TMX tu Structure
GLO General Linguistic Object - see chapter
12
MoLo Monolingual Object - see chapter 12
Mulo Multilingual Object - see chapter 12
Fig 4: Table of Core OpenTMS parameter types
An example how parameters are used is given in Fig. 2.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 17/72
18. Software Architecture 04/2008/KW
6.2 Core Models of OpenTMS
The following chapter describes the core models of OpenTMS. The key idea is
that OpenTMS uses an extendible architecture approach which allows to add new
models in an easy, yet compatible way to the kernel architecture. A new model
has to fulfill some basic requirements, e.g. that parameters are defined and used
in the way as described in the previous chapter 6.1.
Fig 5: OpenTMS Models and their relations
The OpenTMS models are arranged in a kind of “onion model”. The kernel is rep-
resented by the process model which in turn builds on the user, document and
data model which model specific aspects of the OpenTMS system. These kernel
models are “shielded” by the security model which is responsible for assuring that
only allowed operations are performed.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 18/72
19. Software Architecture 04/2008/KW
• Security Model: This model describes the security aspects and require-
ments of OpenTMS. Other models use the security model to allow or re-
strict the access to OpenTMS specific functions. OpenTMS uses a security
model which on the one side secures the communication channel and on
the other side secures data (e.g. the value of elements in an xml file or the
values in a property file).
• User Model: This model realizes the user and its representation in the
OpenTMS. The user model works in tight connection with the security. User
does now only imply human users, but also other processes. User models
have rights attached to them which in turn support the security model of
OpenTMS.
• Process Model: This model implements the functions (combined finally into
applications – see application model) of the OpenTMS, e.g. a converter or a
translation memory search.
• Data Model: Basically this model implements the database side of
OpenTMS. It uses a generalized database model, called data sources.
Data sources are any kind of storage media for data, starting from plain text
files towards SQL and other types of databases.
• Document Model: The document model describes the core documents
used in OpenTMS. Basically this is based on XLIFF and TMX. The docu-
ment model also could be seen as part of the data model but due to the im-
portance of documents as one of the core output produced by the transla-
tion and localization process they are modeled separately.
• GUI Model: This model specifies editors and other functionality which re-
quires a GUI. The GUI model is not further detailed in the architecture
specification here. The GUI model should be defined in a separate docu-
ment.
• Interface Model: The model describes how to extend OpenTMS with new
models. The Interface model is an abstract model and needs further inspec-
tion. An example of such an extension is the interface to CMS systems. In-
terface models are also of quite importance as they serve as the connection
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 19/72
20. Software Architecture 04/2008/KW
to other applications (e.g. Web servers, CMS systems) and in general to
scripting languages like Perl, PHP etc.
• Application Model: This model realizes programs, which performs tasks
like translation etc.
6.3 OpenTMS Core Library
In order to achieve a consistent implementation and in order to foster a quick im-
plementation OpenTMS implements its key functions in a core library. Function
implemented in the core library should not be re-implemented (“reinvented”) in ex-
ternal functions or processes. Obviously the set of key functions will evolve over
time. Functionality and implementation of the core should not be changed without
important reasons (similar to the LINUX implementation process).
Using a core library OpenTMS will ensure that certain functions behave in the
same way across applications. It also gives security to the developer and the user
that functionality does not change unforeseeable.
Core library functions should be the first one which are realized if OpenTMS is im-
plemented in different programming languages.
6.4 The Application Model
The OpenTMS architecture just serves as a model how the different aspects of
tools supporting the translation process can be implemented. As a model it is in-
dependent from any programming language.
Applications need to be written in order to make the functionality of OpenTMS
accessible to users. This is realized in the application model. The GUI model can
be seen as an example of an application model.
Applications obviously depend on the existence of a concrete implementation in an
existing programming language (Java, C#, Perl or whatever). In this sense
OpenTMS provides a programming framework which allows to construct language
support tools.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 20/72
21. Software Architecture 04/2008/KW
In the beginning OpenTMS will come with some basic applications (Editors etc.).
But the main idea is that a profound framework is defined and specified which al-
lows the construction of new language applications.
OpenTMS also supports its own scripting language (OpenTMSL). This language
makes the OpenTMS functions accessible thru simple calls (similar to batch files).
This scripting language can also be used to construct applications.
6.5 Implementation Languages
In a first step it is suggested to implement a Java version of OpenTMS. Java has
the advantage compared to other languages that it runs on several operating ma-
chines (which is one of the goals of FOLT and OpenTMS). Integrating tools written
in other language can be done as OpenTMS from its basic model is constructed
toward using XML-RPC and similar communication modes.
The basic Java implementation can serve as the basis for other implementations
(C, C#, C++, Perl, PHP etc.).
With regard to security issues associated with choosing a proper programming
languages see chapter 7.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 21/72
22. Software Architecture 04/2008/KW
7 SECURITY MODEL
A key success factor of the OpenTMS system is security. As translation always
can involve documents of various security levels a proper handling of the docu-
ments and document transmission is required.
Depending on the security level data can be encoded/encrypted. It is suggested to
use three different levels.
• Level 0: No security procedures are applied, data are transferred as they
are.
• Level 1: The communication channel is secured. It uses standard secure
protocols here.
• Level 2: Encoding for security is done here on data level. Basically this
means that strings are encrypted when the are communicated through a
communication channel or are written or retrieved from a database. This
also involves encrypted XLIFF files (resp. parts of it).
• Level 4: GUI level related security
Level 1 and 2 can be used together to achieve optimal security where necessary.
Security is attached to the OpenTMS User model.
A key feature of the OpenTMS architecture is that the security model is transpar-
ent. Actually when writing a (new) application the programmer does not need to
take care of the security expect. The OpenTMS kernel provides all the functions
and interfaces to make those calls transparent; supplying the correct parameters is
sufficient.
Actually another type of security level (Level 4) can be introduced at GUI level. At
this level functions like copy and paste are secured in addition. This should pro-
hibit that users can copy and paste the content of text windows (editing windows)
into other applications. Defining this security level will be left to the GUI model
definition.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 22/72
23. Software Architecture 04/2008/KW
The following diagram shows how several methods can be combined to achieve a
high security during the transmission of an XLIFF file. In this example in a first step
the XLIFF is secured (encrypted). Once a transfer of the file during the net work is
required the channel as such is also secured. Once the XLIFF file is received it is
decoded by the OpenTMS system. From a programmatic side this is just realised.
by setting and defining the security to be used.
Fig 6: Example securing XLIFF document exchange
7.1 Security, OpenTMS and Programming Languages
In the previous chapter the issue of programming languages has been discussed.
A common known problem with programming languages – more precisely with
applications written in those languages and often also only associated with specific
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 23/72
24. Software Architecture 04/2008/KW
operating systems – security measures are often not properly implemented (e.g.
the very old problem of “buffer overflows” in C).
OpenTMS overcomes this problem by clearly defining specific modules which are
encapsulated and follow modern software development rules (e.g. access only
thru well defined interfaces) a special security layer wraps the various modules.
This architecture specification is mainly targeted towards the server part of
OpenTMS. Thus it is independently from any GUI application.
GUIs can use OpenTMS basically in two ways:
a) thru the OpenTMS server functionality: This approach encapsulates all
modules and functions and gives the highest possible security measure.
Here only “public server sided functionality” can be used.
b) Directly calling functions from the OpenTMS library: Obviously this can
cause problems if the GUI does not call the functions properly (esp. in pro-
gramming languages like C or C++).
One of the OpenTMS target GUIs are web based applications (browser based).
Those will call all the functionality thru a web server, SOAP or XML-RPC inter-
faces. This minimises the danger of introducing security problem on the client size
(e.g. for GUIs which have to follow requirements like ZDv 54/100 VS-NfD „IT-
Sicherheit in der Bundeswehr“). By restricting to “plain HTML” one can reduce the
risk to a minimum. Obviously increasing the security level goes with a decrease in
comfort und user friendliness. This decision is up to the end user and his organisa-
tion.
7.2 Communication Level
Communications which goes through TCP/IP should support (strong) encryption of
the data transmitted. This is done in addition to using protocols like https, se-
cureFTP etc.
7.3 Document Level
The basis of most activities in OpenTMS are documents.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 24/72
25. Software Architecture 04/2008/KW
A key problem is the transfer of xliff files. The content of the segments are nor-
mally readable by human readers. If required the segments in the xliff files (as well
as in tmx or tbx files) can be encrypted (creating something like a secureXLIFF,
secureTMX, secureTBX). The segments can only be read in conjunction with a
user and password. The users who have regular access to the content can be
stored in encrypted form in the header of the xliff file or be supplied when opening
the xliff document.
7.4 Database Level
Database entries follow the same procedure. If required the entries should be en-
crypted. At this level database specific security functionality can and should be
applied to.
Without the knowledge of the user - password combination an export etc. of the
database does not provide any information in case of an attack.
In addition any data base security layers need to be supported too.
7.5 Security Level
The following functions assume that each encryption and decryption process as-
sociates the relevant user and his roles with the security function. At this point no
function parameters are defined. This will be done in an implementation manual.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 25/72
26. Software Architecture 04/2008/KW
Function Comment
Encrypt / Decrypt General function which encrypts and decrypts any
type of document
Encrypt XLIFF This function encrypts the texts (segments) of a
XLIFF document. The xml structure as such is still
Decrypt XLIFF visible. Depending on the parameters supplied
attributes etc. are secured too.
Encrypt TMX This function encrypts the texts (segments) of a
TMX document. The xml structure as such is still
Decrypt TMX visible. Depending on the parameters supplied
attributes etc. are secured too.
Encrypt TBX This function encrypts the texts (segments) of a
TBX document. The xml structure as such is still
Decrypt TBX visible. Depending on the parameters supplied
attributes etc. are secured too.
Establish Secure Communi- Establish a secure communication channel. The
cation type of security depends on the supplied parame-
ters.
Terminate Secure Communi- Terminates a secure communication channel.
cation
Secure Data Source Enables the encryption / decryption of database
entries.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 26/72
27. Software Architecture 04/2008/KW
8 BASIC OPENTMS COMPONENTS
The OpenTMS framework is organized around a set of basic components called
models (see chapter 6) which interact and allow to apply processes on them. The
following is a brief overview which basic models exist:
• Documents: Documents form one key feature of the architecture. Basically
documents are every form of text. Translations and other modification proc-
esses (e.g. segmentation) are applied to documents. A key document type in
OpenTMS is an XLIFF document which is main paradigm for communication
text between various processes.
• Database: Database refers to any kind of storage which can be used to re-
trieve a specific text or sub-text (like a paragraph, segment). Database in the
OpenTMS context is understood widely, starting from simple text files towards
highly sophisticated SQL or object oriented database systems. OpenTMS uses
a general database object which can come in various flavors, e.g. translation
memory, a phrase database or terminology databases. OpenTMS database
architecture supports various security levels. Encrypting of entries should be
supported. OpenTMS uses the notion of “data source” for this generalized
data bases.
• Processes: Processes apply operations to documents and databases. Opera-
tions could be: modifications, inserting, searching, editing, converting etc. A
key process in OpenTMS is the translations process. OpenTMS processes are
named “Translets” (or Translet in singular). An example of a Translet is a Do-
clet, a module which is applied for the conversion, modification etc. of docu-
ments. Processes in OpenTMS are normally accessible through the OpenTMS
Scripting Language, a language which gives access to the core operations of
the OpenTMS architecture (similar to Java Scripts)
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 27/72
28. Software Architecture 04/2008/KW
Fig 7: OpenTMS Objects
From a certain perspective processes can be seen as a special type of commu-
nication. Within OpenTMS three different communication types can be distin-
guished. Communication is here used in a broad view.
• Command (file) based process: Here an executable is run (batch mode).
Command processes use xml based command files as input parameters.
• Function based process: Here the specific process is called either as a func-
tion or method within a piece of software.
• Net (TCP/IP) based process: Here a process is run through a net work
(TCP/IP) using SOAP, RPC, XML-RPC or similar communication methods. The
method is activated in a certain process while the actual execution is run in an-
other process (could be a server, a virtual machine, multi threading or similar).
• Workflow: A workflow is a set of processes which are applied in a specific se-
quence. A workflow also may involve humans as part of the workflow. A typical
workflow could be: PM received document to translate – determines document
characteristics – compute statistics – provides offer – client accepts offer – PM
determines translator – converts document for translator – sends to translator –
and so on. This means that a workflow also can contain purely humans actions
interwoven with computer processes. Anyway each human process must be
mapped to a computer process.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 28/72
29. Software Architecture 04/2008/KW
Later in the document it is mentioned that processes can be organized in pipe-
lines. Actually this means that one process can take the output of another process,
do some computation on this output and create a new output which itself can now
form the input to another process.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 29/72
30. Software Architecture 04/2008/KW
9 DOCUMENT MODEL
9.1 Documents
Documents(“texts”) are a core concept in OpenTMS. Documents are normally the
core interest as documents need to be translated. Documents normally come into
OpenTMS as input or output. Documents are normally processed in OpenTMS
thru XLIFF (chapter 9.4). Documents are converted into XLIFF and back. Docu-
ments come in various formats, e.g.:
• WinWord
• RTF
• Plain text
• HTML
• XML
• OpenOffice
• program texts
• resource files
• property files
• database entries
• any other common location industry formats
• any other document type
The most simple type of a document is a string, a sequence of characters. For
OpenTMS processes strings are packed into XML structures, mainly a subset of
XLIFF.
A key property of a document is a language associated with it – although the lan-
guage itself may vary within the document. If a document gets translated at least a
second language is associated with it.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008
30
31. Software Architecture 04/2008/KW
9.2 Character Sets
OpenTMS uses the Unicode character set for all (internal) representation pur-
poses. This has the advantage that most of the characters used worldwide can be
processed with OpenTMS. Also most programming languages use nowadays Uni-
code as their internal character representation.
UTF-8 formatted text is used as the core character set if OpenTMS produces and
delivers files which are some kind of final document (e.g. for statistics output). De-
viations come in if the original character set differs.
The core library of OpenTMS contains basic functions to convert from one charac-
ter set to another character set. In addition the kernel library should contain some
functions which allow the detection of a character format of a document.
9.3 XML document handling
OpenTMS heavily uses XML bases standards (XLIFF, TMX, TBX). There are sev-
eral good open source implementations for XML handling available (DOM model,
SAX parser, JDOM just to name a view). Obviously those functions should used to
manipulate those documents.
On top of the standard xml library functionality functions are required to support
the manipulation of the translation / localization XML standards. Those functions
will also be part of the core library.
9.4 XLIFF Documents
XLIFF documents form the core document type on which most of the processes
are applied (segmentation, translation etc.). XLIFF documents are created by con-
verters. Converters take different document formats (rtf, xml, html etc.) and con-
vert them to the xml based XLIFF format (XLIFF, 2008).
The following shows a very simple example of an XLIFF document.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 31/72
32. Software Architecture 04/2008/KW
<?xml version="1.0" encoding="UTF-8" ?>
<xliff version="1.0">
<file datatype="XML" original="D:arayatestsimplexmlsimplexml.xml"
source-language="de" target-language="es">
<header>
<phase-group> Header of the XLIFF File
<phase company-name="Araya" date="Sun May 11 11:29:11 CEST 2008" phase-
name="1" process-name="pre-process" tool="XML2XLIFF version 2.0"/>
<phase company-name="Araya" date="Sun May 11 11:29:11 CEST 2008" phase-
name="2" process-name="Segmentation" tool="SEGMENTER version 2.0"/>
</phase-group>
<skl>
Reference to an external file
<external-file href="C:arayasklsimplexml.xml.27120.skl"/>
<internal-file
form="mimestring">PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiID8+DQo
8c2ltcGxleG1sPg0KPHNl
Internal File
Z21lbnQ+JSUlMCUlJQo8L3NlZ21lbnQ+DQo8c2VnbWVudD4lJSUxJSUlCjwvc2VnbWVudD4NC
jwv
c2ltcGxleG1sPg==</internal-file></skl>
<prop-group name="encoding"><prop prop-type="encoding">UTF-
8</prop></prop-group>
<prop-group name="xmlformat">
<prop Properties of the XLIFF File
prop-type="donotresolveentitiesfile">C:arayainiedqm-
ent.txt</prop>
<prop prop-type="iniFile">c:/Araya/ini/config_simplexml.xml</prop>
</prop-group>
<prop-group name="specialinfo">
</prop-group>
</header>
<body>
<trans-unit approved="no" help-id="0" id="0" xml:space="preserve">
<source xml:lang="de">Das ist ein Segment</source>
<target xml:lang="es" xml:space="preserve"/><prop-group><prop prop-
type="segmentid">1067381512</prop></prop-group></trans-unit>
Segments
<trans-unit approved="no" help-id="1" id="1" xml:space="preserve">
<source xml:lang="de">Das ist ein <ph id="0"><b></ph>Segment
mit<ph id="1"></b></ph> Format</source>
<target xml:lang="es" xml:space="preserve"/><prop-group><prop prop-
type="segmentid">1067381512</prop></prop-group></trans-unit>
</body>
</file>
</xliff>
Fig 8: XLIFF File
9.4.1 OpenTMS and Skeleton files
Skelton files are one of the key features of XLIFF. In order to reduce the size of
content of a segment (transunit, source and target) most converters move the non-
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 32/72
33. Software Architecture 04/2008/KW
relevant part (e.g. format information) of an (external) document in an external rep-
resentation. They then use a kind of referencing scheme to specify where parts of
the text and the segment come together (mainly for back conversion). Skeleton
files mainly contain the format (non-textual) part of a document. Often this part is
bigger than the core text.
One can distinguish between internal and external skeleton files (also called skl
files).
External skl files keep the XLIFF file small, while internal skl files create a bigger
XLIFF file. With external files the problem of back conversion is more complicated
as the back converter requires the skl file. One way to overcome this problem is to
compress the internal skl file and encode it appropriately.
OpenTMS supports the back conversion of a document independently from the
place it was created. Thus normally XLIFF files in OpenTMS use internal skl files.
In case where this is not possible or wanted a procedure must be supplied which
allows to reintegrate the skl file into the xliff file before transmitted to another ma-
chine, user etc.
9.4.2 Security and encryption in XLIFF – secureXLIFF
As described in the section about security XLIFF documents must follow the secu-
rity architecture of OpenTMS. XLIFF documents are potential threat for security. If
they are transmitted via the web or by another transport method (USB stick etc.)
other persons may read the XLIFF document. In order to prevent access of unau-
thorized users it is proposed to encrypt the relevant parts (esp. source and target
elements) of the document. Only specified users with the correct password will
gain access through an editor or similar to the content of the XLIFF document.
XLIFF editors reading the file must support the OpenTMS security layer. Using
such a security approach one also could forbid copy and paste etc. for a given xliff
document.
Annotation: Obviously an open source encryption method should be used.
Using a secureXLIFF may be a good argument for industrial user to use the
OpenTMS concept and architecture.
9.5 TMX Documents
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 33/72
34. Software Architecture 04/2008/KW
TMX documents form the core document type on which database operations apply
(fuzzy search, word based search etc.). TMX documents resp. their entries are
stored in databases. Converters take different translation memory exchange for-
mats (Trados, etc.) and convert them to the xml based TMX format (TMX, 2008).
Databases store the tmx entries. While there is no problem with the meta informa-
tion associated with each TMX entry (tu) the global TMX document meta informa-
tion creates a problem. As databases are organized around entries this meta in-
formation must be stored in separate tables and referenced by each entry.
1
TMX files are normally imported into databases to support high access speed .
9.5.1 Security and encryption in TMX – secureTMX
The same security architecture as for XLIFF should be applied to TMX.
9.6 TBX Documents
TBX documents form the core document type for terminology data. TBX docu-
ments are imported into a OpenTMS database. TMX and TBX documents are in-
ternally stored in the same entry structure. They can distinguished by specific
markers.
The reason for storing both TMX and TBX documents in the same type of data-
base is that this allows the re-usage of both data in similar situations. Obvi-
ously the database functions need to support reading and writing the entries given
the context. This a (originally) TBX entry may be used as a TMX entry (translation
memory match) in one context while a TMX entry could be used as a terminology
match in another context. This internally identical handling should not imply that
both entry types are the same but reality shows that often the usage patterns re-
quire that they can be used interchangeable.
9.6.1 Security and encryption in TBX – secure TBX
The same security architecture as fur XLIFF should be applied to TMX.
1
A key question is if OpenTMS should allow direct access to TMX files (like Star text files) too
without having the need to import them into a database. Advantage would be that esp. for
small TMX files there is no real need to store them in a database. It would also not require any
database drivers. XML access functions would be sufficient. One could see this a special type
of database.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 34/72
35. Software Architecture 04/2008/KW
9.7 Other Documents
OpenTMS requires to process all types of other documents. Once those files are
brought into the OpenTMS system those files are converted to XLIFF (except
those cases discussed above). Once processed those XLIFF documents are con-
verted back to their original format.
Ideally OpenTMS should contain or interact with a CMS system which provides a
convenient way of storing all kinds of documents. Interfaces to CMS will be de-
fined. Although the implementation of the interface is not part of the OpenTMS
implementation. See chapter 18
9.8 Basic Document Access Functionality
In the following some basic XLIFF file functions are described. Those functions
should go into the core library of OpenTMS. They are by far not exhaustive. A
more detailed function library for XLIFF will be defined later. Although most of the
functions can be realised by using DOM functionality, a function library which
makes it easy to handle XLIFF files should be realised.
As the functions will involve complex parameter combinations the parameters will
be supplied as XML constructs. For performance reason one will not really supply
flat xml files, but an in-memory version of the XML file (nodes etc.).
Basic Translation Func- Comment
tions for XLIFF documents
Convert Document Converts a given document to XLIFF
Backconvert Document Back converts a given document from XLIFF
CreateXLIFFDocument Creates an empty XLIFF document. This function
maybe questionable as normally XLIFF docu-
ments have just an temporary status. The nor-
mally come into existence thru a converter call.
Nevertheless such a function may be helpful.
Pure to text conversion can be achieved anyway.
GetProperties Retrieves the (general) properties of the XLIFF
document
SetProperties Sets the (general) properties of the XLIFF docu-
ment
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 35/72
36. Software Architecture 04/2008/KW
Segment Segments the XLIFF document based on some
SRX rules (configuration file)
AddTransUnit Adds a new TransUnit at a certain position. This
function also depends on the original format. De-
pending on the format this function may cause
problems in the back conversion process.
RetrieveTransUnit Retrieves a segment of the XLIFF document; this
includes all the information of the segment (thus
the whole trans-unit is received)
RemoveTransUnit Removes a TransUnit; here one could distinguish
between immediately (and therefore permanently
executing the operation) or just making the
change in memory and later saving the changes.
ModifyTransUnit Modifies a TransUnit; here one could distinguish
between immediately (and therefore permanently
executing the operation) or just making the
change in memory and later saving the changes.
TranslateTransUnit The TransUnit is translated based on some pa-
rameters supplied. This can include TM transla-
tion, term translation or machine translation or
basically any other kind of translations or
nvocacation.
SplitTransUnit Splits the source part of a TransUnit. Care has to
be taken with regard to validity.
CombineTransUnit Combines the source parts of a TransUnit. Care
has to be taken with regard to validity.
SaveDocument Saves the XLIFF document
GetStatistics Returns some statistics of the translation process
(GMX based)
Fig 9: Some basic XLIFF File functions
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 36/72
37. Software Architecture 04/2008/KW
10 OPENTMS AS A CLIENT/SERVER ARCHITECTURE
The kernel OpenTMS architecture is based on the client server principle. Using
a client server architecture brings many advantages, amongst the very critical one
that processes can be spread over several computers or threads in modern oper-
ating systems and hardware architectures. This does not imply that the OpenTMS
architecture only can be implemented on a client server basis. All the processes
(Translets) also can run in a single user environment (e.g. by a procedural call
within an editor). But by using a client server framework one avoids the problem to
re-program or re-implement a piece of software which was designed to run in a
single threaded environment only. This holds with regard to using global or static
variables etc. from an implementation point of view.
Each procedure developed for OpenTMS should be designed with multi thread-
ing in the background. Each procedure should be encapsulated in such a way that
it can be surrounded by a (process wrapper) which allows it to run other as a
(multi) thread in the same software or computer environment or can be distributed
over several computers. Actually this means “globally defined variables” should
be avoided as far as possible. As has been described before the key functions are
implemented in the OpenTMS core library.
All (main) procedures should also be written in such a way that they can be called
easily by the OpenTMS scripting language.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 37/72
38. Software Architecture 04/2008/KW
Fig 10: Hierarchy of processes
Processes have to adhere to the security concept of OpenTMS. Processes can
only be executed if they (and the user associated with the process) have appropri-
ate rights (gained thru the security model). This esp. applies for processes which
use network connections.
Fig 11: Applications
Most of the processes are XLIFF exchange based (thinking in terms of functions
and variables this means that the parameters of functions are XLIFF documents or
substructures of XLIFF). This means that the processes mainly operate on XLIFF
based xml structures. They add or modify XLIFF structures. In principle the opera-
tions should be non destructive. That is information is not deleted or removed but
only added. In some cases this cannot be fully held: e.g. if a translator modifies a
translation (in a destructive way) the (older) information is lost. The same may ap-
ply to database entries. This also depends on the usage of a proper versioning
system. As a consequence of using internally XLIFF related structures conver-
sions to related XML based formats like TMX, TBX etc. must be supported. This
can be realized by attaching import and export procedures to the OpenTMS ker-
nel.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 38/72
39. Software Architecture 04/2008/KW
Exceptions are for example converters which take a whatever formatted docu-
ment as input and produce an XLIFF document. The same applies to back con-
version.
Please note that the above figure also represents some kind of workflow. Basic
workflows can be part of the OpenTMS architecture (e.g. each process applying
changes to an XLIFF document should document this in the XLIFF header). But it
is not intended that OpenTMS as such comes with its own workflow solution. More
complex workflow procedures should be modeled either using proprietary or open
source software.
OpenTMS also follow the “old style” of UNIX pipe lining. Processes (see chapter
about process model) take an input and produce an output. The next process will
take the output of the previous process applying some further transformation of the
input and creating new output. Nevertheless there is some difference. As parame-
ters can become quite complex the UNIX style of interpreting the input just as “a
string” is opened here up to support input and output in form of the parameters
described before.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 39/72
40. Software Architecture 04/2008/KW
Fig 12: Pipeline Architecture
Figure 11 shows a typical pipe lining of several processes (Translets) during a
translation process. OpenTMS can differentiate between two basic Translets.
• Human Initiated Translets: These are Translets which are invoked and
(fully) controlled by humans. Examples are a Translation Editor, operation
which invoke inserting or updating entries in a database.
• Automated Translets: These are processes which are normally run auto-
matically and do not require human interactions. Examples are the steps –
conversion – segmentation – pre-translation. Here also automated pro-
cedures (e.g. pre-translating a project – Translets applied to a set of docu-
ments) have to mentioned.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 40/72
41. Software Architecture 04/2008/KW
11 DATA MODEL
11.1 Data sources
Data (mostly databases) are modeled thru data sources. Data sources are the ba-
sic objects which allow the access to all kind of data, esp. databases. Data
sources mainly store segments from TMX files or TBX entries. Data sources are
XML oriented, that is depending on the xml document supplied it converts the en-
try in such a way that it can be transferred to a data component.
Fig 13: Data sources and data components
Why not directly refereeing to databases? The basic idea behind the usage of a
data source as the core data object in OpenTMS (representing databases) etc. is
that creating such a layer between the real databases (e.g. MySQL) and the
OpenTMS software makes adding new types of data quite easy. The various types
of data are referred to as data components. Thus an SQL database is a data
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 41/72
42. Software Architecture 04/2008/KW
component, but also a TMX file could be seen as a data component if the relevant
access operations are supported. Similar an Excel file can be considered as a
data source. Using this approach OpenTMS is not restricted to SQL databases,
but can use flat files, spread sheets etc. too. It can also support direct access to
vendor specific databases or systems. A server sided installation of OpenTMS can
also act as data source.
Access to data sources
through standardised
interface
O
P
E
N Open
T
M TMS Data type
specific
S Data access
S
Source functions
O Layer
F
T
W Maps the OpenTMS
A access functions to the
specific data component
R
E Various data
components like files
etc.
Fig 14: Data sources with several data components
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 42/72
43. Software Architecture 04/2008/KW
A data component which is connected thru a data source must support a core
functionality. This core functionality is divided into three types of functions (meth-
ods):
• Read methods: This involves all functions retrieving data from a data
component. Read methods also maps the results in the way the caller
needs the data (e.g. TBX or TMX).
• Write methods: This involves all functions writing, updating and deleting
data to a data component. Write methods also take into account which in-
put format is used (e.g.TMX or TBX etc.) and convert them into the internal
data source format.
• Select Methods: This methods are part of the read methods and allow to
select specific entries from the data source.
Care has to be taken which security level has been chosen. Depending on the
level the data have to be encrypted and decrypted.
Two types of data components can be distinguished:
• Read only data components: This type of component can only retrieve
data, but not store data. An example could be if a plain TMX file is used as
data component.
• Full data components: Here both read and write methods are supported.
Depending on the user configuration data components can be configured to be-
have differently. It can appear as read only data component for one user, while for
another used it could be accessible as full data component.
11.2 TM Matches
OpenTMS differentiates between three types of matches:
• Perfect Match: This is a match where the segment to be searched
matches the segment in TM both with regard to the text content and
the format
• Exact Match: In this case only the text part of the segment matches with
the database entry perfectly, the format information differs.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 43/72
44. Software Architecture 04/2008/KW
• Fuzzy Match: In this case there are some deviations between the search
segment and the match in the TM. The difference is usually stated in %
values. This type of match is also often called inexact match.
One may consider in the future other types of matches too, e.g. replacement class
matches where only the “blank characters (white spaces)”, differ. For this see also
chapter 12.3.
11.3 Basic data source access functionality
The following (read and write ) access functions are the core functions need. Ac-
cess results in matches. A basic idea is that that the function decides based on the
input supplied how the entry is interpreted and written into the database. This
means that TMX entries are handled differently from TBX entries etc.
Please note that in the description of the functions no explicit reference is made to
the security model. It is assumed that the security level is set before or in invoca-
tion with the database function invocation.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 44/72
45. Software Architecture 04/2008/KW
Access Type Comment
Exact Access A given entry is found by the “string=segment”
supplied but independently of the format..
Exact Format Access A given entry is found by the “string” supplied tak-
ing format information into account.
Fuzzy Access A given entry is found by using a similarity search.
Similarity is measured in %, where 100% is iden-
tical to an exact access.
Fuzzy Format Access A given entry is found by using a similarity search
– taking the format into account. Similarity is
measured in %, where 100% is identical to an
exact format access.
Word Based Access A search is done by splitting the string into indi-
viduals words. The word identification is language
dependent. The words could either be searched
2
using OR or AND . Word based access could be
enhanced by supporting stemming (e.g. Porter
stemming algorithm)
Regular Expression Access A regular expression is used to retrieve the result
set. Actually such a function is quite resource
consuming.
Sub segment Access Segments are retrieved based on some sub seg-
ments of a given search string. Actually this could
be seen as a more specialized form of the regular
expression search or word based search. This
type of search is esp. important if a segment ac-
tually represents a paragraph and may contain
several sentences.
Fig 15: Data source access types
2
It is suggested to use a logical represenation of the query similar to Google (www.google.com).
Here + denotes”word must exist”, while – denotes that the word is not allowed to exist in the
result set.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 45/72
46. Software Architecture 04/2008/KW
Access Functions for TM Comment
and TBX data
RetrieveTMMatch Get a match from the Translation Memory. The
actual result depends on the data source access
type chosen. Parameters involve match quality
etc.
RetrieveTBXMatch Get a TBX match from the terminology database.
The actual result depends on the data source ac-
cess type chosen.
AddEntry This is a generic function adding data (e.g. TMX
entries) to data sources. The function is generic in
that that sense that it decides on the type of the
xml document to be added how the entry is stored
(TMX, TBX etc.).
CreateEntry Creates an empty data source entry of a specific
type
AddTMEntry Adds a TM entry; actually a specialization of Ad-
dEntry
AddTBXEntry Adds a TBX entry; actually a specialization of Ad-
dEntry
RemoveEntry This is a generic function removing data (e.g.
TMX entries) to data sources. The function is ge-
neric in that that sense that it decides on the type
of the xml document to be added how the entry is
stored (TMX, TBX etc.)
ModifyEntry This is a generic function modifying data (e.g.
TMX entries) to data sources. The function is ge-
neric in that that sense that it decides on the type
of the xml document to be added how the entry is
stored (TMX, TBX etc.)
CopyEntry This is a generic function copying data (e.g. TMX
entries) to data sources. The function is generic in
that that sense that it decides on the type of the
xml document to be added how the entry is stored
(TMX, TBX etc.)
Fig 16: Data source access types
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 46/72
47. Software Architecture 04/2008/KW
11.4 Databases
A key principle of the OpenTMS architecture is its independence from database
products. OpenTMS defines a core subset of access functions (based on SQL)
which can be implemented by nearly all database systems.
The following gives a (a non exhaustive) list of database types which should be
3
supported .
11.4.1 Open source SQL data bases
• MySQL - www.mysql.de
• Postgres - www.mysql.de
• H2 - www.h2database.com
• Cloudscape - www.ibm.com/software/data/cloudscape (IBM)
• …
11.4.2 Closed source SQL databases
• SQL Server (different flavors) -
www.microsoft.com/germany/sql/default.mspx
• Oracle - www.oracle.com
• …
11.4.3 Alternatives
SQL databases are not the only databases out there. Other database formats
could be:
• Spreadsheets (like SQL)
3
A key question at this point is if OpenTMS should implement something as an “internal database”
which just would mean storing the database as “simple hash tables” which can be serialised
and de-serialised. See also the discussion of TMX documents (Footnote 1). Alternatively the
internal database could just consist of an xml file.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 47/72
48. Software Architecture 04/2008/KW
• Object oriented databases
• XML database systems (e.g. XINDICE)
• Plain text files
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 48/72
49. Software Architecture 04/2008/KW
11.4.4 Database Access
Internally all main access functions of OpenTMS are based on specific objects
(see page 51) and all access happens through these objects. By using this addi-
tional abstraction level (interfaces as they are called in most programming lan-
guages nowadays) one gets even independent from SQL and is open for future
advances in the area of databases development.
All access functions are mapped to SQL statements (or their equivalents) which
are not hardcoded but stored in xml database configuration files.
Till this point there is no real necessity to realize the database only in SQL. The
advantage of using SQL as the language describing the access functions is a) that
it is widespread and b) standardized.
Fig 17:Configuring different database types
11.4.5 Database and data source configuration
As OpenTMS needs to support a lot of different database / data sources type add-
ing a new database type should not require changing the data source code kernel.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 49/72
50. Software Architecture 04/2008/KW
Therefore for each data source type a configuration file defines the main pa-
rameters of the database. Depending on security require the configuration file
can be secured using the security model functions for documents. This includes:
• Database class driver – e.g com.mysql.jdbc.Driver
• Connection String – e.g. jdbc:mysql:
• Any other connection string specific commands (e.g. buffer size)
• Commit support
• Unicode support
• Server Address
• Port
• User (encrypted)
• Password (encrypted)
• Mapping of OpenTMS database access function to database specific ac-
cess code (e.g. SQL code like <command step="1">DROP TABLE MONO
IF EXISTS MONO</command>). Depending on the access functions they
can be organized in groups if a specific functionality requires to run sev-
eral database functions (e.g. creating all the necessary tables for a new
database). This is mainly important for SQL databases as here a variation
of supported SQL types exist.
• Reference to code (e.g. jar file, dll etc.), If a specific functions needs to run
at a specific point of time (e.g. creating a new database). This should en-
able to inject specific implementation code for specific tasks (e.g. if some
functionality cannot be executed thru SQL commands)
In addition a more generic interface can be called if a database cannot be inte-
grated with the configuration file specifications above. In this case the whole inter-
face for the new database needs to be implemented and made available to
OpenTMS.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 50/72
51. Software Architecture 04/2008/KW
12 TRANSLATION OBJECTS
A key entity in the translation process are translations. Translations (inherently
multilingual) consist usually of segments (monolingual) and languages associated
with those segments.
As a consequence the architecture uses three types of language related entities.
This objects are used by processes to create the translation functionality.
A “General Linguistic Object” (GLO) contains information (features, attributes)
which are common to all linguistic information types. Examples are: unique id,
creation and modification dates, authors etc.. Linguistic Objects always can be
serialized to XML. Main supported formats are here: XLIFF, TMX and TBX.
From that object two objects are derived:
• A “Monolingal Object” (MoLO) which represents a linguistic entity for a
given language. It inherits all the features of GLO and adds for example
the language of the entity (segment).
• A “Multilingual Object” (MuLO) represents translations by linking one or
more MoLOS into one object. A MuLO constists at least of one MoLO and
can contain up to n MoLOS. It is not required that each MoLO of a MuLO
4
has a different language.
Each of those object types contain a unique id, in addition a MoLo inherits an
MuLO related id so that it can be easily associated with its translations.
4
The behaviour of multilingual objects can be configured. One option can be to treat all entries as
bi-lingual objects only. Thus one MuLo only would contain MoLos – a source and target MoLo.
Normally options like this should be used with caution as they introduce problems in managing
real multilingual databases. This is esp. true if one source segment may have several transla-
tons (target MoLos). Nevertheless there may be cases where one requires to have several
translations for a source segment, eg. Something like a temporary translation. In this caseit is
suggested to associate “status attributes” with the MoLo. This could be the used on the one
hand as a sorting criteria for matches and on the other hand for identifying problem transla-
tions.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 51/72
52. Software Architecture 04/2008/KW
Obviously attributes are associated with Linguistic Objects. As several standards
are used (TMX, XLIFF and TBX) a mapping of the attributes between the different
types is required. Within the object the attributes may be identified through their
name space.
Fig 18: Representation of linguistic entities as General Linguistic Object
12.1 Format information
Format information (e.g. transported thru the <ph> tag in XLIFF ) and its correct
handling is a key and kernel function of OpenTMS. The core OpenTMS library
contains all the necessary functions to handle format information correctly.
OpenTMS should aim at providing the highest possible support in format handling.
12.2 Terminology versus Translation Memory
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 52/72
53. Software Architecture 04/2008/KW
Within computational linguistics a key difference is made between terminology and
translation memory. Both concepts clearly are used in two different contexts. This
is also reflected that there are (at least) two standards: TMX (TMX, 2008) and TBX
(TBX, 2008). Nevertheless from a conceptual and software engineering point of
view both concepts share more than distinguish them. Both have “strings” as their
basic representations – either as terms or as segments – and also meta informa-
tion matches in most cases. A main difference is their context usage. TMs are
normally applied at segment level; consist normally of more characters), while
terms are used at a sub segment (word, phrase) level.
As this differences only appear at the usage level OpenTMS consequently imple-
ments the same underlying (database) structure for TM and term entries. Using
special markers a distinction can be made at run time (= usage time). The advan-
tage immediately can be seen that by this approach both concepts can be used in
different usage contexts. Search and retrieval functionality is available for both
concepts (e.g. fuzzy search is rarely available for term databases; using a com-
mon internal representation this drawback is overcome).
Fig 19: Conversions of linguistic entities
12.3 Variables , placeholders, replacement classes
Translation memory entries, sometimes also terminology entries, often contain
textual parts which can act as placeholders. Typical examples of placeholders are
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 53/72
54. Software Architecture 04/2008/KW
numbers, month names, acronyms etc. In many cases it is possible automatically
replacing those “variable parts” with their actual counterpart in a segment. This is
esp. useful in matching, e.g. just be replacing the numbers in a match with its cor-
rect value to achieve a better match, even a perfect match.
OpenTMS supports for this reason the concept of replacement classes. A re-
place class is specific construct which generalizes a certain type of string or infor-
mation. A replacement class consists of basically two parts:
• A class name (e.g. number)
• A procedure describing the replacement class. In many cases the proce-
dure can be defined through a regular expression. Another option maybe
that specific strings (e.g. terms from a terminology database) may act as
replacement class.
• A procedure maybe language dependent. If a procedure is language de-
pendent transformation rules have to be defined how a value of language A
is transformed to a language B.
Example:
Class: GeneralNumber
Procedures:
General:
Definition: ([0-9]+?)(.)([0-9]+?)
Transform: $1.$2
German:
Definition: ([0-9]+?)(,)([0-9]+?)
Transform: $1,$2
The basic idea is that a language specific procedure involves two parts:
• a definition part which describes how to detect (evaluate) an instance of a
replacement class
• a transformation part which describes how to compute the instance of a
replacement class given that a replacement class has been detected (e.g.
in another language)
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 54/72
55. Software Architecture 04/2008/KW
When a replacement class matches parts of segment the matching part is re-
placed with replacement class carrying forward the class name and the value of
the original class.
Replacement classes invoke two main challenges:
• A key problem in defining replace classes is the order in which they are
involved (checked). Depending on the definition of the regular expression
several expression may match (e.g. numbers without and with decimal
points). Open TMS should apply a strict linear order procedure. The first
matching expression is applied and used.
• The other key problem is checking if all the replacement classes appear a)
in both source and target match and b) appear in the source segment (the
one which requires translation). For OpenTMS the proposed solution is that
the replacement classes in both source and target have to mach exactly. If
this is given the replacement classes also have to match source segment to
be translated. It has to be noted that another approach could be used too –
removing the non matching replacement classes in all three involved
strings.
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 55/72
56. Software Architecture 04/2008/KW
13 PROCESS MODEL
13.1 OpenTMS Process
An OpenTMS process realizes the functionality of the OpenTMS system – mainly
supporting the translation process. Examples of processes are converters, seg-
menters, translation memories, machine translation, statistics modules etc.
OpenTMS processes build on the core library functions and move them into a
process environment. In many cases this does not really mean that a process is
created in the deep meaning of a process, it also cold mean that a function of the
core library (but any othr function defined in another OpenTMS context) is called
from an application.
13.2 OpenTMS Scripting Language
Most OpenTMS processes are available through the OpenTMS Scripting Lan-
guage (OpenTMSL). The OpenTMS Scripting language enables developers and
users to write their own scripts to adapt the OpenTMS processes to their needs.
OpenTMSL is defined in a programming language independent way and should be
implemented in different programming languages. It basically makes the functions
defined in the core library accessible to the public through an easy to learn script-
ing language.
Fig 20: OpenTMS Scripting Language
Dok. Nr.: HEA-1-2008; Version 1.0; April/May/June/August 2008 56/72