Requirements for Localizable DTD Design

Working Draft 7 July 2003

Latest version:
http://people.w3.org/rishida/localizable-dtds/
This version:
http://people.w3.org/rishida/localizable-dtds/localisable-dtds-030704.html
Previous version:
http://groups.yahoo.com/group/lisa-its/files/ITS-Requirements/Drafts/ITS-Requirements-k.html (June 2001)
Editors:
Richard Ishida, Yves Savourel

Abstract

When creating DTDs it is important to include constructs that meet the needs of the localisation community, and that enable documents produced using the DTD to be successfully rendered in another language and / or locale. This document sets out to list the key requirements in this regard. It will be used to provide a framework and direction for a detailed solution proposal (or set of proposals) to be developed later. The concepts in this document apply equally to schemas developed with XML Schema or similar languages.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document.

This is a refreshed version of the first edition of this requirements document and is intended to provide a starting point for discussion. It is in no way final. This document is a Working Draft for review by any interested parties. It is a draft document and may be updated, replaced, or made obsolete by other documents at any time. It is inappropriate to use this document as reference material or to cite it as other than "work in progress".

Changes in this version include:

There have been no changes to the actual content.

(There is a public mailing list <lisa-its@yahoogroups.com> associated with this document but it has been dormant for some time now. To view archived mail and files visit <http://groups.yahoo.com/group/lisa-its>. )

Table of Contents

1 Introduction
2 Requirements

Appendices

A Acknowledgements
B References

1 Introduction

1.1 Rationale

As XML usage grows in various domains, more and more XML documents and applications are localised into different languages and / or locales. In order to localise XML data in a cost-effective and time-efficient manner, a number of conditions must exist. Most of these conditions can be set early in the development of the XML schemas or DTDs. In addition, various guidelines should be followed during the development of the XML content. This document highlights the different requirements to create XML document types and author XML content well-suited for localisation.

Internationalisation activities should always have two aims in mind:

  1. to enable localised versions of the application to function correctly
  2. to ensure that the localisation activity itself is as painless, rapid and cost-effective as possible.

This document places significant emphasis on the second of these points. The localisation process and tools typically receive scant consideration in the creation of XML documents and applications. It is hoped that the localisation community and localisation tools vendors will contribute to the development of this document and its related proposals to ensure that they promote standards of value to that community.

It is hoped that these proposals will provide useful context and support information to localisers, in a standard way, and will promote the use of standard mechanisms for encapsulating information required by translation tools and processes. Such standardisations should result in faster, less labour-intensive, and more powerful localisation processes and tools, which in turn benefit the producers of content by lowering barriers to international deployment of their applications.

1.2 Target Audience

The target audience of this document includes the following categories:

  1. Developers of XML schemas and DTDs
  2. Developers of XML authoring tools
  3. Authors of XML content
  4. Developers of localisation tools
  5. Localisers involved with XML
  6. Developers of internet specifications at the World Wide Web Consortium and related bodies.

1.3 Document purpose and update strategy

In this document we try not to propose solutions (although it may be hard to avoid some of the more obvious possibilities) - the expectation is that the ITS group will produce a response to this document that proposes best practices or standard tags, etc. The role of this document is to provide some direction and list the issues needing resolution.

As it is unlikely that we will have identified the complete problem set for some time, the intention is for the document to be updated on an ongoing basis as new ideas and issues arise or are proposed. The updating of this document may therefore oocur simultaneously with the development of the response in the early stages. For this period the versioning will be indicated by the date at the beginning of this document. At some point, when it is felt that we have a workable body of issues to address in an initial solution proposal, the numbering and version control will be phased to coincide with the proposal solution documents.

1.4 Terminology

Schema
Used on its own, this term refers to both document type definitions (DTDs) and XML Schemas.
MT (Machine translation)
Translation that is achieved by sending text to a computer program that applies linguistic rules and terminology and outputs an attempt at a translation. Most translation by vendors is currently achieved using MAT (Machine-Assisted Translation) tools - rather than attempt the translation alone, these applications provide support for a human translator.

2 Requirements

The requirements for localisation-specific information in XML can be defined at several levels.

2.1 General requirements

General requirements may include the following:

  • Implementation of the tags and guidelines should pose the minimum reasonable burden on the content author. Automation or inheritance should be considered wherever possible - content authors are already busy people!
  • Where tag or attribute names are proposed for a particular function, these should be used in a standard way throughout the localisation industry to maximise interoperability of data and ease of data manipulation.

2.2 Direct identification of content that should not be localised

It must be possible to signal to the localisation group that a particular item of content should not be changed during localisation. This may refer to a single character or a large chunk of data; it may refer to text data, structural items, or graphic or multimedia entities. The method used should allow localisation tools to automatically identify and isolate the specific data in question.

Background

There are a number of reasons why it may be necessary to leave particular words, phrases or parts of a document in English during translation.

  • Example 1: the text in the translated documentation refers to a user interface or part of a user interface that will not be translated (eg. "Click on the START button.", where 'START' must match the text on the untranslated UI, whatever the translation of the rest of the sentence.)
  • Example 2: the text provides an example of syntax the user can type, such as 'DateQuery( Year, Month )'. In this case the translator must be made aware that 'DateQuery' is command syntax which must remain in English, whereas the words 'Year' and 'Month' should be translated, since the user will enter these in their own language.

If text must not be translated for one reason or another it must be possible to indicate this in the XML. The mechanism for this should be defined in the DTD.

Without this, translators may at best waste time deciding what should and should not be translated, and at worst will make mistakes.

No-translate assignments can be particularly useful where machine translation or gisting is involved, since the computer typically has no way of deciding what it would be inappropriate to translate.

Notes

Using an industry-standard tag or attribute name would make it much easier to supply text to translation tools, resulting in reduced cost and time for localisation. This would require a proposal that meets the needs of the localisation industry.

A response to this requirement should consider whether it is appropriate to stipulate that the non-translate / translate setting is inherited - allowing a structural element to apply the property to all contents with minimum intervention from the author.

Is there a requirement for a 'localise' attribute, that indicates that localisation changes should be made that don't entail translation - eg. changing a contact address? This may be helpful to show what must be addressed in post-editing after machine translation has taken place.

2.3 Indirect identification of content that should not be localised

An approach should be defined to signal to the localisation group that a particular item of content should not be changed during localisation where this is a function of contextual rules. This may refer to a single character or a large chunk of data; it may refer to text data, structural items, or graphic or multimedia entities. The method used should allow localisation tools to automatically identify and isolate the specific data in question.

Background

The following example based on UIML is taken from [XMLI&L]:

<style>

<property part-name="Main" name="rendering">Main</property>

<property part-name="Main" name="content">Sample UI</property>

<property part-name="Component1" name="rendering">Text</property>

<property part-name="Component1" name="content">Some text to translate.</property>

</style>

In this example only the highlighted text in the 3rd and 5th lines should be translated. The clue to the appropriateness of translation is given in this case by the value of the name attribute.

In other schemas it is possible that the content of a particular element should never be translated, due to the nature of the data it contains (eg. a part number).

2.4 Provision of a SPAN-like element

An element must be provided that behaves like the SPAN element in HTML 4.0. This will allow ITS requirements (e.g. translatability or language information) to be ascribed to a range of text that is not bounded by elements in the content.

Notes

Could result in an attribute or tag name that is promoted as a standard by the localisation industry.

2.5 Notes to localisers

A method must exist for authors to communicate information to localisers about a particular item of content. There should be two such types of information: firstly, notes that must be read before the localiser attempts to localise, and secondly, notes that provide optional background information. Localisation tools must be able to automatically identify and isolate the specific data to which the note refers, and automatically distinguish between the two different types of note.

Background

To assist the translator to achieve a correct translation, authors may need to provide information about the text that they have written. For example, the author may want to:

  • tell the translator how to translate part of the content
  • expand on the meaning or contextual usage of a particular element, such as what a variable refers to or how a string will be used on the UI
  • clarify ambiguity and show relationships between items sufficiently to allow correct translation (eg. in many languages it is impossible to translate the word 'enabled' in isolation without knowing the gender, number and case of the thing it refers to.)
  • explain why text is not translated, point to text re-use, or describe the use of conditional text
  • indicate why a piece of text is emphasised (important, sarcastic, etc.)
  • etc

This can help translators avoid mistakes or avoid spending time searching for information.

Two types of developer's note are needed:

  1. an alert
  2. a description

An alert contains information that the translator MUST read before translating a piece of text. The translation environment must bring this type of note to the attention of the translator before they begin to translate. (For example, an instruction to the translator to leave parts of the text in the source language.)

A description provides useful background information that the translator will refer to only if they wish. (For example, a clarification of ambiguity in the source text). The translation tool would still make this available to the translator, but would not force them to read it before attempting a translation. The translator may only receive an indication that such a note exists and have to take action to view the text.

Notes

Could result in an attribute or tag name that is promoted as a standard by the localisation industry.

2.6 Attributes and translatable text

A schema should ensure that translatable text is stored in elements rather than attributes whenever possible.

Background

If translatable text is provided as an attribute value rather than element content, the following problems may arise:

  1. It is impossible to apply meta-information such as no-translate flags, designer's notes, etc. to the text of the attribute value.
  2. It is difficult to identify in a generic fashion which attribute text is to be translated and which is not (i.e. without recalibrating the translation tools before every job to recognise a set list for each DTD).
  3. The inability to attach unique ids to localisable attribute text makes it more difficult to use ID-based leveraging tools.
  4. Translatable attributes can create problems when they are prepared for localisation because they can occur within the content of a translatable element, breaking it into different parts. An example of this can be found in [XMLI&L], Chapter 7.
  5. The language selection mechanism applies to the content of the element where it is declared, including its attribute values. In some case where the text in an attribute is in a different language than the text of the content, using an attribute prevents you from setting the language correctly.

In the example code below the no-translate flag applies to the content of the element, but not to the title text. The title text may benefit from id-based leveraging, but has no ID. The xml:lang tag, after translation, will only be relevant for the element content, not the title text.

<extract id="0517.1447" translate="no" xml:lang="en" title="Ambiguous linguistic construct.">The man hit the boy with the stick in the bathroom.</extract>

In the next example part of the alt text should be left untranslated (the name of the picture), but it is difficult to see how that would be expressed so that a machine translation tool would exhibit the correct behaviour.

<image id="0517.1716" alt-text="Catalog number 123: The Fish Wife" src="fishwife.png"/>

2.7 Emphasis & document conventions

The schema should express the application of emphasis and document conventions to a particular range of content using naming that reflects the intention and that is not tied in any way to presentation.

Background

Formatting of emphasis and document conventions will be applied in different ways by different cultures and writing systems. For example:

  • some cultures use totally different methods of emphasising text to those used commonly in the West (eg. amikake and wakiten in Japanese), or they may express emphasis using language rather than presentation.
  • for electronic documents, Japanese may prefer not to use bolding or italicisation in small font sizes due to the complexity of the characters.
  • applying capitalisation as a way of indicating procedure names or on-screen text will fail in most non-Latin scripts, since these scripts have no upper- vs. lower-case distinction. (Also, there may not be an equivalent to mono-spaced fonts, as used in this document for tag names and examples.)

Allowing the author to choose bold or italicise or similar tags for emphasis and style elements can lead to:

  • inconsistency in the way emphasis is applied in the document by the author (relevant even if you don't localise)
  • a lack of ability to distinguish between applications of the tag which may need to be formatted differently in another language. (For example, take a document written in Japanese that perhaps uses only underline for all types of emphasis because bold and italic don't work well for small on-screen fonts. When that document is translated into English, it may be desirable to apply bold for certain types of emphasis and italic and underline in others, since this is common for English text. If, however, the Japanese author had associated the emphasis tags with the presentational aspects (ie. they were all simply called 'underline') there will no longer be a way of automatically distinguishing the appropriate context for a varied application of styles.)

Instead the DTD should provide tag or attribute names such as importance, irony, contrast, stress, exasperation, etc.

Notes

A stylesheet should be able to provide a method of applying different presentational mechanisms to a range of text on a language by language basis (ITS does not currently include stylesheet internationalisation in its remit).

2.8 Tags with linguistically-dependent scope

It must be possible for localisation tools and localisers to clearly identify and recognise tags whose applicability and extent will need to be changed during the process of localisation. There must be no restrictions on the modification of these tags as the content changes during localisation (ie. changes to the location and extent of the tag in relation to the content as well as duplication and deletion must all be allowable.)

Background

Certain element tags, such as emphasis, will need to be manipulated by the translator. For example, underlining will stretch across a different range of characters, and may need to be removed or doubled up by the translator within a given sentence.

It is expected (but not necessarily always true) that many such elements which contain content will be in-line in nature rather than block elements, since the key driver for their adaptation will be to achieve a natural sentence syntax in the foreign language. Further, many such tags are likely to be hooks for the application of formatting style. Examples may include emphasis, hyperlink, subscript, superscript, citation and span.

Empty tags also fall into this category. In order to position an empty tag appropriately in the text the translator will usually need to know what it refers to.

Much of the time the translator will simply need to know the meaning of the tags so that they can fit the translation around the protected tag names in the appropriate way. In some cases, achieving a good sentence flow will mean re-ordering multiple tags.

There will also be occasions when the translator wishes to remove or insert new tags because of the requirements of their language. For example, in a particular language, emphasis may be expressed by the language itself rather than by formatting changes and in this case the translator will want to remove the emphasis tag altogether. The converse of this, of course, is that when translating in the other direction the localiser will need to add markup that was not present in the original. Alternatively, some languages may require an additional range of emphasis to be defined within the same sentence.

Notes

The recommendations for this section touch on process issues as well as DTD design.

2.9 Unique identifiers

It should be possible to attach a unique identifier to any localisable item of content - be it text, structure or unparsed entity. This id should be completely unique across all documents but should be identical across all translations of the same item..

Background

In order to most effectively re-use translated text where content is re-used (either across update versions or across deliverables) it is necessary to have a totally unique and eternally persistent id associated with the element. This id allows the translation tool to correctly track an item of content from one version or location to the next. After one is sure that this is the same item, the content can be examined for changes, and if no change has taken place the potential for re-use of the previous translation is very high. This approach can be referred to aschange analysis.

The potential for re-use of translations is very appealing in terms of productivity and cost savings for product launch.

Change analysis constitutes an extremely powerful productivity tool for translation when compared to the typical source matching (a.k.a. translation memory) techniques, which simply look for similar source text in the database without being able to tell whether the context of its use is the same.

This change analysis technique has been possible with UI messages in the past, but the introduction of structured XML (and SGML) documents will allow for its use in documents also.

Where text entities will be re-used across products, or where a localisation group is dealing with these ids must be totally unique.

2.10 Character encoding declarations

Character encoding should be identified at the top of any parsed entity using the standard method defined for XML. The encoding should be declared for all external parsed entities that will be included in a document.

Background

The encoding of the text in an entity must be declared or predictable in order for the application reading it to understand the contents of that entity - whether this is a document entity or an external parsed entity.

An encoding used to create source text might not support the characters required for translation. If the translated text is in another encoding than the source, it is especially important to declare which encodings are being used where. In addition, if different encodings are being used for external entities, some conversion will be necessary to produce a single encoding throughout the document instance.

If the application cannot automatically apply the correct character set mappings the document will not be readable.

It would be useful to always declare the encoding used at the top of any file, in ASCII, to aid in reading or processing files.

2.11 Declaring the language of the content

The main language (or languages of a truly multilingual document) must be declared at the beginning of any document, using industry standard approaches. Such declarations should also apply to any external parsed entities that are stored separately.

Any content in another language within a document should be labelled appropriately.

In addition, it must be possible to declare a single document as being composed of multilingual parts of equal standing, ie. the document entity does not represent a single language.

Background

A number of rendering practises will vary according to the locale of the text - i.e. the language and market region. Examples include text expansion, hyphenation, wrapping rules, colour usage, fonts, spell checking, line height and inter-line spacing, quotation marks and other punctuation, etc. For the appropriate presentation to be applied automatically to documents in different languages it is essential to know the language of the text.

It would also be useful to indicate the locale of the document as a whole to facilitate both processing and identification of translated documents during localisation and content management.

It should also be possible to indicate the language of the XML content for any element or range of text where the language differs from that of the document as a whole. Note that this includes graphics, audio and other unparsed entities, which may need labelling for or treatment specific to a given locale.

2.12 Describing other cultural aspects of the content

It must be possible to declare more than just language at the beginning of any document or any external parsed entities that are stored separately. This information may include any combination of language, script usage, geographical area, dialect, or historical period.

Any content within a document which varies from the declaration at the head of the document should be labelled appropriately.

In addition, it must be possible to declare a single document as being composed of multicultural parts of equal standing - ie. the document entity does not represent a single culture.

Background

The current system of language identification does allow for an approximation to 'locales' by appending country codes to the language codes, but there are difficulties with this classification system that are already being encountered in localisation. For example: how does one distinguish in a standard way between simplified and traditional Chinese without using codes for China and Taiwan? or describe whether Serbian text is in the Latin vs. cyrillic script? how does one indicate that a voice track is in the language spoken in German-speaking Switzerland rather than the language written there, since one is Schwytzertuutsch and the other is very close to but not the same as 'High German'? how does one indicate that a piece of content is in 'International Spanish'? how does one indicate that this is English as spoken in the time of Chaucer?

Most importantly, how does one do this in a way that lends to tools and systems automatically recognising the labels used in order to apply presentation or processing?

Notes

This is an area that cries out for a solution that provides interoperability through standardisation, however the development of locale and script related tag standards is a significant area of study in its own right that is outside the remit of ITS.

Proposals in this area will impact W3C specifications significantly.

2.13 Citations

Any citation used in text should be identified as such, and should be accompanied by information about the source of the citation. A standard approach should be used to identify the source so that localisation tools can automatically retrieve the information about the source.

Background

Take an example such as the following:

Selecting Initialise Auditron will always produce a confirmation screen. If OK is selected twice, the Auditron will be initialised and the account data deleted.

The text 'Initialise Auditron' is a quotation from the user interface. Other types of quotation include mimics of the operating system or application messages. Since these messages have typically already been translated, the translator's job is to locate the actual translation previously supplied so as to maintain consistency.

For example, 'Initialise Auditron' may have been translated in Spanish in any of the following ways:

  • Inicializar Auditrón
  • Inicializar el Auditrón
  • Inicialización del Auditrón

Any of these are acceptable translations, but the translator must try to choose the exact words used in the context which is being quoted.

This requirement calls for a standard way of delimiting quotes so that change analysis, source matching or other approaches can be used to locate the appropriate translation quickly.

Notes

Could result in an attribute or tag name that is promoted as a standard by the localisation industry.

2.14 References to UI messages in documentation

Quotations of user interface messages in documentation text should be implemented in such a way that it is possible to retrieve the actual text from the UI resource database.

Background

This takes the previous idea a little further. If a UI message is quoted in the documentation, it is likely to improve productivity and quality of localisation to pull the translated text directly from the UI database, rather than asking the translator to type it in again.

In this case, the example given immediately above may look something more like the following:

Selecting <ui-message name="Initialise Auditron" id="msg123" /> will always produce a confirmation screen. If <ui-message name="OK" id="msg124" /> is selected twice, the Auditron will be initialised and the account data deleted.

Notes

Could result in an attribute or tag name that is promoted as a standard by the localisation industry.

2.15 Indication of container size

Where fixed sizes are used for containers or objects (such as tables, table cells, frames, buffers, screens, images, etc.) a standard method should be used for indicating the dimensions of the container so that localisation tools can automatically recognise them.

Background

This helps localisers ensure that content will fit as text expands in translation or if graphics need to be adapted.

Notes

Could result in an attribute or tag name that is promoted as a standard by the localisation industry.

2.16 Infinite naming scheme

Well-formed documents should not use element or attributes names that are dynamically created.

Background

For example, the following XML excerpt which has automatically generated element names is not very conducive to localization because most translation tools will be unable to deal with it efficiently. This is because the translation tools do not know what the element represents, and therefore how to deal with it.

<message001>Root path: </message001>
<message002>Display Options: </message002>
...

2.17 Allowed characters

Where repertoire restrictions apply, there should be a means of indicating the range of characters that can be used in a local version of a document.

Background

For example, a document may contain UI strings for a firmware application where the character set is limited and allow only a small sub-set of Unicode.

2.18 Term identification

It should be possible to indicate that a given element or span of text is a term.

Background

The capability to specify terms within the source content is of great interest for various translation and terminology-related tools. It facilitates, for example, the creation of glossaries and indexes, and allows terminology validation between source and target documents.

Notes

Could result in a set of attributes and tag names that is promoted as a standard by the localisation industry.

2.19 Inline and subflow elements

There should be a means of indicating whether an element is equivalent or not to a unit that will be used for automated translation processing. Some elements may contain other elements which are translation units in their own right.

Background

Documents are organized in elements containing text, elements or both (mixed content). Identifying the type of each element is important for the translation tools because they base the segmentation of the text on these properties. It is also necessary to have provision for identifying an element inside a mixed content that is a segment by itself. [RI: note that we should probably tease out the idea that segments may not be equivalent to elements - eg. a number of sentences may be included within a single para element. I assume that we are not planning to try to segment at this level, are we? Although perhaps some automated process run on the XML file prior to localisation may do something of the sort???]

For example in XHTML, <p> and <td> have mixed content, while <b>, <small>, etc. are inline elements. At the same time a quote within a paragraph may be a subflow-type element.

2.20 Word breaking

A property to indicate whether an element can mark the end of a word is necessary for tools to get accurate word counts.

Background

For example in XHTML, <br/> is a word-breaking element, while <b> or <big> are not.

2.21 White space handling

It must be possible to specify whether a given element allows white spaces to be collapsed during translation, and the XML must appropriately handle spaces for non-Latin scripts (such as Thai and CJK).

Background

The xml:space="preserve" attribute allows to specify that the white spaces must be preserved, but this properties should be available at the document type level.

Knowing whether the white spaces in a given element are collapsible or not is important for proper matching when using translation memories tools.

2.22 Unicode characters vs. markup

Unicode formatting and control code characters should only be used when markup is not appropriate.

Background

For example, there are Unicode control characters that allow the user to control bidirectional formatting of Arabic and Hebrew, but it is better to use markup to achieve this behaviour. There are some Unicode characters, however, that should be used for controlling format.

Notes

The guidelines for this requirement should be provided by [UXML] which is a Unicode TR and a W3C note.

2.23 Markup to support international script features

Markup should be available to support the required behaviours of all scripts.

Background

This covers behaviours that are not common to all scripts. For example, ruby support may be needed for Far Eastern documents, and bidirectional control tags are needed for Middle Eastern documents. These should be incorporated into the schema if it is to be properly internationalised.

Notes

The W3C specifications can provide the lead here.

[RI: There may also be aspects relating to the implementation of apparently standard features which bear investigation - for example, is there a recommended way of implementing lists given that numbering systems may vary widely in different areas - ie. the list type properties may need to be defined, but also a localisable approach may need to be taken to allow the application of such.]

2.24 Support for localisable resource data

The DTD should provide support for storage of all localisable resource data used in stylesheets, templates and user interfaces.

Background

Stylesheets often contain text which is presentational in nature, for example a 'Warning' title for warning text. If this text is in the stylesheet it can create difficulties for localisation. A better approach is to gather all such text into an XML file and refer to the appropriate piece of text from the stylesheet.

Note that the scope of this requirement is not limited to translatable text strings. Punctuation may need to be changed for different languages / scripts, as may images.

Notes

An externalised UI string may contain embedded elements, such as a variable that refers to a figure number. Such variables will need to be equipped with all necessary attribute values to convey information to the translator - such as what the variable stands for, whether or not it truncates and if so its minimum length, otherwise its maximum length. There must also be a unique identifier to allow the stylesheet to provide the value of the variable. [RI: need to develop this further - may lead to additional requirements or at least an amplification of the current requirement.]

Externalised strings should be accompanied by elements (designer's notes) that describe the use of the string and unique ids that the stylesheet will use to reference them. They may also be grouped.


A Acknowledgements

This document was developed with contributions from the following people:


B References

DTD-DESIGN
Localizable DTD Design, Richard Ishida, September 2000.
Available at: http://www.multilingual.com/ishida49.htm.
UXML
Martin Dürst and Asmus Freytag, Unicode in XML and other Markup Languages, Unicode Technical Report #20 and W3C Note. (See http://www.w3.org/TR/unicode-xml.)
XMLI&L
Yves Savourel, XML Internationalisation and Localisation, Sams, ISBN: 0672320967