Saturday, February 19, 2005

Annotating XHTML links to specify translation relationships between linked documents.

Background and Motivation

A fundamental property of the World Wide Web is its ability to represent links between documents. The capability and practice of specifying machine readable relationship metadata between linked documents on the web has historically been limited. The XHTML standard does however provide the rel attribute on links, and recently there has been some activity attempting to formalize relationship metadata about linked documents and / or the entities they represent using the “Semantic XHTML” approach, e.g. XFN.

This document introduces the concept of XHTML annotations which indicate to users and software agents that the linked documents represent parallel versions of the same text in different languages – that is, that they represent translations of each other.

The ability to specify that linked documents represent translations of each other has at least two substantial applications of interest to the authors:
  1. The facilitation of multilingual blogging, as a bridge between cultures and peoples.

  2. The facilitation of crawlers/bots which can “harvest” parallel texts in different languages to be used as a corpus for training human language translation programs based on statistical computing / machine learning techniques. The widespread adoption of quality machine translation will hopefully further application 1). Further, an extensive database of parallel texts can serve as a "translation memory" to further the study of human languages themselves.

The facilitation of crawlers/bots which can harvest parallel texts from the web is of particular interest to the authors. We refer to these as “Rosettabots” after the most famous parallel text, the Rosetta stone. The facilitation of Rosettabots motivates many of the requirements on the link metadata we propose.

Vision and Goals

We have an ultimate vision which is unapologetically grand: nothing less than closing the distance between peoples and cultures of the world by making translations on the “social web” easy, ubiquitous and discoverable. The mechanisms and formats we propose, however, eschew grandiosity: we advocate an incremental, interoperable, and simple approach. The metadata format annotating translated document links which we propose here is a small, pragmatic first step toward the ultimate vision of interlingual blogging.

The initial metadata format should be sufficient to convey a bare minimum of information about the relationship between the documents and the translator(s). The metadata format should be extensible, to address issues including versioning, authentication and rights management in the future.

The link annotations indicating a translation relationship should be amenable to hand editing in existing tools. In the intermediate term, tools analogous to XFN Creator will enable authors and translators to easily add the required metadata. Widespread adoption of the metadata format will likely depend on support within popular blogging tools (Blogger, Movable Type, Word Press…) and syndication mechanisms (RSS, Atom).

Requirements for a metadata format specifying the translation relationship between linked documents:

  • The format should enable programs (crawlers, user agents, syndication services) to easily extract metadata about linked, translated documents, facilitating the applications listed above.

  • The proposed translation metadata must be usable within existing blogging and wiki tools, must not confuse existing browsers. In particular, the format of the proposed translation metadata must not preclude a document containing the metadata from being validated by the XHTML 1.0 Strict DTD.

  • A pair of linked documents consists of the original document and its translation; the annotated link may be present in either document, or both. Therefore, both a translates and translated-by type of annotation must be provided. The translates annotation represents an assertion of the translator, and therefore arguably has the least credibility. The translated-by annotation is embedded in the original document and therefore confers the approval of the original author. Bidirectional annotated links will confer the most credibility.

  • There must be a means to specify the identity and type of entity which performed the translation between documents. Note that this entity need not be the nominal “author” of either of the documents. The entity could be a human individual, an institution, a machine translation program, or some combination. The format for specifying the translating entity must therefore be flexible and extensible.

  • The proposed metadata format should provide the ability to delimit ranges within the linked documents which correspond to translated segments. The format should not preclude multiple such annotated and linked segments within a single XHTML document.

  • An original document can be modified subsequent to a translation being published, rendering the translation invalid. Therefore a translation of a document must be capable of specifying the version of the document being translated. This could be accomplished in different ways, e.g. timestamps or hashes.

  • The proposed metadata format should avail itself of all available internet standards for the relevant components, such as identity (URN), language (ISO-639), timestamps, etc, thus minimizing the unique functionality required of programs using the metadata.

  • The proposed metadata format should co-exist with existing and future Semantic XHTML proposals by e.g. limiting “namespace pollution”.


UPDATE 3/14: A draft Profile can be found here.

The examples in this post in French and its English translation have been updated.

No comments:

Post a Comment