Saturday, February 19, 2005

Annotating XHTML links to specify translation relationships between linked documents.

Background and Motivation

A fundamental property of the World Wide Web is its ability to represent links between documents. The capability and practice of specifying machine readable relationship metadata between linked documents on the web has historically been limited. The XHTML standard does however provide the rel attribute on links, and recently there has been some activity attempting to formalize relationship metadata about linked documents and / or the entities they represent using the “Semantic XHTML” approach, e.g. XFN.

This document introduces the concept of XHTML annotations which indicate to users and software agents that the linked documents represent parallel versions of the same text in different languages – that is, that they represent translations of each other.

The ability to specify that linked documents represent translations of each other has at least two substantial applications of interest to the authors:
  1. The facilitation of multilingual blogging, as a bridge between cultures and peoples.

  2. The facilitation of crawlers/bots which can “harvest” parallel texts in different languages to be used as a corpus for training human language translation programs based on statistical computing / machine learning techniques. The widespread adoption of quality machine translation will hopefully further application 1). Further, an extensive database of parallel texts can serve as a "translation memory" to further the study of human languages themselves.

The facilitation of crawlers/bots which can harvest parallel texts from the web is of particular interest to the authors. We refer to these as “Rosettabots” after the most famous parallel text, the Rosetta stone. The facilitation of Rosettabots motivates many of the requirements on the link metadata we propose.

Vision and Goals

We have an ultimate vision which is unapologetically grand: nothing less than closing the distance between peoples and cultures of the world by making translations on the “social web” easy, ubiquitous and discoverable. The mechanisms and formats we propose, however, eschew grandiosity: we advocate an incremental, interoperable, and simple approach. The metadata format annotating translated document links which we propose here is a small, pragmatic first step toward the ultimate vision of interlingual blogging.

The initial metadata format should be sufficient to convey a bare minimum of information about the relationship between the documents and the translator(s). The metadata format should be extensible, to address issues including versioning, authentication and rights management in the future.

The link annotations indicating a translation relationship should be amenable to hand editing in existing tools. In the intermediate term, tools analogous to XFN Creator will enable authors and translators to easily add the required metadata. Widespread adoption of the metadata format will likely depend on support within popular blogging tools (Blogger, Movable Type, Word Press…) and syndication mechanisms (RSS, Atom).

Requirements for a metadata format specifying the translation relationship between linked documents:

  • The format should enable programs (crawlers, user agents, syndication services) to easily extract metadata about linked, translated documents, facilitating the applications listed above.

  • The proposed translation metadata must be usable within existing blogging and wiki tools, must not confuse existing browsers. In particular, the format of the proposed translation metadata must not preclude a document containing the metadata from being validated by the XHTML 1.0 Strict DTD.

  • A pair of linked documents consists of the original document and its translation; the annotated link may be present in either document, or both. Therefore, both a translates and translated-by type of annotation must be provided. The translates annotation represents an assertion of the translator, and therefore arguably has the least credibility. The translated-by annotation is embedded in the original document and therefore confers the approval of the original author. Bidirectional annotated links will confer the most credibility.

  • There must be a means to specify the identity and type of entity which performed the translation between documents. Note that this entity need not be the nominal “author” of either of the documents. The entity could be a human individual, an institution, a machine translation program, or some combination. The format for specifying the translating entity must therefore be flexible and extensible.

  • The proposed metadata format should provide the ability to delimit ranges within the linked documents which correspond to translated segments. The format should not preclude multiple such annotated and linked segments within a single XHTML document.

  • An original document can be modified subsequent to a translation being published, rendering the translation invalid. Therefore a translation of a document must be capable of specifying the version of the document being translated. This could be accomplished in different ways, e.g. timestamps or hashes.

  • The proposed metadata format should avail itself of all available internet standards for the relevant components, such as identity (URN), language (ISO-639), timestamps, etc, thus minimizing the unique functionality required of programs using the metadata.

  • The proposed metadata format should co-exist with existing and future Semantic XHTML proposals by e.g. limiting “namespace pollution”.


UPDATE 3/14: A draft Profile can be found here.

The examples in this post in French and its English translation have been updated.

Wednesday, February 02, 2005

Phonak gets reprieve - translation

It would seem that "innocent until proven guilty" holds with the UCI: Phonak will contest the ProTour next year.
At the end of November, the team had been barred from the ProTour after the revelation of the doping case concerning the Swiss Oscar Camenzind, the American Tyler Hamilton and the Spaniard Santiago Perez , three of its major racers who are no longer part of the group in 2005.

The three arbitrators of the CAS, referring to the date of November 12, 2004, the deadline for the examination of license requests by the UCI, "it was not possible, at this stage, to exclude the Phonak team from the ProTour only on the basis of suspicions of doping concerning these two racers (Hamilton and Perez) and even before knowing the result of the disciplinary proceedings with regard to them".

Last year, the Swiss team had made the doping buzz. Camenzind, positive for EPO, had been terminated at once but Hamilton, Olympic time trial champion in Athens and the first athlete to be declared positive for blood transfusion, then Perez (also transfusion), had been supported by their organization which gave a report raising doubts on the method of detection.

[Translated (badly) from the original French here, by your's truly with the aid of Google]

Hamilton quit the team so Phonak would have a better shot. Looks like it paid off.