Bourbon With Attitude: 2005

Monday, March 14, 2005

DRAFT Translation Link Metadata Tutorial

What’s this about?

Do you blog in more than one language? Do you sometimes post translations of your own or other people’s posts (or excerpts of posts)? Would you like to make your translations easier to discover on the web, and maybe help to train machine translators of the future? If so, read on – “decorating” your blog posts with a few tags and attributes can make it happen.

For some background on translation metadata and why it’s cool, see this post. [Todo – this section needs some expanding]. For a more formal, “Reference Manual” type specification for these annotations, see here.

This note is intended for folks who are annotating their blog posts “by hand”, as well as those who are writing tools to help automate the process. It describes how to annotate some simple links on your blog post which indicate that it’s a translation (or “original”), what part of the post is translated, and who’s doing the translating. These are links you’d probably include anyway, so let’s start with a simple example. [Both of the example posts here and here are from my own blog for now; let’s pretend they’re from different blogs. 8) ]

Basics of Linking to Translations.

Let’s say you have post which translates an excerpt some bike racing news from a French source. Somewhere, you’d probably include a link to that post – say for instance like this:

[Translated (badly) from the original French here.]

Let’s look at the markup for this section:

[Translated (badly) from the original French <a href="http://lewy14.blogspot.com/2005/01/phonak-gets-reprieve.html"> here</a>.]

The first thing to do is add the ISO-639 language code with an hreflang attribute to indicate the language of the original (French, in this case):

[Translated (badly) from the original French <a href="http://lewy14.blogspot.com/2005/01/phonak-gets-reprieve.html" hreflang="fr">here</a>.]

Which Document is the Original?

Now, add an indication that the current document is a translation from the orginal linked document: we do this by using the rel attribute, “which specifies the relationship from the current document” (see here) with a “space separated list of link types”. For this we’ll define a link type called original, and use it as a value for the rel attribute.

[Translated (badly) from the original French <a href="http://lewy14.blogspot.com/2005/01/phonak-gets-reprieve.html" hreflang="fr" rel="original">here</a>.]

Note, there’s a complementary link-type called translation which annotates links which are translations to the language indicated by the hreflang attribute: the implication is that the “current document” is the “original”. The translation link type can be thought of as conveying more “authority”, in that the translation is explicitly endorsed by the author of the original. Finally, note that documents can link each other reflexively with original and translation link types (this is the case with the two examples I’ve posted here and here.

Where within the Document is the translation Excerpt(s)?

To make things even easier for automatic translation harvesting (“rosettabots”), consider wrapping the section of your post which consists of translated text with a div element, and give that element a unique id attribute. The id attribute value acts as a fragment identifier, allowing the “rosettabot” to easily identify the translated text. Let’s say the translated text is wrapped in a div element with an id of “rb-1”. We’d add the following to the rel attribute:

[Translated (badly) from the original French <a href="http://lewy14.blogspot.com/2005/01/phonak-gets-reprieve.html" hreflang="fr" rel="original xlt-id:#rb-1">here</a>.]

This indicates that the translation is contained in the element indicated by the fragment identifier “rb-1”. (There’s an org-id: link type prefix as well, indicating the fragment identifier (if any) for the original. Plus there’s a couple other ways of specifying excerpts within posts; we’ll cover that in another tutorial but for now refer to the spec and the examples.]

And the Translator is…

Now, an interested reader (or ‘bot) might want to know – who’s doing the translating? In the case of this example, a (passable) translation was constructed by person (me) with lousy French skills, from a risibly bad machine translation (Google – hey, no offense Google, but most current, free, public machine translation services are pretty pathetic). So how can we capture this? Simple. Once again, we take it from the top, this time adding links to both the human and machine translator. And I’ll cut right to the chase this time, since you know the drill: we’ll annotate each link with it’s own special “link type”, a rel attribute value indicating the translator is either a human or machine.

[Translated (badly) from the original French <a href="http://lewy14.blogspot.com/2005/01/phonak-gets-reprieve.html" hreflang="fr" rel="original xlt-id:#rb-1">here</a>, by <a href="mailto:lewykatorz@yahoo.com" rel="human">your's truly</a> with the aid of <a href="http://www.google.com/language_tools?hl=en" rel="machine">Google</a>]

Wrapping it up, literally.

Almost done. One last step: bundle up all the links mentioned above into their own div section, and give that section a class attribute with the value rosettabot. Why do we do this? A few reasons:

It creates a relationship between the links, which is important when there are multiple such sets of links in a single document (the front page of a blog, for instance, or a document with many translation excerpts from different sources.
It separates the link text and markup from the main body of the post, which can make it easier for “rosettabots” to separate the “data” from the “metadata”.
It serves as a “namespace”, to limit the scope of the “link types” (values of the rel attribute) that we defined above.

Putting it all together – here’s all the annotated links, grouped together within a div element:

<div class="rosettabot">[Translated (badly) from the original French <a href="http://lewy14.blogspot.com/2005/01/phonak-gets-reprieve.html" hreflang="fr" rel="original xlt-id:#rb-1">here</a>, by <a href="mailto:lewykatorz@yahoo.com" rel="human">your's truly</a> with the aid of <a href="http://www.google.com/language_tools?hl=en" rel="machine">Google</a>]</div>

So there you go – not hard at all. There are some more techniques for delimiting excerpts as I mentioned above, but this should be enough to get you started. Any feedback on this tutorial, the spec, the ideas behind them, or my bad French, leave a comment below or email me – thanks!

Wednesday, March 09, 2005

DRAFT Translation Link Metadata Profile

[UPDATE 3/14] Minor edits and a few "issues" added.
[UPDATE 3/15] changed rel values to original and translation

The following is a brief but reasonably formal specification of a proposed profile for translation link annotation metadata. This profile specifies the elements and attributes proposed to annotate links to (human language) translations of HTML documents (and excerpts of such documents) with metadata. This metadata can specify the language of the translation, whether the translation is endorsed by the author of the original, the identify the translator. It can also specify (or give hints about) the range of text translated. The metadata profile can be extended to satisfy other requirements. An informal requirements document for translation link metadata can be found here.

This brief specification assumes the reader is familiar with the XHTML Meta Data Profile from the Global Multimedia Protocols Group. XMDP defines a format to define “properties” and the values that those properties can take on. This profile format extends the notion of “property” to cover XPath expressions. Nested nodeset specifications can be viewed as location steps, allowing a complete XPath expression to be "read off" from the Profile. Using these techniques, richer Semantic XHTML constructs can be specified more precisely, with an appropriate level of profile complexity.

The following is a DRAFT specification for the purpose of discussion and review. In particular, the XPath expressions need to be tested.

/descendent::div[contains(attribute::class, "rosettabot")]

Required. The links to translated text and the link to the translating entity are contained within a div element, which is marked with a class attribute which contains "rosettabot". The translation link metadata is contained within a div element in order to group together the link to the translated document (or excerpt) and the link(s) identifying the translating entity(ies). This is especially convenient when there are several translation link blocks within the same document. The grouping also serves to separate these links from the text being translated. Finally, the grouping acts as a kind of informal “namespace” (not to be confused with formal namespaces in XML), delimiting the scope of certain tokens used within rel attributes on the links.
[Issue: should we also allow span elements to serve as containers?]
[Issue: earlier examples contained the "urn:" prefix on "rosettabot". Should Semantic XHTML profiles which define class attribute values ("class names") employ a urn naming scheme to prevent collisions?]

descendent::a[contains(attribute::rel, "original") or contains(attribute::rel, "translation")]

Required. This link is the link to the translated text, and will always have a rel attribute which contains either "translation" or "original"

attribute::href

This is the link to a document which represents a "parallel text" of the containing document, expressed in another language.

attribute::hreflang

The ISO-639-2 language code for the language of the linked document.

attribute::rel

The value of the rel attribute is a whitespace delimited list of tokens. The meaning of the presence of these tokens is specified below. Other tokens from other profiles (e.g. XFN) may be present.

contains(., "original"): Indicates that the linked document is the original document, and that the containing document is the translation. Either "original" or "translation" is required to be present in the rel attribute value. [Issue: the “translated” document is presumed to have a language code somewhere – do we want to rely on this, or for convenience require it be specified as well?]
contains(., "translation"): Indicates that the linked document is a translation of the containing document, which is the original. This confers somewhat greater authority in that it represents an endoresment of the author of the original document. When a back link in the translated document is present, even more authority is conferred. Either "original" or "translation" is required to be present in the rel attribute value. [Issue: the “original” document is presumed to have a language code somewhere – do we want to rely on this, or for convenience require it be specified as well?]
contains(., "org-id:"): Optional. If at all possible a fragment identifier which ideally targets a container (e.g. a div element) within the linked document which contains original or translated text. The characters after the ':' and before the next space character specifies the fragment identifier (if any) which contains or anchors the original text excerpt. Note that the "original" text is the linked document if original is present in the rel attribute value, and the containing document if translation is present.
contains(., "xlt-id:"): Optional. Like "org-id:", but indicating the fragment identifier of the original document.
contains(., "org-xp:"): Optional. Like "org-id", but followed by an XPath spec which points to the original text excerpt in the original document.
contains(., "xlt-xp:"): Optional. Like "xlt-id", but followed by an XPath spec which points to the translated text excerpt in the translated document.

descendent::a[contains(attribute::rel, "human")]

Optional. This is a link identifying a human translator, or an entity (corporation, non-profit, etc) which did the translation.

attribute::href: Web page, email address, etc.

descendent::a[contains(attribute::rel, "machine")]

Optional. This is a link identifying a machine translation program. Note that both a human translator and a machine translator may be specified, in which case the translation should be considered to be a machine translation which was "cleaned up" by the human entity.(Note: while technically optional, the translating entity link(s) really should be present).

attribute::href: Ideally, specifies a web page where the machine translation program can be accessed.

child::comment()[starts-with(., "rosettabot")]

Optional. Translation hints may be present to aid rosettabots which are attempting to strip extraneous text and markup and identify aligned parallel texts. Alignment hints are present as HTML comments. (Note: Processing Instructions were examined for this role and rejected as some current blogging tools permit HTML comments to be embedded in a friendlier manner.) Alignment hints may be present with or without fragment identifiers for the original and translated text (see above). Note that a rosettabot may succeed in identifying parallel texts without any hints or fragment identifiers at all, but will require substantially more sophistication.

contains(., "org-hint-begin:"): Indicates an alignment hint, which speficies the first few words of the original text excerpt as a double quote delimited string following the ':'.
contains(., "org-hint-end:"): Indicates an alignment hint, which speficies the last few words of the original text excerpt as a double quote delimited string following the ':'.
contains(., "xlt-hint-begin:"): Like "org-hint-begin", except for the translated text.
contains(., "xlt-hint-end:"): Like "org-hint-end", except for the translated text.

For a couple quick examples, view the HTML source of this original document and this translation. For a brief tutorial outlining how to annotate your links according to this profile, see here.

Saturday, February 19, 2005

Annotating XHTML links to specify translation relationships between linked documents.

Background and Motivation

A fundamental property of the World Wide Web is its ability to represent links between documents. The capability and practice of specifying machine readable relationship metadata between linked documents on the web has historically been limited. The XHTML standard does however provide the rel attribute on links, and recently there has been some activity attempting to formalize relationship metadata about linked documents and / or the entities they represent using the “Semantic XHTML” approach, e.g. XFN.

This document introduces the concept of XHTML annotations which indicate to users and software agents that the linked documents represent parallel versions of the same text in different languages – that is, that they represent translations of each other.

The ability to specify that linked documents represent translations of each other has at least two substantial applications of interest to the authors:

The facilitation of multilingual blogging, as a bridge between cultures and peoples.

The facilitation of crawlers/bots which can “harvest” parallel texts in different languages to be used as a corpus for training human language translation programs based on statistical computing / machine learning techniques. The widespread adoption of quality machine translation will hopefully further application 1). Further, an extensive database of parallel texts can serve as a "translation memory" to further the study of human languages themselves.

The facilitation of crawlers/bots which can harvest parallel texts from the web is of particular interest to the authors. We refer to these as “Rosettabots” after the most famous parallel text, the Rosetta stone. The facilitation of Rosettabots motivates many of the requirements on the link metadata we propose.

Vision and Goals

We have an ultimate vision which is unapologetically grand: nothing less than closing the distance between peoples and cultures of the world by making translations on the “social web” easy, ubiquitous and discoverable. The mechanisms and formats we propose, however, eschew grandiosity: we advocate an incremental, interoperable, and simple approach. The metadata format annotating translated document links which we propose here is a small, pragmatic first step toward the ultimate vision of interlingual blogging.

The initial metadata format should be sufficient to convey a bare minimum of information about the relationship between the documents and the translator(s). The metadata format should be extensible, to address issues including versioning, authentication and rights management in the future.

The link annotations indicating a translation relationship should be amenable to hand editing in existing tools. In the intermediate term, tools analogous to XFN Creator will enable authors and translators to easily add the required metadata. Widespread adoption of the metadata format will likely depend on support within popular blogging tools (Blogger, Movable Type, Word Press…) and syndication mechanisms (RSS, Atom).

Requirements for a metadata format specifying the translation relationship between linked documents:

The format should enable programs (crawlers, user agents, syndication services) to easily extract metadata about linked, translated documents, facilitating the applications listed above.

The proposed translation metadata must be usable within existing blogging and wiki tools, must not confuse existing browsers. In particular, the format of the proposed translation metadata must not preclude a document containing the metadata from being validated by the XHTML 1.0 Strict DTD.

A pair of linked documents consists of the original document and its translation; the annotated link may be present in either document, or both. Therefore, both a translates and translated-by type of annotation must be provided. The translates annotation represents an assertion of the translator, and therefore arguably has the least credibility. The translated-by annotation is embedded in the original document and therefore confers the approval of the original author. Bidirectional annotated links will confer the most credibility.

There must be a means to specify the identity and type of entity which performed the translation between documents. Note that this entity need not be the nominal “author” of either of the documents. The entity could be a human individual, an institution, a machine translation program, or some combination. The format for specifying the translating entity must therefore be flexible and extensible.

The proposed metadata format should provide the ability to delimit ranges within the linked documents which correspond to translated segments. The format should not preclude multiple such annotated and linked segments within a single XHTML document.

An original document can be modified subsequent to a translation being published, rendering the translation invalid. Therefore a translation of a document must be capable of specifying the version of the document being translated. This could be accomplished in different ways, e.g. timestamps or hashes.

The proposed metadata format should avail itself of all available internet standards for the relevant components, such as identity (URN), language (ISO-639), timestamps, etc, thus minimizing the unique functionality required of programs using the metadata.

The proposed metadata format should co-exist with existing and future Semantic XHTML proposals by e.g. limiting “namespace pollution”.

UPDATE 3/14: A draft Profile can be found here.

The examples in this post in French and its English translation have been updated.

Wednesday, February 02, 2005

Phonak gets reprieve - translation

It would seem that "innocent until proven guilty" holds with the UCI: Phonak will contest the ProTour next year.

At the end of November, the team had been barred from the ProTour after the revelation of the doping case concerning the Swiss Oscar Camenzind, the American Tyler Hamilton and the Spaniard Santiago Perez , three of its major racers who are no longer part of the group in 2005.
The three arbitrators of the CAS, referring to the date of November 12, 2004, the deadline for the examination of license requests by the UCI, "it was not possible, at this stage, to exclude the Phonak team from the ProTour only on the basis of suspicions of doping concerning these two racers (Hamilton and Perez) and even before knowing the result of the disciplinary proceedings with regard to them".
…
Last year, the Swiss team had made the doping buzz. Camenzind, positive for EPO, had been terminated at once but Hamilton, Olympic time trial champion in Athens and the first athlete to be declared positive for blood transfusion, then Perez (also transfusion), had been supported by their organization which gave a report raising doubts on the method of detection.
[Translated (badly) from the original French here, by your's truly with the aid of Google]

Hamilton quit the team so Phonak would have a better shot. Looks like it paid off.

Saturday, January 29, 2005

Phonak gets reprieve

It would seem that "innocent until proven guilty" holds with the UCI: Phonak will contest the ProTour next year. From the French Yahoo site:

Fin novembre, l'équipe avait été évincée du ProTour après la révélation de cas de dopage concernant le Suisse Oscar Camenzind, l'Américain Tyler Hamilton et l'Espagnol Santiago Perez, trois de ses coureurs majeurs qui ne font plus partie du groupe en 2005.
Pour les trois arbitres du TAS, qui ont fait référence à la date du 12 novembre 2004, la date-butoir pour l'examen des demandes de licences par l'UCI, "il n’était pas possible, à ce stade, d'écarter l'équipe Phonak du ProTour sur la seule base de soupçons de dopage concernant ces deux coureurs (Hamilton et Perez) et avant même de connaître le résultat des procédures disciplinaires les concernant".
…
L'an dernier, l'équipe suisse avait fait l'actualité en matière de dopage. Camenzind, positif à l'EPO, avait été licencié sur le champ mais Hamilton, champion olympique du contre-la-montre à Athènes et premier sportif à être déclaré positif pour transfusion sanguine, puis Perez (transfusion lui aussi), avaient été soutenus par leur formation qui faisait état de doutes supposés sur la méthode de détection.
[Translated (badly) to English here, by your's truly with the aid of Google]

Hamilton quit the team so Phonak would have a better shot. Looks like it paid off.