Monday, March 14, 2005

DRAFT Translation Link Metadata Tutorial

What’s this about?

Do you blog in more than one language? Do you sometimes post translations of your own or other people’s posts (or excerpts of posts)? Would you like to make your translations easier to discover on the web, and maybe help to train machine translators of the future? If so, read on – “decorating” your blog posts with a few tags and attributes can make it happen.

For some background on translation metadata and why it’s cool, see this post. [Todo – this section needs some expanding]. For a more formal, “Reference Manual” type specification for these annotations, see here.

This note is intended for folks who are annotating their blog posts “by hand”, as well as those who are writing tools to help automate the process. It describes how to annotate some simple links on your blog post which indicate that it’s a translation (or “original”), what part of the post is translated, and who’s doing the translating. These are links you’d probably include anyway, so let’s start with a simple example. [Both of the example posts here and here are from my own blog for now; let’s pretend they’re from different blogs. 8) ]

Basics of Linking to Translations.

Let’s say you have post which translates an excerpt some bike racing news from a French source. Somewhere, you’d probably include a link to that post – say for instance like this:

[Translated (badly) from the original French here.]


Let’s look at the markup for this section:

<p><small>[Translated (badly) from the original French <a href="http://lewy14.blogspot.com/2005/01/phonak-gets-reprieve.html"> here</a>.]</small></p>

The first thing to do is add the ISO-639 language code with an hreflang attribute to indicate the language of the original (French, in this case):

<p><small>[Translated (badly) from the original French <a href="http://lewy14.blogspot.com/2005/01/phonak-gets-reprieve.html" hreflang="fr">here</a>.]</small></p>

Which Document is the Original?

Now, add an indication that the current document is a translation from the orginal linked document: we do this by using the rel attribute, “which specifies the relationship from the current document” (see here) with a “space separated list of link types”. For this we’ll define a link type called original, and use it as a value for the rel attribute.

<p><small>[Translated (badly) from the original French <a href="http://lewy14.blogspot.com/2005/01/phonak-gets-reprieve.html" hreflang="fr" rel="original">here</a>.]</small></p>

Note, there’s a complementary link-type called translation which annotates links which are translations to the language indicated by the hreflang attribute: the implication is that the “current document” is the “original”. The translation link type can be thought of as conveying more “authority”, in that the translation is explicitly endorsed by the author of the original. Finally, note that documents can link each other reflexively with original and translation link types (this is the case with the two examples I’ve posted here and here.

Where within the Document is the translation Excerpt(s)?

To make things even easier for automatic translation harvesting (“rosettabots”), consider wrapping the section of your post which consists of translated text with a div element, and give that element a unique id attribute. The id attribute value acts as a fragment identifier, allowing the “rosettabot” to easily identify the translated text. Let’s say the translated text is wrapped in a div element with an id of “rb-1”. We’d add the following to the rel attribute:

<p><small>[Translated (badly) from the original French <a href="http://lewy14.blogspot.com/2005/01/phonak-gets-reprieve.html" hreflang="fr" rel="original xlt-id:#rb-1">here</a>.]</small></p>

This indicates that the translation is contained in the element indicated by the fragment identifier “rb-1”. (There’s an org-id: link type prefix as well, indicating the fragment identifier (if any) for the original. Plus there’s a couple other ways of specifying excerpts within posts; we’ll cover that in another tutorial but for now refer to the spec and the examples.]

And the Translator is…

Now, an interested reader (or ‘bot) might want to know – who’s doing the translating? In the case of this example, a (passable) translation was constructed by person (me) with lousy French skills, from a risibly bad machine translation (Google – hey, no offense Google, but most current, free, public machine translation services are pretty pathetic). So how can we capture this? Simple. Once again, we take it from the top, this time adding links to both the human and machine translator. And I’ll cut right to the chase this time, since you know the drill: we’ll annotate each link with it’s own special “link type”, a rel attribute value indicating the translator is either a human or machine.

<p><small>[Translated (badly) from the original French <a href="http://lewy14.blogspot.com/2005/01/phonak-gets-reprieve.html" hreflang="fr" rel="original xlt-id:#rb-1">here</a>, by <a href="mailto:lewykatorz@yahoo.com" rel="human">your's truly</a> with the aid of <a href="http://www.google.com/language_tools?hl=en" rel="machine">Google</a>]</small></p>

Wrapping it up, literally.

Almost done. One last step: bundle up all the links mentioned above into their own div section, and give that section a class attribute with the value rosettabot. Why do we do this? A few reasons:
  • It creates a relationship between the links, which is important when there are multiple such sets of links in a single document (the front page of a blog, for instance, or a document with many translation excerpts from different sources.
  • It separates the link text and markup from the main body of the post, which can make it easier for “rosettabots” to separate the “data” from the “metadata”.
  • It serves as a “namespace”, to limit the scope of the “link types” (values of the rel attribute) that we defined above.

Putting it all together – here’s all the annotated links, grouped together within a div element:

<div class="rosettabot"><p><small>[Translated (badly) from the original French <a href="http://lewy14.blogspot.com/2005/01/phonak-gets-reprieve.html" hreflang="fr" rel="original xlt-id:#rb-1">here</a>, by <a href="mailto:lewykatorz@yahoo.com" rel="human">your's truly</a> with the aid of <a href="http://www.google.com/language_tools?hl=en" rel="machine">Google</a>]</small></p></div>

So there you go – not hard at all. There are some more techniques for delimiting excerpts as I mentioned above, but this should be enough to get you started. Any feedback on this tutorial, the spec, the ideas behind them, or my bad French, leave a comment below or email me – thanks!

Wednesday, March 09, 2005

DRAFT Translation Link Metadata Profile

[UPDATE 3/14] Minor edits and a few "issues" added.
[UPDATE 3/15] changed rel values to original and translation

The following is a brief but reasonably formal specification of a proposed profile for translation link annotation metadata. This profile specifies the elements and attributes proposed to annotate links to (human language) translations of HTML documents (and excerpts of such documents) with metadata. This metadata can specify the language of the translation, whether the translation is endorsed by the author of the original, the identify the translator. It can also specify (or give hints about) the range of text translated. The metadata profile can be extended to satisfy other requirements. An informal requirements document for translation link metadata can be found here.

This brief specification assumes the reader is familiar with the XHTML Meta Data Profile from the Global Multimedia Protocols Group. XMDP defines a format to define “properties” and the values that those properties can take on. This profile format extends the notion of “property” to cover XPath expressions. Nested nodeset specifications can be viewed as location steps, allowing a complete XPath expression to be "read off" from the Profile. Using these techniques, richer Semantic XHTML constructs can be specified more precisely, with an appropriate level of profile complexity.

The following is a DRAFT specification for the purpose of discussion and review. In particular, the XPath expressions need to be tested.



Required. The links to translated text and the link to the translating entity are contained within a div element, which is marked with a class attribute which contains "rosettabot". The translation link metadata is contained within a div element in order to group together the link to the translated document (or excerpt) and the link(s) identifying the translating entity(ies). This is especially convenient when there are several translation link blocks within the same document. The grouping also serves to separate these links from the text being translated. Finally, the grouping acts as a kind of informal “namespace” (not to be confused with formal namespaces in XML), delimiting the scope of certain tokens used within rel attributes on the links.
[Issue: should we also allow span elements to serve as containers?]
[Issue: earlier examples contained the "urn:" prefix on "rosettabot". Should Semantic XHTML profiles which define class attribute values ("class names") employ a urn naming scheme to prevent collisions?]


Required. This link is the link to the translated text, and will always have a rel attribute which contains either "translation" or "original"


This is the link to a document which represents a "parallel text" of the containing document, expressed in another language.


The ISO-639-2 language code for the language of the linked document.


The value of the rel attribute is a whitespace delimited list of tokens. The meaning of the presence of these tokens is specified below. Other tokens from other profiles (e.g. XFN) may be present.

contains(., "original")

Indicates that the linked document is the original document, and that the containing document is the translation. Either "original" or "translation" is required to be present in the rel attribute value. [Issue: the “translated” document is presumed to have a language code somewhere – do we want to rely on this, or for convenience require it be specified as well?]

contains(., "translation")

Indicates that the linked document is a translation of the containing document, which is the original. This confers somewhat greater authority in that it represents an endoresment of the author of the original document. When a back link in the translated document is present, even more authority is conferred. Either "original" or "translation" is required to be present in the rel attribute value. [Issue: the “original” document is presumed to have a language code somewhere – do we want to rely on this, or for convenience require it be specified as well?]

contains(., "org-id:")

Optional. If at all possible a fragment identifier which ideally targets a container (e.g. a div element) within the linked document which contains original or translated text. The characters after the ':' and before the next space character specifies the fragment identifier (if any) which contains or anchors the original text excerpt. Note that the "original" text is the linked document if original is present in the rel attribute value, and the containing document if translation is present.

contains(., "xlt-id:")

Optional. Like "org-id:", but indicating the fragment identifier of the original document.

contains(., "org-xp:")

Optional. Like "org-id", but followed by an XPath spec which points to the original text excerpt in the original document.

contains(., "xlt-xp:")

Optional. Like "xlt-id", but followed by an XPath spec which points to the translated text excerpt in the translated document.






Optional. This is a link identifying a human translator, or an entity (corporation, non-profit, etc) which did the translation.


Web page, email address, etc.




Optional. This is a link identifying a machine translation program. Note that both a human translator and a machine translator may be specified, in which case the translation should be considered to be a machine translation which was "cleaned up" by the human entity.(Note: while technically optional, the translating entity link(s) really should be present).


Ideally, specifies a web page where the machine translation program can be accessed.



child::comment()[starts-with(., "rosettabot")]

Optional. Translation hints may be present to aid rosettabots which are attempting to strip extraneous text and markup and identify aligned parallel texts. Alignment hints are present as HTML comments. (Note: Processing Instructions were examined for this role and rejected as some current blogging tools permit HTML comments to be embedded in a friendlier manner.) Alignment hints may be present with or without fragment identifiers for the original and translated text (see above). Note that a rosettabot may succeed in identifying parallel texts without any hints or fragment identifiers at all, but will require substantially more sophistication.

contains(., "org-hint-begin:")

Indicates an alignment hint, which speficies the first few words of the original text excerpt as a double quote delimited string following the ':'.

contains(., "org-hint-end:")

Indicates an alignment hint, which speficies the last few words of the original text excerpt as a double quote delimited string following the ':'.

contains(., "xlt-hint-begin:")

Like "org-hint-begin", except for the translated text.

contains(., "xlt-hint-end:")

Like "org-hint-end", except for the translated text.



For a couple quick examples, view the HTML source of this original document and this translation. For a brief tutorial outlining how to annotate your links according to this profile, see here.