Internationalizing Permalinks
July 15, 2010
Creating permalinks for English sentences
Creating a permalink for a sentence written in english is simply a matter of removing any non latin characters and concatenating the groups of latin characters with a hyphen. In Java this can be achieved using regular expressions:
Pattern permalinkPattern = Pattern.compile("\\w+");
Matcher matcher = permalinkPattern.matcher("I like red porridge with cream");
StringBuilder sb = new StringBuilder();
boolean separate = false;
while (matcher.find()) {
if (separate) {
sb.append("-");
}
sb.append(matcher.group());
separate = true;
}
sb.toString();
For instance if the title of the webpage is “I like red porridge with cream” then the permalink (excluding the domain name and a few other bits) is “I-like-red-porridge-with-cream”.
Internationalizing permalink: The IRI and Transliteration approach
Now lets see what happens to the webpage title written in a national script such as Danish “Jeg kan lide rødgrød med fløde”. The code produces the permalink “Jeg-kan-lide-r-dgr-d-med-fl-de”. Hmm it starts to look like morse code – not very legible to the user.
The reason we filter anything but latin characters is because the path element of URLs only allow the latin alphabet and percent encoding, not UTF-8. While percent encoding can represent national characters it is not legible to the user and therefore by itself not suitable for permalinks. RFC 3987 is a standard for multilingual URLs (called IRIs) whereby URLs encoded in UTF-8 on the clientside are converted to percent encoding before they are sent over the HTTP protocol to the server. Because both client and serverside software must support the standard there are still issues with lack of support in legacy software.
Until IRIs become pervasive another approach is to transliterate the webpage title. Transliteration is the practice of writing a character or a word in another alphabet. For instance by convention the Danish character “ø” is transliterated to “oe” in the latin alphabet. The sentence “Jeg kan lide rødgrød med fløde” is thus transliterated to “Jeg kan lide roedgroed med floede”. This is perfectly legible to a Dane. If we run the transliterated sentence through the permalink code then we get “Jeg-kan-lide-roedgroed-med-floede” which is fine.
A Java API for transliteration
I have created an open source Java API to transliterate national scripts into latin but I’m facing a challenge that some european alphabets share the same national characters but have different conventions for transliteration to latin characters. Therefore I have to create transliteration tables for each language. So far the API supports Danish, Swedish, Norwegian and German.
To use the API download the following two Jar files from google code and put them in your classpath:
scalemania-latin-transliteration-1.0.jar
scalemania-latin-transliteration-data-1.0.jar
To transliterate a String you must first envoke the factory method to create a Transliterator for a language and then call the Transliterators method transliterate(String source).
String language = "da";
String title = "Jeg kan lide rødgrød med føde";
Transliterator transliterator = TransliteratorFactory.newInstance().getTransliterator(language);
if (transliterator != null) {
title = transliterator.transliterate(title);
}
How to determine the users language
In a web application the language of the user can be determined from the HTTP header called accept-language
String acceptLanguage = request.getHeader("accept-language");
The contents of the field may list several languages separated by comma:
Accept-Language: da,en;q=0.7,en-gb;q=0.3
Some entries have a quality rating that represents the users preference for the language. The default value of the quality “q” is 1.0. In the example above the language “da” has the higest preference of 1.0.
Adding support for languages to the API
Each language has its own XML file containing the mapping of native characters to latin characters. For instance the content of the Danish XML file looks like this:
<?xml version="1.0" encoding="utf-8"?>
<transliterator language="da">
<char native="æ" latin="ae" />
<char native="ø" latin="oe" />
<char native="å" latin="aa" />
</transliterator>
To add support for another language one would simply have to provide such a mapping file -no programming involved. All the language XML files are packaged in a separate Jar file to allow other implementations to use the same data.
Feedback and language contributions are welcome
March 30, 2011 at 08:28
hello,
i find your post quite interesting. i would like to ask you what would be the filename of the XML file containing the mapping for another language, lets say spain.
March 30, 2011 at 20:13
The iso 639 code for spain is “es” so following the pattern of the existing files the name would be “generic-transliterator_es.xml”.
You can use one of the existing files as a template for you own file. Once your done with the XML file you would have to register it in “generic-transliterator-index.xml” by adding a tag for the language “es”.
March 30, 2011 at 20:32
BTW theres also a wiki page that explains how to create your own mapping file.
http://code.google.com/p/scalemania-latin-transliteration/wiki/AddLanguage