Internationalizing Permalinks

July 15, 2010

Creating permalinks for English sentences

Creating a permalink for a sentence written in english is simply a matter of removing any non latin characters and concatenating the groups of latin characters with a hyphen. In Java this can be achieved using regular expressions:

Pattern permalinkPattern = Pattern.compile("\\w+");

Matcher matcher = permalinkPattern.matcher("I like red porridge with cream");

StringBuilder sb = new StringBuilder();
boolean separate = false;

while (matcher.find()) {
	if (separate) {
		sb.append("-");
	}
	sb.append(matcher.group());
	separate = true;
}

sb.toString();

For instance if the title of the webpage is “I like red porridge with cream” then the permalink (excluding the domain name and a few other bits) is “I-like-red-porridge-with-cream”.

Internationalizing permalink: The IRI and Transliteration approach

Now lets see what happens to the webpage title written in a national script such as Danish “Jeg kan lide rødgrød med fløde”. The code produces the permalink “Jeg-kan-lide-r-dgr-d-med-fl-de”. Hmm it starts to look like morse code – not very legible to the user.

The reason we filter anything but latin characters is because the path element of URLs only allow the latin alphabet and percent encoding, not UTF-8. While percent encoding can represent national characters it is not legible to the user and therefore by itself not suitable for permalinks. RFC 3987 is a standard for multilingual URLs (called IRIs) whereby URLs encoded in UTF-8 on the clientside are converted to percent encoding before they are sent over the HTTP protocol to the server. Because both client and serverside software must support the standard there are still issues with lack of support in legacy software.

Until IRIs become pervasive another approach is to transliterate the webpage title. Transliteration is the practice of writing a character or a word in another alphabet. For instance by convention the Danish character “ø” is transliterated to “oe” in the latin alphabet. The sentence “Jeg kan lide rødgrød med fløde” is thus transliterated to “Jeg kan lide roedgroed med floede”. This is perfectly legible to a Dane. If we run the transliterated sentence through the permalink code then we get “Jeg-kan-lide-roedgroed-med-floede” which is fine.

A Java API for transliteration

I have created an open source Java API to transliterate national scripts into latin but I’m facing a challenge that some european alphabets share the same national characters but have different conventions for transliteration to latin characters. Therefore I have to create transliteration tables for each language. So far the API supports Danish, Swedish, Norwegian and German.

To use the API download the following two Jar files from google code and put them in your classpath:

scalemania-latin-transliteration-1.0.jar
scalemania-latin-transliteration-data-1.0.jar

To transliterate a String you must first envoke the factory method to create a Transliterator for a language and then call the Transliterators method transliterate(String source).

String language = "da";
String title = "Jeg kan lide rødgrød med føde";

Transliterator transliterator = TransliteratorFactory.newInstance().getTransliterator(language);
if (transliterator != null) {
	title = transliterator.transliterate(title);
}

How to determine the users language

In a web application the language of the user can be determined from the HTTP header called accept-language

String acceptLanguage = request.getHeader("accept-language");

The contents of the field may list several languages separated by comma:

Accept-Language: da,en;q=0.7,en-gb;q=0.3

Some entries have a quality rating that represents the users preference for the language. The default value of the quality “q” is 1.0. In the example above the language “da” has the higest preference of 1.0.

Adding support for languages to the API

Each language has its own XML file containing the mapping of native characters to latin characters. For instance the content of the Danish XML file looks like this:

<?xml version="1.0" encoding="utf-8"?>
<transliterator language="da">
	<char native="æ" latin="ae" />
	<char native="ø" latin="oe" />
	<char native="å" latin="aa" />
</transliterator>

To add support for another language one would simply have to provide such a mapping file -no programming involved. All the language XML files are packaged in a separate Jar file to allow other implementations to use the same data.

Feedback and language contributions are welcome :)

3 Responses to “Internationalizing Permalinks”

  1. george Says:

    hello,
    i find your post quite interesting. i would like to ask you what would be the filename of the XML file containing the mapping for another language, lets say spain.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.