Understanding Sorbian
21 Feb 2022
A team led by Professor Alexander Fraser has programmed translation software for a language that is now spoken by just under 20.000 people.
21 Feb 2022
A team led by Professor Alexander Fraser has programmed translation software for a language that is now spoken by just under 20.000 people.
 
						© IMAGO / Winfried Rothermel
How do you translate a language you don’t know? Translations usually work with two language versions, the original language and the language you want to translate into. But in some cases there is only a small amount of what you would call parallel data: exact translations of sentences, grammar, equivalent words. This complicates the work of computer scientists like Professor Alexander Fraser from the Center for Information and Language Processing at LMU Munich. Using his ERC Starting Grant, Fraser studied languages without parallel data.
“I became interested in Sorbian because I found that there were digital traces of the language, such as a small Sorbian Wikipedia,” Fraser recalls. What he did not find was parallel data—in other words, word-for-word translations. Sorbian is a protected language in the German-speaking world and is divided into two written languages: Lower Sorbian and Upper Sorbian.
Professor Fraser got in touch with the Witaj Sorbian Language Center, which was already working on developing translation software. “My team and I organized an international competition where we tried out our translation applications. We focused on Upper Sorbian first,” he explains. To do this, his team first had to devise a method that could produce translations in the absence of parallel data.
In what’s known as supervised translation, up to three million sentences are normally fed into a machine translation system. The system then links the data and translation strands and can translate individual words or sentences in both directions.
However, this process was not possible here; the researchers did not have enough parallel data. The team therefore developed an unsupervised translation system. “This was very exciting for us, because designing translation systems without knowing the exact translations is a very complex business,” Fraser recalls.
Using statistical models, the translation programs can learn the correlations in both languages. To facilitate this, the researchers create a data structure for each language. The word they are looking for is interwoven into this, with the lexical environment included. The two data structures are then linked together. The process is very error prone, and it is repeated several times to improve the results.
What was then created in close cooperation with the Witaj Sorbian Language Center is the sotra translator, short for Sorbian Translator. Following on from the first unsupervised translations, the Sorbian community produced parallel translations, which the software learned. The program is now based on a corpus of some 200,000 Upper Sorbian and German sentence pairs from all different spheres of life.
The goal is now to include Lower Sorbian too. As of 21 February 2022, Microsoft Translator itself offers Upper Sorbian translations, inspired by the sotra project. “This is an important step for the Sorbian community,” Fraser says, “because it brings more attention to the language.”