Polyglot machines

2 Jan 2024

How artificial intelligence learns the rich variety of human languages: Hinrich Schütze, computational linguist at LMU, researches multilingual software that can do small languages. From the research magazine EINSICHTEN

Quantum leaps in the treatment of language: computational linguist Hinrich Schütze works at the interface between linguistics and computer science.

© Oliver Jung / LMU

In mid-September 2023, a video circulated on social media featuring a young man who goes by the handle “Jon Finger” on Twitter (X).

In English, he extols the virtues of a new, AI-powered automatic translation software, which is supposedly capable of rendering the speech of any speaker captured on video into pretty much any language out there – in their own voice, lip-synced, regardless of the position of the camera and their face, and independently of light conditions and ambient noise. It is touted as a program for simultaneous interpreting under natural conditions, one that is so accomplished that the viewer does not realize that translation software is responsible for the images (lip movements), sound (voice), and languages spoken – and not a polyglot speaker.

Einsichten-Cover: Echt Jetzt - Künstlich, Natürlich: Die Grenzen verschwimmen


Read more about the boundary between the natural and the artificial in the current issue of our research magazine EINSICHTEN at | © LMU

As a video, the film impresses with its matter-of-fact presentation and lo-fi esthetics: no polished visuals, no preening CEO, no contrived ambience, no bells and whistles. The video stops. And immediately starts up again. This time, the young man speaks French without an accent. But the images are the exact same as before when he was speaking English. The video stops and resumes again. This time, he is speaking German. The viewer hears and sees the same speaker saying the same script, only in a different language. Even if you look closely, you cannot see any lip movements that seem ‘off,’ and you do not hear any voice other than that of the original speaker. Jon Finger’s company is called HeyGen and its advertising slogan proclaims: “No camera? No crew? No problem!”

But in reality, things are not so easy and there are substantial problems that remain to be solved.

Automatic translation still presents a technical challenge, even if people’s expectations have fundamentally changed. After all, natural language processing (NLP) operates in a cycle: Language spoken by humans is acoustically captured by a machine, which converts it into text. The textual data is semantically processed and then translated and converted into another spoken language. Implementation of each of these steps was in its infancy just half a decade ago.

Probabilities of sequences of sounds, words, and sentences

As it transpired, the startling solution to these problems would be to skip the steps! AI systems such as ChatGPT use a neural network architecture and unsupervised learning to generate translations. They are able to learn without an explicit rule set. To this end, the machines had to be trained on huge volumes of audio and textual data from natural languages so that they could recognize the probabilities of sequences of sounds, words, and sentences, both spoken and written. AI-generated texts are plausible, then, because these systems can mine a huge reservoir of spoken and written texts (from media, books, podcasts, and blogs) that were already in logical, functional, and comprehensible form in their original context.

But what if there is a small body of texts available for training the systems? This applies to natural languages with a limited local range of speakers, for example, where the volume of spoken and written texts available is correspondingly small?

Enter Professor Hinrich Schütze, a computational linguist at LMU’s Center for Information and Language Processing (CIS). In his field of computer-assisted language processing, researchers seek to algorithmically process and generate natural language in the form of textual data. Operating at the interface between linguistics and computer science, computational linguistics has been around since the early 1960s. At the beginning, the machine processing of language was confined to the capturing of limited text corpora and the collection of statistical data – things like the frequency of occurrence and the environments of individual words or sentences in the works of Goethe. But concordances and lexical and formal statistics alone are not sufficient for teaching a model to understand a language.

Then the internet and advances in machine learning revolutionized the possibilities of natural language understanding (NLU) and natural language generation (NLG), including for computational linguistics. “In our field, we’re living in fascinating times. Suddenly, intelligences have developed that nobody can really explain,” summarizes Schütze. “People have been researching artificial intelligence for decades. But all the specific algorithms, all the methods that made it possible, say, for IBM to build the AI program Watson have proven in retrospect to have been not so important,” he continues. “Scaling was the key: sheer data volumes and computing power and the resulting size of the models – this was what led to emergence,” giving the AIs properties of a new kind. “This enabled quantum leaps in the treatment of language. Some AI researchers simply invoke the ‘God of Scale’ and have given up on explaining emergent behavior. But we should never deploy models for critical applications that are ‘blackbox’. So it is absolutely necessary to develop more explainable models that do mor than scaling.”

Previously, researchers had tentatively begun to move away from text and voice input to explore meaning. Computers segmented chains of letters in words and sentences, and personal forms and case markers were analyzed to extract the grammatical information and return the words to their basic forms. Then the words were analyzed to determine their structural function in the sentence (subject, predicate, object, article, etc.). And finally, meanings were assigned to sentences and the relationships between successive sentences were determined. This consecutive process remained laborious, time-consuming, and prone to errors. With the ‘God of Scale’ epiphany, it was no longer deemed necessary for machine language processing to return words to syntax and semantics. “The algorithm no longer goes through the input step by step, but processes everything all together at the same time. Just like you take in a picture in one glance. Naturally, this process is much faster.”

7,000 languages in the world, classified into 400 families

Just because AI is able to handle English, French, German, Spanish, Chinese, and Russian does not mean that it has mastered all the languages of the world equally. This is where Schütze’s current research projects come in. “There are more than 7,000 languages in the world, which are classified into some 400 families. Plentiful resources are available for only about 100 languages. There is not enough training data for most languages. For starters, we need look no further than the Sorbian language in Germany. Then there are various African languages, languages of the indigenous peoples of America, languages in Southeast Asia, Australia, Oceania …”

It was meant to be a tower that would reach to the heavens, until God put a stop to construction by confounding the speech of the inhabitants of Babel – thus creating a bewildering multitude of languages. So goes the narrative in the Book of Genesis. Painting by Pieter Bruegel the Elder. Today, computational linguists are working to unravel this confusion of tongues.

© Mithra-Index / Heritage-Images / Picture Alliance

Most pioneers of AI technology to date have been American companies interested in commercial applications for automatic language processing. As such, they have ignored the ‘small’ or ‘low-resource’ languages, because they hold out little promise of profitability. The lack of technical support for minor languages, indeed the overlooking of entire language communities, erects virtual barriers and deepens the digital divide. To remedy this situation, Schütze started at the beginning: by collecting language data and categorizing relationships for these small languages.

Training with the Bible

Read the e-paper issue of EINSICHTEN (in German)!

Read more

Existing typologies classify languages according to their geographical characteristics (such as the continent on which the language is spoken), their phylogenetic features (genealogical relationships between languages), or their structural features (morphology and syntax). Schütze’s approach is able to analyze over 1,000 languages with AI because it compares them using a so-called super-parallel dataset: the Parallel Bible Corpus (PBC). After all, no other book has been translated into as many languages as the Bible. Starting with the Septuagint in the year 250 before Christ, the Holy Scriptures remain the most translated work in the world to this day.

A large number of Bible translations are now available in electronic form, which enables Schütze to analyze many different languages in parallel. “We’re working simultaneously on multiple languages,” says Schütze, “now on Walloon, now on an African language, now on an Oceanic language. We don’t concentrate on one language, check it off our list, and move on to the next. We work on all of them simultaneously.”

His team’s method of ‘conceptualization’ tries first of all to find similarities and differences in the huge corpus – for example, in how languages divide the world into concepts and what they associate with these concepts. Chinese, Japanese, and Korean, for instance, all associate the concept “mouth” with “entrance” due to the influence of the corresponding Chinese ideograph. This association is absent in European languages, indicating that the three East Asian languages share a similar conceptualization of their linguistic world, which is clearly distinct from European languages. Thus, conceptual similarity supplements conventional techniques based on lexical and typological similarities.

But how do researchers get hold of new language data if there’s hardly any available? “One way,” says Schütze, “is to scour the web for everything that exists in this language. That’s not easy at first, and classification remains difficult. Gradually, however, you can build up a corpus on which to train an AI. Then we get the trained system to explain, for example, what a sentence is about whose contents we understand but which was not in the AI’s training data. If the model answers correctly, this shows it has learned something about the language. We evaluate the capabilities of the AI using the large text corpora, in which of course we have sentences that are already translated. But apart from the scarcity of data for smaller languages, the main problem in working with AI remains the occurrence of hallucinations.”

Schütze explains that an AI model, though equipped with large volumes of data, does not possess an ‘explicit memory.’ It tries out meaningful probabilities in a language. Intelligent speech, however, is contingent on there being a memory to distinguish probabilities from facts. Nevertheless, the AI improvises knowledge even when it doesn’t possess facts. This gives rise to hallucinations, syntactically correct but purely invented answers, such as a reference to a quote that does not actually exist. In view of this problem, Schütze’s team is working on endowing AIs with a memory and training them on factual knowledge, so that in addition to implicit knowledge about how a language works, they can also build up explicit knowledge about what is correct. In this way, the AIs will be able to distinguish answers that merely sound good from ones that are based on facts, and prefer the latter.

Language models upgraded with a ‘working memory’

All language processing models are currently faced with these problems, including ChatGPT: Without subsequent cooperation with humans – human-centered NLP – and a strong connection to knowledge databases, the AIs lack factual correctness. This results in bias and hallucinations, where the AI models generate answers that are nonsensical, illogical, or irrelevant. As a first step, for example, ChatGPT could be made to furnish references with its answers, which it currently does not provide. And so Schütze is working on upgrading the language models with a ‘working memory’ that prompts the saving of information and the retrieval of knowledge.

But when should the machine save something in explicit memory and when should it retrieve the information? “This is still a huge problem,” says Schütze. “Yet explicit memory is a decisive feature of intelligence. There’s only one instance of genuine intelligence, and that’s human beings. We should be guided by human intelligence when designing AIs,” explains Schütze. “Admittedly, there’s also the view that it's not necessary to base AI on humans. Airplanes don’t beat their wings like birds, but they fly all the same. It could turn out to be like that with AI, that the machine is intelligent but works in a different way. However, I remain convinced that our work on AI should be based on the human model. And our intelligence needs memory to work.”

Author: Bernd Graff

Portrait of Prof. Dr. Hinrich Schütze.

Models for machines: computational linguists like Hinrich Schütze get AIs to learn how a language works.

© Oliver Jung/ LMU

Prof. Dr. Hinrich Schütze ist Inhaber des Lehrstuhls für Computerlinguistik und Co-Direktor des Centrums für Informations- und Sprachverarbeitung (CIS) an der LMU. Schütze, Jahrgang 1964, studierte Informatik in Braunschweig und Stuttgart. Seinen Ph.D. machte er in Computerlinguistik an der Stanford University, Kalifornien, USA. Anschließend arbeitete Schütze fünf Jahre am Xerox Palo Alto Research Center und weitere fünf Jahre als Gründer und Führungskraft bei Suchmaschinen- sowie Textminingunternehmen im Silicon Valley. Schütze war Inhaber des Lehrstuhls für Theoretische Computerlinguistik an der Universität Stuttgart, bevor im Jahre 2013 an die LMU berufen wurde. Er ist Fellow der Association for Computational Linguistics, des European Laboratory for Learning and Intelligent Systems und von HessianAI. Im Jahr 2018 sprach ihm der Europäische Forschungsrat (ERC) einen seiner Advanced Grants zu.

Read more articles from "EINSICHTEN. Das Forschungsmagazin" in the online section and never miss an issue by activating the e-paper alert.
Or subscribe to the free print version of EINSICHTEN (in German).

What are you looking for?