Insight Brain: Deep learning and the fight to protect language diversity

Submitted on Monday, 26/09/2022

It is estimated that there are 7,000 languages spoken in the world. Globally, 26 languages are dying every year and with them the knowledge and stories of their cultures. Encoding minority languages online is a key strategy in ensuring their survival.

Cardamom is a fascinating project being run from Insight@NUI Galway that employs deep learning to develop resources to support minority languages online.

Digital language tools currently only support a small fraction of these languages. While languages in Africa, the Americas or the Pacific may spring to mind first, we technically don’t have to go further than Europe: a recent report on European languages classified all but two EU languages as ‘severely’ under-resourced.

Fortunately, the scientific community is well-aware of the resource gap. The last decade or so has seen a global surge in interest in research dedicated to language technology dealing with low-resource languages.

The rise and advance of deep learning, a subfield of artificial intelligence and machine learning that tries to emulate how the brain works and how humans learn, has been important for language technology. In the field of Natural Language Processing—a subfield of computer science that aims to make computers understand human language—the employment of deep learning models has shown that we can extract linguistic information from texts without having to teach the model anything about the language or languages in question. This has led to impressive results with machine translation improving in quality year-on-year and even being able to translate between pairs of languages it has never seen before.

There is one caveat, however; since deep learning models basically learn by example, they need a lot of data. This means that such models are unsuitable for under-resourced languages for which much less linguistic data is available—at least when simply applying machine-learning techniques in the same way as we would do for better-resourced languages.

Here is where the project Comparative deep models for minority and historical languages, Cardamom for short, comes in. The aim of this Irish Research Council-funded project, is to use insights from linguistics and data gathered from the Web to bolster natural language processing techniques and applications that benefit low-resource languages, which have been largely ignored by current approaches.

The project’s methodology involves two major parallel but complementary strategies. Firstly, we aim to significantly enlarge datasets for minority languages, focusing on European and Indian languages, by gathering as much text from the Web as possible. Secondly, we will develop models of language, based on deep learning, that learn features of low-resource languages from closely-related, better-resourced languages, thus reducing the need for large datasets in minority-language and other low-resource scenarios.

Speakers of minority languages are among the fastest growing communities on the Web and meeting their needs is of major societal and commercial importance.
The project involved the creation of translation not only in Bengali, the 6th most-spoken language in the world—yet still with less text available on the Web than even Irish—but also languages like Chittagonian and Rohingya that had next to no resources. For those languages, we developed one of the first digital corpora, providing an important step towards the development of much-needed language technology.

Our focus in the Cardamom project is not only on contemporary minority languages, but also on historical varieties. From a methodological and linguistic data point of view, this makes sense: languages like Old English or Old Irish are characterised by scanty textual evidence, a scenario not unlike the Indian subcontinent, where many languages have little presence on the Web (although many are becoming increasingly important in the rapidly developing and globalising world).

Just as we can use linguistic information from a well-resourced language like Hindi to understand any of the hundreds of closely-related Indo-European languages spoken in India, we can use features from, say, Modern Irish, to gain more insight into the historical stages of the Irish language.

By incorporating historical language data in our models, we aspire to facilitate the growing demand for text analysis in digital humanities, whereby access to large corpora in languages such as Sanskrit or Old Irish can enable new insights in the study of history and literature. Our research will therefore contribute to digitally safeguarding and future-proofing linguistic heritage, not only in a contemporary, but also in a historical sense.