Microsoft has announced plans to expand the development and adoption of multilingual LLMs as part of a new partnership drive across Europe.
Europe’s 24 official languages and 250 indigenous languages are currently underrepresented in web content, on which the large language models (LLMs) the industry uses were trained.
The result is that LLMs are currently unable to process Swedish or Romanian at the same standard as English. To bridge this language gap, Microsoft will make multilingual data from GitHub accessible to the European community in collaboration with Hugging Face.
On 1st September, Microsoft will also issue a call for applications for grants to build content out in 10 underrepresented European languages.
“We have learnt that, basically, one needs to record several hundred hours of people, speaking a particular language in order to support the multi-modal capability of AI,” explained Brad Smith, vice chair and president at Microsoft.
“So for example, to be able to handle text to speech and speech to text, and we can do that by employing people to go record more audio in more languages.”
Microsoft will also create new jobs at its innovation centers in Strasbourg, the Microsoft Open Innovation Center (MOIC) and AI for Good Lab, partner with the ICube Laboratory at the University of Strasbourg which is already working on this problem, and fund two post-doctoral researchers.
This will involve digitizing existing content such as books, as well as creating audio content in the languages to improve multimodal training data. To back these efforts, the company said it will provide groups with Azure cloud credits, grants, and engineering support.
Microsoft has stressed that this data will be in the public domain and will be made freely available to European citizens.
“It’s important to underscore that all of this work is designed to donate more data so that others can use it,” Smith told assembled media.
“Our goal is to make it available to the European public and to open source developers. And at the same time, if there are particular partners that submit a proposal that work within a certain approach in terms of terms and the like, we want to be open to honoring their terms,” he added.
“But across the board, I want to be clear Microsoft is not going to have a proprietary interest in any of this new content that is made available.”
English dominates AI training
Microsoft research has found that 46% of web content used to train large language models (LLMs) is English.
“When you crawl the whole open web, what you see is predominantly the number one language on the web is English,” explained Juan Lavista, adding that German, Spanish, and French come second but still make up less than 6% of the total.
This is massively disproportionate, with the 379.7 million native English speakers worldwide outnumbered by the 485.1 million native Spanish speakers, for example.
Lavista added that this has been a problem since the beginning of the internet, as it was established in English and didn’t universally support special characters such as those required in French until 2003.
In a presentation, he showed how Meta’s Llama 3.1 drops 10 points in performance benchmarks when used in Swedish compared to English. In Latvian or Estonian, the gap is even more stark, with the model 25 points down .
Microsoft and European governments have identified these limitations as a clear barrier to unlocking productivity through AI in the coming years.
The MOIC and AI for Good Lab will also publish an open blueprint for training LLMs and creating regional language datasets, targeting organizations such as the Basque Center for Language Technology, Barcelona Supercomputing Center, and University of Santiago de Compostela, which are working on Azure-based AI models in Basque, Catalan, and Galician.
In addition to supporting the new languages via improved datasets and hands-on support, Microsoft announced new collaborations with IE University School of Science & Technology in Madrid and the University of Strasbourg, to support other ongoing research projects.
Make sure to follow ITPro on Google News to keep tabs on all our latest news, analysis, and reviews.
MORE FROM ITPRO
Source link
