Creating large language models for European languages that may have less data than English is challenging in artificial intelligence. Companies in the tech world have been working on this, and recently, a startup from Helsinki, Finland, introduced a new solution to this problem.
Before this, some language models were available, but they were often specific to one language and could have performed better for languages with less data. The problem was that these models needed to capture each European language’s unique characteristics, culture, and value base. The existing solutions were limited, and there was a need for something more inclusive.
Now, a Finnish AI startup has developed an open-source solution called Poro. It is a large language model that aims to cover all 24 official languages of the European Union. The idea is to create a family of models that understand and represent the diversity of European languages. The startup believes that this is important for digital sovereignty, ensuring that the value created by these models stays within Europe.
Poro is designed to tackle the challenge of training language models for languages with less available data, like Finnish. It uses a cross-lingual training approach, meaning it learns from data in higher-resourced languages, like English, to enhance its performance for lower-resourced languages.
The Poro 34B model has 34.2 billion parameters and uses a unique architecture called a BLOOM transformer with ALiBi embeddings. It’s trained on a massive multilingual dataset, covering languages and programming languages like Python and Java. The training happens on one of Europe’s fastest supercomputers, which provides enormous computing power.
The startup releases checkpoints throughout the model training process, showcasing its progress. Even at 30% completion, Poro is showing state-of-the-art results. In tests, it outperforms existing models for Finnish and is on track to match or surpass English performance.
In conclusion, Poro represents a step forward in AI, specifically for European languages. It’s not just about creating a powerful language model but doing so in a way that is open and transparent and respects the diversity of languages and cultures in Europe. If successful, Poro could be a game-changer, offering a homegrown alternative to the language models from major tech companies.
Niharika is a Technical consulting intern at Marktechpost. She is a third year undergraduate, currently pursuing her B.Tech from Indian Institute of Technology(IIT), Kharagpur. She is a highly enthusiastic individual with a keen interest in Machine learning, Data science and AI and an avid reader of the latest developments in these fields.