The Excessive-Efficiency Language Applied sciences (HPLT) venture is creating very large-scale multilingual assets for giant language fashions and machine translation.

Large textual content collections for pre-training are the ‘crude oil’ of the massive language mannequin (LLM) period. The method of ‘refining’ high-quality datasets from net knowledge at scale presupposes computational infrastructure and technological muscle that’s usually attribute of company environments, as evidenced, for instance, by some notable usually obtainable pre-training datasets: C4,¹ FineWeb 1 & 2,^2,3 MADLAD-400,⁴ or Nemotron-CC.⁵ With a couple of notable exceptions, this line of labor tends to capitalise on the English language.

Right here, we current the open-source outcomes^6,9,10 of the European R&D consortium HPLT – a venture that has been funded below the auspices of the Horizon Europe programme in 2022–2025. Along with a myriad of extra outcomes, HPLT has produced huge pre-training datasets of high-quality texts in near 200 distinct language–script mixtures. Its 2025 monolingual knowledge launch, HPLT 3.0, contains some 30 trillion sub-word tokens in whole, of which near half signify languages aside from English. We make this useful resource publicly obtainable below essentially the most permissive phrases of use doable. We additional share a state-of-the-art and open-source knowledge preparation pipeline, an progressive multilingual analysis framework, in addition to a whole lot of language fashions pre-trained on HPLT knowledge.

Fig. 1

Moreover, the venture has produced novel bilingual datasets for greater than 50 language pairs, a whole lot of related machine translation fashions, open-source pipelines for knowledge preparation, mannequin coaching, and analysis, in addition to synthesised extra pre-training knowledge for underrepresented languages by machine translation of very high-quality English paperwork. In our view, it’s the totality of usually obtainable and really large-scale assets and the documentation of the underlying processes that bears promise of ‘democratising’ the present LLM and MT panorama.

Organisation

The HPLT consortium comprised companions from 5 totally different universities (Charles College in Prague and the Universities of Edinburgh, Helsinki, Oslo, and Turku), two nationwide HPC centres (CESNET within the Czech Republic and Sigma2 in Norway), and a language engineering firm (Prompsit) from throughout Europe. The venture has acquired about €4.1m from the Horizon Europe programme and £960,000 from UK Analysis and Innovation, and ran from September 2022 via December 2025. The venture was coordinated by Jan Hajič (Charles College), with technical coordination by Kenneth Heafield (Edinburgh) and Stephan Oepen (Oslo) in its first and second halves, respectively.

Knowledge curation

HPLT has gathered and processed greater than ten petabytes of uncooked net knowledge. The venture has launched greater than 30 billion tokens (word-like items) of high-quality textual knowledge, accompanied by wealthy metadata, for near 200 distinct languages. The method of extracting, cleansing, annotating, and filtering texts from uncooked net archives is schematically depicted in Fig. 1, composed of a few dozen modules.

Uncooked net archives had been drawn from three sources: the Internet Archive (IA), host of the enduring Wayback Machine); the non-profit Common Crawl Foundation (CC); and the ArchiveBot volunteer infrastructure for long-term net archiving. Sub-tasks like, for instance, the extraction of ‘operating textual content’ from marked-up doc codecs, language identification on the doc and paragraph ranges, ‘fuzzy’ near-deduplication, annotation with a wealth of textual content high quality and regulatory compliance indicators, and last filtering based mostly on all obtainable data, every instantly affect the sensible utility of the ultimate knowledge units. Right here, textual content high quality versus general quantity current separate and usually antithetical dimensions for optimisation, making a wealthy area for various design decisions and trade-offs. This stays an energetic space of analysis. The open-source HPLT processing pipelines are extremely versatile and parameterisable, the place default values signify the present state of data.

Monolingual statistics

To place the HPLT monolingual knowledge into perspective, Desk 1 (under) presents doc and token counts (see observe) for the English and multilingual (non-English) partitions of the info, in addition to counts for a small pattern of particular person languages. For ease of comparability, these statistics are accompanied with common doc lengths and per-language proportions, and contrasted with corresponding figures for 3 different publicly obtainable multilingual datasets talked about above.

Desk 1: Word: For the aim of comparable statistics throughout languages and totally different datasets, all token counts are computed utilizing the Gemma-3 tokenizer,⁸ a SentencePiece mannequin with a vocabulary of 256K sub-words, offering good protection for all goal languages

As is obvious from these numbers, HPLT 3.0 is by far the most important publicly obtainable such dataset, and its multilingual breadth compares favourably to different extensively used assets. In Gemma-3 tokens, the multilingual HPLT 3.0 partition is about 2–3 instances bigger than FineWeb and the sooner model HPLT 2.0, respectively, and 5 instances bigger than the older MADLAD-400 dataset. When it comes to common doc size, which regularly is correlated with textual content high quality, HPLT 3.0 and a pair of.0 sample alike, markedly forward of FineWeb however nicely behind MADLAD-400. For a small number of European languages, the desk reveals languages ranging between a ‘mere’ billion of accessible tokens to others with a whole lot of billions.

In-depth analytics

Coaching knowledge high quality arguably is an important think about mannequin high quality, however in-depth knowledge inspection at scale is a difficult endeavour. HPLT has developed an open-source software, HPLT Analytics, to compute a broad vary of fine-grained statistics and allow interactive visualisation and exploration. The datasets are internally structured in paperwork, paragraph-like segments, and tokens. Descriptive frequency and size statistics, mixed with fundamental correlation evaluation with metadata like web domains or predicted textual content register labels, can reveal distributional developments or outliers. Annotations are predominantly obtainable on the doc degree, however in some instances additionally for smaller items. Contrasting the distributions of doc versus phase language predictions, for instance, permits insights into each levels of in-document ‘code switching’ and uncertainty in language identification, usually amongst intently associated languages.

Multilingual analysis

As an extra software to gauge knowledge high quality and experimentally inform design decisions in coaching knowledge preparation (in addition to in language mannequin coaching), the venture has developed a framework for automated large-scale multilingual analysis, dubbed HPLT-e. In its present state of growth, the framework contains 127 language understanding and era duties throughout the 9 European languages highlighted in Desk 1.

This choice allowed each availability of native audio system within the venture staff and a minimal degree of range when it comes to language assets, households, and scripts. Duties in HPLT-e are sometimes drawn from pre-existing benchmark suites, however emphasising natively constructed (slightly than translated) duties and increasing every with three to seven human-written prompts to mitigate the methodological problem of immediate sensitivity. Just like Penedo et al.,^2,3 we pretrain separate ‘smallish’ (2B parameters) GPT-like fashions per language utilizing an in any other case fastened pretraining setup, and consider them at common checkpoint intervals in a zero-shot regime, fastidiously choosing duties that meet a variety of analysis sign standards, i.e. could be anticipated to behave as informative and dependable indicators of coaching knowledge high quality. Such standards embody monotonicity and relative stability of mannequin efficiency as pretraining progresses, rating consistency throughout pretraining intervals, and a number of, indicators of restricted immediate sensitivity. Fig. 2 reveals a comparability of the 4 datasets launched above utilizing HPLT-e. To mixture scores throughout totally different prompts, duties, and languages, per-task scores are maximised throughout prompts and min-max normalised relative to a task-specific random baseline. Per-task scores are then averaged throughout process classes inside every language and, lastly, throughout languages. Another strategy to general aggregation is named Borda’s rely, utilizing Vote’n’Rank,⁷ which is basically the common of per-language counts of a mannequin outranking all of the others. Fashions educated on all 4 datasets for as much as 100B tokens present a monotonic efficiency enchancment on our chosen duties. Fashions pretrained on (the comparatively smaller) MADLAD-400 obtain the very best multilingual rating, adopted by HPLT 3.0, whereas HPLT 2.0 and FineWeb carry out on par. These outcomes are corroborated by rank-based aggregation throughout duties and languages, which yields: MADLAD-400, HPLT 3.0, and HPLT 2.0 and FineWeb.

Language fashions

Whereas coaching knowledge creation has taken centre stage within the HPLT work plan, the venture has additionally developed a wealth of language fashions of various sizes and architectures supporting varied languages and language teams.

Along with giant language fashions educated from scratch for Finnish and Norwegian, a standard theme on this work was robust emphasis on smaller, specialised fashions which might be environment friendly to run. In whole, publicly obtainable project results comprise a whole lot of language fashions, together with the next sub-groups:

55 monolingual encoder-only (BERT-like) fashions for a typologically various set of languages. When fine-tuned as embedders for ‘traditional’ language understanding duties, these fashions uniformly present superior efficiency to plain multilingual fashions.
57 monolingual encoder–decoder (T5-like) fashions, once more for a typologically broad set of languages. These fashions exhibit aggressive efficiency in each embedding and era benchmarks, thus, providing a novel platform for experimentation.

38 monolingual decoder-only (GPT-like) reference fashions, every with 2.15B parameters and educated to 100B tokens. These fashions can serve plenty of functions, together with as baselines for mono- and multilingual coaching, references for the comparability of HPLT and different knowledge, and instruments for contrasting the HPLT knowledge high quality throughout totally different.
Two bigger (13B parameters), frequently pretrained generative fashions, for Finnish and Norwegian, constructed on the absolutely open-source OLMo 2 platform. These fashions examine favourably to language-specific diversifications of the Mistral NeMo mannequin, suggesting that absolutely clear basis fashions can yield aggressive outcomes to their merely open-weight counterparts.

Mining for bilingual textual content

One other wealth of open-source outcomes from HPLT are associated to machine translation (MT), notably giant collections of parallel texts derived from mining the monolingual datasets for translational correspondences on the sentence of doc ranges. These assets are created utilizing the extra processing block known as Bitextor Pipeline in Fig. 1. The pipeline applies a multi-stage textual content extraction process that identifies paperwork with an identical content material in numerous languages utilizing varied matching and alignment methods carried out as an open supply toolbox.¹ Heavy parallel computing makes it doable to run such bitext mining on a scale supplied by the monolingual web-crawls coming from HPLT. Historically, parallel texts are supplied as sentence-aligned bitexts that may instantly be fed into machine translation coaching. HPLT gives three releases of parallel textual content corpora with a language protection of 57 language pairs. The info is collected in an English-centric method aligning paperwork with English counterparts in our dataset. Pivoting on these English paperwork, we will then additionally derive multilingual parallel textual content collections spanning 1,446 language pairs. In whole, HPLT gives 2.7 million sentence alignments launched from our repository of parallel corpora, OPUS.²

Machine Translation

Mirroring the interaction of information creation and mannequin constructing within the LLM monitor, HPLT has labored intensely on the event and analysis of latest translation fashions for 100 language pairs, mixed with novel infrastructures for automated coaching at scale and integration of benchmarking outcomes into the OPUS dashboard. A particular focus is about on effectivity, emphasising the necessity of compact translation fashions that may run regionally on edge units. Specialised fashions which might be a number of magnitudes smaller than frequent general-purpose language fashions allow quick inference with out shedding translation efficiency and allow safe deployments which might be unbiased from exterior companies and on-line connections. Translation fashions educated together with HPLT knowledge present aggressive efficiency as compared, particularly for lesser-resourced languages. To additional scale back computational prices, we additionally developed a pipeline for systematic multilingual data distillation that helps the switch from costly trainer fashions to compact scholar fashions that may be as small as 20 megabytes of dimension.

Computational infrastructure

All work in HPLT has been exceedingly compute- and storage-intensive, made doable via a mixture of assets lined by the venture grant and of extra substantial assets allotted to consortium members from nationwide (Czech, Finnish, and Norwegian) quotas and thru the EuroHPC system. ‘Bulk’ storage for very large-scale net knowledge, in whole near 21 petabytes, was distributed over amenities within the Czech Republic (CESNET), Norway (Sigma2), and Finland (LUMI). Unique entry to devoted compute nodes tightly built-in with the storage programs made doable a primary stage of light-weight doc and metadata extraction (see Fig. 1), lowering the info quantity for additional processing by a few issue of three.

Along with some experimentation on nationwide superclusters, the EuroHPC LUMI system served as the primary ‘workhorse’ for HPLT, the place the consortium used mixed allocations of round 60 million CPU and about 11.5 million GPU hours over the 40-month venture length, which is the theoretical equal – on common – of greater than 2,000 energetic CPUs always.

Please Word: It is a Business Profile

References

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the boundaries of switch studying with a unified text-to-text transformer. Journal of Machine Studying Analysis, 21(140):1–67
Guilherme Penedo, Hynek Kydlíček, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. 2024. The FineWeb datasets: Decanting the online for the best textual content knowledge at scale. Advances in Neural Data Processing Methods, 37:30811–30849
Colin Raffel, Martin Jaggi, Leandro Von Werra, and Thomas Wolf. 2025. FineWeb2: One pipeline to scale all of them – adapting pre-training knowledge processing to each language

Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. 2023. MADLAD-400: A multilingual and document-level giant audited dataset. In Proceedings of the thirty seventh Worldwide Convention on Neural Data Processing Methods, Crimson Hook, NY, USA
Dan Su, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. 2025. Nemotron-CC: Reworking Frequent Crawl right into a refined long-horizon pretraining dataset. In Proceedings of the 63rd Annual Assembly of the Affiliation for Computational Linguistics (Quantity 1: Lengthy Papers), pages 2459–2475, Vienna, Austria
Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, and Jörg Tiedemann. 2024. A brand new huge multilingual dataset for high-performance language applied sciences. In Proceedings of the 2024 Joint Worldwide Convention on Computational Linguistics, Language Assets and Analysis (LREC-COLING 2024), pages 1116–1128, Torino, Italia

Mark Rofin, Vladislav Mikhailov, Mikhail Florinsky, Andrey Kravchenko, Tatiana Shavrina, Elena Tutubalina, Daniel Karabekyan, and Ekaterina Artemova. 2023. Vote’n’rank: Revision of benchmarking with social alternative idea. In Proceedings of the seventeenth Convention of the European Chapter of the Affiliation for Computational Linguistics, pages 670–686, Dubrovnik, Croatia
Gemma Group. 2025. Gemma 3. Google Technical Report
Laurie Burchell, Ona De Gibert Bonet, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, Andrey Kutuzov, Joona Kytöniemi, Veronika Laippala, Petter Mæhlum, Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Nikita Moghe, Amanda Myntti, Dayyán O’Brien, Stephan Oepen, Proyag Pal, Jousia Piha, Sampo Pyysalo, Gema Ramírez-Sánchez, David Samuel, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, Tereza Vojtěchová, and Jaume Zaragoza-Bernabeu. 2025. An expanded huge multilingual dataset for high-performance language applied sciences (HPLT). In Proceedings of the 63rd Annual Assembly of the Affiliation for Computational Linguistics (Quantity 1: Lengthy Papers), pages 17452–17485, Vienna, Austria
Stephan Oepen, Nikolay Arefev, Mikko Aulamo, Marta Bañón, Maja Buljan, Laurie Burchell, Lucas Charpentier, Pinzhen Chen, Mariya Fedorova, Ona de Gibert, Barry Haddow, Jan Hajič, Jindřich Helcl, Andrey Kutuzov, Veronika Laippala, Zihao Li, Risto Luukkonen, Bhavitvya Malik, Vladislav Mikhailov, Amanda Myntti, Dayyán O’Brien, Lucie Poláková, Sampo Pyysalo, Gema Ramírez Sánchez, Janine Siewert, Pavel Stepachev, Jörg Tiedemann, Teemu Vahtola, Dušan Variš, Fedor Vitiugin, Tea Vojtěchov, Jaume Zaragoza. 2025. HPLT 3.0: Very Giant-Scale Multilingual Assets for LLMs and MT. Mono- and Bi-lingual Knowledge, Multilingual Analysis, and Pre-Skilled Fashions. arXiv:2511.01066 [cs.CL]

Please observe, this text may also seem within the twenty fifth version of our quarterly publication.

Source link

High-performance large language models for Europe

The Excessive-Efficiency Language Applied sciences (HPLT) venture is creating very large-scale multilingual assets for giant language fashions and machine translation.

Organisation

Knowledge curation

Monolingual statistics

In-depth analytics

Multilingual analysis

Language fashions

Mining for bilingual textual content

Machine Translation

Computational infrastructure

Please Word: It is a Business Profile

References

Leave a Reply Cancel reply

Your Trusted Source for Accurate and Timely Updates!

Popular Posts

Microsoft Strikes $6B Deal in Norway

MSInsight Raises €1.6M in Seed Funding

BetterFleet Raises $15M in Funding

Payments resilience playbook: building future-ready systems

Validity Acquires Litmus

About US

Top Categories

Usefull Links