Norway's 2 petabytes of Huawei flash storage and LLM training
Recorded: May 26, 2026, 1:15 p.m.
| Original | Summarized |
Norway’s 2 petabytes of Huawei flash storage and LLM training Jump to main content Search Topics AI/ML Architecture Block Composable Containers Container storage Data Management Data Protection Disk File Flash Glossary HCI Media Multicloud NVME Object Public Cloud SCM Security Storage Management Storage Networking Tape Search Block File Object Disk Flash HCI Media Multicloud Storage Networking Tape AI/ML BANDF AD FLASH Norway’s 2 petabytes of Huawei flash storage and LLM training Chris Mellor Chris Blocks & Files editor Published Norway’s National Library is developing a large language model (LLM) that understands the Norwegian language and is using 2 PB of Huawei OceanStor Dorado flash storage in its AI training data pipeline. Marius Husnes.
Marius Husnes, the Head of IT Platform at the library (Nasjonlbiblioteket) discussed the project at Huawei’s ID Forum 2026 in Paris, saying that no commercial LLM provider was developing a local (Norwegian) language LLM. He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language.Norway’s Ministry of Culture tasked the National Library with building a sovereign AI (LLM) as the library has the single largest digital collection of Norwegian books, newspapers, web pages and so forth in the country. Like many state libraries it is entitled to receive copies of every published book and broadcasted content. Its legal deposit mandate in this area extended beyond books, as it was duty-bound to collect and preserve all of Norway’s cultural heritage. An agreement with Norwegian newspapers permitted LLM training on copyrighted content and, Husnes said: ”No private company has this.”The library was also well-placed to do this as it had been digitizing its collection since 2005 and had amassed 20 PB of unique data stored in 3-2-1 form (3 copies, 2 media types, 1 off-site), meaning some 60 PB overall. The digitization process for the raw text, sound, moving pictures, still images and web content involved much OCR scanning, and generated a lot of metadata, and also APIs for online access. The bulk of the data was deposited in a digital disk plus tape archive, a preservation system. Husnes’ task was to get this data to the LLM training system. He said the bottlemeck was not compute; it was data quality, cleaning and pipeline throughput. There were two main processing stages. First there was in-house computation, using an Nvidia DGX H200 system, a 384 core CPU cluster and multiple Huawei OceanStor Dorado all-flash arrays, totalling 2 PB of flash capacity. This is low-latency storage for the data pipelines and training preparation. Husnes - training national LLM.
The pipeline has data ingestion, cleaning, deduplication, format normalization, validation and preparation steps.Once the data has passed through the pipeline it’s sent to Norway’s national supercomputer, the Sigma2 Olivia system, for the actual training runs. The Olivia system is an HPE Cray Supercomputing EX system, with 448 GPUs and 64,512 CPU cores. It uses a 5.3 PB Cray ClusterStor E1000 storage system.One large problem area has been over-coming two different storage system needs. The 60 PB preservation system is optimized for durability and cost, not fast IO, and has a high read latency, being designed for infrequent access. The AI Pipeline storage is designed for high-throughput, low-latency, parallel data IO. Husnes said he learnt that nobody was talking about the problems involved in moving PB-scale datasets from an archive to, and through, an AI data pipeline system. His team had to find out how to do it themselves. Husnes - preservation and AI pipeline storage.
The LLM training is ongoing and he finished his talk with a summary of what his team is stll learning about: BANDF AD Our takeaways here are, one, that Huawei storage is playing a serious and significant role in the European market, and two, that any country developing a sovereign, local language LLM would do well to consult with Husnes and get acquainted with what’s involved.As Husnes put it; Norway is a small country solving a problem every non-English-speaking nation will face: how do you build AI that reflects your language, your culture and your history? AI needs custodians, not just builders. huawei BANDF AD FLASH Norway’s 2 petabytes of Huawei flash storage and LLM training May 22, 2026 Storage news ticker - 22 May May 22, 2026 LucidLink CEO says it's needed for AEC data center boom May 22, 2026 Kioxia rides the AI wave to record revenues and a US listing May 21, 2026 flash Huawei’s new stacking tech for high-capacity SSDs May 21, 2026 security Commvault sees ResOps as a business model, not malware prevention/recovery mechanics May 20, 2026 FILE PowerStore gets performance and capacity upgrades - and there’s more May 19, 2026 security Everpure’s immutable snapshots provide accelerated malware attack recovery May 19, 2026 AI/ML Dell's AI Factory getting supercharged storage May 18, 2026 disk WD securing disk drives with post-quantum cryptography May 18, 2026 AI/ML Redis agentic AI flowers with Iris May 18, 2026 FLASH Scality says Samsung is developing nearline SSDs up to 1 PB flash Kioxia and Dell cram 10 PB into slim 2RU server Flash Kioxia launches XG10 PCIe 5.0 client SSD HPE updates Alletras X and B10000, Zerto and Data Fabric in GreenLake private cloud update blast PARTNER CONTENT The storage refresh that outlives the flash cycle Scality’s Autonomous Data Infrastructure does agent-driven tiering and more AI/ML MinIO adds petabyte-scale MemKV cache for Nvidia GPU inference HCI MSP-focussed Virtuozzo goes all-in on AI Ten enterprise AI storage systems reviewed and reported DRAM and gloom-glut cyclicality DDN storage being used in French Pangea supercomputer The home of computer storage news About Us Contact usAdvertise with usWho we areNewsletter Our Websites The Next PlatformDevClassThe Register Your Privacy Cookies PolicyPrivacy PolicyTs & CsYour Consent Options Archives B&F news dating back to 2019 Copyright. All rights reserved © 1998-2026. |
Norway’s National Library is engaged in developing a sovereign Large Language Model (LLM) tailored to the Norwegian language, utilizing two petabytes of Huawei OceanStor Dorado flash storage within its AI training data pipeline. This initiative stems from the recognition that a nation lacking its own sovereign LLM is disadvantaged when interacting with globally trained, English-speaking models that may not contain specific historical, cultural, or regional knowledge. The library, which holds the largest digital collection of Norwegian books, newspapers, and other cultural heritage, was tasked by the Ministry of Culture to build this sovereign AI, drawing upon its extensive digitized materials, which amounts to approximately 20 petabytes of unique data preserved in a 3-2-1 backup strategy. An agreement was secured permitting the LLM training to be conducted on this copyrighted content. The process required handling massive datasets from archiving to AI training, presenting significant storage infrastructure challenges. The data processing pipeline involved two primary stages: in-house computation and downstream training. For the initial data pipelines and preparation phases, the team used an Nvidia DGX H200 system alongside multiple Huawei OceanStor Dorado all-flash arrays, providing two petabytes of low-latency storage critical for data movement. The pipeline encompassed steps like data ingestion, cleaning, deduplication, format normalization, validation, and preparation before the data was fed into Norway’s national supercomputer, the Sigma2 Olivia system, which uses a Cray ClusterStor E1000 storage system for actual training runs. A major technical hurdle identified by Marius Husnes, Head of IT Platform at the library, was reconciling the disparate requirements of the storage systems. The 60 petabytes of preservation data are optimized for durability and cost, resulting in high read latency and design for infrequent access. In contrast, the AI pipeline storage required high-throughput, low-latency, parallel data input for the training process. Husnes noted that the practical difficulty lay in managing the movement of petabyte-scale datasets between this archival preservation system and the high-performance AI data pipeline system, a problem he stated was not commonly discussed. The ongoing training project revealed further systemic challenges that require ongoing attention. The team is actively addressing issues related to evaluation, governance, and orchestration. Currently, there are no standardized tools available to objectively assess the quality of a sovereign Norwegian LLM, prompting the team to develop their own evaluation methods based on the language's multifaceted nature, including dialects and historical linguistic changes. Furthermore, institutional and political questions surrounding governance—such as access control and usage policies for the sovereign LLM—remain unresolved. Orchestrating the three distinct systems—the preservation archive, the on-premise AI environment, and the national supercomputer—to function harmoniously is an ongoing project. The experience underscores the necessity for AI development to incorporate custodianship, emphasizing that AI systems require responsible oversight in addition to technical construction. The project highlights the significant role of storage technology, such as Huawei's solutions, in enabling complex, large-scale AI initiatives within the context of national data sovereignty. |