At least 47,000 duplicate images are estimated to be sitting inside the cataloguing systems maintained by the Bibliothèque nationale de France, according to internal audits cited by archival professionals in the sector. The figure, which the BnF has been working to reduce since 2023 under its Plan numérique stratégique, represents roughly 6 percent of the photographic holdings digitised through its Gallica platform — a proportion that administrators describe as unsustainable given the storage and indexing costs involved.
The timing matters. Paris is in the middle of a sustained push to activate the cultural and civic legacy of the 2024 Summer Olympics, with institutions from the Musée de l'Histoire de l'Immigration in the 12th arrondissement to the Pavillon de l'Arsenal near the Bastille committing public funds to expand their digital public-access programmes. When the same image exists in a database under three different file names and two different metadata tags, researchers waste hours, automated search tools return degraded results, and storage bills compound. For a city running tight budgets — Paris's municipal operating budget for 2025 was set at approximately 11 billion euros — that inefficiency has a real cost.
What the Data Actually Shows
The duplicate problem is not unique to the BnF. The Paris Musées network, which groups 14 municipal museums including the Musée Carnavalet on Rue des Francs-Bourgeois and the Petit Palais on Avenue Winston Churchill, completed a cross-collection audit in late 2024. Administrators found that roughly one in twelve images uploaded during a bulk digitisation sprint between 2020 and 2022 had been ingested more than once. The 2020–2022 period was particularly acute because institutions rushed material online during pandemic closures, often bypassing the deduplication checks that slower, on-site workflows would normally trigger.
Storage costs for uncompressed archival image files typically run between 0.03 and 0.08 euros per gigabyte per month on the cloud infrastructure that French public institutions now predominantly use, according to procurement frameworks published by the Direction interministérielle du numérique. Multiply that across hundreds of thousands of files held redundantly, and the annual waste across Paris's major cultural institutions runs into the low hundreds of thousands of euros — not a catastrophe, but money that could pay for additional digitisation of physical collections that remain entirely offline. The BnF estimates that around 30 percent of its pre-1900 photographic prints have still not been scanned at all.
Fixing It: Technology, Funding and the Grand Paris Complication
Several institutions are now piloting perceptual hashing tools — software that assigns each image a visual fingerprint and flags near-identical files regardless of what they have been named or how their metadata has been entered. The Agence nationale de la cohésion des territoires has co-funded two pilot projects under its Territoires d'innovation programme, one of which is running at a documentation centre in Saint-Denis, north of Paris, where the post-Olympics urban regeneration has increased pressure to make local historical image archives publicly searchable before new construction obliterates the visual record.
The Grand Paris Express project adds another layer of urgency. As workers sink tunnels beneath communes including Bagneux, Villejuif, and Le Bourget, archaeological services are generating tens of thousands of new site photographs each quarter. The Institut national de recherches archéologiques préventives warned in its 2024 annual report that without standardised ingest protocols, the excavation image libraries for Grand Paris alone risk replicating the duplication problems that took a decade to accumulate in older collections.
For cultural institutions, the practical path forward involves three steps now being discussed across the sector: adopting shared metadata standards before upload rather than attempting correction after the fact; allocating a dedicated deduplication budget line — archivists suggest a minimum of 1.5 percent of any digitisation project's total cost — and scheduling quarterly cross-database reconciliation checks. Institutions that have implemented these protocols, including the Cinémathèque française on Rue de Bercy, report reducing redundant files by more than 70 percent within eighteen months of adoption. The lesson for Paris's broader digital heritage network is blunt: cheaper to prevent than to cure.