Conference paper: Extracting structured information from technical specification PDFs using LLMs and RAG

Published|

15/12/2025

Share|

This paper presents an end-to-end artificial intelligence pipeline for extracting structured, asset-level information on data centres from unstructured technical specification PDFs. 

Addressing the growing sustainability challenges associated with the rapid global expansion of data centres, the authors combine Retrieval-Augmented Generation (RAG) with large language models to automate the collection of key operational and environmental attributes. The study evaluates multiple open-source and proprietary language models using both quantitative RAGAS metrics and expert human validation, demonstrating that RAG-based approaches can significantly improve the scalability and transparency of data centre data collection while still requiring targeted human oversight. 

By illustrating how the extracted data can be used to estimate facility-level carbon and water footprints, the paper lays the groundwork for a global, open database to support policymakers, investors, researchers, and civil society in assessing and managing the environmental impacts of the data centre sector.

This paper was first presented on the 4th of August at the 2025 Fragile Earth workshop, part of the KDD Conference August 3-7th in Toronto, Canada. It was first published on OpenReview.