A benchmark of text embedding models for semantic harmonization of alzheimer\'s disease cohorts

Adams, Tim; Salimi, Yasamin; Can Ay, Mehmet; Valderrama, Diego; Jacobs, Marc; Fröhlich, Holger

Archives

Back to all journals

journal articles

A BENCHMARK OF TEXT EMBEDDING MODELS FOR SEMANTIC HARMONIZATION OF ALZHEIMER\'S DISEASE COHORTS

Tim Adams, Yasamin Salimi, Mehmet Can Ay, Diego Valderrama, Marc Jacobs, Holger Fröhlich

J Prev Alz Dis 2026;1(13)

BACKGROUND: Harmonizing diverse healthcare datasets is a challenging task due to inconsistent naming conventions. Manual harmonization is time- and resource-intensive, limiting scalability for multi-cohort Alzheimer's Disease research. Large Language Models, or specifically text-embedding models, offer a promising solution, but their rapid development necessitates continuous, domain-specific benchmarking, especially since general established benchmarks lack clinical data harmonization use cases. OBJECTIVES: To evaluate how different text-embedding models perform for the harmonization of clinical variables. DESIGN AND SETTING: We created a novel benchmark to assess how well different Language Model embeddings can be used to harmonize cohort study metadata with an in-house Common Data Model that includes cohort-to-cohort mappings for a wide range of Alzheimer’s Disease cohorts. We evaluated five different state-of-the-art text embedding models for seven different data sets in the context of Alzheimer’s disease. PARTICIPANTS: No patient data were utilized for any of the analyses, as the evaluation was based on semantic harmonization of cohort metadata only. MEASUREMENTS: Text descriptions of variables from different modalities were included for the analyses, namely clinical, lifestyle, demographics, and imaging. RESULTS: Our benchmark results favored different models compared to general-purpose benchmarks. This suggests that models fine-tuned for generic tasks may not translate well to real-world data harmonization, particularly in Alzheimer’s disease. We propose guidelines to format metadata to facilitate manual or model-assisted data harmonization. We introduce an open-source library (https://github.com/SCAI-BIO/ADHTEB) and an interactive leaderboard (https://adhteb.scai.fraunhofer.de) to aid future model benchmarking. CONCLUSIONS: Our findings highlight the importance of domain-specific benchmarks for clinical data harmonization in the field of Alzheimer’s disease and motivate standards for naming conventions that may support semi-automated mapping applications in the future.

CITATION:
Tim Adams ; Yasamin Salimi ; Mehmet Can Ay ; Diego Valderrama ; Marc Jacobs ; Holger Fröhlich (2025): A benchmark of text embedding models for semantic harmonization of Alzheimer's disease cohorts. The Journal of Prevention of Alzheimer’s Disease (JPAD). https://doi.org/10.1016/j.tjpad.2025.100420

OPEN ACCESS

Download PDF (2.38 Mo)