journal articles
A BENCHMARK OF TEXT EMBEDDING MODELS FOR SEMANTIC HARMONIZATION OF ALZHEIMER\'S DISEASE COHORTS
Tim Adams, Yasamin Salimi, Mehmet Can Ay, Diego Valderrama, Marc Jacobs, Holger Fröhlich
J Prev Alz Dis 2026;1(13)
BACKGROUND: Harmonizing diverse healthcare datasets is a challenging task due to inconsistent naming conventions. Manual harmonization is time- and resource-intensive, limiting scalability for multi-cohort Alzheimer's Disease research. Large Language Models, or specifically text-embedding models, offer a promising solution, but their rapid development necessitates continuous, domain-specific benchmarking, especially since general established benchmarks lack clinical data harmonization use cases.
OBJECTIVES: To evaluate how different text-embedding models perform for the harmonization of clinical variables.
DESIGN AND SETTING: We created a novel benchmark to assess how well different Language Model embeddings can be used to harmonize cohort study metadata with an in-house Common Data Model that includes cohort-to-cohort mappings for a wide range of Alzheimer’s Disease cohorts. We evaluated five different state-of-the-art text embedding models for seven different data sets in the context of Alzheimer’s disease.
PARTICIPANTS: No patient data were utilized for any of the analyses, as the evaluation was based on semantic harmonization of cohort metadata only.
MEASUREMENTS: Text descriptions of variables from different modalities were included for the analyses, namely clinical, lifestyle, demographics, and imaging.
RESULTS: Our benchmark results favored different models compared to general-purpose benchmarks. This suggests that models fine-tuned for generic tasks may not translate well to real-world data harmonization, particularly in Alzheimer’s disease. We propose guidelines to format metadata to facilitate manual or model-assisted data harmonization. We introduce an open-source library (https://github.com/SCAI-BIO/ADHTEB) and an interactive leaderboard (https://adhteb.scai.fraunhofer.de) to aid future model benchmarking.
CONCLUSIONS: Our findings highlight the importance of domain-specific benchmarks for clinical data harmonization in the field of Alzheimer’s disease and motivate standards for naming conventions that may support semi-automated mapping applications in the future.
CITATION:
Tim Adams ; Yasamin Salimi ; Mehmet Can Ay ; Diego Valderrama ; Marc Jacobs ; Holger Fröhlich (2025): A benchmark of text embedding models for semantic harmonization of Alzheimer's disease cohorts. The Journal of Prevention of Alzheimer’s Disease (JPAD). https://doi.org/10.1016/j.tjpad.2025.100420
