Do They Really Know? Evaluating Large Language Models’ Ability to Reference and Cite Oncology Guidelines
Published in International Conference on Artificial Intelligence in Medicine, 2025
Recommended citation: Belligoli, P., Bitterman, D., Miller, T. (2025). Do They Really Know? Evaluating Large Language Models’ Ability to Reference and Cite Oncology Guidelines. In: Bellazzi, R., Juarez Herrero, J.M., Sacchi, L., Zupan, B. (eds) Artificial Intelligence in Medicine. AIME 2025. Lecture Notes in Computer Science(), vol 15735. Springer, Cham. https://doi.org/10.1007/978-3-031-95841-0_6 https://doi.org/10.1007/978-3-031-95841-0_6
Abstract: Large language models (LLMs) hold significant promise in clinical decision support by generating evidence-based recommendations, particularly in complex domains like breast cancer. This study investigates whether LLMs possess specific knowledge of restricted oncology guidelines (NCCN) and open-access guidelines (ASCO and ESMO) by evaluating their performance on 50 synthetic breast cancer case vignettes. Two proprietary models (GPT-4 and Claude-3.5-Sonnet) and two open-source models (LLaMA-3.2 3B and Mistral-7B) were prompted to generate treatment recommendations by retrieving the exact citations they referenced to create recommendations. References were manually evaluated and classified as exact matches, paraphrased, or hallucinated. Although none of the models successfully retrieved verbatim quotes, GPT-4 generated citations that reflected the content of the NCCN, ASCO, and ESMO guidelines in 90%, 64%, and 70%, respectively. Claude-3.5-Sonnet performed similarly, with 80% for NCCN, 84% for ASCO, and 88% for ESMO. In contrast, LLaMA-3.2 3B showed weaker performance, referring to NCCN, ASCO, and ESMO in 26%, 28%, and 50% of cases, respectively. Mistral-7B performed comparably to LLaMA-3.2 in NCCN (14%) but achieved higher rates for ASCO (68%) and ESMO (84%). As LLMs evolve, ensuring consistent output, accurate citations, and reliable reference of clinical guidelines will be essential for their integration into clinical decision support systems.