Large Language Model-Derived Digital Twins for Predicting Medication Treatments in the Intensive Care Unit
Published in Am J Respir Crit Care Med, 2025
Recommended citation: M. Afshar, M.S. Tootooni, A. Mayampurath, T. Miller, M.M. Churpek, Y. Gao, D. Dligach, and B. Eslami. Large Language Model-Derived Digital Twins for Predicting Medication Treatments in the Intensive Care Unit [abstract]. Am J Respir Crit Care Med 2025;211:A7181. https://doi.org/10.1164/ajrccm.2025.211.Abstracts.A7181
Abstract: RATIONALE: A digital twin is a computational representation of a physical object, individual, or group of individuals that uses real-time data to simulate, monitor, and optimize tasks, including decision-making and outcome prediction. Integrating digital twins into the intensive care unit (ICU) can alleviate cognitive burdens on providers and enhance decision-making. We created medical digital twins by training a large language model (LLM) using supervised fine-tuning (SFT) with Low-Rank Adapters (LoRA) on physician notes from different medical ICUs. We hypothesize that treatment recommendations are more accurate when a digital twin is trained on the ICU specialty it is being tested on versus being trained on other ICU specialties. METHODS: We analyzed discharge summaries from the Medical Information Mart for Intensive Care (MIMIC-III) dataset, a publicly available ICU electronic health record with notes from the medical, cardiothoracic, and surgical ICUs. We identified all medications mentioned in ICU discharge summaries using the SparkNLP tool and masked them using a special token, creating separate datasets for training and testing our models. All models were tested on the medical ICU dataset of 1,000 notes. We performed supervised fine-tuning (SFT) on an open-source LLM, LLaMA-3, using LoRA adapters to train and predict the masked medications. We also evaluated a baseline model without training in a zero-shot context using only instructions for the ICU context (e.g. “You are a Medical ICU physician …”). Evaluation was performed using BERTScore, which evaluates text similarity by comparing token embeddings, capturing meaning rather than character sting overlap, and ROUGE-L, which computes the longest common text subsequence to measure the amount of overlap between the generated and the reference text. RESULTS: LLMs trained on medical ICU notes and then tested on medical ICU notes showed the highest performance with a BERTScore of 0.820, outperforming those trained on other ICU specialties (Table 1). The model trained on a random sample of patients from all ICU specialties scored slightly lower at 0.813, while the cardiothoracic ICU digital twin showed the lowest performance, highlighting distinct preference differences between ICU specialties. Untrained, out-of-the box LLM models performed the worst. CONCLUSIONS: Our findings demonstrate the potential of LLMs for medication prediction and highlight the importance of context-specific training with supervised fine tuning. Digital twins can lead to more effective and personalized decision-support systems, ultimately assisting the human provider. Our work is foundational in evaluating the capabilities of LLMs to act as digital twins for clinical decision support.