Publications

LCD benchmark: long clinical document benchmark on mortality prediction for language models

Published in Journal of the American Medical Informatics Association, 2024

Abstract:
Objectives: The application of natural language processing (NLP) in the clinical domain is important due to the rich unstructured information in clinical documents, which often remains inaccessible in structured data. When applying NLP methods to a certain domain, the role of benchmark datasets is crucial as benchmark datasets not only guide the selection of best-performing models but also enable the assessment of the reliability of the generated outputs. Despite the recent availability of language models capable of longer context, benchmark datasets targeting long clinical document classification tasks are absent.

Materials and Methods: To address this issue, we propose Long Clinical Document (LCD) benchmark, a benchmark for the task of predicting 30-day out-of-hospital mortality using discharge notes of Medical Information Mart for Intensive Care IV and statewide death data. We evaluated this benchmark dataset using baseline models, from bag-of-words and convolutional neural network to instruction-tuned large language models. Additionally, we provide a comprehensive analysis of the model outputs, including manual review and visualization of model weights, to offer insights into their predictive capabilities and limitations.

Results: Baseline models showed 28.9% for best-performing supervised models and 32.2% for GPT-4 in F1 metrics. Notes in our dataset have a median word count of 1687.

Discussion: Our analysis of the model outputs showed that our dataset is challenging for both models and human experts, but the models can find meaningful signals from the text.

Conclusion: We expect our LCD benchmark to be a resource for the development of advanced supervised models, or prompting methods, tailored for clinical text.

GitHub for data processing and llm experiment:
https://github.com/Machine-Learning-for-Medical-Language/long-clinical-doc

Recommended citation:

WonJin Yoon, Shan Chen, Yanjun Gao, Zhanzhan Zhao, Dmitriy Dligach, Danielle S Bitterman, Majid Afshar, Timothy Miller, LCD benchmark: long clinical document benchmark on mortality prediction for language models. Journal of the American Medical Informatics Association, 2024, ocae287, https://doi.org/10.1093/jamia/ocae287 https://doi.org/10.1093/jamia/ocae287

Generalizable clinical note section identification with large language models

Published in JAMIA Open, 2024

Objectives Clinical note section identification helps locate relevant information and could be beneficial for downstream tasks such as named entity recognition. However, the traditional supervised methods suffer from transferability issues. This study proposes a new framework for using large language models (LLMs) for section identification to overcome the limitations.

Recommended citation:

Weipeng Zhou, Timothy Miller. 2024. Generalizable clinical note section identification with large language models, JAMIA Open, Volume 7, Issue 3, October 2024, ooae075, https://doi.org/10.1093/jamiaopen/ooae075 https://doi.org/10.1093/jamiaopen/ooae075

Cumulus: a federated electronic health record-based learning system powered by Fast Healthcare Interoperability Resources and artificial intelligence

Published in Journal of the American Medical Informatics Association, 2024

Abstract:

Objective:
To address challenges in large-scale electronic health record (EHR) data exchange, we sought to develop, deploy, and test an open source, cloud-hosted app “listener” that accesses standardized data across the SMART/HL7 Bulk FHIR Access application programming interface (API).

Methods:
We advance a model for scalable, federated, data sharing and learning. Cumulus software is designed to address key technology and policy desiderata including local utility, control, and administrative simplicity as well as privacy preservation during robust data sharing, and artificial intelligence (AI) for processing unstructured text.

Results:
Cumulus relies on containerized, cloud-hosted software, installed within a healthcare organization’s security envelope. Cumulus accesses EHR data via the Bulk FHIR interface and streamlines automated processing and sharing. The modular design enables use of the latest AI and natural language processing tools and supports provider autonomy and administrative simplicity. In an initial test, Cumulus was deployed across 5 healthcare systems each partnered with public health. Cumulus output is patient counts which were aggregated into a table stratifying variables of interest to enable population health studies. All code is available open source. A policy stipulating that only aggregate data leave the institution greatly facilitated data sharing agreements.

Discussion and Conclusion:
Cumulus addresses barriers to data sharing based on (1) federally required support for standard APIs, (2) increasing use of cloud computing, and (3) advances in AI. There is potential for scalability to support learning across myriad network configurations and use cases.

Recommended citation:

Andrew J McMurry, Daniel I Gottlieb, Timothy A Miller, James R Jones, Ashish Atreja, Jennifer Crago, Pankaja M Desai, Brian E Dixon, Matthew Garber, Vladimir Ignatov, Lyndsey A Kirchner, Philip R O Payne, Anil J Saldanha, Prabhu R V Shankar, Yauheni V Solad, Elizabeth A Sprouse, Michael Terry, Adam B Wilcox, Kenneth D Mandl, Cumulus: a federated electronic health record-based learning system powered by Fast Healthcare Interoperability Resources and artificial intelligence, Journal of the American Medical Informatics Association, Volume 31, Issue 8, August 2024, Pages 1638–1647, https://doi.org/10.1093/jamia/ocae130 https://doi.org/10.1093/jamia/ocae130

Automated stratification of trauma injury severity across multiple body regions using multi-modal, multi-class machine learning models

Published in JAMIA, 2024

Abstract: Objective The timely stratification of trauma injury severity can enhance the quality of trauma care but it requires intense manual annotation from certified trauma coders. The objective of this study is to develop machine learning models for the stratification of trauma injury severity across various body regions using clinical text and structured electronic health records (EHRs) data.

Recommended citation:

Jifan Gao, Guanhua Chen, Ann P O’Rourke, John Caskey, Kyle A Carey, Madeline Oguss, Anne Stey, Dmitriy Dligach, Timothy Miller, Anoop Mayampurath, Matthew M Churpek, Majid Afshar, Automated stratification of trauma injury severity across multiple body regions using multi-modal, multi-class machine learning models, Journal of the American Medical Informatics Association, Volume 31, Issue 6, June 2024, Pages 1291–1302, https://doi.org/10.1093/jamia/ocae071 https://doi.org/10.1093/jamia/ocae071

Development of a Benchmark Corpus for Medical Device Adverse Event Detection

Published in CL4Health Workshop, 2024

Abstract: The U.S. Food and Drug Administration (FDA) collects real-world adverse events, including device-associated deaths, injuries, and malfunctions, through passive reporting to the agency’s Manufacturer and User Facility Device Experience (MAUDE) database. However, this system’s full potential remains untapped given the extensive use of unstructured text in medical device adverse event reports and lack of FDA resources and expertise to properly analyze all available data. In this work, we focus on addressing this limitation through the development of an annotated benchmark corpus to support the design and development of state-of-the-art NLP approaches towards automatic extraction of device-related adverse event information from FDA Medical Device Adverse Event Reports. We develop a dataset of labeled medical device reports from a diverse set of high-risk device types, that can be used for supervised machine learning. We develop annotation guidelines and manually annotate for nine entity types. The resulting dataset contains 935 annotated adverse event reports, containing 12252 annotated spans across the nine entity types. The dataset developed in this work will be made publicly available upon publication.

Recommended citation:

Susmitha Wunnava, David A. Harris, Florence T. Bourgeois, and Timothy A. Miller. 2024. Development of a Benchmark Corpus for Medical Device Adverse Event Detection. In Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024, pages 240–245, Torino, Italia. ELRA and ICCL. https://aclanthology.org/2024.cl4health-1.29

Moving Biosurveillance Beyond Coded Data Using AI for Symptom Detection From Physician Notes: Retrospective Cohort Study

Published in JMIR, 2024

Abstract: Background: Real-time surveillance of emerging infectious diseases necessitates a dynamically evolving, computable case definition, which frequently incorporates symptom-related criteria. For symptom detection, both population health monitoring platforms and research initiatives primarily depend on structured data extracted from electronic health records.

Recommended citation:

McMurry A, Zipursky A, Geva A, Olson K, Jones J, Ignatov V, Miller T, Mandl K Moving Biosurveillance Beyond Coded Data Using AI for Symptom Detection From Physician Notes: Retrospective Cohort Study J Med Internet Res 2024;26:e53367 URL: https://www.jmir.org/2024/1/e53367 DOI: 10.2196/53367 https://doi.org/10.2196/53367

Deep Learning-Based Natural Language Processing to Automate Esophagitis Severity Grading from the Electronic Health Records

Published in International Journal of Radiation Oncology, Biology, Physics, 2023

Abstract: Radiotherapy (RT) toxicities can impair survival and quality-of-life, yet their risk factors and optimal management are under-studied. Real-world evidence holds enormous potential to improve our understanding of RT adverse events, but this information is often only documented in clinic notes and cannot, at present, be automatically extracted. To address this unmet need, we developed natural language processing (NLP) algorithms to automatically identify the presence and severity of esophagitis from notes of patients treated with thoracic RT.

Download here

Improving Model Transferability for Clinical Note Section Classification Models Using Continued Pretraining

Published in Journal of the American Medical Informatics Association (JAMIA), 2023

Objective: The classification of clinical note sections is a critical step before doing more fine-grained natural language processing tasks such as social determinants of health extraction and temporal information extraction. Often, clinical note section classification models that achieve high accuracy for 1 institution experience a large drop of accuracy when transferred to another institution. The objective of this study is to develop methods that classify clinical note sections under the SOAP (“Subjective,” “Object,” “Assessment,” and “Plan”) framework with improved transferability.

Recommended citation:

Weipeng Zhou, Meliha Yetisgen, Yanjun Gao, Guergana Savova, and Timothy Miller. 2023. Improving Model Transferability for Clinical Note Section Classification Models Using Continued Pretraining. JAMIA, September 2023, ocad190 https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocad190/7277369?login=true

Improving the Transferability of Clinical Note Section Classification Models with BERT and Large Language Model Ensembles

Published in Proceedings of the 5th Clinical Natural Language Processing Workshop, 2023

Abstract: Text in electronic health records is organized into sections, and classifying those sections into section categories is useful for downstream tasks. In this work, we attempt to improve the transferability of section classification models by combining the dataset-specific knowledge in supervised learning models with the world knowledge inside large language models (LLMs). Surprisingly, we find that zero-shot LLMs out-perform supervised BERT-based models applied to out-of-domain data. We also find that their strengths are synergistic, so that a simple ensemble technique leads to additional performance gains.

Recommended citation:

Weipeng Zhou, Majid Afshar, Dmitriy Dligach, Yanjun Gao, and Timothy Miller. 2023. Improving the Transferability of Clinical Note Section Classification Models with BERT and Large Language Model Ensembles. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 125–130, Toronto, Canada. Association for Computational Linguistics. https://aclanthology.org/2023.clinicalnlp-1.16/

Two-Stage Fine-Tuning for Improved Bias and Variance for Large Pretrained Language Models

Published in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023

Abstract: The bias-variance tradeoff is the idea that learning methods need to balance model complexity with data size to minimize both under-fitting and over-fitting. Recent empirical work and theoretical analysis with over-parameterized neural networks challenges the classic bias-variance trade-off notion suggesting that no such trade-off holds: as the width of the network grows, bias monotonically decreases while variance initially increases followed by a decrease. In this work, we first provide a variance decomposition-based justification criteria to examine whether large pretrained neural models in a fine-tuning setting are generalizable enough to have low bias and variance. We then perform theoretical and empirical analysis using ensemble methods explicitly designed to decrease variance due to optimization. This results in essentially a two-stage fine-tuning algorithm that first ratchets down bias and variance iteratively, and then uses a selected fixed-bias model to further reduce variance due to optimization by ensembling. We also analyze the nature of variance change with the ensemble size in low- and high-resource classes. Empirical results show that this two-stage method obtains strong results on SuperGLUE tasks and clinical information extraction tasks. Code and settings are available: https://github.com/christa60/bias-var-fine-tuning-plms.git

Recommended citation:

Lijing Wang, Yingya Li, Timothy Miller, Steven Bethard, and Guergana Savova. 2023. Two-Stage Fine-Tuning for Improved Bias and Variance for Large Pretrained Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15746–15761, Toronto, Canada. Association for Computational Linguistics. https://aclanthology.org/2023.acl-long.877/

End-to-end clinical temporal information extraction with multi-head attention

Published in The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, 2023

Abstract: Understanding temporal relationships in text from electronic health records can be valuable for many important downstream clinical applications. Since Clinical TempEval 2017, there has been little work on end-to-end systems for temporal relation extraction, with most work focused on the setting where gold standard events and time expressions are given. In this work, we make use of a novel multi-headed attention mechanism on top of a pre-trained transformer encoder to allow the learning process to attend to multiple aspects of the contextualized embeddings. Our system achieves state of the art results on the THYME corpus by a wide margin, in both the in-domain and cross-domain settings.

Recommended citation:

Timothy Miller, Steven Bethard, Dmitriy Dligach, and Guergana Savova. 2023. End-to-end clinical temporal information extraction with multi-head attention. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 313–319, Toronto, Canada. Association for Computational Linguistics. https://aclanthology.org/2023.bionlp-1.28/

Representing and utilizing clinical textual data for real world studies: An OHDSI approach

Published in Journal of Biomedical Informatics, 2023

Abstract: Clinical documentation in electronic health records contains crucial narratives and details about patients and their care. Natural language processing (NLP) can unlock the information conveyed in clinical notes and reports, and thus plays a critical role in real-world studies. The NLP Working Group at the Observational Health Data Sciences and Informatics (OHDSI) consortium was established to develop methods and tools to promote the use of textual data and NLP in real-world observational studies. In this paper, we describe a framework for representing and utilizing textual data in real-world evidence generation, including representations of information from clinical text in the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), the workflow and tools that were developed to extract, transform and load (ETL) data from clinical notes into tables in OMOP CDM, as well as current applications and specific use cases of the proposed OHDSI NLP solution at large consortia and individual institutions with English textual data. Challenges faced and lessons learned during the process are also discussed to provide valuable insights for researchers who are planning to implement NLP solutions in real-world studies.

Recommended citation:

Vipina K. Keloth, Juan M. Banda, Michael Gurley, Paul M. Heider, Georgina Kennedy, Hongfang Liu, Feifan Liu, Timothy Miller, Karthik Natarajan, Olga V Patterson, Yifan Peng, Kalpana Raja, Ruth M. Reeves, Masoud Rouhizadeh, Jianlin Shi, Xiaoyan Wang, Yanshan Wang, Wei-Qi Wei, Andrew E. Williams, Rui Zhang, Rimma Belenkaya, Christian Reich, Clair Blacketer, Patrick Ryan, George Hripcsak, Noémie Elhadad, Hua Xu, Representing and utilizing clinical textual data for real world studies: An OHDSI approach, Journal of Biomedical Informatics, Volume 142, 2023 https://doi.org/10.1016/j.jbi.2023.104343

Natural Language Processing Methods to Empirically Explore Social Contexts and Needs in Cancer Patient Notes

Published in JCO Clinical Cancer Informatics, 2023

Abstract:

Recommended citation:

Natural Language Processing Methods to Empirically Explore Social Contexts and Needs in Cancer Patient Notes. Abigail Derton, Marco Guevara, Shan Chen, Shalini Moningi, David E. Kozono, Dianbo Liu, Timothy A. Miller, Guergana K. Savova, Raymond H. Mak, and Danielle S. Bitterman. JCO Clinical Cancer Informatics 2023 :7