Radiology Report Summarization With Large Language Models

Loading...
Thumbnail Image

Authors

Idlbi, Mahmoud

Date

2025-05-30

Type

thesis

Language

eng

Keyword

Large language models , Radiology , Clinical natural language processing , Medical text summarization , Fine-tuning

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

Cancer remains a leading cause of mortality worldwide, with timely and accurate assessments of disease progression critical for optimal treatment. In routine oncology practice, repeated radiology examinations yield numerous reports—each containing detailed Findings and a more concise Impression—that clinicians must synthesize to make informed treatment decisions. Motivated by the time-intensive and error-prone nature of manually reviewing these longitudinal imaging histories, this thesis proposes a two-stage fine-tuning strategy for radiology report summarization using both decoder-only and encoder-decoder large language models (LLMs), specifically LLaMa 3.1-8B-Instruct and T5-Base. In the first stage, we fine-tune each model on MIMIC-CXR, a large publicly available dataset of chest radiographs with free-text radiology reports, to transform the Findings into an Impression. Due to its large parameter count, the LLaMa model was fine-tuned using QLoRA—a quantized adaptation approach that enables memory-efficient training—while T5-Base was fine-tuned using standard LoRA. Our results demonstrate substantial performance gains over the respective base models. For LLaMa, improvements of 174.18%, 115.57%, 129.48%, 16.08%, and 111.15% were observed in BLEU-4, ROUGE-L, METEOR, BERTScore, and F1-RadGraph, respectively. Similarly, the fine-tuned T5-Base model achieved BLEU-4, ROUGE-L, and F1-RadGraph scores of 31.25, 47.57, and 40.73—representing increases of 677.36%, 255.18%, and 337.31% over the base model. In the second stage, we refine both models on a small, custom-curated dataset of 174 Impressions and their associated summaries. The objective of this stage is to enable the model to summarize individual Impressions while emphasizing longitudinal and clinically relevant changes. This setup facilitates iterative summarization of radiology reports over time. The two-stage fine-tuned LLaMa model significantly outperforms its base version on this task, achieving improvements of 456.88%, 209.15%, 57.01%, 15.19%, and 35.65% across BLEU-4, ROUGE-L, METEOR, BERTScore, and F1-RadGraph, respectively. The two-stage T5 model achieved improvements of 35.16%, 20.09%, 4.39%, 3.05%, and 7.96% across BLEU-4, ROUGE-L, METEOR, BERTScore, and F1-RadGraph, respectively. While both models performed well overall, the two-stage LLaMa model outperformed the T5 model on the downstream Impression summarization task, whereas T5 achieved the best results on the MIMIC-CXR task. These findings suggest that LLaMa’s decoder-only architecture may offer advantages in the two-stage fine-tuning strategy.

Description

Citation

Publisher

License

Journal

Volume

Issue

PubMed ID

External DOI

ISSN

EISSN