Radiology Report Summarization With Large Language Models
Loading...
Authors
Idlbi, Mahmoud
Date
2025-05-30
Type
thesis
Language
eng
Keyword
Large language models , Radiology , Clinical natural language processing , Medical text summarization , Fine-tuning
Alternative Title
Abstract
Cancer remains a leading cause of mortality worldwide, with timely and accurate assessments of disease progression critical for optimal treatment. In routine oncology practice, repeated radiology examinations yield numerous reports—each containing detailed Findings and a more concise Impression—that clinicians must synthesize to make informed treatment decisions. Motivated by the time-intensive and error-prone nature of manually reviewing these longitudinal imaging histories, this thesis proposes a two-stage fine-tuning strategy for radiology report summarization using both decoder-only and encoder-decoder large language models (LLMs), specifically LLaMa 3.1-8B-Instruct and T5-Base.
In the first stage, we fine-tune each model on MIMIC-CXR, a large publicly available dataset of chest radiographs with free-text radiology reports, to transform the Findings into an Impression. Due to its large parameter count, the LLaMa model was fine-tuned using QLoRA—a quantized adaptation approach that enables memory-efficient training—while T5-Base was fine-tuned using standard LoRA. Our results demonstrate substantial performance gains over the respective base models. For LLaMa, improvements of 174.18%, 115.57%, 129.48%, 16.08%, and 111.15% were observed in BLEU-4, ROUGE-L, METEOR, BERTScore, and F1-RadGraph, respectively. Similarly, the fine-tuned T5-Base model achieved BLEU-4, ROUGE-L, and F1-RadGraph scores of 31.25, 47.57, and 40.73—representing increases of 677.36%, 255.18%, and 337.31% over the base model.
In the second stage, we refine both models on a small, custom-curated dataset of 174 Impressions and their associated summaries. The objective of this stage is to enable the model to summarize individual Impressions while emphasizing longitudinal and clinically relevant changes. This setup facilitates iterative summarization of radiology reports over time. The two-stage fine-tuned LLaMa model significantly outperforms its base version on this task, achieving improvements of 456.88%, 209.15%, 57.01%, 15.19%, and 35.65% across BLEU-4, ROUGE-L, METEOR, BERTScore, and F1-RadGraph, respectively. The two-stage T5 model achieved improvements of 35.16%, 20.09%, 4.39%, 3.05%, and 7.96% across BLEU-4, ROUGE-L, METEOR, BERTScore, and F1-RadGraph, respectively. While both models performed well overall, the two-stage LLaMa model outperformed the T5 model on the downstream Impression summarization task, whereas T5 achieved the best results on the MIMIC-CXR task. These findings suggest that LLaMa’s decoder-only architecture may offer advantages in the two-stage fine-tuning strategy.
