Towards Generating a Library of Metastatic Phenotypes from Structured Radiology Reports

Thumbnail Image
Batch, Karen
metastases , machine learning , radiology , natural language processing (NLP) , cancer , convolutional neural networks (CNN) , consecutive reports , TF-IDF , computed tomography (CT)
One in four Canadians will die from cancer. The majority of cancer patients die not from their primary tumor but from the progression of metastatic disease. Among patients with metastatic disease, the patterns of disease spread are increasingly recognized as an important stratification variable for treatment selection. While historically, disease patterns have been described in autopsy series, identifying these patterns in all patients undergoing treatment may improve our understanding of the natural history of different primary cancers and the organotropism of metastatic disease, thereby facilitating better clinical reasoning and decision-making. Radiographic imaging plays a central role in identifying metastatic disease. This thesis tests the working hypothesis that a ten-year database of computed tomography (CT) radiologic reports contain sufficient information for machine learning (ML) to identify metastatic phenotypes based on disease progression patterns. Natural language processing (NLP) will allow the use of structured radiology reports as weak labels for semi-supervised classification of over 700,000 CT reports. This research is made possible by recent advances in the field and capitalizes on our unique access to large amounts of structured reporting data. The thesis has two contributions: the first uses NLP models for single-report metastases detection, while the second uses deep learning models to detect metastatic disease across multiple consecutive reports. The single-report work aims to measure the frequency of metastatic disease in different organs reported on CT scans of the chest, abdomen, and pelvis over a ten-year period at a cancer center, using time frequency-inverse document frequency (TF-IDF) and ensemble machine-learning models to make predictions. It was demonstrated that NLP can achieve high accuracy (90\%-99\%) in extracting metastatic disease labels in different organs within structured radiology reports for cancer patients. The second contribution is to further improve the detection of metastatic disease over time from these same structured radiology reports by exposing prediction models to historical information. We use NLP to extract and encode important features from the structured text reports, which are then used to develop, train, and validate models. Three models – a simple convolutional neural network, a convolutional neural network augmented with an attention layer, and a recurrent neural network – were developed to classify the type of metastatic disease and validated against the ground truth labels. The models use features from consecutive structured text radiology reports of a patient to predict the presence of metastatic disease in the reports. This research demonstrated that neural network NLP prediction models can generate better weak labels for semi-supervised classification of CT reports when exposed to consecutive reports through a patient's treatment history. Our results suggest that NLP models can extract cancer progression patterns from multiple consecutive radiology reports and predict the presence of metastatic disease in multiple organs with high performance. This is the first time that NLP has been applied to study metastatic progression. We demonstrate a promising automated approach to label large numbers of radiology reports without involving human experts in a time- and cost-effective manner and enables tracking of cancer progression over time.
External DOI