Classification and Analysis of Osteoarthritis Using Unstructured Chart Notes and Structured Data from EMRs

Loading...
Thumbnail Image

Authors

Cai, Jiahao

Date

2024-08-29

Type

thesis

Language

eng

Keyword

Natural Language Processing (NLP) , Unstructured and sturctured medical data , Machine learning , Osteoarthritis , Electronic medical records (EMR) , Text analytics

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

Osteoarthritis (OA) is a very common musculoskeletal condition defined by the progressive degradation of joint cartilage and the underlying bone, resulting in the manifestation of pain and functional limitations. The timely intervention and effective management of OA rely heavily on the critical aspect of efficient and accurate diagnosis. Electronic Medical Records (EMR) in primary care settings contain patients' structured historical data including unstructured text data in the encounter chart notes. The unstructured notes are often very long, compiled from multiple patient-physician encounters, and contain medical jargon including personal data. The data offers a variety of computational challenges but contains valuable information for disease diagnosis, especially for detecting hip or knee OA. We applied different methodologies including information extraction, feature selection, word embeddings, and developed analytical pipelines including statistical machine learning (ML) algorithms, and deep learning (DL) algorithms to detect the OA-affected bone joints (Knee and hip OA, knee OA, hip OA, and other OA). Our methodology incorporated a range of text encodings, including TF-IDF, Word2Vec, GloVe, FastText, and Bidirectional Encoder Representations from Transformers (BERT), in order to represent textual data as vectors. we presented a variety of statistical ML and DL models including Extreme Gradient Boosting (XGBoost), Random Forest (RF), Support Vector Machines (SVM), Multilayer Perceptrons (MLP), Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM) networks, Bidirectional LSTMs (Bi-LSTM), CNN-Bi-LSTM, and BERT. We trained these models with encoded text data and evaluated whether the integration of paragraph extraction, duplicate removal, and negation tagging could improve the performance of the classification of OA. We first trained and evaluated a Bi-LSTM model with FastText on the gold labelled data, it achieved the highest F1 Score and accuracy of 87.13% and 87.97% among all models, respectively. Our trained Bi-LSTM model predicted OA from unlabelled data to produce pseudo labels and combined the pseudo labelled and gold labelled data to train a new Bi-LSTM model. Our proposed method achieved 88.23% F1 Score and 88.84% accuracy by using the Bi-LSTM model.

Description

Citation

Publisher

License

Queen's University's Thesis/Dissertation Non-Exclusive License for Deposit to QSpace and Library and Archives Canada
ProQuest PhD and Master's Theses International Dissemination Agreement
Intellectual Property Guidelines at Queen's University
Copying and Preserving Your Thesis
This publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.
Attribution-NonCommercial-NoDerivatives 4.0 International

Journal

Volume

Issue

PubMed ID

External DOI

ISSN

EISSN