A Privacy-Preserving Analytics Pipeline for De-Identified Primary Care Data

Loading...
Thumbnail Image
Date
Authors
Pepin, Ian
Keyword
De-identification , Natural Language Processing , Protected Health Information , PHI , Secure Multi-Party Computation , MPC , Secret Sharing , Private Set Intersection , PSI , Privacy
Abstract
Data breaches in the healthcare industry are at an all-time high. The average breach in the healthcare industry reached US$10.10 million in 2022, which is highest among all industries for the 12th consecutive year [106]. Although the healthcare industry is one of the more highly regulated industries, initial attack vectors such as phishing, compromised credentials, or insider threats are at the root of many breaches today. Potential vulnerabilities to these attack vectors can be dangerous for individuals or organizations that share medical data with others. This research aims to address the challenges in the secured sharing and processing of clinical text data. The research objectives include evaluation and comparison of de-identification tools for clinical notes, and the assessment of Secure Multi-Party Computation (MPC) protocols and frameworks to perform computations on encrypted medical data. The thesis makes several contributions in the area of secured analytics of sensitive data. First, we compare the features and performance of five state-of-the-art de-identification tools for free-text clinical notes, highlighting the strengths and weaknesses of each one. Next, we propose a de-identification pipeline that removes most of the manual work associated with this type of task. Finally, we build a solution that involves MPC, specifically Secret Sharing, to allow multiple parties to jointly evaluate functions on their encrypted inputs without revealing the unencrypted data to anyone. We evaluate the performance of the framework against the same framework for the analysis of unencrypted medical data. The contributions of this thesis benefit researchers and medical professionals by demonstrating the feasibility of our proposed methods in privacy-preserving secured data analytics.
External DOI