De-identification of clinical narratives with open source generative models
DOI:
https://doi.org/10.59681/2175-4411.v16.iEspecial.2024.1365Keywords:
Artificial Intelligence, Natural Language Processing, Medical RecordsAbstract
Objectives: De-identifying clinical narratives is essential to protect patient privacy and ensure regulatory compliance. However, this is a complex task due to the various types of entities to be de-identified and the need to process texts locally for security and privacy reasons. Methods: This article presents an experimental study on the de-identification of clinical narratives using open-source generative models that can be run locally. Results: We evaluated the effectiveness of five language models, comparing them to GPT-4, a proprietary model. The models were assessed based on precision, recall, and F-score. Our preliminary results indicate that while GPT-4 achieved the best performance, the open-source model Llama3 by Meta demonstrated robustness and effectiveness in this task. Conclusion: This study contributes to the field by providing insights into the performance of different models in anonymizing clinical narratives.
References
Liu, Zengjian et al. “De-identification of clinical notes via recurrent neural network and conditional random field.” Journal of biomedical informatics vol. 75S (2017): S34-S42. doi:10.1016/j.jbi.2017.05.023 DOI: https://doi.org/10.1016/j.jbi.2017.05.023
Yang, Hui, and Jonathan M Garibaldi. “Automatic detection of protected health information from clinic narratives.” Journal of biomedical informatics vol. 58 Suppl,Suppl (2015): S30-S38. doi:10.1016/j.jbi.2015.06.015 DOI: https://doi.org/10.1016/j.jbi.2015.06.015
Meystre, Stéphane M et al. “Text de-identification for privacy protection: a study of its impact on clinical text information content.” Journal of biomedical informatics vol. 50 (2014): 142-50. doi:10.1016/j.jbi.2014.01.011 DOI: https://doi.org/10.1016/j.jbi.2014.01.011
Grouin, Cyril, and Aurélie Névéol. "De-identification of clinical notes in French: towards a protocol for reference corpus development." Journal of biomedical informatics 50 (2014): 151-161. DOI: https://doi.org/10.1016/j.jbi.2013.12.014
Act, Accountability. "Health insurance portability and accountability act of 1996." Public law 104 (1996): 191.
Yadav, Shweta, et al. "Deep learning architecture for patient data de-identification in clinical records." Proceedings of the clinical natural language processing workshop (ClinicalNLP). 2016.
Hartman, Tzvika, et al. "Customization scenarios for de-identification of clinical notes." BMC medical informatics and decision making 20 (2020): 1-9. DOI: https://doi.org/10.1186/s12911-020-1026-2
Prado, Carolina Braun, et al. "De-Identification Challenges in Real-World Portuguese Clinical Texts." Latin American Conference on Biomedical Engineering. Cham: Springer Nature Switzerland, 2022.
Deleger, Louise, et al. "Large-scale evaluation of automated clinical note de-identification and its impact on information extraction." Journal of the American Medical Informatics Association 20.1 (2013): 84-94. DOI: https://doi.org/10.1136/amiajnl-2012-001012
Obeid, Jihad S., et al. "Impact of de-identification on clinical text classification using traditional and deep learning classifiers." Studies in health technology and informatics 264 (2019): 283.
Ahmed, Tanbir, Md Momin Al Aziz, and Noman Mohammed. "De-identification of electronic health record using neural network." Scientific reports 10.1 (2020): 18600. DOI: https://doi.org/10.1038/s41598-020-75544-1
Catelli, Rosario, et al. "A novel covid-19 data set and an effective deep learning approach for the de-identification of italian medical records." Ieee Access 9 (2021): 19097-19110. DOI: https://doi.org/10.1109/ACCESS.2021.3054479
Khin, Kaung, Philipp Burckhardt, and Rema Padman. "A deep learning architecture for de-identification of patient notes: Implementation and evaluation." arXiv preprint arXiv:1810.01570 (2018).
Santos, Joaquim, et al. "De-identification of clinical notes using contextualized language models and a token classifier." Brazilian Conference on Intelligent Systems. Cham: Springer International Publishing, 2021. DOI: https://doi.org/10.1007/978-3-030-91699-2_3
Liu, Zhengliang, et al. "Deid-gpt: Zero-shot medical text de-identification by gpt-4." arXiv preprint arXiv:2303.11032 (2023).
AI@Meta, 2024. Llama 3 model card. URL: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
Mistral AI Team, 2024. Model Card for Mixtral-8x7B. URL: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1.
Hong, J., Lee, N., Thorne, J., 2024. Orpo: Monolithic preference optimization without reference model. arXiv:2403.07691.
CohereForAI, 2024. Model Card for C4AI Command R+. URL: https://huggingface.co/CohereForAI/c4ai-command-r-plus.
Google, 2024. Gemma Model Card. URL: https://huggingface.co/google/gemma-1.1-7b-it.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Submission of a paper to Journal of Health Informatics is understood to imply that it is not being considered for publication elsewhere and that the author(s) permission to publish his/her (their) article(s) in this Journal implies the exclusive authorization of the publishers to deal with all issues concerning the copyright therein. Upon the submission of an article, authors will be asked to sign a Copyright Notice. Acceptance of the agreement will ensure the widest possible dissemination of information. An e-mail will be sent to the corresponding author confirming receipt of the manuscript and acceptance of the agreement.