Desidentificação de narrativas clínicas com modelos generativos de código aberto

Elisa Terumi Rubel Schneider; Fernando Henrique Schneider; Yohan Bonescki Gumiel; Lilian Mie Mukai Cintho; Adriana Pagano; Emerson Cabrera Paraiso; Marina de Sa Rebelo; Marco Antonio Gutierrez; Jose Eduardo Krieger; Claudia Moro

doi:10.59681/2175-4411.v16.iEspecial.2024.1365

De-identification of clinical narratives with open source generative models

Authors

Elisa Terumi Rubel Schneider FMUSP
Fernando Henrique Schneider FMUSP
Yohan Bonescki Gumiel FMUSP
Lilian Mie Mukai Cintho Universidade Estadual de Ponta Grossa
Adriana Pagano Universidade Federal de Minas Gerais
Emerson Cabrera Paraiso Pontifícia Universidade Católica do Paraná
Marina de Sa Rebelo FMUSP
Marco Antonio Gutierrez FMUSP
Jose Eduardo Krieger FMUSP
Claudia Moro Pontifícia Universidade Católica do Paraná

DOI:

https://doi.org/10.59681/2175-4411.v16.iEspecial.2024.1365

Keywords:

Artificial Intelligence, Natural Language Processing, Medical Records

Abstract

Objectives: De-identifying clinical narratives is essential to protect patient privacy and ensure regulatory compliance. However, this is a complex task due to the various types of entities to be de-identified and the need to process texts locally for security and privacy reasons. Methods: This article presents an experimental study on the de-identification of clinical narratives using open-source generative models that can be run locally. Results: We evaluated the effectiveness of five language models, comparing them to GPT-4, a proprietary model. The models were assessed based on precision, recall, and F-score. Our preliminary results indicate that while GPT-4 achieved the best performance, the open-source model Llama3 by Meta demonstrated robustness and effectiveness in this task. Conclusion: This study contributes to the field by providing insights into the performance of different models in anonymizing clinical narratives.

Author Biographies

Elisa Terumi Rubel Schneider, FMUSP

PhD, Instituto do Coração - InCor/HC FMUSP, São Paulo (SP), Brazil

Fernando Henrique Schneider, FMUSP

BSc, Instituto do Coração - InCor/HC FMUSP, São Paulo (SP), Brazil

Yohan Bonescki Gumiel, FMUSP

PhD, Instituto do Coração - InCor/HC FMUSP, São Paulo (SP), Brazil

Lilian Mie Mukai Cintho, Universidade Estadual de Ponta Grossa

PhD, Universidade Estadual de Ponta Grossa (UEPG), Ponta Grossa (PR), Brazil

Adriana Pagano, Universidade Federal de Minas Gerais

PhD, Universidade Federal de Minas Gerais (UFMG), Belo Horizonte (MG), Brazil

Emerson Cabrera Paraiso, Pontifícia Universidade Católica do Paraná

PhD, Pontifícia Universidade Católica do Paraná (PUCPR), Curitiba (PR), Brazil

Marina de Sa Rebelo, FMUSP

PhD, Instituto do Coração - InCor/HC FMUSP, São Paulo (SP), Brazil

Marco Antonio Gutierrez, FMUSP

PhD, Instituto do Coração - InCor/HC FMUSP, São Paulo (SP), Brazil

Jose Eduardo Krieger, FMUSP

PhD, Instituto do Coração - InCor/HC FMUSP, São Paulo (SP), Brazil

Claudia Moro, Pontifícia Universidade Católica do Paraná

PhD, Pontifícia Universidade Católica do Paraná (PUCPR), Curitiba (PR), Brazil

References

Liu, Zengjian et al. “De-identification of clinical notes via recurrent neural network and conditional random field.” Journal of biomedical informatics vol. 75S (2017): S34-S42. doi:10.1016/j.jbi.2017.05.023 DOI: https://doi.org/10.1016/j.jbi.2017.05.023

Yang, Hui, and Jonathan M Garibaldi. “Automatic detection of protected health information from clinic narratives.” Journal of biomedical informatics vol. 58 Suppl,Suppl (2015): S30-S38. doi:10.1016/j.jbi.2015.06.015 DOI: https://doi.org/10.1016/j.jbi.2015.06.015

Meystre, Stéphane M et al. “Text de-identification for privacy protection: a study of its impact on clinical text information content.” Journal of biomedical informatics vol. 50 (2014): 142-50. doi:10.1016/j.jbi.2014.01.011 DOI: https://doi.org/10.1016/j.jbi.2014.01.011

Grouin, Cyril, and Aurélie Névéol. "De-identification of clinical notes in French: towards a protocol for reference corpus development." Journal of biomedical informatics 50 (2014): 151-161. DOI: https://doi.org/10.1016/j.jbi.2013.12.014

Act, Accountability. "Health insurance portability and accountability act of 1996." Public law 104 (1996): 191.

Yadav, Shweta, et al. "Deep learning architecture for patient data de-identification in clinical records." Proceedings of the clinical natural language processing workshop (ClinicalNLP). 2016.

Hartman, Tzvika, et al. "Customization scenarios for de-identification of clinical notes." BMC medical informatics and decision making 20 (2020): 1-9. DOI: https://doi.org/10.1186/s12911-020-1026-2

Prado, Carolina Braun, et al. "De-Identification Challenges in Real-World Portuguese Clinical Texts." Latin American Conference on Biomedical Engineering. Cham: Springer Nature Switzerland, 2022.

Deleger, Louise, et al. "Large-scale evaluation of automated clinical note de-identification and its impact on information extraction." Journal of the American Medical Informatics Association 20.1 (2013): 84-94. DOI: https://doi.org/10.1136/amiajnl-2012-001012

Obeid, Jihad S., et al. "Impact of de-identification on clinical text classification using traditional and deep learning classifiers." Studies in health technology and informatics 264 (2019): 283.

Ahmed, Tanbir, Md Momin Al Aziz, and Noman Mohammed. "De-identification of electronic health record using neural network." Scientific reports 10.1 (2020): 18600. DOI: https://doi.org/10.1038/s41598-020-75544-1

Catelli, Rosario, et al. "A novel covid-19 data set and an effective deep learning approach for the de-identification of italian medical records." Ieee Access 9 (2021): 19097-19110. DOI: https://doi.org/10.1109/ACCESS.2021.3054479

Khin, Kaung, Philipp Burckhardt, and Rema Padman. "A deep learning architecture for de-identification of patient notes: Implementation and evaluation." arXiv preprint arXiv:1810.01570 (2018).

Santos, Joaquim, et al. "De-identification of clinical notes using contextualized language models and a token classifier." Brazilian Conference on Intelligent Systems. Cham: Springer International Publishing, 2021. DOI: https://doi.org/10.1007/978-3-030-91699-2_3

Liu, Zhengliang, et al. "Deid-gpt: Zero-shot medical text de-identification by gpt-4." arXiv preprint arXiv:2303.11032 (2023).

AI@Meta, 2024. Llama 3 model card. URL: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.

Mistral AI Team, 2024. Model Card for Mixtral-8x7B. URL: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1.

Hong, J., Lee, N., Thorne, J., 2024. Orpo: Monolithic preference optimization without reference model. arXiv:2403.07691.

CohereForAI, 2024. Model Card for C4AI Command R+. URL: https://huggingface.co/CohereForAI/c4ai-command-r-plus.

Google, 2024. Gemma Model Card. URL: https://huggingface.co/google/gemma-1.1-7b-it.

Downloads

PDF (Português (Brasil))

Published

2024-11-19

How to Cite

Schneider, E. T. R., Schneider, F. H., Gumiel, Y. B., Cintho, L. M. M., Pagano, A., Paraiso, E. C., … Moro, C. (2024). De-identification of clinical narratives with open source generative models. Journal of Health Informatics, 16(Especial). https://doi.org/10.59681/2175-4411.v16.iEspecial.2024.1365

Download Citation

Issue

Vol. 16 No. Especial (2024): Congresso Brasileiro de Informática em Saúde

Section

CBIS 2024

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Submission of a paper to Journal of Health Informatics is understood to imply that it is not being considered for publication elsewhere and that the author(s) permission to publish his/her (their) article(s) in this Journal implies the exclusive authorization of the publishers to deal with all issues concerning the copyright therein. Upon the submission of an article, authors will be asked to sign a Copyright Notice. Acceptance of the agreement will ensure the widest possible dissemination of information. An e-mail will be sent to the corresponding author confirming receipt of the manuscript and acceptance of the agreement.

Most read articles by the same author(s)

Elisa Terumi Rubel Schneider, Yohan Bonescki Gumiel, Lucas Ferro Antunes de Oliveira, Carolina de Oliveira Montenegro, Laura Rubel Barzotto, Claudia Moro, Adriana Pagano, Emerson Cabrera Paraiso, Developing a Transformer-based Clinical Part-of-Speech Tagger for Brazilian Portuguese , Journal of Health Informatics: Vol. 15 No. Especial (2023): XIX Congresso Brasileiro de Informática em Saúde
Gabrielle dos Santos Leandro, Claudia Moro, SISVAL-RENAL: clinical decision support to manage the anemia in chronic kidney failure , Journal of Health Informatics: Vol. 15 No. Especial (2023): XIX Congresso Brasileiro de Informática em Saúde
Marco Antonio Gutierrez, Ciência de dados e Inteligência Artificial em Medicina , Journal of Health Informatics: Vol. 11 No. 4 (2019)
João Vitor Andrioli de Souza, Elisa Terumi Rubel Schneider, Josilaine Oliveira Cezar, Lucas Emanuel Silva e Oliveira, Yohan Bonescki Gumiel, Emerson Cabrera Paraiso, Douglas Teodoro, Claudia Maria Cabral Moro Barra, A Multilabel Approach to Portuguese Clinical Named Entity Recognition , Journal of Health Informatics: Vol. 12 (2020): Suplemento I - XVII Congresso Brasileiro de Informática em Saúde - CBIS 2020
Jurema da Silva Herbas Palomo, Bruna Gabriela Bibancos Damas, Marco Antonio Gutierrez, Avaliação do registro eletrônico da prescrição e evolução de enfermagem , Journal of Health Informatics: Vol. 2 No. 1 (2010)
Luiz Henrique Pereira Niero, João Vitor Andrioli de Souza, Luciana Martins Gomes da Silva, Yohan Bonescki Gumiel, Nícolas Henrique Borges, Gustavo Henrique Munhoz Piotto, Gustavo Giavarini, Lucas Emanuel Silva e Oliveira, Challenges and Issues on Extracting Named Entities from Oncology Clinical Notes , Journal of Health Informatics: Vol. 15 No. Especial (2023): XIX Congresso Brasileiro de Informática em Saúde
Marco Antônio Gutierrez, Sistemas de Informação Hospitalares: progressos e avanços , Journal of Health Informatics: Vol. 3 No. 2 (2011)
Daniel Mário de Lima, Ramon Alfredo Moreno, Marina de Sá Rebelo, José Eduardo Krieger, Marco Antonio Gutierrez, A COVID-19 surveillance platform to monitor risk of infection based on a machine learning model , Journal of Health Informatics: Vol. 12 (2020): Suplemento I - XVII Congresso Brasileiro de Informática em Saúde - CBIS 2020
Giovanni Pazini Meneghel Paiva, Elisa Terumi Rubel Schneider, Josilaine Oliveira Cezar, Lucas Ferro Antunes de Oliveira, João Vitor Andrioli, Claudia Maria Cabral Moro Barra, Emerson Cabrera Paraiso, Lucas Emanuel Silva e Oliveira, Yohan Bonescki Gumiel, COVID 19: O que sentem os brasileiros de acordo com o Twitter? , Journal of Health Informatics: Vol. 12 (2020): Suplemento I - XVII Congresso Brasileiro de Informática em Saúde - CBIS 2020
Ramon Alfredo Moreno, Vinicius Lima, Isidro Lopes, Marco Antonio Gutierrez, MedCast - Sistema Colaborativo para Discussão de casos clínicos , Journal of Health Informatics: Vol. 3 No. 3 (2011)

1 2 > >>

De-identification of clinical narratives with open source generative models

Authors

DOI:

Keywords:

Abstract

Author Biographies

Elisa Terumi Rubel Schneider, FMUSP

Fernando Henrique Schneider, FMUSP

Yohan Bonescki Gumiel, FMUSP

Lilian Mie Mukai Cintho, Universidade Estadual de Ponta Grossa

Adriana Pagano, Universidade Federal de Minas Gerais

Emerson Cabrera Paraiso, Pontifícia Universidade Católica do Paraná

Marina de Sa Rebelo, FMUSP

Marco Antonio Gutierrez, FMUSP

Jose Eduardo Krieger, FMUSP

Claudia Moro, Pontifícia Universidade Católica do Paraná

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Most read articles by the same author(s)

Language

Information

Indexadores, Bases de Dados, Repositórios e Bibliotecas