Anonymization of medical texts with natural language processing
DOI:
https://doi.org/10.59681/2175-4411.v17.2025.1227Keywords:
data anonymization, medical records, natural language processingAbstract
Objective: To present and evaluate an anonymization method for medical records in Portuguese, using a pre-trained named entity recognition (NER) model without fine-tuning. Method: The GLiNER (Generalist and Lightweight Model for Named Entity Recognition) model was applied to identify and mask potentially identifying information (example: name, age, organization, and city) in 27,540 discharge summaries (12,163 patients) from a tertiary hospital in São Paulo (2017-2023). Information loss was evaluated with ROUGE F1, BLEU-4, BERTscore, and human analysis of errors was performed on a random sample (N=400). Result: Human analysis showed anonymization failure in two cases (0.50%), allowing the identification of the patient or the assistant. Quantitative metrics indicated preservation of textual utility (median BERTscore: 0.76). Conclusion: The model is efficient but not perfect, highlighting the need for hybrid anonymization approach (automatic and human validation) to comply with the General Law for the Protection of Personal Data. It can be used as a step for creationing necessary medical datasets for the development of natural language processing in Brazil.
Downloads
References
Landolsi MY, Hlaoua L, Ben Romdhane L. Information extraction from electronic medical documents: state of the art and future research directions. Knowl Inf Syst 2023; 65: 463–516. DOI: https://doi.org/10.1007/s10115-022-01779-1
Hossain E, Rana R, Higgins N, et al. Natural Language Processing in Electronic Health Records in relation to healthcare decision-making: A systematic review. Comput Biol Med 2023; 155: 106649. DOI: https://doi.org/10.1016/j.compbiomed.2023.106649
Luo X, Deng Z, Yang B, et al. Pre-trained language models in medicine: A survey. Artif Intell Med 2024; 154: 102904. DOI: https://doi.org/10.1016/j.artmed.2024.102904
Brasil, Lei no. 13.709, de 14 de Agosto de 2018. Lei Geral de Proteção de Dados Pessoais (LGPD): LGPD, 2018.
Sweeney L. k-Anonymity: A model for protecting privacy. Int. J. Unc. Fuzz. Knowl. Based Syst. 2002; 10: 557–570. DOI: https://doi.org/10.1142/S0218488502001648
Liu J, Gupta S, Chen A, et al. OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study. J Med Internet Res 2023; 25: e48145. DOI: https://doi.org/10.2196/48145
Johnson AEW, Bulgarelli L, Pollard TJ. Deidentification of free-text medical records using pre-trained bidirectional transformers. Proc ACM Conf Health Inference Learn (2020) 2020; 2020: 214–221. DOI: https://doi.org/10.1145/3368555.3384455
Vakili T, Henriksson A, Dalianis H. End-to-end pseudonymization of fine-tuned clinical BERT models Privacy preservation with maintained data utility. BMC Med Inform Decis Mak 2024; 24: 162. DOI: https://doi.org/10.1186/s12911-024-02546-8
Minaee S, Mikolov T, Nikzad N, et al. Large Language Models: A Survey, 2024.
Yoon J, Drumright LN, van der Schaar M. Anonymization Through Data Synthesis Using Generative Adversarial Networks (ADS-GAN). IEEE J. Biomed. Health Inform. 2020; 24: 2378–2388. DOI: https://doi.org/10.1109/JBHI.2020.2980262
Gadotti A, Rocher L, Houssiau F, et al. Anonymization: The imperfect science of using data while preserving privacy. Sci Adv 2024; 10: eadn7053. DOI: https://doi.org/10.1126/sciadv.adn7053
Johnson AEW, Bulgarelli L, Shen L, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 2023; 10: 1. DOI: https://doi.org/10.1038/s41597-023-01945-2
Nigo M, Rasmy L, Mao B, et al. Deep learning model for personalized prediction of positive MRSA culture using time-series electronic health records. Nat Commun 2024; 15: 2036. DOI: https://doi.org/10.1038/s41467-024-46211-0
Falter M, Godderis D, Scherrenberg M, et al. Using natural language processing for automated classification of disease and to identify misclassified ICD codes in cardiac disease. Eur Heart J Digit Health 2024; 5: 229–234. DOI: https://doi.org/10.1093/ehjdh/ztae008
Lin C-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out, pp. 74–81. Barcelona, Spain: Association for Computational Linguistics.
Papineni K, Roukos S, Ward T, et al. Bleu: a Method for Automatic Evaluation of Machine Translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. (ed Isabelle P, Charniak E and Lin D), pp. 311–318. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics. DOI: https://doi.org/10.3115/1073083.1073135
Zhang T, Kishore V, Wu F, et al. BERTScore: Evaluating Text Generation with BERT, 2019.
Lee Y-Q, Chen C-T, Chen C-C, et al. Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study. J Med Internet Res 2024; 26: e48443. DOI: https://doi.org/10.2196/48443
Preiksaitis C, Ashenburg N, Bunney G, et al. The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review. JMIR Med Inform 2024; 12: e53787. DOI: https://doi.org/10.2196/53787
Park Y-J, Pillai A, Deng J, et al. Assessing the research landscape and clinical utility of large language models: a scoping review. BMC Med Inform Decis Mak 2024; 24: 72. DOI: https://doi.org/10.1186/s12911-024-02459-6
Oliveira LESE, Peters AC, Da Silva AMP, et al. SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks. J Biomed Semantics 2022; 13: 13. DOI: https://doi.org/10.1186/s13326-022-00269-1
Seastedt KP, Schwab P, O'Brien Z, et al. Global healthcare fairness: We should be sharing more, not less, data. PLOS Digit Health 2022; 1: e0000102. DOI: https://doi.org/10.1371/journal.pdig.0000102
 
											Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Rildo Pinto da Silva, Antonio Pazin-Filho

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Submission of a paper to Journal of Health Informatics is understood to imply that it is not being considered for publication elsewhere and that the author(s) permission to publish his/her (their) article(s) in this Journal implies the exclusive authorization of the publishers to deal with all issues concerning the copyright therein. Upon the submission of an article, authors will be asked to sign a Copyright Notice. Acceptance of the agreement will ensure the widest possible dissemination of information. An e-mail will be sent to the corresponding author confirming receipt of the manuscript and acceptance of the agreement. 
						







