Developing a Transformer-based Clinical Part-of-Speech Tagger for Brazilian Portuguese
DOI:
https://doi.org/10.59681/2175-4411.v15.iEspecial.2023.1086Palavras-chave:
Natural language processing, Electronic Health Records, Deep LearningResumo
Electronic Health Records are a valuable source of information to be extracted by means of natural language processing (NLP) tasks, such as morphosyntactic word tagging. Although there have been significant advances in health NLP, such as the Transformer architecture, languages such as Portuguese are still underrepresented. This paper presents taggers developed for Portuguese texts, fine-tuned using BioBERtpt (clinical/biomedical) and BERTimbau (generic) models on a POS-tagged corpus. We achieved an accuracy of 0.9826, state-of-the-art for the corpus used. In addition, we performed a human-based evaluation of the trained models and others in the literature, using authentic clinical narratives. Our clinical model achieved 0.8145 in accuracy compared to 0.7656 for the generic model. It also showed competitive results compared to models trained specifically with clinical texts, evidencing domain impact on the base model in NLP tasks.
Referências
Jurafsky D, Martin JH. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Second. Prentice Hall; 2008.
Fonseca ER, Rosa JLG, Aluísio SM. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. Journal of the Brazilian Computer Society. 2014;21:1-14.
Aluísio S, Pelizzoni J, Marchi AR, de Oliveira L, Manenti R, Marquiafável V. An Account of the Challenge of Tagging a Reference Corpus for Brazilian Portuguese. In: Mamede NJ, Trancoso I, Baptista J, das Graças Volpe Nunes M, editors. Computational Processing of the Portuguese Language. Berlin, Heidelberg: Springer Berlin Heidelberg; 2003. p. 110-7.
Fonseca ER, Rosa JLG. Mac-Morpho revisited: towards robust part-of-speech tagging [Internet]. Proceedings. 2013 ;[citado 2022 ago. 09 ] Available from: http://www.lbd.dcc.ufmg.br/colecoes/stil/2013/0011.pdf
Dos Santos CN, Zadrozny B. Learning character-level representations for part-of-speech tagging. ICML’14 Proc. 31st Int. Conf. Int. Conf. Mach Learn. 2014;32:1818–26.
Fernandes ER, Rodrigues IM, Milidiu RL. Portuguese Part-of-Speech Tagging with Large Margin Structure Learning. 2014 Brazilian Conf. Intell. Syst., IEEE; 2014, p. 25–30. doi: https://doi.org/10. 1109/BRACIS.2014.16.
De Sousa RCC, Lopes H. Portuguese POS Tagging Using BLSTM Without Handcrafted Features. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11896 LNCS, 2019, p. 120–30. doi: https://doi.org/10.1007/978-3- 030-33904-3_11.
Oleynik M, Nohama P, Cancian PS, Schulz S. Performance analysis of a POS tagger applied to discharge summaries in portuguese. Stud Health Technol Inform. 2010;160:959–63. https://doi.org/10.3233/ 978-1-60750-588-4-959.
Ferro Antunes de Oliveira L, Oliveira L, Gumiel Y, Carvalho D, Moro C. Defining a state-of-the-art POS-tagging environment for Brazilian Portuguese clinical texts. Research on Biomedical Engineering. 2020 06;36.
De Oliveira LFA, Pagano A, e Oliveira LES, Moro C. Challenges in Annotating a Treebank of Clinical Narratives in Brazilian Portuguese. In: Pinheiro V, Gamallo P, Amaro R, Scarton C, Batista F, Silva D, et al., editors. Computational Processing of the Portuguese Language. Cham: Springer International Publishing; 2022. p. 90-100.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is All you Need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc.; 2017.
Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics; 2019. p. 4171-86.
Schneider ETR, de Souza JVA, Knafou J, Oliveira LESe, Copara J, Gumiel YB, et al. BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition. In: Proceedings of the 3rd Clinical Natural Language Processing Workshop. Online: Association for Computational Linguistics; 2020. p. 65-72.
Souza F, Nogueira R, Lotufo R. BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In: Cerri R, Prati RC, editors. Intelligent Systems. Cham: Springer International Publishing; 2020. p. 403-17.
E Oliveira LES, Peters AC, da Silva AMP, Gebeluca CP, Gumiel YB, Cintho LMM, et al. SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks. Journal of Biomedical Semantics. 2022 May;13(1).
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Transformers: State-of-the-Art Natural Language Processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics; 2020. p. 38-45.
Santos CND, Zadrozny B. Training state-of-the-art portuguese POS taggers without handcrafted features. In: International Conference on Computational Processing of the Portuguese Language. Springer, Cham, 2014. p. 82-93.
Gumiel YB, Oliveira LES, Claveau V, Grabar N, Paraiso EC, Moro C, et al. Temporal Relation Extraction in Clinical Texts: A Systematic Review. ACM Computing Surveys (CSUR), v. 54, n. 7, p. 1-36, 2021.
Downloads
Publicado
Como Citar
Edição
Seção
Licença
Copyright (c) 2023 Elisa Terumi Rubel Schneider, Yohan Bonescki Gumiel, Lucas Ferro Antunes de Oliveira, Carolina de Oliveira Montenegro, Laura Rubel Barzotto, Claudia Moro, Adriana Pagano, Emerson Cabrera Paraiso
Este trabalho está licenciado sob uma licença Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
A submissão de um artigo ao Journal of Health Informatics é entendida como exclusiva e que não está sendo considerada para publicação em outra revista. A permissão dos autores para a publicação de seu artigo no J. Health Inform. implica na exclusiva autorização concedida aos editores para incluí-lo na revista. Ao submeter um artigo, ao autor será solicitada a permissão eletrônica de um Termo de Transferência de Direitos Autorais. Uma mensagem eletrônica será enviada ao autor correspondente confirmando o recibo do manuscrito e o aceite da Declaração de Direito Autoral.