Developing a Transformer-based Clinical Part-of-Speech Tagger for Brazilian Portuguese

Elisa Terumi Rubel Schneider; Yohan Bonescki Gumiel; Lucas Ferro Antunes de Oliveira; Carolina de Oliveira Montenegro; Laura Rubel Barzotto; Claudia Moro; Adriana Pagano; Emerson Cabrera Paraiso

doi:10.59681/2175-4411.v15.iEspecial.2023.1086

Developing a Transformer-based Clinical Part-of-Speech Tagger for Brazilian Portuguese

Autores

Elisa Terumi Rubel Schneider Pontifícia Universidade Católica do Paraná - PUCPR
Yohan Bonescki Gumiel Universidade Federal de Minas Gerais - UFMG
Lucas Ferro Antunes de Oliveira Pontifícia Universidade Católica do Paraná - PUCPR
Carolina de Oliveira Montenegro Pontifícia Universidade Católica do Paraná - PUCPR
Laura Rubel Barzotto Pontifícia Universidade Católica do Paraná - PUCPR
Claudia Moro Pontifícia Universidade Católica do Paraná - PUCPR
Adriana Pagano Universidade Federal de Minas Gerais - UFMG
Emerson Cabrera Paraiso Pontifícia Universidade Católica do Paraná - PUCPR

DOI:

https://doi.org/10.59681/2175-4411.v15.iEspecial.2023.1086

Palavras-chave:

Natural language processing, Electronic Health Records, Deep Learning

Resumo

Electronic Health Records are a valuable source of information to be extracted by means of natural language processing (NLP) tasks, such as morphosyntactic word tagging. Although there have been significant advances in health NLP, such as the Transformer architecture, languages such as Portuguese are still underrepresented. This paper presents taggers developed for Portuguese texts, fine-tuned using BioBERtpt (clinical/biomedical) and BERTimbau (generic) models on a POS-tagged corpus. We achieved an accuracy of 0.9826, state-of-the-art for the corpus used. In addition, we performed a human-based evaluation of the trained models and others in the literature, using authentic clinical narratives. Our clinical model achieved 0.8145 in accuracy compared to 0.7656 for the generic model. It also showed competitive results compared to models trained specifically with clinical texts, evidencing domain impact on the base model in NLP tasks.

Downloads

Não há dados estatísticos.

Biografia do Autor

Yohan Bonescki Gumiel, Universidade Federal de Minas Gerais - UFMG

Universidade Federal de Minas Gerais - UFMG. Laboratório de Informática Biomédica - Instituto do Coração - HC FMUSP.

Lucas Ferro Antunes de Oliveira, Pontifícia Universidade Católica do Paraná - PUCPR

Pontifícia Universidade Católica do Paraná - PUCPR. Universidade Federal de Minas Gerais - UFMG.

Referências

Jurafsky D, Martin JH. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Second. Prentice Hall; 2008.

Fonseca ER, Rosa JLG, Aluísio SM. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. Journal of the Brazilian Computer Society. 2014;21:1-14.

Aluísio S, Pelizzoni J, Marchi AR, de Oliveira L, Manenti R, Marquiafável V. An Account of the Challenge of Tagging a Reference Corpus for Brazilian Portuguese. In: Mamede NJ, Trancoso I, Baptista J, das Graças Volpe Nunes M, editors. Computational Processing of the Portuguese Language. Berlin, Heidelberg: Springer Berlin Heidelberg; 2003. p. 110-7.

Fonseca ER, Rosa JLG. Mac-Morpho revisited: towards robust part-of-speech tagging [Internet]. Proceedings. 2013 ;[citado 2022 ago. 09 ] Available from: http://www.lbd.dcc.ufmg.br/colecoes/stil/2013/0011.pdf

Dos Santos CN, Zadrozny B. Learning character-level representations for part-of-speech tagging. ICML’14 Proc. 31st Int. Conf. Int. Conf. Mach Learn. 2014;32:1818–26.

Fernandes ER, Rodrigues IM, Milidiu RL. Portuguese Part-of-Speech Tagging with Large Margin Structure Learning. 2014 Brazilian Conf. Intell. Syst., IEEE; 2014, p. 25–30. doi: https://doi.org/10. 1109/BRACIS.2014.16.

De Sousa RCC, Lopes H. Portuguese POS Tagging Using BLSTM Without Handcrafted Features. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11896 LNCS, 2019, p. 120–30. doi: https://doi.org/10.1007/978-3- 030-33904-3_11.

Oleynik M, Nohama P, Cancian PS, Schulz S. Performance analysis of a POS tagger applied to discharge summaries in portuguese. Stud Health Technol Inform. 2010;160:959–63. https://doi.org/10.3233/ 978-1-60750-588-4-959.

Ferro Antunes de Oliveira L, Oliveira L, Gumiel Y, Carvalho D, Moro C. Defining a state-of-the-art POS-tagging environment for Brazilian Portuguese clinical texts. Research on Biomedical Engineering. 2020 06;36.

De Oliveira LFA, Pagano A, e Oliveira LES, Moro C. Challenges in Annotating a Treebank of Clinical Narratives in Brazilian Portuguese. In: Pinheiro V, Gamallo P, Amaro R, Scarton C, Batista F, Silva D, et al., editors. Computational Processing of the Portuguese Language. Cham: Springer International Publishing; 2022. p. 90-100.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is All you Need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc.; 2017.

Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics; 2019. p. 4171-86.

Schneider ETR, de Souza JVA, Knafou J, Oliveira LESe, Copara J, Gumiel YB, et al. BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition. In: Proceedings of the 3rd Clinical Natural Language Processing Workshop. Online: Association for Computational Linguistics; 2020. p. 65-72.

Souza F, Nogueira R, Lotufo R. BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In: Cerri R, Prati RC, editors. Intelligent Systems. Cham: Springer International Publishing; 2020. p. 403-17.

E Oliveira LES, Peters AC, da Silva AMP, Gebeluca CP, Gumiel YB, Cintho LMM, et al. SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks. Journal of Biomedical Semantics. 2022 May;13(1).

Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Transformers: State-of-the-Art Natural Language Processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics; 2020. p. 38-45.

Santos CND, Zadrozny B. Training state-of-the-art portuguese POS taggers without handcrafted features. In: International Conference on Computational Processing of the Portuguese Language. Springer, Cham, 2014. p. 82-93.

Gumiel YB, Oliveira LES, Claveau V, Grabar N, Paraiso EC, Moro C, et al. Temporal Relation Extraction in Clinical Texts: A Systematic Review. ACM Computing Surveys (CSUR), v. 54, n. 7, p. 1-36, 2021.

Downloads

Publicado

20-07-2023

Como Citar

Schneider, E. T. R., Gumiel, Y. B., Oliveira, L. F. A. de, Montenegro, C. de O., Barzotto, L. R., Moro, C., … Paraiso, E. C. (2023). Developing a Transformer-based Clinical Part-of-Speech Tagger for Brazilian Portuguese. Journal of Health Informatics, 15(Especial). https://doi.org/10.59681/2175-4411.v15.iEspecial.2023.1086

Baixar Citação

Edição

v. 15 n. Especial (2023): XIX Congresso Brasileiro de Informática em Saúde

Seção

CBIS 2022

Licença

Copyright (c) 2023 Elisa Terumi Rubel Schneider, Yohan Bonescki Gumiel, Lucas Ferro Antunes de Oliveira, Carolina de Oliveira Montenegro, Laura Rubel Barzotto, Claudia Moro, Adriana Pagano, Emerson Cabrera Paraiso

Este trabalho está licenciado sob uma licença Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Este trabalho está licenciado sob a Creative Commons Atribuição–Não Comercial–Compartilha Igual 4.0 Internacional (CC BY-NC-SA 4.0). Isso significa que qualquer pessoa pode ler, baixar, copiar, redistribuir e adaptar o material, desde que seja atribuída a devida autoria aos autores e à fonte original. O uso comercial do conteúdo não é permitido. Caso o material seja modificado, remixado ou utilizado para a criação de obras derivadas, estas deverão ser distribuídas sob a mesma licença. Essa licença favorece a ampla disseminação do conhecimento, assegurando o reconhecimento da autoria, restringindo a exploração comercial e garantindo que versões derivadas permaneçam acessíveis sob os mesmos termos.

Consulte a Licença