Developing a Transformer-based Clinical Part-of-Speech Tagger for Brazilian Portuguese
DOI:
https://doi.org/10.59681/2175-4411.v15.iEspecial.2023.1086Keywords:
Natural language processing, Electronic Health Records, Deep LearningAbstract
Electronic Health Records are a valuable source of information to be extracted by means of natural language processing (NLP) tasks, such as morphosyntactic word tagging. Although there have been significant advances in health NLP, such as the Transformer architecture, languages such as Portuguese are still underrepresented. This paper presents taggers developed for Portuguese texts, fine-tuned using BioBERtpt (clinical/biomedical) and BERTimbau (generic) models on a POS-tagged corpus. We achieved an accuracy of 0.9826, state-of-the-art for the corpus used. In addition, we performed a human-based evaluation of the trained models and others in the literature, using authentic clinical narratives. Our clinical model achieved 0.8145 in accuracy compared to 0.7656 for the generic model. It also showed competitive results compared to models trained specifically with clinical texts, evidencing domain impact on the base model in NLP tasks.
References
Jurafsky D, Martin JH. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Second. Prentice Hall; 2008.
Fonseca ER, Rosa JLG, Aluísio SM. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. Journal of the Brazilian Computer Society. 2014;21:1-14.
Aluísio S, Pelizzoni J, Marchi AR, de Oliveira L, Manenti R, Marquiafável V. An Account of the Challenge of Tagging a Reference Corpus for Brazilian Portuguese. In: Mamede NJ, Trancoso I, Baptista J, das Graças Volpe Nunes M, editors. Computational Processing of the Portuguese Language. Berlin, Heidelberg: Springer Berlin Heidelberg; 2003. p. 110-7.
Fonseca ER, Rosa JLG. Mac-Morpho revisited: towards robust part-of-speech tagging [Internet]. Proceedings. 2013 ;[citado 2022 ago. 09 ] Available from: http://www.lbd.dcc.ufmg.br/colecoes/stil/2013/0011.pdf
Dos Santos CN, Zadrozny B. Learning character-level representations for part-of-speech tagging. ICML’14 Proc. 31st Int. Conf. Int. Conf. Mach Learn. 2014;32:1818–26.
Fernandes ER, Rodrigues IM, Milidiu RL. Portuguese Part-of-Speech Tagging with Large Margin Structure Learning. 2014 Brazilian Conf. Intell. Syst., IEEE; 2014, p. 25–30. doi: https://doi.org/10. 1109/BRACIS.2014.16.
De Sousa RCC, Lopes H. Portuguese POS Tagging Using BLSTM Without Handcrafted Features. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11896 LNCS, 2019, p. 120–30. doi: https://doi.org/10.1007/978-3- 030-33904-3_11.
Oleynik M, Nohama P, Cancian PS, Schulz S. Performance analysis of a POS tagger applied to discharge summaries in portuguese. Stud Health Technol Inform. 2010;160:959–63. https://doi.org/10.3233/ 978-1-60750-588-4-959.
Ferro Antunes de Oliveira L, Oliveira L, Gumiel Y, Carvalho D, Moro C. Defining a state-of-the-art POS-tagging environment for Brazilian Portuguese clinical texts. Research on Biomedical Engineering. 2020 06;36.
De Oliveira LFA, Pagano A, e Oliveira LES, Moro C. Challenges in Annotating a Treebank of Clinical Narratives in Brazilian Portuguese. In: Pinheiro V, Gamallo P, Amaro R, Scarton C, Batista F, Silva D, et al., editors. Computational Processing of the Portuguese Language. Cham: Springer International Publishing; 2022. p. 90-100.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is All you Need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc.; 2017.
Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics; 2019. p. 4171-86.
Schneider ETR, de Souza JVA, Knafou J, Oliveira LESe, Copara J, Gumiel YB, et al. BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition. In: Proceedings of the 3rd Clinical Natural Language Processing Workshop. Online: Association for Computational Linguistics; 2020. p. 65-72.
Souza F, Nogueira R, Lotufo R. BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In: Cerri R, Prati RC, editors. Intelligent Systems. Cham: Springer International Publishing; 2020. p. 403-17.
E Oliveira LES, Peters AC, da Silva AMP, Gebeluca CP, Gumiel YB, Cintho LMM, et al. SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks. Journal of Biomedical Semantics. 2022 May;13(1).
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Transformers: State-of-the-Art Natural Language Processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics; 2020. p. 38-45.
Santos CND, Zadrozny B. Training state-of-the-art portuguese POS taggers without handcrafted features. In: International Conference on Computational Processing of the Portuguese Language. Springer, Cham, 2014. p. 82-93.
Gumiel YB, Oliveira LES, Claveau V, Grabar N, Paraiso EC, Moro C, et al. Temporal Relation Extraction in Clinical Texts: A Systematic Review. ACM Computing Surveys (CSUR), v. 54, n. 7, p. 1-36, 2021.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Elisa Terumi Rubel Schneider, Yohan Bonescki Gumiel, Lucas Ferro Antunes de Oliveira, Carolina de Oliveira Montenegro, Laura Rubel Barzotto, Claudia Moro, Adriana Pagano, Emerson Cabrera Paraiso
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Submission of a paper to Journal of Health Informatics is understood to imply that it is not being considered for publication elsewhere and that the author(s) permission to publish his/her (their) article(s) in this Journal implies the exclusive authorization of the publishers to deal with all issues concerning the copyright therein. Upon the submission of an article, authors will be asked to sign a Copyright Notice. Acceptance of the agreement will ensure the widest possible dissemination of information. An e-mail will be sent to the corresponding author confirming receipt of the manuscript and acceptance of the agreement.