Challenges and Issues on Extracting Named Entities from Oncology Clinical Notes
DOI:
https://doi.org/10.59681/2175-4411.v15.iEspecial.2023.1097Keywords:
Natural Language Processing, Electronic Health Records, Medical OncologyAbstract
This article aims to describe the annotation process of a multi-institutional corpus of clinical texts in the oncology specialty and to train models for the Recognition of Named Entities. We use the annotated corpus to train models with different amounts of data and compare the model result with the amount of data used in training. The training of the models was done from the fine-tuning of the Bidirectional Encoder Representations from Transformers adapted to the medical-biological domain of the Portuguese language (BioBERTpt). To compare model behavior with increasing training data, models were trained with incremental amounts of data. As a result, we found that models trained with smaller but fully revised datasets performed better than models trained with larger datasets with little revision.
References
Jensen, PB; Jensen, LJ, Brunak, S. Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics 2012;13(6): 395-405.
Jian F, Jiang Y, Zhi H, Dong Y, Li H, Ma S, et al. Artificial intelligence in healthcare: Past, present and future, Stroke Vasc. Neurol. 2 2017;230–243.
Sheikhalishahi S, Miotto R, Dudley JT, Lavelli A, Rinaldi F, Osmani V. Natural language processing of clinical notes on chronic diseases: systematic review. JMIR medical informatics 2019;21:1-18.
World Health Organization. Cancer. Available from: https://www.who.int/news-room/fact-sheets/detail/cancer
Oliveira LES e, Peters AC, da Silva AMP, Gebeluca CP, Gumiel YB, Cintho LMM, et al. SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks. Journal of Biomedical Semantics. 2022 May 8;13(1).
Schneider ETR, de Souza JVA, Knafou J, Oliveira LES e, Copara J, Gumiel YB, et al. BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition. Proceedings of the 3rd Clinical Natural Language Processing Workshop. 2020;
de Souza JVA, Gumiel YB, Silva EL, Moro CM. Named entity recognition for clinical portuguese corpus with conditional random fields and semantic group.Anais do XIX Simpósio Brasileiro de Computação Aplicada à Saúde; 2019, 318-323.
dos Santos HDP, Silva AP, Maciel MCO, Burin HMV, Urbanetto JS, Vieira R. Fall Detection in EHR using Word Embeddings and Deep Learning. Proceedings of the 19th International Conference on Bioinformatics and Bioengineering (BIBE); 2019, 265-268.
Lopes F, Teixeira C, Oliveira HG. Contributions to Clinical Named Entity Recognition in Portuguese. Proceedings of the 18th BioNLP Workshop and Shared Task; 2019, 223–233.
Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA: A Cancer Journal for Clinicians [Internet]. 2021 Feb 4;71(3):209–49. Available from: https://acsjournals.onlinelibrary.wiley.com/doi/10.3322/caac.21660
Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. Neural Architectures for Named Entity Recognition [Internet]. arXiv.org. 2016. Available from: https://arxiv.org/abs/1603.01360
Roberts A, Gaizauskas R, Hepple M, Davis N, Demetriou G, Guo Y, et al. The CLEF corpus: semantic annotation of clinical text. AMIA Annual Symposium Proceedings AMIA Symposium [Internet]. 2007 [cited 2022 Aug 27];2007:625–9. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2655900/
Kang T, Zhang S, Tang Y, Hruby GW, Rusanov A, Elhadad N, et al. EliIE: An open-source information extraction system for clinical trial eligibility criteria. Journal of the American Medical Informatics Association [Internet]. 2017 Apr 1 [cited 2021 Dec 1];24(6):1062–71. Available from: https://academic.oup.com/jamia/article/24/6/1062/3098256?login=true
Richard LJ, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics [Internet]. 1977;33(1):159–74. Available from: https://www.jstor.org/stable/2529310?seq=7#metadata_info_tab_contents
Stubbs A. MAE and MAI: Lightweight Annotation and Adjudication Tools [Internet]. Association for Computational Linguistics; 2011 [cited 2022 Aug 29] p. 23–4. Available from: https://aclanthology.org/W11-0416.pdf
de Oliveira LFA, Oliveira LES, Gumiel YB, Carvalho DR, Moro CMC. Defining a state-of-the-art POS-tagging environment for Brazilian Portuguese clinical texts. Research on Biomedical Engineering. 2020 Jun 19;36(3):267–76.
Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [Internet]. arXiv.org. 2018. Available from: https://arxiv.org/abs/1810.04805
CoNLL-2003 standard https://aclanthology.org/W03-0419.pdf
Souza JVA, Schneider ETR, Oliveira LES, Gumiel YB, Paraiso EC, Teodoro D, Barra CMCM. A Multilabel approach to Portuguese clinical named entity recognition. Journal of Health Informatics. 2020 Dez; (special number SBIS): 366-72.
Simpletransformers library. Available from: https://simpletransformers.ai/
Google Collab. Available from: https://colab.research.google.com/
Issifu AM, Ganiz MC. A simple data augmentation method to improve the performance of named entity recognition models in medical domain. 2021. 6th International Conference on Computer Science and Engineering (UBMK): 763-768.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Luiz Henrique Pereira Niero, João Vitor Andrioli de Souza, Luciana Martins Gomes da Silva, Yohan Bonescki Gumiel, Nícolas Henrique Borges, Gustavo Henrique Munhoz Piotto, Gustavo Giavarini, Lucas Emanuel Silva e Oliveira
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Submission of a paper to Journal of Health Informatics is understood to imply that it is not being considered for publication elsewhere and that the author(s) permission to publish his/her (their) article(s) in this Journal implies the exclusive authorization of the publishers to deal with all issues concerning the copyright therein. Upon the submission of an article, authors will be asked to sign a Copyright Notice. Acceptance of the agreement will ensure the widest possible dissemination of information. An e-mail will be sent to the corresponding author confirming receipt of the manuscript and acceptance of the agreement.