Corpora Analysis: Journalistic and Scientific
Palavras-chave:
Natural Language Processing, Medical Informatics, Information ScienceResumo
Objective: This study aimed to compare two Corpora, one obtained from compiled newspapers – Journalistic Corpus, and the other from scientific papers – Scientific Corpus, with the hypothesis that the Scientific Corpus is more appropriated to Part-of-Speech information extraction in scientific similar texts. The aims were to analyze differences and similarities through: accuracy measurement; descriptive analysis; and independence of components in the Corpora. Methods: The analysis consisted on three steps: Descriptive Analysis; Accuracy Assessment; and Pointwise Mutual Information - PMI. Results: There was an important difference between words that do not match in both Corpora. The Scientific Corpus (92.95%) accuracy assessment was higher than Newspaper Corpus (88.32%). The PMI calculations for the bigrams of Newspaper and Scientific Corpora did not show statistically significant difference. Conclusion: The experiments carried out lead us to conclude that in order to extract PoS information with accuracy a better performance resulted with the association of scientific text with its specific Corpus and not a generic one, like Newspaper Corpus.
Downloads
Publicado
Como Citar
Edição
Secção
Licença
A submissão de um artigo ao Journal of Health Informatics é entendida como exclusiva e que não está sendo considerada para publicação em outra revista. A permissão dos autores para a publicação de seu artigo no J. Health Inform. implica na exclusiva autorização concedida aos editores para incluí-lo na revista. Ao submeter um artigo, ao autor será solicitada a permissão eletrônica de um Termo de Transferência de Direitos Autorais. Uma mensagem eletrônica será enviada ao autor correspondente confirmando o recibo do manuscrito e o aceite da Declaração de Direito Autoral.