Corpora Analysis: Journalistic and Scientific
Palabras clave:
Natural Language Processing, Medical Informatics, Information ScienceResumen
Objective: This study aimed to compare two Corpora, one obtained from compiled newspapers – Journalistic Corpus, and the other from scientific papers – Scientific Corpus, with the hypothesis that the Scientific Corpus is more appropriated to Part-of-Speech information extraction in scientific similar texts. The aims were to analyze differences and similarities through: accuracy measurement; descriptive analysis; and independence of components in the Corpora. Methods: The analysis consisted on three steps: Descriptive Analysis; Accuracy Assessment; and Pointwise Mutual Information - PMI. Results: There was an important difference between words that do not match in both Corpora. The Scientific Corpus (92.95%) accuracy assessment was higher than Newspaper Corpus (88.32%). The PMI calculations for the bigrams of Newspaper and Scientific Corpora did not show statistically significant difference. Conclusion: The experiments carried out lead us to conclude that in order to extract PoS information with accuracy a better performance resulted with the association of scientific text with its specific Corpus and not a generic one, like Newspaper Corpus.
Descargas
Publicado
Cómo citar
Número
Sección
Licencia
La sumisión de un artículo a el Journal of Health Informatics es entendida como exclusiva y que no esta siendo considerado para publicación en otro periódico. La permisión de los autores para la publicación de su artículo en lo JHI implica en la exclusiva autorización concedida a los editores para su inclusión en la revista. Al someter un artículo, a lo autor será solicitada la permisión electrónica de una Nota de Copyright. Una mensaje electrónica será enviada a lo autor correspondiente confirmando el recibo del manuscrito y lo aceite de la Nota de Copyright.