Análise do desempenho de extratores automáticos de candidatos a termos: proposta metodológica para tratamento de filtragem dos dados

Authors

  • Rosana de Barros Silva e Teixeira Pontifícia Universidade Católica de São Paulo (PUC-SP)

DOI:

https://doi.org/10.11606/issn.2317-9511.tradterm.2011.36765

Keywords:

Terminology, Corpus Linguistics, Computational tools, Automatic extraction of term candidates.

Abstract

This article aims to present one aspect of the masters dissertation entitled (Onco)mastology terms: a corpus-mediated approach (2011). This work will explore one of the goals that guided the study, namely, verifying the success rates of four computational tools for automatic extraction of term candidates: Corpógrafo 4.0, WordSmith Tools 3.0, e-Termos and ZExtractor. Two corpora were used in the investigation: the study corpus (MAMAtex), with a total of 563,482 words, and a reference corpus (Banco de Português 1.0), with 125,927,624 words. The first, which is specialized, consists of some of the genres of scientific discourse, of scientific dissemination and instruction in (Onco)mastology, while the second, a generallanguage text, includes various genres. Two approaches were chosen to support this analysis from the theoretical and methodological standpoint: the Communicative Theory of Terminology (CABRÉ 1993) and Corpus Linguistics (SINCLAIR 1991; BERBER SARDINHA 2004, 2005). As revealed by the data, Corpógrafo 4.0 ranks highest, with 27.56% accuracy, followed by ZExtractor (26.05%), WordSmith Tools 3.0 (21.77%) and e-Terms (14.44 %). In order to make feasible the examination of candidates, given that the lists generated by the programs included thousands of words, a methodology was developed using Microsoft Office Excel 2007 for filtering candidates common to all the tools and unique to each one. This cut in the data served as a possibly feasible "methodological shortcut" for optimizing the  selection of term candidates from lists processed by two or more programs.

Downloads

Download data is not yet available.

Author Biography

  • Rosana de Barros Silva e Teixeira, Pontifícia Universidade Católica de São Paulo (PUC-SP)
    Jornalista e professora de Língua Portuguesa. Mestre em Linguística Aplicada e Estudos da Linguagem pela PUC-SP, é também membro do GELC/CNPq. Desenvolve pesquisas nas áreas de Terminologia, Linguística de *Corpus* e Análise do Discurso. É autora de *Glossário de Oncomastologia: um repertório de termos sobre o câncer de mama*, com lançamento previsto para 2012 pela editora Olho d´Água.

Published

2011-12-04

Issue

Section

Articles

How to Cite

Teixeira, R. de B. S. e. (2011). Análise do desempenho de extratores automáticos de candidatos a termos: proposta metodológica para tratamento de filtragem dos dados. TradTerm, 18, 297-319. https://doi.org/10.11606/issn.2317-9511.tradterm.2011.36765