Evaluación del uso de distintas métricas de distancia de texto en un algoritmo agregado para la imputación de valores faltantes mediante clasificación
Abstract
Nowadays, there is a general problem of missing values in databases around the
world, which is caused by several reasons going from hardware malfunctions to nonmandatory
fields in forms. Data imputation can be defined as the use of some method
to find plausible values for those missing. When the missing value can be inferred from
a text value attribute, then the problemcan be seen as a classification algorithms problem
where text documents should be organized within categories representing the
plausible missing values. It also implies the problem of calculating how similar is a text
value with respect to another. Existing literature about solving this kind of problems
is extensive, however, during the last 25 years the statistical methods (where similarity
functions are applied over vectors of words) have achieved good results in many
areas of text mining [38]. Additionally, topic modeling has arisen in the last years as
a promising alternative to existing methods by achieving dimensional reduction and
incorporating the semantic factor when classifying documents [30]. This project is
focused on the evaluation of traditional data representation techniques and similarity
metrics (words vectors, Cosine and Jaccard) respect to topic modeling techniques
and probability distributions comparison (Latent Dirichlet Allocation and Kullback-
Leibler Divergence). An statistical analysis is applied to the results obtained after
running several experiments that involved the mentioned metrics, both individually
and combined, to classify data sets of text documents.
At a high level, the results show that the accuracy scores achieved by using document
representations obtained thought Latent Dirichlet Allocation, combined with
the relative entropy metric, were statically similar to the ones obtained by using
traditional text classification techniques. The topics modeling manages to abstract
thousands of words in less than 60 topics for the main set of experiments. The results
also highlight cons, improvement areas and potential scenarios where such models
could achieve a better performance.
Description
Proyecto de Graduación (Maestría en Computación) Instituto Tecnológico de Costa Rica, Escuela de Ingeniería en Computación, 2017.
Share
Metrics
Collections
- Maestría en Computación [107]