Propuesta de algoritmo que combina el agrupamiento en subespacios basado en densidad y el agrupamiento basado en restricciones para la detección de grupos que incluyan atributos de interés en conjuntos de datos de alta dimensionalidad

Vallejos-Peña, Alonso

Propuesta de algoritmo que combina el agrupamiento en subespacios basado en densidad y el agrupamiento basado en restricciones para la detección de grupos que incluyan atributos de interés en conjuntos de datos de alta dimensionalidad

Files

propuesta_algoritmo_combina_agrupamiento_subespacios.pdf (1.36 MB)

Date

2017

Authors

Vallejos-Peña, Alonso

Publisher

Instituto Tecnológico de Costa Rica

Abstract

Cluster analysis is one of data mining most common tasks, used frequently in finance, biology, medicine and market analysis problems [12]. High dimensional data poses a challenge to traditional clustering algorithms, where the similarity measures are not meaningful, affecting the quality of the groups. As a result, subspace clustering algorithms have been proposed as an alternative, aiming to find all groups in all spaces of the dataset [45]. By detecting groups on lower dimensional spaces, each group can belong to different subspaces of the original dataset [31]. Therefore, attributes the user may consider of interest can be excluded in some or all groups, decreasing the value of the result for the data analysts. Currently, the improvement of the results and the detection of more significant groups, is considered one of the biggest opportunity areas in the cluster analysis of high dimensional data, particularly, the capability to consider the relevance of attributes on the subspace pruning logic and the group detection is an open research area [30]. For this project, a new algorithm is proposed, that combines SUBCLU [1] and the constraint clustering algorithms [6] that allows the users to identify variables as attributes of interest based on prior domain knowledge, targeting to direct group detection towards spaces that include users attributes of interest, thereafter, generating more meaningful groups. Using this new algorithm (SUBCLU-R), an experiment was executed to compare the results from SUBCLU and SUBCLU-R. In this experiment, first, the average cohesion, separation and silhouette index was obtained for both algorithms by executing multiple tests in our dataset. Then, using a statistical hypothesis test we compared the obtained averages to find out if the observed differences were significant. Finally, a result analysis was performed, focused on comparing the performance of the proposed algorithm against the original SUBCLU. 6 The results indicate that it is possible to influence groupings towards those including attributes of interest, thanks to the inclusion of constrained clustering for subspace pruning. With this proposal, N-d detected subspaces (N is the total number of detected subspaces and d the number of attributes in the dataset) include the attribute of interest. After comparing both algorithm results, it was determined that SUBCLU-R detects a significantly higher percentage of groupings with the attribute of interest, while no significant statistical differences were found for the internal metrics of the groupings.

Description

Proyecto de Graduación (Maestría en Computación) Instituto Tecnológico de Costa Rica, Escuela de Ingeniería en Computación, 2017.

Keywords

Minería de datos, Algoritmos, Datos, Densidad, Computación, Research Subject Categories::TECHNOLOGY::Information technology::Computer science

URI

https://hdl.handle.net/2238/9374

Collections

Maestría en Computación

Full item page

Propuesta de algoritmo que combina el agrupamiento en subespacios basado en densidad y el agrupamiento basado en restricciones para la detección de grupos que incluyan atributos de interés en conjuntos de datos de alta dimensionalidad

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

See / DOI

Full text

Description

Collections

Endorsement

Review

Supplemented By

Referenced By