Propuesta de algoritmo que combina el agrupamiento en subespacios basado en densidad y el agrupamiento basado en restricciones para la detección de grupos que incluyan atributos de interés en conjuntos de datos de alta dimensionalidad
Resumen
Cluster analysis is one of data mining most common tasks, used frequently in finance, biology, medicine and market analysis problems [12].
High dimensional data poses a challenge to traditional clustering algorithms, where the similarity measures are not meaningful, affecting the quality of the groups. As a result, subspace clustering algorithms have been proposed as an alternative, aiming to find all groups in all spaces of the dataset [45].
By detecting groups on lower dimensional spaces, each group can belong to different subspaces of the original dataset [31]. Therefore, attributes the user may consider of interest can be excluded in some or all groups, decreasing the value of the result for the data analysts.
Currently, the improvement of the results and the detection of more significant groups, is considered one of the biggest opportunity areas in the cluster analysis of high dimensional data, particularly, the capability to consider the relevance of attributes on the subspace pruning logic and the group detection is an open research area [30].
For this project, a new algorithm is proposed, that combines SUBCLU [1] and the constraint clustering algorithms [6] that allows the users to identify variables as attributes of interest based on prior domain knowledge, targeting to direct group detection towards spaces that include users attributes of interest, thereafter, generating more meaningful groups.
Using this new algorithm (SUBCLU-R), an experiment was executed to compare the results from SUBCLU and SUBCLU-R. In this experiment, first, the average cohesion, separation and silhouette index was obtained for both algorithms by executing multiple tests in our dataset. Then, using a statistical hypothesis test we compared the obtained averages to find out if the observed differences were significant. Finally, a result analysis was performed, focused on comparing the performance of the proposed algorithm against the original SUBCLU.
6
The results indicate that it is possible to influence groupings towards those including attributes of interest, thanks to the inclusion of constrained clustering for subspace pruning. With this proposal, N-d detected subspaces (N is the total number of detected subspaces and d the number of attributes in the dataset) include the attribute of interest. After comparing both algorithm results, it was determined that SUBCLU-R detects a significantly higher percentage of groupings with the attribute of interest, while no significant statistical differences were found for the internal metrics of the groupings.
Descripción
Proyecto de Graduación (Maestría en Computación) Instituto Tecnológico de Costa Rica, Escuela de Ingeniería en Computación, 2017.
Compartir
Métricas
Colecciones
- Maestría en Computación [107]