Abstract
An integrated framework for density-based cluster analysis, outlier detection, and data visualization is introduced in this article. The main module consists of an algorithm to compute hierarchical estimates of the level sets of a density, following Hartigan’s classic model of density-contour clusters and trees. Such an algorithm generalizes and improves existing density-based clustering techniques with respect to different aspects. It provides as a result a complete clustering hierarchy composed of all possible density-based clusters following the nonparametric model adopted, for an infinite range of density thresholds. The resulting hierarchy can be easily processed so as to provide multiple ways for data visualization and exploration. It can also be further postprocessed so that: (i) a normalized score of “outlierness” can be assigned to each data object, which unifies both the global and local perspectives of outliers into a single definition; and (ii) a “flat” (i.e., nonhierarchical) clustering solution composed of clusters extracted from local cuts through the cluster tree (possibly corresponding to different density thresholds) can be obtained, either in an unsupervised or in a semisupervised way. In the unsupervised scenario, the algorithm corresponding to this postprocessing module provides a global, optimal solution to the formal problem of maximizing the overall stability of the extracted clusters. If partially labeled objects or instance-level constraints are provided by the user, the algorithm can solve the problem by considering both constraints violations/satisfactions and cluster stability criteria. An asymptotic complexity analysis, both in terms of running time and memory space, is described. Experiments are reported that involve a variety of synthetic and real datasets, including comparisons with state-of-the-art, density-based clustering and (global and local) outlier detection methods.
Topics

No keywords indexed for this article. Browse by subject →

References
126
[6]
F. Angiulli and C. Pizzuti . 2002. Fast outlier detection in high dimensional spaces . In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD) . Helsinki, Finland. 15--26. DOI:http://dx.doi.org/10.1007/3-540-45681-3_2 F. Angiulli and C. Pizzuti. 2002. Fast outlier detection in high dimensional spaces. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD). Helsinki, Finland. 15--26. DOI:http://dx.doi.org/10.1007/3-540-45681-3_2
[8]
K. Bache and M. Lichman. 2013. UCI Machine Learning Repository. (2013). http://archive.ics.uci.edu/ml. K. Bache and M. Lichman. 2013. UCI Machine Learning Repository. (2013). http://archive.ics.uci.edu/ml.
[10]
V. Barnett and T. Lewis. 1994. Outliers in Statistical Data (3rd ed.). John Wiley & Sons. V. Barnett and T. Lewis. 1994. Outliers in Statistical Data (3rd ed.). John Wiley & Sons.
[13]
R. J. Beckman and R. D. Cook . 1983 . Outlier..........s . Technometrics 25 , 2 (1983), 119 -- 149 . R. J. Beckman and R. D. Cook. 1983. Outlier..........s. Technometrics 25, 2 (1983), 119--149.
[15]
C. M. Bishop . 2006. Pattern Recognition and Machine Learning . Springer . C. M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer.
[16]
S. Brecheisen , H.-P. Kriegel , P. Kröger , and M. Pfeifle . 2004. Visually mining through cluster hierarchies . In Proceedings of the 4th SIAM International Conference on Data Mining (SDM). Lake Buena Vista, FL. 400--412 . S. Brecheisen, H.-P. Kriegel, P. Kröger, and M. Pfeifle. 2004. Visually mining through cluster hierarchies. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM). Lake Buena Vista, FL. 400--412.
[20]
R. J. G. B. Campello , D. Moulavi , and J. Sander . 2013a. Density-based clustering based on hierarchical density estimates . In Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) . Gold Coast, Australia. 160--172. DOI:http://dx.doi.org/10.1007/978-3-642-37456-2_14 R. J. G. B. Campello, D. Moulavi, and J. Sander. 2013a. Density-based clustering based on hierarchical density estimates. In Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Gold Coast, Australia. 160--172. DOI:http://dx.doi.org/10.1007/978-3-642-37456-2_14
[27]
X. H. Dang , I. Assent , R. T. Ng , A. Zimek , and E. Schubert . 2014. Discriminative features for identifying and interpreting outliers . In Proceedings of the 30th International Conference on Data Engineering (ICDE) , Chicago, IL. 88--99. DOI:http://dx.doi.org/10.1109/ICDE. 2014 .6816642 X. H. Dang, I. Assent, R. T. Ng, A. Zimek, and E. Schubert. 2014. Discriminative features for identifying and interpreting outliers. In Proceedings of the 30th International Conference on Data Engineering (ICDE), Chicago, IL. 88--99. DOI:http://dx.doi.org/10.1109/ICDE.2014.6816642
[32]
L. Ertöz , M. Steinbach , and V. Kumar . 2002. A new shared nearest neighbor clustering algorithm and its applications . In Workshop on Clustering High Dimensional Data and its Applications at 2nd SIAM International Conference on Data Mining. 105--115 . L. Ertöz, M. Steinbach, and V. Kumar. 2002. A new shared nearest neighbor clustering algorithm and its applications. In Workshop on Clustering High Dimensional Data and its Applications at 2nd SIAM International Conference on Data Mining. 105--115.
[33]
L. Ertöz , M. Steinbach , and V. Kumar . 2003. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data . In Proceedings of the 3rd SIAM International Conference on Data Mining (SDM) , San Francisco, CA. L. Ertöz, M. Steinbach, and V. Kumar. 2003. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In Proceedings of the 3rd SIAM International Conference on Data Mining (SDM), San Francisco, CA.
[34]
M. Ester . 2009. Density-based Clustering . In Encyclopedia of Database Systems, L. Liu and M. T. Özsu (Eds.) . Springer , 795--799. DOI:http://dx.doi.org/10.1007/978-0-387-39940-9_605 M. Ester. 2009. Density-based Clustering. In Encyclopedia of Database Systems, L. Liu and M. T. Özsu (Eds.). Springer, 795--799. DOI:http://dx.doi.org/10.1007/978-0-387-39940-9_605
[35]
M. Ester , H.-P. Kriegel , J. Sander , and X. Xu . 1996. A density-based algorithm for discovering clusters in large spatial databases with noise . In Proceedings of the 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD) . Portland, OR. 226--231. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD). Portland, OR. 226--231.
[36]
B. S. Everitt S. Landau and M. Leese. 2001. Cluster Analysis (4th ed.). Arnold. B. S. Everitt S. Landau and M. Leese. 2001. Cluster Analysis (4th ed.). Arnold. 10.1002/9781118887486.ch6
[37]
A. Foss and O. R. Zaïane . 2002. A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets . In Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM), Maebashi City. Japan. 179--186 . A. Foss and O. R. Zaïane. 2002. A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets. In Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM), Maebashi City. Japan. 179--186.
[42]
Sample Criteria for Testing Outlying Observations

Frank E. Grubbs

The Annals of Mathematical Statistics 10.1214/aoms/1177729885
[46]
S. Hang , Z. You , and L. Y. Chun . 2009. Incorporating biological knowledge into density-based clustering analysis of gene expression data . In 6th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) , Vol. 5 . Tianjin, China, 52--56. S. Hang, Z. You, and L. Y. Chun. 2009. Incorporating biological knowledge into density-based clustering analysis of gene expression data. In 6th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Vol. 5. Tianjin, China, 52--56.
[47]
The meaning and use of the area under a receiver operating characteristic (ROC) curve.

J A Hanley, B J McNeil

Radiology 10.1148/radiology.143.1.7063747
[48]
J. A. Hartigan . 1975. Clustering Algorithms . John Wiley & Sons , New York , London, Sydney, Toronto. J. A. Hartigan. 1975. Clustering Algorithms. John Wiley & Sons, New York, London, Sydney, Toronto.
[50]
D. Hawkins . 1980. Identification of Outliers . Chapman and Hall . D. Hawkins. 1980. Identification of Outliers. Chapman and Hall.

Showing 50 of 126 references

Metrics
673
Citations
126
References
Details
Published
Jul 22, 2015
Vol/Issue
10(1)
Pages
1-51
License
View
Funding
FAPESP Award: #2013/18698-4 and #2010/20032-6
CNPq Award: #304137/2013-8 and#201239/2012-4
NSERC
Cite This Article
Ricardo J. G. B. Campello, Davoud Moulavi, Arthur Zimek, et al. (2015). Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM Transactions on Knowledge Discovery from Data, 10(1), 1-51. https://doi.org/10.1145/2733381
Related

You May Also Like

Graph evolution

Jure Leskovec, Jon Kleinberg · 2007

2,024 citations

Isolation-Based Anomaly Detection

Fei Tony Liu, Kai Ming Ting · 2012

1,600 citations

A Survey on Causal Inference

Liuyi Yao, Zhixuan Chu · 2021

376 citations

Temporal Link Prediction Using Matrix and Tensor Factorizations

Daniel M. Dunlavy, Tamara G. Kolda · 2011

351 citations