Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection

Ricardo J. G. B. Campello; Davoud Moulavi; Arthur Zimek; Jorg Sander

doi:10.1145/2733381

journal article Jul 22, 2015

Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection

Ricardo J. G. B. Campello Davoud Moulavi Arthur Zimek Jorg Sander

ACM Transactions on Knowledge Discovery from Data Vol. 10 No. 1 pp. 1-51 · Association for Computing Machinery (ACM)

View at Publisher Save 10.1145/2733381

Abstract

An integrated framework for density-based cluster analysis, outlier detection, and data visualization is introduced in this article. The main module consists of an algorithm to compute hierarchical estimates of the level sets of a density, following Hartigan’s classic model of density-contour clusters and trees. Such an algorithm generalizes and improves existing density-based clustering techniques with respect to different aspects. It provides as a result a complete clustering hierarchy composed of all possible density-based clusters following the nonparametric model adopted, for an infinite range of density thresholds. The resulting hierarchy can be easily processed so as to provide multiple ways for data visualization and exploration. It can also be further postprocessed so that: (i) a normalized score of “outlierness” can be assigned to each data object, which unifies both the global and local perspectives of outliers into a single definition; and (ii) a “flat” (i.e., nonhierarchical) clustering solution composed of clusters extracted from local cuts through the cluster tree (possibly corresponding to different density thresholds) can be obtained, either in an unsupervised or in a semisupervised way. In the unsupervised scenario, the algorithm corresponding to this postprocessing module provides a global, optimal solution to the formal problem of maximizing the overall stability of the extracted clusters. If partially labeled objects or instance-level constraints are provided by the user, the algorithm can solve the problem by considering both constraints violations/satisfactions and cluster stability criteria. An asymptotic complexity analysis, both in terms of running time and memory space, is described. Experiments are reported that involve a variety of synthetic and real datasets, including comparisons with state-of-the-art, density-based clustering and (global and local) outlier detection methods.

Topics

No keywords indexed for this article. Browse by subject →

References

126

[1]

10.1145/1150402.1150459

[2]

10.1145/2463676.2463696

[3]

10.1145/375663.375668

[4]

10.5555/1609942.1609946

[5]

10.1145/1497577.1497581

[6]

F. Angiulli and C. Pizzuti . 2002. Fast outlier detection in high dimensional spaces . In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD) . Helsinki, Finland. 15--26. DOI:http://dx.doi.org/10.1007/3-540-45681-3_2 F. Angiulli and C. Pizzuti. 2002. Fast outlier detection in high dimensional spaces. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD). Helsinki, Finland. 15--26. DOI:http://dx.doi.org/10.1007/3-540-45681-3_2

[7]

10.1145/304182.304187

[8]

K. Bache and M. Lichman. 2013. UCI Machine Learning Repository. (2013). http://archive.ics.uci.edu/ml. K. Bache and M. Lichman. 2013. UCI Machine Learning Repository. (2013). http://archive.ics.uci.edu/ml.

[9]

10.2307/2347159

[10]

V. Barnett and T. Lewis. 1994. Outliers in Statistical Data (3rd ed.). John Wiley & Sons. V. Barnett and T. Lewis. 1994. Outliers in Statistical Data (3rd ed.). John Wiley & Sons.

[11]

10.5555/1404506

[12]

10.1145/956750.956758

[13]

R. J. Beckman and R. D. Cook . 1983 . Outlier..........s . Technometrics 25 , 2 (1983), 119 -- 149 . R. J. Beckman and R. D. Cook. 1983. Outlier..........s. Technometrics 25, 2 (1983), 119--149.

[14]

10.1145/1015330.1015360

[15]

C. M. Bishop . 2006. Pattern Recognition and Machine Learning . Springer . C. M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer.

[16]

S. Brecheisen , H.-P. Kriegel , P. Kröger , and M. Pfeifle . 2004. Visually mining through cluster hierarchies . In Proceedings of the 4th SIAM International Conference on Data Mining (SDM). Lake Buena Vista, FL. 400--412 . S. Brecheisen, H.-P. Kriegel, P. Kröger, and M. Pfeifle. 2004. Visually mining through cluster hierarchies. In Proceedings of the 4th SIAM International Conference on Data Mining (SDM). Lake Buena Vista, FL. 400--412.

[17]

10.1145/375663.375672

[18]

10.1145/342009.335388

[19]

10.1145/1353343.1353398

[20]

R. J. G. B. Campello , D. Moulavi , and J. Sander . 2013a. Density-based clustering based on hierarchical density estimates . In Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) . Gold Coast, Australia. 160--172. DOI:http://dx.doi.org/10.1007/978-3-642-37456-2_14 R. J. G. B. Campello, D. Moulavi, and J. Sander. 2013a. Density-based clustering based on hierarchical density estimates. In Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Gold Coast, Australia. 160--172. DOI:http://dx.doi.org/10.1007/978-3-642-37456-2_14

[21]

10.1007/s10618-013-0311-4

[22]

10.1016/j.datak.2008.06.006

[23]

10.1007/s100440050011

[24]

10.1016/s0003-2670(01)83197-4

[25]

10.2307/3315985

[26]

10.1016/s0167-9473(00)00052-9

[27]

X. H. Dang , I. Assent , R. T. Ng , A. Zimek , and E. Schubert . 2014. Discriminative features for identifying and interpreting outliers . In Proceedings of the 30th International Conference on Data Engineering (ICDE) , Chicago, IL. 88--99. DOI:http://dx.doi.org/10.1109/ICDE. 2014 .6816642 X. H. Dang, I. Assent, R. T. Ng, A. Zimek, and E. Schubert. 2014. Discriminative features for identifying and interpreting outliers. In Proceedings of the 30th International Conference on Data Engineering (ICDE), Chicago, IL. 88--99. DOI:http://dx.doi.org/10.1109/ICDE.2014.6816642

[28]

10.1016/s0169-7439(01)00111-3

[29]

10.5555/3120676.3120693

[30]

10.1109/icdm.2010.151

[31]

10.1007/s10115-011-0430-4

[32]

L. Ertöz , M. Steinbach , and V. Kumar . 2002. A new shared nearest neighbor clustering algorithm and its applications . In Workshop on Clustering High Dimensional Data and its Applications at 2nd SIAM International Conference on Data Mining. 105--115 . L. Ertöz, M. Steinbach, and V. Kumar. 2002. A new shared nearest neighbor clustering algorithm and its applications. In Workshop on Clustering High Dimensional Data and its Applications at 2nd SIAM International Conference on Data Mining. 105--115.

[33]

L. Ertöz , M. Steinbach , and V. Kumar . 2003. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data . In Proceedings of the 3rd SIAM International Conference on Data Mining (SDM) , San Francisco, CA. L. Ertöz, M. Steinbach, and V. Kumar. 2003. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In Proceedings of the 3rd SIAM International Conference on Data Mining (SDM), San Francisco, CA.

[34]

M. Ester . 2009. Density-based Clustering . In Encyclopedia of Database Systems, L. Liu and M. T. Özsu (Eds.) . Springer , 795--799. DOI:http://dx.doi.org/10.1007/978-0-387-39940-9_605 M. Ester. 2009. Density-based Clustering. In Encyclopedia of Database Systems, L. Liu and M. T. Özsu (Eds.). Springer, 795--799. DOI:http://dx.doi.org/10.1007/978-0-387-39940-9_605

[35]

M. Ester , H.-P. Kriegel , J. Sander , and X. Xu . 1996. A density-based algorithm for discovering clusters in large spatial databases with noise . In Proceedings of the 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD) . Portland, OR. 226--231. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD). Portland, OR. 226--231.

[36]

B. S. Everitt S. Landau and M. Leese. 2001. Cluster Analysis (4th ed.). Arnold. B. S. Everitt S. Landau and M. Leese. 2001. Cluster Analysis (4th ed.). Arnold. 10.1002/9781118887486.ch6

[37]

A. Foss and O. R. Zaïane . 2002. A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets . In Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM), Maebashi City. Japan. 179--186 . A. Foss and O. R. Zaïane. 2002. A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets. In Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM), Maebashi City. Japan. 179--186.

[38]

10.1109/tpami.2005.113

[39]

10.1109/tit.1975.1055330

[40]

10.1109/icdm.2006.43

[41]

10.1023/b:visi.0000042993.50813.60

[42]

Sample Criteria for Testing Outlying Observations

Frank E. Grubbs

The Annals of Mathematical Statistics 10.1214/aoms/1177729885

[43]

10.1109/icdmw.2006.92

[44]

10.1109/tcbb.2008.32

[45]

10.1002/wics.6

[46]

S. Hang , Z. You , and L. Y. Chun . 2009. Incorporating biological knowledge into density-based clustering analysis of gene expression data . In 6th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) , Vol. 5 . Tianjin, China, 52--56. S. Hang, Z. You, and L. Y. Chun. 2009. Incorporating biological knowledge into density-based clustering analysis of gene expression data. In 6th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Vol. 5. Tianjin, China, 52--56.

[47]

The meaning and use of the area under a receiver operating characteristic (ROC) curve.

J A Hanley, B J McNeil

Radiology 10.1148/radiology.143.1.7063747

[48]

J. A. Hartigan . 1975. Clustering Algorithms . John Wiley & Sons , New York , London, Sydney, Toronto. J. A. Hartigan. 1975. Clustering Algorithms. John Wiley & Sons, New York, London, Sydney, Toronto.

[49]

10.1080/01621459.1987.10478428

[50]

D. Hawkins . 1980. Identification of Outliers . Chapman and Hall . D. Hawkins. 1980. Identification of Outliers. Chapman and Hall.

Showing 50 of 126 references

Cited By

673

Multimodal Generalized Defect Category Discovery in industrial scenarios via defect-aware representation guided calibrated clustering

Hao Cheng, Jiaxiang Luo · 2026

Advanced Engineering Informatics

A Spatiotemporal Cluster Analysis and Dynamic Evaluation Model for the Rock Mass Instability Risk During Deep Mining of Metal Mine

Yuting Bian, Wei Zhu · 2026

Mathematics

Streaming Hierarchical Clustering Based on Point-Set Kernel

Xin Han, Ye Zhu · 2026

ACM Transactions on Knowledge Disco...

Unsupervised machine-learning pipeline for data-driven defect detection and characterisation: Application to displacement cascades

Samuel Del Fré, Andrée De Backer · 2026

Computational Materials Science

Clustering with Minimum Spanning Trees: How Good Can It Be?

Marek Gagolewski, Anna Cena · 2024

Journal of Classification

Development and Application of an Advanced Automatic Identification System (AIS)-Based Ship Trajectory Extraction Framework for Maritime Traffic Analysis

I-Lun Huang, Man-Chun Lee · 2024

Journal of Marine Science and Engin...

Global machine learning potentials for molecular crystals

Ivan Žugec, R. Matthias Geilhufe · 2024

The Journal of Chemical Physics

Metrics

673

Citations

126

References

Details

Published: Jul 22, 2015
Vol/Issue: 10(1)
Pages: 1-51
License: View

Authors

R

Ricardo J. G. B. Campello

Department of Computer Sciences, University of São Paulo, Brazil

D

Davoud Moulavi

Department of Computing Science, University of Alberta, Canada

A

Arthur Zimek

Ludwig-Maximilians-Universität München, Germany

J

Jorg Sander

Department of Computing Science, University of Alberta, Canada

Funding

FAPESP Award: #2013/18698-4 and #2010/20032-6

CNPq Award: #304137/2013-8 and#201239/2012-4

NSERC

Cite This Article

Ricardo J. G. B. Campello, Davoud Moulavi, Arthur Zimek, et al. (2015). Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM Transactions on Knowledge Discovery from Data, 10(1), 1-51. https://doi.org/10.1145/2733381

Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection

You May Also Like