Information Theory and Estimation of Hierarchies from Metric Data

Estimation of Distance-Based Hierarchies in the Presence of Noise.

with Dekang Zhu (NUDT), Xuezhi Wang (RMIT), Xiang Li (NUDT) and Bill Moran (University of Melbourne)

Distance-based hierarchical clustering (DHC) methods are widely used in unsupervised data analysis, with their majority occurring in highly uncertain domains. This uncertainty can be an inherent “noise” in the measurement process but more often it is merely a technique for modelling the unknowns in the weight assignment process. Conventional approaches to statistical estimation of partitions and hierarchies view the objects to be clustered as random samples of certain distributions over a prescribed geometry (e.g. Gaussian mixture model estimation using expectation-maximization in Euclidean spaces), but this is, generally, unsatisfactory.

Instead, we propose to directly attribute uncertainty to the process of obtaining values for the pairwise distances rather than distort the data by mapping it into one’s “favorite space”. To the best of our knowledge, very little work has been done in this vein. We incorporate a statistical model of the uncertainty through corruption or noise in the pairwise distances and investigate the problem of estimating the DHC as unknown parameters from measurements. With work by Carlsson and Mèmoli establishing the primacy of Single Linkage Hierarchical Clustering as the only natural candidate for DHC, we focus on single linkage hierarchical clustering (SLHC). Statistical estimation of SLHC is of particular importance due to evidence that the negative effects of the so-called “chaining phenomenon” which tends to turn users away from SLHC (which otherwise would be a preferred tool in their toolkit) may be mitigated by the introduction of noise (`dithering’). In [1], we study the geometry of SLHC point-pre-images and uncover the inherent super-exponential complexity of maximum likelihood estimation (MLE) of SLHC. We find conditions on the distance measurement noise guaranteeing that SLHC is equivalent to maximum partial profile likelihood estimation (MPPLE) ignoring certain measured distances. At the same time, we show that direct evaluation of SLHC on noisy data yields a consistent estimator. Consequently, a full maximum likelihood estimation (MLE) is expected to perform better than SLHC in getting the correct HC results for the ground truth metric — a claim supported by our work on numerical approximation of the MLE estimator of SLHC [4,5].

[1]  “Statistical properties of the single linkage hierarchical clustering estimator,” Journal of Statistical Planning and Inference, volume 185 (2017), pp. 15-28.

[2] “Statistical Estimation for Single Linkage Hierarchical Clustering,” Cyber Technology in Automation, Control, and Intelligent Systems (CYBER), 2015 IEEE International Conference on.

[3] “Maximum Likelihood Estimation for Single Linkage Hierarchical Clustering,”  preprint: arXiv:1511.07944


Entropy-Based Control of Hierarchical Classification Schemes.

with Bill Moran (University of Melbourne), Ali Pezeshki and Lorenzo DeCarli (Colorado State University)

A statistical outlook on hierarchical clustering motivates additional questions. Among these a fundamental problem would be: Extend the notions of Shannon entropy and relative entropy, already employed in non-hierarchical clustering, from the domain of partitions to the domain of hierarchies (dendrograms) while retaining their qualities as statistically meaningful notions of diversity and divergence.

Using the well-known equivalence between dendrograms and ultra-metrics, one extends the Statistical Mechanics approach to defining the entropy of a partition to ultra-metrics [1], showing how to view the (weighted) collection of minimum spanning trees of an ultra-metric as a measure of diversity. Computing the weights is the crux of the project: direct methods are intractable due to a very tight relationship between this problem and the long-open problem of enumerating the extremal directions of the metric cone, but some progress has been made towards a solution using the theory of Multivariate Splines.

In the meantime, we have manged to derive a surprisingly useful surrogate quantity, the discrete entropy of a combinatorial dendrogram, which we introduce in [4]. Observing that the collection of MSTs of an ultra-metric space characterizes the combinatorial structure of the associated dendrogram, one can view these MSTs as micro-states in the macro-state corresponding to this combinatorial structure. An application of the matrix-tree theorem yields a simple expression for the combinatorial fraction of this collection of trees in the collection of all spanning trees, resulting in a natural measure of diversity of the combinatorial dendrogram, which reduces to Shannon entropy in the standard thermodynamical limit and has linear computational complexity in the size of the data set.

This simplified notion of entropy of a hierarchy proves useful for the detection of “hierarchical anomalies” – data elements which, if introduced into the data set, will result in significant collapse of its existing SLHC hierarchy. Our paper [4] develops these tools for a potential application in cyber security:

[4]  “Detecting Poisoning Attacks on Hierarchical Malware Classification Systems” (with Bill Moran, Ali Pezeshki and Omur Arslan), SPIE Proceedings Volume 10185, Cyber Sensing 2017; 101850E (2017).