Apakah boleh menggunakan jarak Manhattan dengan tautan antar-klaster Ward dalam hierarki klaster?

15

I am using hierarchical clustering to analyze time series data. My code is implemented using the Mathematica function DirectAgglomerate[...], which generates hierarchical clusters given the following inputs:

a distance matrix D
the name of the method used to determine inter-cluster linkage.

I have calculated the distance matrix D using Manhattan distance:

d (x, y) = \sum_{i} | x_{i} - y_{i} |

$d(x,y) = \sum_i|x_i - y_i|$

where $i = 1,\cdots, n$ and $n \approx 150$ is the number of data points in my time series.

My question is, is it ok to use Ward's inter-cluster linkage with a Manhattan distance matrix? Some sources suggest that Ward's linkage should only be used with Euclidean distance.

Note that DirectAgglomerate[...] calculates Ward's linkage using the distance matrix only, not the original observations. Unfortunately, I am unsure how Mathematica modifies Ward's original algorithm, which (from my understanding) worked by minimizing the error sum of squares of the observations, calculated with respect to the cluster mean. For example, for a cluster $c$ consisting of a vector of univariate observations, Ward formulated the error sum of squares as:

(\sum_{j} | | c_{j} - m e a n (c) | |_{2})^{2}

$(\sum_j||c_j - mean(c)||_2)^2$

(Other software tools such as Matlab and R also implement Ward's clustering using just a distance matrix so the question isn't specific to Mathematica.)

clustering distance-functions ward

— Rachel
sumber

I have recently analyzed a fairly big set of data using Ward method. In my specific case Manatthan distance gave essentially the same clustering as Euclidean distance. I cannot give you any mathematical prove in favour of any combination of methods, but -at least in my case- the clustering was not affected by the distance method

— nico

All R functions do not necessarily wait for a distance matrix. See e.g., the on-line help for agnes in the cluster package.

— chl

It is actually okay to use any distance. Check vlado.fmf.uni-lj.si/pub/preprint/ward.pdf The only catch is that, the mean we are talking about is no longer the arithmetic mean but Frechet mean.

— Randy Lai

but can we use manhattan distance for complete linkage??

— Payel Banerjee

8

The Ward clustering algorithm is a hierarchical clustering method that minimizes an 'inertia' criteria at each step. This inertia quantifies the sum of squared residuals between the reduced signal and the initial signal: it is a measure of the variance of the error in an l2 (Euclidean) sens. Actually, you even mention it in your question. This is why, I believe, it makes no sens to apply it to a distance matrix that is not a l2 Euclidean distance.

On the other hand, an average linkage or a single linkage hierarchical clustering would be perfectly suitable for other distances.

— Gael Varoquaux
sumber

2

Thanks for your comment; I think you are correct. However, in practise it seems that Ward's linkage is often used with non Euclidean distances. I am still not sure what the implications of this might be.

— Rachel

It probably comes from people using Ward simply because it is well known. I would say that Ward brings no gain compared to an average linkage in this settings. However, it is more computationally expensive (you need to compute the first two moments for each merge, or to precompute them). Thus, from a pragmatic standpoint, I would simply go for average linkage.

— Gael Varoquaux

1

Actually, the inertia would be defined using sum of squared distance (not necessary to be euclidean) see vlado.fmf.uni-lj.si/pub/preprint/ward.pdf

— Randy Lai

5

I can't think of any reason why Ward should favor any metric. Ward's method is just another option to decide which clusters to fusion next during agglomeration. This is achieved by finding the two clusters whose fusion will minimize a certain error (examplary source for the formula).

Hence it relies on two concepts:

The mean of vectors which (for numerical vectors) is generally calculated by averaging over every dimension separately.
The distance metric itself i.e. the concept of similarity expressed by this metric.

So: As long as the properties of the choosen metric (like e.g. rotation,translation or scale invariance) satisfy your needs (and the metric fits to the way the cluster mean is calculated), I don't see any reason to not use it.

I suspect that most people suggest the euclidean metric because they

want to increase the weight of the differences between a cluster mean and a single observation vector (which is done by quadration)
or because it came out as best metric in the validation based on their data
or because it is used in general.

— steffen
sumber

Thanks for your response. I have clarified my question a little to highlight that the 'DirectAgglomerate[...]' algorithm takes a distance matrix only. Given this, would the modified implementation of Ward's linkage be based on the assumption that the distance Matrix is Euclidean? Matlab's implementation of Ward's linkage, for example, notes that it is suitable for Euclidean distances only (mathworks.com/help/toolbox/stats/linkage.html).

— Rachel

1

@Rachel: aaah, I see. Any ward implementation has to calculate the distance between cluster members and the centroid. Intuitively it is clear the metric used for this should be equivalent to the metric used to calculate the distances between observations... hence matlab requires an euclidean distmatrix. But now the question arises why implementations do not request a function instead of distance matrix ? How much damage is done when one uses different metrices for both tasks ? I admit, I don't know it right know.

— steffen

hello example removed. any other website?

— MonsterMMORPG

2

Another way of thinking about this, which might lend itself to an adaptation for $\ell_1$ is that choice of the mean comes from the fact that the mean is the point that minimizes the sum of squared Euclidean distances. If you're using $\ell_1$ to measure the distance between time series, then you should be using a center that minimizes the sum of squared $\ell_1$ distances.

— Suresh Venkatasubramanian
sumber