Example of Single Pass Clustering Technique

Suppose that we have the following set of documents and terms, and that
we are interested in clustering the terms using the single pass method
(note that the same method can beused to cluster the documents, but in
that case, we would be using the document vectors (rows) rather than the
term vector (columns).

	T1	T2	T3	T4	T5
Doc1	1	2	0	0	1
Doc2	3	1	2	3	0
Doc3	3	0	0	0	1
Doc4	2	1	0	3	0
Doc5	2	2	1	5	1

Start with T1 in a cluster by itself, say C1. At this point, C1 contains
only one item, T1, so the centroid of C1 is simply the vector for T1:

C1 = <1, 3, 3, 2, 2>.

Now compare (i.e., measure similarities) of the next item (T2) to centroids
of all existing clusters. At this point we have only one cluster, C1 (we
will use dot product for simplicity):

SIM(T2, C1) = 1*2 + 1*3 + 0*3 + 1*2 + 2*2 = 11

Now we need a pre-specified similarity threshold. Let's say that our
threshold is 10. This means that if the similarity of T2 to the cluster
centroid is >= 10, then we add T2 to the cluster, otherwise we use T2 to
start a new cluster.

In this case. SIM(T2, C1) = 11 > 10. Therefore we add T2 to cluster C1.
We now need to compute the new centroid for C1 (which now contains T1 and
T2). The centroid (which is the average vector for T1 and T2 is:

C1 = <3/2, 4/2, 3/2, 3/2, 4/2>

Now, we move to the next item, T3. Again, there is only one cluster, C1,
so we only need to compare T3 with C1 centroid. The dot product of T3 and
the above centroid is:

SIM(T3, C1) = 0 + 8/2 + 0 + 0 + 4/2 = 6

This time, T3 does not pass the threshold test (the similarity is less
than 10). Therefore, we use T3 to start a new cluster, C2. Now we have two
clusters

C1 = {T1, T2}
C2 = {T3}

We move to the next unclustered item, T4. Since we now have two clusters,
we need to compute the MAX similarity of T4 to the 2 cluster centroids
(note that the centroid of cluster C2 right now is just the vector for T3):

SIM(T4, C1) = <0, 3, 0, 3, 5> . <3/2, 4/2, 3/2, 3/2, 4/2>
            = 0 + 12/2 + 0 + 9/2 + 20/2 = 20.5

SIM(T4, C2) = <0, 3, 0, 3, 5> . <0, 2, 0, 0, 1>
            = 0 + 6 + 0 + 0 + 5 = 11

Note that both similarity scores pass the threshold (10), however, we pick
the MAX, and therefore, T4 will be added to cluster C1. Now we have the
following:

C1 = {T1, T2, T4}
C2 = {T3}

The centroid for C2 is still just the vector for T3:

C2 = <0, 2, 0, 0, 1>

and the new centroid for C1 is now:

C1 = <3/3, 7/3, 3/3, 6/3, 9/3>

The only item left unclustered is T5. We compute its similarity to the
centroids of existing clusters:

SIM(T5, C1) = <1, 0, 1, 0, 1> . <3/3, 7/3, 3/3, 6/3, 9/3>
            = 3/3 + 0 + 3/3 + 0 + 9/3 = 5

SIM(T5, C2) = <1, 0, 1, 0, 1> . <0, 2, 0, 0, 1>
            = 0 + 0 + 0 + 0 +1 = 1

Neither of these similarity values pass the threshold. Therefore, T5 will
have to go into a new cluster C3. There are no more unclustered items, so
we are done (after making a single pass through the items). The final
clusters are:

C1 = {T1, T2, T4}
C2 = {T3}
C3 = {T5}

Note: Obviously, the results for this method are highly dependent on the
similarity threshold that is used. You should use your judgment in setting
this threshold so that you are left with a reasonable number of clusters.


Copyright © 2002, Bamshad Mobasher, School of CTI, DePaul University.