Cluster Analysis

Summary

This demo illustrates how the JMSL Library can be used to easily create a Java™ application that performs K Means Cluster Analysis or Hierarchical Cluster Analysis. Cluster analysis aims to group data with similar properties together. For these two-dimensional cases, physical interpretation is straightforward, but this is rarely the case for complex data mining.

K-Means

This chart, known as the Hertzsprung-Russell diagram, plots a star's temperature versus its luminosity (scaled so that the Sun has unit luminosity). The main diagonal contains the set of stars known as the "Main Sequence" that show increasing luminosity with increasing temperature (note that the top x axis increases from right to left). The group of stars below this diagonal are White Dwarfs (hot stars that are relatively dim). The stars in the upper right of the diagram are Red Giants (cooler stars that are nonetheless quite bright). A few Blue Giants can be seen above the Main Sequence on the left side.

Hierarchical

The data from the NIH (direct link, more detail) is displayed in a heatmap chart. This display shows the reaction of different genes to various treatments. There is significant information included in this chart that is easier to comprehend if similar results are grouped together.

Usage

K-Means

Select the number of clusters to identify and then press the Run Analysis button. The seeds for the clustering algorithm are generated randomly each time the analysis is run. The Show Report button will present the results of the cluster analysis in tabular format. As the analysis is run again, the report is updated automatically. The Annotate Graph button will label regions of the graph, including our Sun.

Hierarchical

The data are shown initially ungrouped. Cluster analysis is done using two methods with various options available. The top JComboBox allows the selection of method to use in computing distances, and the second JComboBox sets the option to use for hierarchical clustering. Once selections are made, press the Run Analysis button to cluster the data and see a Dendrogram appear with more detailed information on the similarity between individual rows and columns. Color and row/column labels can be added with the Highlight Clusters and Show Labels toggle buttons. After setting either of these, press the Update button. A report is shown using the View Report button.

JMSL Library Math/Stat Classes

com.imsl.stat.ClusterKMeans - This class accepts the input vectors as a 2D array along with an array of cluster centers in its constructor. Once instantiated, the analysis is run by calling the compute() method. Then various statistics are available through getClusterSSQ(), getClusterMembership() and getClusterCounts(). Refer to the runAnalysis() method for details.

com.imsl.stat.Summary - This class is used in computeScaledData() to convert the input data into statistically similar ranges. If the input data were used without conversion, the y values, ranging from -10 to 20 would be ten times more influential in the clustering of the data than the x values, ranging from -0.5 to 2.5. To avoid this undesired weighting effect, data to be used for cluster analysis should be scaled consistently. For this example, the Summary class is used to scale both input vectors to [0,1].

com.imsl.stat.Dissimilarities and com.imsl.stat.ClusterHierarchical - These two classes are used to group and cluster the data displayed in the heatmap. The getDistanceMatrix() method obtains the output from Dissimilarities and is used as the input to ClusterHierarchical. The options selected in the JComboBoxes are passed to either method in the constructor.

JMSL Library Charting Classes

This application utilizes the com.imsl.chart.Data object to generate a basic scatter plot in the K-Means example. When the clusters are recomputed, we would like to remove the previous data and plot new data. This is accomplished without redrawing the entire chart by only removing Data objects, leaving the axes, etc. in tact:

private void clearChart() {
    ChartNode
[] children = axis.getChildren();
    for (int i=0; i<children.length; i++) {
        if (children[i] instanceof Data) {
            children[i].remove();
        }
    }
}

The Hierarchical examples uses the com.imsl.chart.Heatmap class to display the 2D data to be clustered. Once the analysis is run, com.imsl.chart.Dendrogram displays the relationship between rows and columns. The Dendrograms are placed next to the Heatmap through careful use of the AxisXY.setViewport() method. The parameters used in setViewport() are scaled values on [0,1] to fill the chart window. Values for each of the charts follows:

Heatmap: axis.setViewport(0.1,0.75,0.25,0.9);

Top vertical dendrogram: axisV.setViewport(0.1,0.75,0.1,0.25);

Right horizontal dendrogram: axisH.setViewport(0.75,0.9,0.25,0.9);

With these values, the entire charting space is utilized and the three separate axes combine to form a single informative view.

Java Code

In order to more efficiently work with the data associated with a particular cluster, a class ClusterSet was written. The class object is created by looping through the number of clusters and passing the input data array, the output cluster membership array, the current iteration value, and the number of points in the current cluster. The ClusterSet's x and y fields then contain all of the x,y coordinates for the data points associated with this particular cluster. These coordinates are then used to create a new Data object to be added to the chart.

The code that makes use of this class from the runAnalysis() method is:

ClusterSet cset = new ClusterSet(data, ic, i+1, nc[i]);
Data
cData = new Data(axis, cset.x, cset.y);

And the source code for the ClusterSet class itself:

class ClusterSet {
    double
[] x,y;
    ClusterSet(double[][] data, int[] member, int select, int total) {
        x = new double[total];
        y = new double[total];
        int
count = 0;
        for
(int i=0; i<data.length; i++) {
            if
(member[i] == select) {
                x[count] = minX+rangeX*(data[i][0]);
                y[count] = minY+rangeY*(data[i][1]);
                count ++;
            }
        }
    }
}

Links to Source Code

ClusterMain.java This is the main class, and it extends JFrame. This small class instantiates a JTabbedPane and adds the two JPanels below, which in turn implement their own listeners for the user interfaces and interaction with chart areas.
StarCluster.java This JPanel contains the K-Means example; it draws the input data in a chart, then waits for user input to run the cluster analysis, create a report window, or draw annotations.
HCluster.java This JPanel contains the Hierarchical example; it draws the input data in a chart, then waits for user input to run the cluster analysis, create a report window, or draw annotations.
ClusterReport.java This class extends JDialog and is called by StarCluster and HCluster to display a report containing details from the cluster analysis. It contains a setText() method used to update its HTML content.

Running This Demo

Two alternatives are available to run this demo:

1) Use the source code in your development environment as any other Java code. More information is available in the How To.

2) An executable jar file containing all of the demos referenced in this guide is included in the jmsl/lib directory. On Windows, you may double-click the file to run it if files with a ".jar" extension are properly registered with javaw.exe. Alternatively, for both Windows and UNIX environments, the jar file may be executed from the command line using java -jar gallery.jar.

As list of buttons, one for each demo, is created. Demos can be subsetted as they relate to specific areas (Math, Stat, Finance, Charting) by choosing the appropriate selection on the JComboBox. To run the Additional Demos, select Quick Start in the JComboBox.