This demo illustrates how the JMSL Library can be used to easily create a Java application that performs K Means Cluster Analysis or Hierarchical Cluster Analysis. Cluster analysis aims to group data with similar properties together. For these two-dimensional cases, physical interpretation is straightforward, but this is rarely the case for complex data mining.
This chart, known as the Hertzsprung-Russell diagram, plots a star's temperature versus its luminosity (scaled so that the Sun has unit luminosity). The main diagonal contains the set of stars known as the "Main Sequence" that show increasing luminosity with increasing temperature (note that the top x axis increases from right to left). The group of stars below this diagonal are White Dwarfs (hot stars that are relatively dim). The stars in the upper right of the diagram are Red Giants (cooler stars that are nonetheless quite bright). A few Blue Giants can be seen above the Main Sequence on the left side.
The data from the NIH (direct link, more detail) is displayed in a heatmap chart. This display shows the reaction of different genes to various treatments. There is significant information included in this chart that is easier to comprehend if similar results are grouped together.
Select the number of clusters to identify and then press the Run Analysis button. The seeds for the clustering algorithm are generated randomly each time the analysis is run. The Show Report button will present the results of the cluster analysis in tabular format. As the analysis is run again, the report is updated automatically. The Annotate Graph button will label regions of the graph, including our Sun.
The data are shown initially ungrouped. Cluster analysis is done using two methods with various options available. The top JComboBox allows the selection of method to use in computing distances, and the second JComboBox sets the option to use for hierarchical clustering. Once selections are made, press the Run Analysis button to cluster the data and see a Dendrogram appear with more detailed information on the similarity between individual rows and columns. Color and row/column labels can be added with the Highlight Clusters and Show Labels toggle buttons. After setting either of these, press the Update button. A report is shown using the View Report button.
com.imsl.stat.ClusterKMeans
- This class accepts the input vectors as a 2D array along with an array of
cluster centers in its constructor. Once instantiated, the analysis is run by
calling the compute()
method. Then various statistics are available
through getClusterSSQ()
, getClusterMembership()
and
getClusterCounts()
. Refer to the runAnalysis()
method
for details.
com.imsl.stat.Summary
- This
class is used in computeScaledData()
to convert the input data
into statistically similar ranges. If the input data were used without conversion,
the y values, ranging from -10 to 20 would be ten times more influential in
the clustering of the data than the x values, ranging from -0.5 to 2.5. To avoid
this undesired weighting effect, data to be used for cluster analysis should
be scaled consistently. For this example, the Summary class is used to scale
both input vectors to [0,1].
com.imsl.stat.Dissimilarities
and com.imsl.stat.ClusterHierarchical
- These two classes are used
to group and cluster the data displayed in the heatmap. The getDistanceMatrix()
method obtains the output from Dissimilarities
and is used as the
input to ClusterHierarchical
. The options selected in the JComboBoxes
are passed to either method in the constructor.
This application utilizes the com.imsl.chart.Data
object to generate a basic scatter plot in the K-Means example. When the clusters
are recomputed, we would like to remove the previous data and plot new data.
This is accomplished without redrawing the entire chart by only removing Data
objects, leaving the axes, etc. in tact:
private
void clearChart() {
ChartNode[] children = axis.getChildren();
for (int i=0; i<children.length; i++) {
if (children[i] instanceof Data) {
children[i].remove();
}
}
}
The Hierarchical examples uses the com.imsl.chart.Heatmap
class to display the 2D data to be clustered. Once the analysis is run, com.imsl.chart.Dendrogram
displays the relationship between rows and columns. The Dendrograms are placed
next to the Heatmap through careful use of the AxisXY.setViewport()
method. The parameters used in setViewport()
are scaled values
on [0,1] to fill the chart window. Values for each of the charts follows:
Heatmap: axis.setViewport
(0.1
,
0.75
,
0.25
,
0.9
);
Top vertical dendrogram: axisV.setViewport
(0.1
,
0.75
,
0.1
,
0.25
);
Right horizontal dendrogram: axisH.setViewport
(0.75
,
0.9
,
0.25
,
0.9
);
With these values, the entire charting space is utilized and the three separate axes combine to form a single informative view.
In order to more efficiently work with the data associated
with a particular cluster, a class ClusterSet
was written. The
class object is created by looping through the number of clusters and passing
the input data array, the output cluster membership array, the current iteration
value, and the number of points in the current cluster. The ClusterSet
's
x and y fields then contain all of the x,y coordinates
for the data points associated with this particular cluster. These coordinates
are then used to create a new Data
object to be added to the chart.
The code that makes use of this class from the runAnalysis()
method is:
ClusterSet
cset = new ClusterSet(data, ic, i+1, nc[i]);
Data cData = new Data(axis, cset.x, cset.y);
And the source code for the ClusterSet
class
itself:
class
ClusterSet {
double[] x,y;
ClusterSet(double[][] data, int[] member, int select, int total) {
x = new double[total];
y = new double[total];
int count = 0;
for (int i=0; i<data.length; i++) {
if (member[i] == select) {
x[count] = minX+rangeX*(data[i][0]);
y[count] = minY+rangeY*(data[i][1]);
count ++;
}
}
}
}
ClusterMain.java | This is the main class, and it extends JFrame .
This small class instantiates a JTabbedPane and adds the two JPanel s
below, which in turn implement their own listeners for the user interfaces
and interaction with chart areas. |
StarCluster.java | This JPanel contains the K-Means example; it draws the input data in a chart, then waits for user input to run the cluster analysis, create a report window, or draw annotations. |
HCluster.java | This JPanel contains the Hierarchical example; it draws the input data in a chart, then waits for user input to run the cluster analysis, create a report window, or draw annotations. |
ClusterReport.java | This class extends JDialog and is called by StarCluster and
HCluster to display a report containing details from the cluster analysis.
It contains a setText() method used to update its HTML content. |
Two alternatives are available to run this demo:
1) Use the source code in your development environment as any other Java code. More information is available in the How To.
2) An executable jar file containing all of the demos
referenced in this guide is included in the jmsl/lib directory. On
Windows, you may double-click the file to run it if files with a ".jar"
extension are properly registered with javaw.exe. Alternatively,
for both Windows and UNIX environments, the jar file may be executed from the
command line using java -jar gallery.jar
.
As list of buttons, one for each demo, is created. Demos can be subsetted as they relate to specific areas (Math, Stat, Finance, Charting) by choosing the appropriate selection on the JComboBox. To run the Additional Demos, select Quick Start in the JComboBox.