Homework 7 - Analysis of precipitation data

To obtain the data set for this homework, ftp to yosemite.atmos.ucla.edu, log on as "anonymous", cd to the directory pub/fovell, and get westcoast.sas. Problems? E-mail me.

WARNING This assignment can generate reams of output. Edit your *.lst files to remove useless stuff before printing them!

Introduction

The data set westcoast.sas contains 50 year long time series of monthly precipitation (600 obs total) for 26 climate divisions located along the conterminous United States west coast. A "climate division" is a (usually small) area comprising one or more meteorological stations. These data, which span the period from January, 1948 to December, 1997, inclusive, were originally obtained from the National Climatic Data Center (NCDC) data archive. Here is an ugly but usable base map for the west coast climate divisions.

The data matrix X is 600 (months) by 26 (objects). We will perform object-based PCA, spectral and cluster analyses on these data. First, however, each object will be standardized to zero mean and unit variance. Thus, our PCA will be on the object correlation matrix. I am intentionally removing mean and variance differences among the objects, leaving the phasing of the objects' temporal fluctuations as their sole source of distinctiveness. The intent here is to create "component objects" or object clusters that emphasize raw temporal similarity.

What is to be done

We will do the following: (1) Unrotated PCA, with spectral analysis of the component time series; (2) Rotated PCA, as a form of cluster analysis; (3) Traditional geometric cluster analysis, for comparison and contrast with PCA-based clustering.

Periodogram spectral analysis
Here is a brief explanation (I'll have a handout on this topic). This technique, which distributes the adjusted TSS of a time series among a variety of periods, is actually closely related to linear regression. Suppose you have a time series of M (here, 600) data points, spanning temporal length T (here, 600 months) with a time resolution of DT (here, 1 month). The series is centered to zero mean. You examine the "spectral power" at a set of periodicities related to the total time series length: T, T/2, T/3,... down to 2DT, for a total of M/2 periods. For a given period, you fit the OLS model of Y = Beta1*cos(period) + Beta2*sin(period) + error, where Y is the component time series you're analyzing. The SSR of this regression is the series' spectral power at the chosen period.

At this point, it sounds like we'd have to do M/2 separate regressions and, more importantly, worry about redundancy and partial correlations. Not so. Keep in mind that a sine and cosine of the same period are uncorrelated. Further, a sine (or cosine) of any period is uncorrelated with sines and cosines of all other periods. Thus, we can build an OLS model combining all M completely independent terms (M/2 sines and M/2 cosines, spanning M/2 different periods) without worrying about colinearities, inflated standard errors, biased estimates or overlapping SSRs. Because each regressor is completely independent, there is no overlap and the sum of the individual SSRs of each period equals the adjusted TSS for the time series Y.

Geometric cluster analysis
This will also be gone over in much more detail in class. Traditional cluster analysis starts with a "dissimilarity matrix" for the N objects, a square, symmetric NxN matrix comprising the geometric distances between all possible object pairs. "Agglomerative hierarchical" cluster analysis starts with N clusters (one object per cluster) and in each step identifies and fuses the cluster pair with the smallest geometric separation. Once a cluster fusion takes place, the distance between the new, multiobject cluster and all remaining clusters has to be recomputed. Clustering algorithms vary dramatically on how this is accomplished.

In each step of agglomerative hierarchical clustering, the number of remaining clusters is reduced by one, and clustering terminates when one, all-inclusive cluster is forged. Then, the results are inspected and one or more clustering levels are selected. There are tests and rules for choosing among the N possible clustering levels. Generally, they aren't very good tests. Most algorithms make "hard" clusters; each object belongs to one, and only one, cluster. This is often unreasonable; real clusters should be "fuzzy" (overlapping).

List of specific tasks, with detailed notes

Perform an unrotated, truncated PCA on the object correlation matrix.
Notes
- proc factor prints out V (eigenvalue-scaled) eigenvector loadings, under the heading factor pattern. Remember, this matrix is objects by component objects. The objects are referred to as "COLn", where "n" is the object number from the base map.
- Unless you prevent it (see later), the factor procedure truncates your PCA, throwing out the smaller eigenvectors. By default, it employs the "average root" truncation test. Because this is a correlation based PCA, Tr(S)=Tr(L)=26, since each object has unit variance. Therefore, the average eigenvalue will be unity. In this case, three components have eigenvalues greater than one and are retained.
- proc score computes the standardized replacement data matrix Y, which is dimensioned 600x3 because we've truncated to three components. The new component objects are termed "FACTORn", where n=1, 2, or 3. In this particular case, these new data represent 600 month long time series for each new component object, which are then passed to proc spectra.
Tasks
- The V eigenvector loadings can be contoured in space, to see what patterns develop. Using copies of the base map, plot the loadings for the first three unrotated components. Interpret the results. Anything about the patterns strike you as odd? Anything strike you as useful?
- Next, the SAS program as I set it up passes the Y scores of the first three components to spectra for a simple spectral analysis. In the output, "Period" is in months, and "P_0n" is the spectral power residing at that month for the nth component. Anything pop out at you?
Perform a rotated PCA, without and with truncation, for a PCA-based clustering of the objects.
Notes
- Let's say we call factor again, this time keeping all 26 eigenvectors (by specifying N=26), and specify rotation of those eigenvectors using the "Varimax" method. Actually, this is a secondary rotation, isn't it? It suffices to say that the practical rationale behind eigenvector re-rotation is to get the loadings in each eigenvector either as large or as small as possible. Note that our reasons for initial and subsequent axis rotation are completely different.
- Recall the V matrix is dimensioned objects by component objects, and one of its properties is that the sum of squares for each object (row) equals its variance. Since we've standardized the objects to variance one, the row sum of squares for the untruncated V is one.
- Because of this, the square of each eigenvector loading may be thought of as representing an object's fractional participation in that particular eigenvector. So, a loading is 0.9 means 81% of that object's original variance has been placed in the given eigenvector.
- If an object has a loading that large in one eigenvector, its loadings in other components has to be small (since their squares sum to unity). We employ subsequent rotation to try to concentrate the object's fractional participations into as few eigenvectors as possible. These eigenvectors may then be considered object clusters. They are fuzzy clusters since each object can participate in more than one cluster. Forcing each object to belong to only one cluster is called "hardening" the clusters.
- If we rotate the untruncated V, the row sum of squares is still unity --- we've just rearranged the axes again. We can also perform subsequent rotation after truncation. Needless to say, this will change the results, at least a little.
- In this application, both rotations preserve the mutual orthogonality of the axes. The secondary rotation need not be orthogonal, actually.
Tasks
- Rotate the truncated PCA obtained earlier and examine the new rotated eigenvectors (now termed "Rotated Factor Pattern"). Take each of the three rotated components to be a cluster. Now, you can do several things to visualize the clustering outcome. Note that there is NO explicit information whatsover in the data set regarding the spatial location of an object. So why should the clustering have any spatial coherence after you map it? Because, in general, the closer two objects are to each other spatially the more similar they tend to be (though I'm sure you can easily conjure up counterexamples to this!).
  1. Choose a representive "critical" loading value, say 0.7 (but your choice), to represent a particular cluster's effective boundary. Then contour plot the boundaries for each cluster on the base map. This illustrates cluster membership and cluster overlap.
  2. "Harden" the cluster by assigning each object to the cluster-eigenvector that possesses the object's largest loading. Then, on the base map, identify the cluster membership of each object.
- Now rotate an untruncated PCA. This creates many more component clusters, but they may not contain any large loadings, even after subsequent rotation. Have the results changed in any meaningful way? Can you think of a good reason to NOT truncate the eigenvector matrix before secondary rotation?
Perform a traditional, geometric cluster analysis on the objects for comparison.
Notes
- We will demonstrate in class that geometric cluster analysis and PCA-based clustering operate on precisely the same data and thus may be directly compared. Let the 26x26 geometric dissimilarity matrix computed from the original 26x600 data set be called D. Do an object-based PCA and form the 26x26 eigenvector matrix V. If you do not truncate V, you can obtain the same dissimilarity matrix from the loadings as you got from the original data (to within an unimportant constant proportionality). You can rotate V using Varimax and still get the very same D matrix from the rotated eigenvectors, as long as you do not truncate and your secondary rotation is orthogonal.
- The latter point is actually proven in the westcoast.sas program code, in the commented-out code at the bottom. This code grabs the rotated factor loadings from the untruncated PCA and feeds them to cluster. If you inspect the output, you'll see it is identical to what you got from the original data matrix X.
Task
- The SAS code provided in westcoast.sas conducts a clustering on the standardized 26x600 precipitation data set using the "average linkage" method, and identifies the members of the (nonoverlapping) clustering solution at the three cluster level. Using the base map, show the clustering membership at the three cluster level. Compare and contrast with the PCA-based regionalization. Which do you like better? Why are they so similar? Can you objectively judge which method of assigning objects to clusters is better? Can you identify situations in which one method or the other will perform poorly?

SAS code from westcoast.sas

* ------------------------------------------------------------------------;
* preliminary data manipulation;
* ------------------------------------------------------------------------;
* the data set starts out divisions (objects) x months (obs/vars) 26x600;
* here we transpose the data set to make it months x divisions or 600x26;
* the objects are now called col1-col26;

proc transpose data=precip out=trans;

* now standardize the objects to zero mean, variance one
  (we can make FACTOR do this... just making it obvious it was done);

proc standard m=0 s=1 data=trans out=stand; var col1-col26;

* ------------------------------------------------------------------------;
* (1) PCA on the objects -- do an unrotated analysis first;
* ------------------------------------------------------------------------;
* proc factor does the PCA on the 26x26 correlation matrix for the objects,
    and reports the V eigenvalue-scaled eigenvectors;

proc factor data=stand rotate=NONE score outstat=factout1; var col1-col26;

* proc score creates the standardized Y scores (called factorN,
   where N is the root number), the 600x26 replacement data matrix;
* the Y score data set contains time series for the component objects;

proc score data=stand score=factout1 out=score1; var col1-col26;

* if score worked correctly, the new component data set should be
   uncorrelated -- check for first 5 factors (we are such skeptics);

proc corr data=score1; var factor1-factor3;

* send a few of the component time series to proc spectra for
   periodogram spectral analysis;
* the data set "specout" contains spectral power ("P_*") for each
   included period, which is in months (600 months is max period);

proc spectra p adjmean data=score1 out=specout; var factor1-factor3;
proc print data=specout;

* ------------------------------------------------------------------------;
* (2) now do a rotated PCA on the objects;
* ------------------------------------------------------------------------;

* here, we truncate the PCA, keeping only the roots with eigenvalues
   greater than one, and then perform a Varimax rotation;
* check out the rotated "factor" (eigenvector) loadings and perform
   a PCA-based fuzzy clustering;

proc factor data=stand rotate=varimax; var col1-col26;

* what happens if we had NOT truncated - would the clustering have
   been different?  Check it out;

proc factor data=stand N=26 score outstat=factoutr rotate=varimax; var col1-col2
6;

* ------------------------------------------------------------------------;
* (3) A traditional Euclidean-based cluster analysis;
* ------------------------------------------------------------------------;

* first, get the standardized data set back to divisions x months;
* now the months/obs will be called obs1-obs600;

proc transpose data=stand out=tempor;

* we need a division identifier, so make one using data command;

data precips; set tempor;
 id=_N_;

* do an average linkage cluster analysis, requesting "pseudo" statistics;
* our cluster routine computes a 26x26 Euclidean distance matrix from the
   standardized 26x600 data set;

proc cluster data=precips method=average pseudo outtree=tree; var obs1-obs600;
id id;

* proc tree processes the output from cluster and prints large useless
   diagram if NOPRINT is not specified;
* the number of clusters is specified in nclusters option;

proc tree data=tree noprint out=output nclusters=3; id id;

* sorting the output from tree for easy presentation;

proc sort data=output; by cluster id;

* print it out;

proc print uniform; id id;

* ------------------------------------------------------------------------;
* extra material after this line - remove asterisks to enable proc calls;
* ------------------------------------------------------------------------;

* above, we did a cluster analysis on the 26x600 standardized data set;
* we also could have performed the cluster analysis on the 26x26 V eigenvectors
   from proc factor... even after rotation... that's the same information
   so long as we don't __truncate__;
* to prove that we take the output data set from the untruncated, varimax rotated
   PCA on the objects above, extract the 26x26 eigenvector loading matrix
   (SAS datatype 'PATTERN'), transpose it in preparation for input to cluster;
* in the first cluster proc, we computed a Euclidean distance (ED) matrix
   from a 600x26 data set.  In the second we computed the ED matrix from a
   26x26 data set.  But it told us the very same thing;

* proc score data=stand score=factoutr out=scorer;
*  var col1-col26;
* data fact1;
*  set factoutr;
*  if _type_='PATTERN';
* proc transpose data=fact1 out=fact2;
* proc print;
* proc cluster data=fact2 method=average pseudo outtree=tree1;
*  var factor1-factor26;