C162/C213 Fovell

Revised 9 January 2014

SAS (which once stood for "Statistical Analysis System") is an integrated collection of
statistical and other procedures (PROCs) that we have installed on the Synoptic Lab Linux computers. **SAS should be available from any of the lab workstations**,
but you may need to run a script first:

`source /home/fovell/dosas.csh` (if you are using tcsh)

OR

`source /home/fovell/dosas.sh` (if you are using bash)

If you have done this, and SAS still does not work, please see Carl Evans or myself.

SAS tasks are segregated into **DATA** and **PROC** steps.
DATA steps read in data and perform manipulations on those data (transformations, rescalings,
deletion of specified cases, etc.). PROC steps perform analyses, generate informational statistics,
make plots and print listings. For example:

Procedure | Function |

REG | Linear regression of a given model |

MEANS | Computes summary statistics (mean, variance, max, min) |

UNIVARIATE | Computes distributional statistics, normality tests |

STANDARD | Transforms the mean and variance of data |

CORR | Computes Pearson product moment correlations |

PLOT | Makes 2D ASCII plots of data for the printer |

Prints out a given data set | |

FACTOR | Does principal components and factor analyses |

CLUSTER | Performs cluster analyses |

SAS is a programming language, and thus we need to adhere to certain syntax rules. The most important rule is:

Forgetting to use the semicolon accounts for probably 75% of all SAS errors.
The advantages of the semicolon terminator, however, is that you may put more than one
SAS statement per line,
and also you may use more than one line for a single SAS statement without having to
worry about continuation markers.
There are no limits to the length of a SAS statement, and no column restrictions.
Another important point about SAS syntax is that it is **case
insensitive**.

*A sample DATA step.* Say you wish to read in a data set consisting of four variables, and need
to transform a couple of them.

`
`

* a comment starts with an asterisk and terminates with semicolon, may span more than one line, and may go anywhere (but don't embed comments in your data sets); * the DATA statement defines the data set name. Here, it is "example"; * the data set name cannot exceed eight characters, a throwback to the IBM stone age; DATA example; * the INPUT statement tells SAS the variables to read in. Free format may be used whenever there is one or more blanks separating each column, so the columns need not line up; INPUT Y X1 X2 X3; * Transformations and creations of new variables would follow the INPUT statement; * Y is temperature in Fahrenheit, so convert it to Celsius below; Y = (Y - 32.)*5./9.; * standard algebraic order applies; * Create new variables X3 = ln(X1) and X4 = X1*X2; X3 = LOG(X1); X4 = X1*X2; * The CARDS statement tells SAS the data follow. (Does this statement show SAS' age or what?) Do NOT use semicolons in the data; CARDS; 97 33 -45 -2 80 45 6 12 32 -3 12 5

Sometimes you may wish to include character variables, for identification purposes. These variables are designated by the "$" sign following the variable name (as in NAME $). Note the space between the variable name and the dollar sign.

*Sample PROC steps.*

PROC MEANS DATA=example; VAR Y X1; * the statement above tells SAS to get the data set "example" and compute summary statistics on variables Y and X1. If you did not specify the data set name, SAS uses the most recently created data set. If you do not specify which VARs to process, SAS will use all numeric variables in the data set; PROC REG; MODEL Y = X1 X2; * this performs the simplest linear regression of X1 and X2 on Y, yielding a minimum of output. Because we did not specifically use the DATA= datasetname designator, SAS uses the most recently created data set.; PROC REG; MODEL Y = X1 X2 X4 / P R NOINT; * options follow the / sign; OUTPUT OUT=regout P=yhat R=resid; * the statements above fit a model of Y = f(X1, X2, X4) and SAS has been asked to: (1) Leave out the intercept term "NOINT", (2) Compute and print out predicted values and residuals ("P R"), (3) Save the predicted values and residuals into a new data set called "regout", with the former called "yhat" and latter called "resid".; * note this procedure creates a new output data set, which becomes the default "open" dataset; PROC PRINT; * this statement alone would print data set "regout" by default; PROC PRINT DATA=example; * to print out the first created data set; * to avoid mistakes, it is a good idea to always tell SAS which data set you want it to operate on. So, use the DATA=datasetname statement at every opportunity;

*How to run SAS.* SAS may be run interactively in full screen mode or
in "batch" mode, from the Unix command line. Here is how to use batch
mode.

- Save your SAS statements into a file with the
`.sas`extension. Example:`cloud.sas` - To run SAS, simply issue the Unix command:
`sas cloud` - When SAS runs, it creates two new files,
`cloud.lst`and`cloud.log`.**If these files already existed, they are overwritten!**`cloud.lst`contains the SAS output.`cloud.log`is where to find where errors (if any) occurred. - To make the SAS output listing file conform to a standard size terminal window, use this
statement as the first line in your SAS program:
`options linesize=72; * otherwise, the listing will be 132 characters wide;`

*Some advanced stuff.*

- Merging two or more data sets:
* this data set reads in station id number and twelve monthly temperature values; * note the convenient shorthand T1-T12 reads in vars T1, T2, T3, ..., T12; data temps; input station T1-T12; cards; (data follows) ; * marks end of the temperature data; * now read in precipitation data set for the same stations; data precips; input station P1-P12; cards; (data follows) ; * marks end of the precip data; data combine; merge temps precips; by station; * the new data set "combine" contains the contents of both data sets, and consists of the variables: station, T1-T12 and P1-P12;

- Transforming data after the data set has been created
data original; input Y X1 X2; cards; (data follows) ; data revised; set original; Y = Y/100.; * now two data sets exist, differing by how Y is scaled.;

- Standardizing data to specified mean and standard deviation
proc standard data=somedataset M=0 S=1 OUT=newdataset; VAR variablelist; * the new data set "newdataset" contains the contents of "somedataset" but the variables specified in the variablelist have been transformed to zero mean (M=0) and unit standard deviation (S=1);

- Putting more than one observation in the data set onto one line
* read in year and mean temperature for some station; data temps; input year temp @@; cards; 1950 56.1 1951 54.3 1952 58.5 1953 59.1 1954 60.1 1955 52.1 1956 52.0 1957 60.0 etc. ;

- SAS functions that may be used between the INPUT and CARDS statements generally mirror
their FORTRAN counterparts.
Y = LOG(X); * natural logarithm; Y = LOG10(X); * log base 10; Y = EXP(X); Y = COS(X); Y = SIN(X); Y = ABS(X); Y = SQRT(X); * obvious!; Y = ATAN(X); Y = ARCOS(X); Y = ARSIN(X); * inverse trig functions;

- Before merging two or more data sets, they have to be sorted by some common variable. If
they are not already sorted by that variable, or if SAS complains for some reason,
then you have to do the sort yourself, using the SORT procedure.
PROC SORT DATA=dset1; BY station; * alters data set "dset1" to be sorted by the station variable value; PROC SORT DATA=dset1 OUT=sort1; BY station; * the sorted data set is called "sort1" and the original data set is unaltered;

- The TITLE statement may be used anywhere
TITLE this statement is placed atop each printed page of the xxxx.lst file;

- The PLOT procedure may be used to make multiple plots or superimpose plots
* this statement simply plots Y vs X1; PROC PLOT DATA=example; PLOT Y*X1; * here, we give SAS the symbol to use in the plotting; PROC PLOT DATA=example; PLOT Y*X1='+'; * this statement makes two plots: Y vs X1 and Y vs X2 but doesn't overlay them; PROC PLOT DATA=example; PLOT Y*(X1 X2); * that could also have been written as PLOT Y*X1 Y*X2; * this overlays the plots of Y vs X1 and X2; PROC PLOT DATA=example; PLOT Y*(X1 X2) / OVERLAY;

- Deleting specific cases (good for when you need to remove designated observations when
they are suspected to be outliers or overly influential on the model).
DATA example; INPUT casenum Y X1 X2 X3; if casenum = 42 then delete; * deletes case number 42; if X1 >= 4.02 then delete; * deletes all data if X1 equals or exceeds 4.02; if X1 < 4.02; * same effect as statement above, since the default action is KEEP;

- Removing variables from a data set.
* example of removing variables from original data set; DATA original; INPUT Y X1 X2 X3; * say you create X4=X2/X3 and do not need X2 and X3 anymore; X4=X2/X3; DROP X2 X3; CARDS; * example of preventing variables from carrying forward to a newly created data set; DATA new; set example(drop=X2 X3);

- Assigning case numbers. Good for when you want to designate which cases to drop and
don't have casenumber as an input variable.
DATA example; INPUT Y X1 X2 X3; casenum=_N_; * _N_ is SAS's built-in case counter; if casenum = 42 then delete; CARDS;

- Formatted input examples.
INPUT A 3-4 B 10-12 C 13-20; * giving column numbers; INPUT A B 10-12 C 13-20; * you can mix free and fixed formats; INPUT @3 A 2. B 4.; * start at column 3, read in 2 cols (numbers 3 and 4) into A, and next 4 cols into B. Note periods after "2" and "4";

- More than one line per observation.
INPUT Y X1 #2 X2 X3; * two lines per case, with X2 and X3 being on second line;

- Transposing data sets. SAS reads in data sets as cases-by-variables.
If you then need to operate on them as variables-by-cases for some reason, use
the TRANSPOSE procedure.
Careful: make sure you know what your new "variables" names are.
PROC TRANSPOSE DATA=input OUT=output;

- Getting your output into another application
* say you wish to get a few variables from data set "example" into list format for input some other app (maybe a graphing app); * first, set the "pagesize" to some huge number so your data don't become interspersed with page banners; OPTIONS pagesize=9999; DATA _null_; SET example; PUT X Y YHAT RESID; * you can also specify output formats; * now look for your data having been written out to the xxxx.log file;

- SAS contains a powerful matrix programming language which is called using PROC IML.
Good for operations on entire matrices at once.
You could write most of PROC REG with a very few statements in IML. (A
good exercise that proves to yourself you know what's going on.)

* Page created September, 1998, by
Robert Fovell
*