C162/C213 Fovell
Revised 9 January 2014
SAS (which once stood for "Statistical Analysis System") is an integrated collection of statistical and other procedures (PROCs) that we have installed on the Synoptic Lab Linux computers. SAS should be available from any of the lab workstations, but you may need to run a script first:
source /home/fovell/dosas.csh (if you are using tcsh)
OR
source /home/fovell/dosas.sh (if you are using bash)
If you have done this, and SAS still does not work, please see Carl Evans or myself.
SAS tasks are segregated into DATA and PROC steps. DATA steps read in data and perform manipulations on those data (transformations, rescalings, deletion of specified cases, etc.). PROC steps perform analyses, generate informational statistics, make plots and print listings. For example:
Procedure | Function |
REG | Linear regression of a given model |
MEANS | Computes summary statistics (mean, variance, max, min) |
UNIVARIATE | Computes distributional statistics, normality tests |
STANDARD | Transforms the mean and variance of data |
CORR | Computes Pearson product moment correlations |
PLOT | Makes 2D ASCII plots of data for the printer |
Prints out a given data set | |
FACTOR | Does principal components and factor analyses |
CLUSTER | Performs cluster analyses |
SAS is a programming language, and thus we need to adhere to certain syntax rules. The most important rule is:
Forgetting to use the semicolon accounts for probably 75% of all SAS errors. The advantages of the semicolon terminator, however, is that you may put more than one SAS statement per line, and also you may use more than one line for a single SAS statement without having to worry about continuation markers. There are no limits to the length of a SAS statement, and no column restrictions. Another important point about SAS syntax is that it is case insensitive.
A sample DATA step. Say you wish to read in a data set consisting of four variables, and need to transform a couple of them.
* a comment starts with an asterisk and terminates with semicolon, may
span more than one line, and may go anywhere (but don't embed comments in your data sets);
* the DATA statement defines the data set name. Here, it is "example";
* the data set name cannot exceed eight characters, a throwback to the IBM stone age;
DATA example;
* the INPUT statement tells SAS the variables to read in. Free format may be used
whenever there is one or more blanks separating each column, so the
columns need not line up;
INPUT Y X1 X2 X3;
* Transformations and creations of new variables would follow the INPUT statement;
* Y is temperature in Fahrenheit, so convert it to Celsius below;
Y = (Y - 32.)*5./9.; * standard algebraic order applies;
* Create new variables X3 = ln(X1) and X4 = X1*X2;
X3 = LOG(X1); X4 = X1*X2;
* The CARDS statement tells SAS the data follow. (Does this statement
show SAS' age or what?) Do NOT use semicolons in the data;
CARDS;
97 33 -45 -2
80 45 6 12
32 -3 12 5
It helps sometimes to use a single semicolon on an otherwise blank line
at the end of the data to tell SAS it has reached the end of the DATA step.
This is why your data cannot include semicolons.
Sometimes you may wish to include character variables, for identification purposes. These variables are designated by the "$" sign following the variable name (as in NAME $). Note the space between the variable name and the dollar sign.
Sample PROC steps.
PROC MEANS DATA=example; VAR Y X1;
* the statement above tells SAS to get the data set "example" and compute summary
statistics on variables Y and X1. If you did not specify the data set name, SAS
uses the most recently created data set. If you do not specify which VARs to
process, SAS will use all numeric variables in the data set;
PROC REG; MODEL Y = X1 X2;
* this performs the simplest linear regression of X1 and X2 on Y, yielding a minimum
of output. Because we did not specifically use the DATA= datasetname designator,
SAS uses the most recently created data set.;
PROC REG;
MODEL Y = X1 X2 X4 / P R NOINT; * options follow the / sign;
OUTPUT OUT=regout P=yhat R=resid;
* the statements above fit a model of Y = f(X1, X2, X4) and SAS has been asked to:
(1) Leave out the intercept term "NOINT",
(2) Compute and print out predicted values and residuals ("P R"),
(3) Save the predicted values and residuals into a new data set called "regout",
with the former called "yhat" and latter called "resid".;
* note this procedure creates a new output data set, which becomes the default
"open" dataset;
PROC PRINT; * this statement alone would print data set "regout" by default;
PROC PRINT DATA=example; * to print out the first created data set;
* to avoid mistakes, it is a good idea to always tell SAS which data set you want it
to operate on. So, use the DATA=datasetname statement at every opportunity;
How to run SAS. SAS may be run interactively in full screen mode or in "batch" mode, from the Unix command line. Here is how to use batch mode.
options linesize=72; * otherwise, the listing will be 132 characters wide;
Some advanced stuff.
* this data set reads in station id number and twelve monthly temperature values; * note the convenient shorthand T1-T12 reads in vars T1, T2, T3, ..., T12; data temps; input station T1-T12; cards; (data follows) ; * marks end of the temperature data; * now read in precipitation data set for the same stations; data precips; input station P1-P12; cards; (data follows) ; * marks end of the precip data; data combine; merge temps precips; by station; * the new data set "combine" contains the contents of both data sets, and consists of the variables: station, T1-T12 and P1-P12;
data original; input Y X1 X2; cards; (data follows) ; data revised; set original; Y = Y/100.; * now two data sets exist, differing by how Y is scaled.;
proc standard data=somedataset M=0 S=1 OUT=newdataset; VAR variablelist; * the new data set "newdataset" contains the contents of "somedataset" but the variables specified in the variablelist have been transformed to zero mean (M=0) and unit standard deviation (S=1);
* read in year and mean temperature for some station; data temps; input year temp @@; cards; 1950 56.1 1951 54.3 1952 58.5 1953 59.1 1954 60.1 1955 52.1 1956 52.0 1957 60.0 etc. ;
Y = LOG(X); * natural logarithm; Y = LOG10(X); * log base 10; Y = EXP(X); Y = COS(X); Y = SIN(X); Y = ABS(X); Y = SQRT(X); * obvious!; Y = ATAN(X); Y = ARCOS(X); Y = ARSIN(X); * inverse trig functions;
PROC SORT DATA=dset1; BY station; * alters data set "dset1" to be sorted by the station variable value; PROC SORT DATA=dset1 OUT=sort1; BY station; * the sorted data set is called "sort1" and the original data set is unaltered;
TITLE this statement is placed atop each printed page of the xxxx.lst file;
* this statement simply plots Y vs X1; PROC PLOT DATA=example; PLOT Y*X1; * here, we give SAS the symbol to use in the plotting; PROC PLOT DATA=example; PLOT Y*X1='+'; * this statement makes two plots: Y vs X1 and Y vs X2 but doesn't overlay them; PROC PLOT DATA=example; PLOT Y*(X1 X2); * that could also have been written as PLOT Y*X1 Y*X2; * this overlays the plots of Y vs X1 and X2; PROC PLOT DATA=example; PLOT Y*(X1 X2) / OVERLAY;
DATA example; INPUT casenum Y X1 X2 X3; if casenum = 42 then delete; * deletes case number 42; if X1 >= 4.02 then delete; * deletes all data if X1 equals or exceeds 4.02; if X1 < 4.02; * same effect as statement above, since the default action is KEEP;
* example of removing variables from original data set; DATA original; INPUT Y X1 X2 X3; * say you create X4=X2/X3 and do not need X2 and X3 anymore; X4=X2/X3; DROP X2 X3; CARDS; * example of preventing variables from carrying forward to a newly created data set; DATA new; set example(drop=X2 X3);
DATA example; INPUT Y X1 X2 X3; casenum=_N_; * _N_ is SAS's built-in case counter; if casenum = 42 then delete; CARDS;
INPUT A 3-4 B 10-12 C 13-20; * giving column numbers; INPUT A B 10-12 C 13-20; * you can mix free and fixed formats; INPUT @3 A 2. B 4.; * start at column 3, read in 2 cols (numbers 3 and 4) into A, and next 4 cols into B. Note periods after "2" and "4";
INPUT Y X1 #2 X2 X3; * two lines per case, with X2 and X3 being on second line;
PROC TRANSPOSE DATA=input OUT=output;
* say you wish to get a few variables from data set "example" into list format for input some other app (maybe a graphing app); * first, set the "pagesize" to some huge number so your data don't become interspersed with page banners; OPTIONS pagesize=9999; DATA _null_; SET example; PUT X Y YHAT RESID; * you can also specify output formats; * now look for your data having been written out to the xxxx.log file;
Page created September, 1998, by Robert Fovell