How to use SAS

C162/C213 Fovell

Revised 9 January 2014

SAS (which once stood for "Statistical Analysis System") is an integrated collection of statistical and other procedures (PROCs) that we have installed on the Synoptic Lab Linux computers. SAS should be available from any of the lab workstations, but you may need to run a script first:

source /home/fovell/dosas.csh (if you are using tcsh)

OR

source /home/fovell/dosas.sh (if you are using bash)

If you have done this, and SAS still does not work, please see Carl Evans or myself.

SAS tasks are segregated into DATA and PROC steps. DATA steps read in data and perform manipulations on those data (transformations, rescalings, deletion of specified cases, etc.). PROC steps perform analyses, generate informational statistics, make plots and print listings. For example:

Procedure Function
REG Linear regression of a given model
MEANS Computes summary statistics (mean, variance, max, min)
UNIVARIATE Computes distributional statistics, normality tests
STANDARD Transforms the mean and variance of data
CORR Computes Pearson product moment correlations
PLOT Makes 2D ASCII plots of data for the printer
PRINT Prints out a given data set
FACTOR Does principal components and factor analyses
CLUSTER Performs cluster analyses

SAS is a programming language, and thus we need to adhere to certain syntax rules. The most important rule is:

virtually all SAS statements terminate in a semicolon

Forgetting to use the semicolon accounts for probably 75% of all SAS errors. The advantages of the semicolon terminator, however, is that you may put more than one SAS statement per line, and also you may use more than one line for a single SAS statement without having to worry about continuation markers. There are no limits to the length of a SAS statement, and no column restrictions. Another important point about SAS syntax is that it is case insensitive.

A sample DATA step. Say you wish to read in a data set consisting of four variables, and need to transform a couple of them.

	* a comment starts with an asterisk and terminates with semicolon, may 
	  span more than one line, and may go anywhere (but don't embed comments in your data sets);
	* the DATA statement defines the data set name.  Here, it is "example";
	* the data set name cannot exceed eight characters, a throwback to the IBM stone age;
	
	DATA example; 
	
	* the INPUT statement tells SAS the variables to read in.  Free format may be used
		whenever there is one or more blanks separating each column, so the 
		columns need not line up;

	INPUT Y X1 X2 X3;
	
	* Transformations and creations of new variables would follow the INPUT statement;
	* Y is temperature in Fahrenheit, so convert it to Celsius below;
	
	Y = (Y - 32.)*5./9.; * standard algebraic order applies;
	
	* Create new variables X3 = ln(X1) and X4 = X1*X2;
	
	X3 = LOG(X1); X4 = X1*X2;

	* The CARDS statement tells SAS the data follow.  (Does this statement 
	  show SAS' age or what?)  Do NOT use semicolons in the data;
	
	CARDS;
	97 33  -45       -2
	80           45     6      12
             32   -3    12     5
It helps sometimes to use a single semicolon on an otherwise blank line at the end of the data to tell SAS it has reached the end of the DATA step. This is why your data cannot include semicolons.

Sometimes you may wish to include character variables, for identification purposes. These variables are designated by the "$" sign following the variable name (as in NAME $). Note the space between the variable name and the dollar sign.

Sample PROC steps.

	PROC MEANS DATA=example; VAR Y X1;	
    
	* the statement above tells SAS to get the data set "example" and compute summary
	    statistics on variables Y and X1.  If you did not specify the data set name, SAS
	    uses the most recently created data set.  If you do not specify which VARs to
	    process, SAS will use all numeric variables in the data set;

	PROC REG; MODEL Y = X1 X2;

	* this performs the simplest linear regression of X1 and X2 on Y, yielding a minimum 
	    of output.  Because we did not specifically use the DATA= datasetname designator,
	    SAS uses the most recently created data set.;

	PROC REG;
		MODEL Y = X1 X2 X4 / P R NOINT;   * options follow the / sign;
		OUTPUT OUT=regout P=yhat R=resid;

	* the statements above fit a model of Y = f(X1, X2, X4) and SAS has been asked to:
		(1) Leave out the intercept term "NOINT",
		(2) Compute and print out predicted values and residuals ("P R"),
		(3) Save the predicted values and residuals into a new data set called "regout", 
		    with the former called "yhat" and latter called "resid".;
		    
	* note this procedure creates a new output data set, which becomes the default 
	    "open" dataset;

	PROC PRINT; * this statement alone would print data set "regout" by default;
	
	PROC PRINT DATA=example; * to print out the first created data set;

	* to avoid mistakes, it is a good idea to always tell SAS which data set you want it
	    to operate on.  So, use the DATA=datasetname statement at every opportunity;

How to run SAS. SAS may be run interactively in full screen mode or in "batch" mode, from the Unix command line. Here is how to use batch mode.

  1. Save your SAS statements into a file with the .sas extension. Example: cloud.sas
  2. To run SAS, simply issue the Unix command: sas cloud
  3. When SAS runs, it creates two new files, cloud.lst and cloud.log. If these files already existed, they are overwritten! cloud.lst contains the SAS output. cloud.log is where to find where errors (if any) occurred.
  4. To make the SAS output listing file conform to a standard size terminal window, use this statement as the first line in your SAS program:

    options linesize=72; * otherwise, the listing will be 132 characters wide;

Some advanced stuff.

  1. Merging two or more data sets:
    		* this data set reads in station id number and twelve monthly temperature values;	
    		* note the convenient shorthand T1-T12 reads in vars T1, T2, T3, ..., T12;
    		
    		data temps; input station T1-T12; cards;		
    			(data follows)
    		; * marks end of the temperature data;
    		
    		* now read in precipitation data set for the same stations;
    		
    		data precips; input station P1-P12; cards;		
    			(data follows)
    		; * marks end of the precip data;
    		
    		data combine; merge temps precips; by station;
    		
    		* the new data set "combine" contains the contents of both data sets, and consists of
    			the variables: station, T1-T12 and P1-P12;
    
  2. Transforming data after the data set has been created
    		data original; input Y X1 X2; cards;		
    			(data follows)
    		;
    		data revised; set original; Y = Y/100.;
    		
    		* now two data sets exist, differing by how Y is scaled.;
    
  3. Standardizing data to specified mean and standard deviation
    		proc standard data=somedataset M=0 S=1 OUT=newdataset; VAR variablelist;
    		
    		* the new data set "newdataset" contains the contents of "somedataset" but 
    		     the variables specified in the variablelist have been transformed to zero
    		     mean (M=0) and unit standard deviation (S=1);
    
  4. Putting more than one observation in the data set onto one line
    		* read in year and mean temperature for some station;
    		
    		data temps; input year temp @@; cards;
    
    		1950 56.1  1951 54.3  1952 58.5  1953 59.1
    		1954 60.1  1955 52.1  1956 52.0  1957 60.0
    		etc.
    		;
    
  5. SAS functions that may be used between the INPUT and CARDS statements generally mirror their FORTRAN counterparts.
    		Y = LOG(X); * natural logarithm;
    		Y = LOG10(X); * log base 10;
    		Y = EXP(X); Y = COS(X); Y = SIN(X); Y = ABS(X); Y = SQRT(X); * obvious!;
    		Y = ATAN(X); Y = ARCOS(X); Y = ARSIN(X); * inverse trig functions;
    
  6. Before merging two or more data sets, they have to be sorted by some common variable. If they are not already sorted by that variable, or if SAS complains for some reason, then you have to do the sort yourself, using the SORT procedure.
    		PROC SORT DATA=dset1; BY station; 
    		* alters data set "dset1" to be sorted by the station variable value;
    		
    		PROC SORT DATA=dset1 OUT=sort1; BY station; 
    		* the sorted data set is called "sort1" and the original data set is unaltered;
    
  7. The TITLE statement may be used anywhere
    		TITLE this statement is placed atop each printed page of the xxxx.lst file;
    
  8. The PLOT procedure may be used to make multiple plots or superimpose plots
    		* this statement simply plots Y vs X1;
    		PROC PLOT DATA=example; PLOT Y*X1;
    
    		* here, we give SAS the symbol to use in the plotting;
    		PROC PLOT DATA=example; PLOT Y*X1='+';
    
    		* this statement makes two plots: Y vs X1 and Y vs X2 but doesn't overlay them;
    		PROC PLOT DATA=example; PLOT Y*(X1 X2);
    		* that could also have been written as PLOT Y*X1 Y*X2;
    
    		* this overlays the plots of Y vs X1 and X2;
    		PROC PLOT DATA=example; PLOT Y*(X1 X2) / OVERLAY;
    
  9. Deleting specific cases (good for when you need to remove designated observations when they are suspected to be outliers or overly influential on the model).
    		DATA example;
    		INPUT casenum Y X1 X2 X3;
    		if casenum = 42 then delete; * deletes case number 42;
    		if X1 >= 4.02 then delete; * deletes all data if X1 equals or exceeds 4.02;
    		if X1 < 4.02; * same effect as statement above, since the default action is KEEP;
    
  10. Removing variables from a data set.
    		* example of removing variables from original data set;
    		DATA original; INPUT Y X1 X2 X3;
    		* say you create X4=X2/X3 and do not need X2 and X3 anymore;
    		X4=X2/X3;
    		DROP X2 X3; CARDS;
    
    		* example of preventing variables from carrying forward to a newly created data set;
    
    		DATA new; set example(drop=X2 X3);
    
  11. Assigning case numbers. Good for when you want to designate which cases to drop and don't have casenumber as an input variable.
    		DATA example;
    		INPUT Y X1 X2 X3;
    		casenum=_N_; * _N_ is SAS's built-in case counter;
    		if casenum = 42 then delete;
    		CARDS;
    
  12. Formatted input examples.
    		INPUT A 3-4 B 10-12 C 13-20; * giving column numbers;
    		INPUT A B 10-12 C 13-20; * you can mix free and fixed formats;
    		INPUT @3 A 2. B 4.; * start at column 3, read in 2 cols (numbers 3 and 4)
    			into A, and next 4 cols into B.  Note periods after "2" and "4";
    
  13. More than one line per observation.
    		INPUT Y X1 #2 X2 X3; * two lines per case, with X2 and X3 being on
    			second line;
    
  14. Transposing data sets. SAS reads in data sets as cases-by-variables. If you then need to operate on them as variables-by-cases for some reason, use the TRANSPOSE procedure. Careful: make sure you know what your new "variables" names are.
    		PROC TRANSPOSE DATA=input OUT=output;
    
  15. Getting your output into another application
            * say you wish to get a few variables from data set "example" into list format for input 
                some other app (maybe a graphing app);
            * first, set the "pagesize" to some huge number so your data don't 
                become interspersed with page banners;
            OPTIONS pagesize=9999;
            DATA _null_; SET example;
                PUT X Y YHAT RESID;
            
            * you can also specify output formats;
            * now look for your data having been written out to the xxxx.log file;         
    
  16. SAS contains a powerful matrix programming language which is called using PROC IML. Good for operations on entire matrices at once. You could write most of PROC REG with a very few statements in IML. (A good exercise that proves to yourself you know what's going on.)

Go back to home page

Page created September, 1998, by Robert Fovell

Belorussian translation