When calculating descriptive statistics on a table of data, we often need to produce different statistics for the different measurements contained within. We can create any number of output data sets, each containing different statistics from a single call to the summary procedure. This article shows you how.
Indeed, either of Proc MEANS or Proc SUMMARY can produce the functionality demonstrated here, as under the covers, both procedures are one and the same. It is simply that each executes with different options by default.
Consider the following example:
proc summary data=sashelp.shoes;
types region
region*subsidiary;
class region subsidiary;
var stores sales returns;
output out=total_stores n(stores)=
sum(stores)=
/ autoname autolabel;
output out=sales_analysis
n(sales returns)=
mean(sales returns)=
std(sales returns)=
min(sales returns)=
max(sales returns)=
/ autoname autolabel;
run;
Submitting this code to SAS produces the following log, which illustrates the two output tables generated:
NOTE: There were 395 observations read from the data set
SASHELP.SHOES.
NOTE: The data set WORK.TOTAL_STORES has 63 observations and 6
variables.
NOTE: The data set WORK.SALES_ANALYSIS has 63 observations and
14 variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 0.04 seconds
cpu time 0.06 seconds
The procedure statements are explained as follows:
proc summary data=sashelp.shoes;
The PROC statement calls the procedure and reads the SHOES data set from the SASHELP library. This table is available to all readers irrespective of SAS version and platform.
types region
region*subsidiary;
The TYPES statement overrides the default behaviour to summarise all combinations of classification variables. In this example groups will be formed for the unique values within REGION and the cross of values from REGION and SUBSIDIARY columns.
class region subsidiary;
The CLASS statement defines the columns whose values are used to form groups.
var stores sales returns;
The VAR statement defines which variables will be analysed. Not all variables need to be used on OUTPUT statements, and each variable may be used more than once.
output out=total_stores n(stores)=
sum(stores)=
/ autoname autolabel;
The first OUTPUT statement (above) defines the TOTAL_STORES table. It requests the N and SUM statistics are calculated for the STORES variable. Options AUTONAME and AUTOLABEL request the procedure creates unique and meaningful column names and labels for the results.
output out=sales_analysis
n(sales returns)=
mean(sales returns)=
std(sales returns)=
min(sales returns)=
max(sales returns)=
/ autoname autolabel;
The second OUTPUT statement defines the SALES_ANALYSIS table. Various statistics are requested, N through to MAX, for two variables, in this case SALES and RETURNS are analysed. Similarly to the previous example, the AUTONAME and AUTOLABEL options are applied, removing the programmers need to specify each output column name.
run;
The RUN statement marks the boundary of the procedure step and causes execution at this point.
No comments:
Post a Comment