The APPEND procedure (and APPEND statement of the DATASETS procedure) is an efficient method for concatenating observations from a smaller data set to a larger data set. The BASE= data set option is reserved for the larger of the two data sets with the DATA= option for the smaller data set. Essentially, the APPEND procedure avoids reading any observations in the BASE= data set by positioning the record pointer at the end of the BASE= data set. Each observation from the smaller data set is then applied one at a time to the end of the BASE= data set. In the first example, the BASE= data set identifies a larger data set called MASTER and the DATA= data set identifies the smaller TXN data set.
Code:
PROC APPEND
BASE=master
DATA=txn;
run;
When two or more data sets need to be concatenated, multiple APPEND procedures are issued. In the next example, two separate PROC APPEND steps are specified to concatenate the two smaller data sets (TXN1 and TXN2) at the end of the larger BASE= (MASTER) data set.
Code:
PROC APPEND
BASE=master
DATA=txn1;
RUN;
PROC APPEND
BASE=master
DATA=txn2;
RUN;
Forcing the Concatenation of Data Sets
When one or more variables in the input data set, represented by the DATA= option, are not present in the BASE= data set, or the variables in the input data set are not of the same type as the variables in the BASE= data set, an optional FORCE option can be specified on the PROC APPEND statement to force the concatenation of the data sets. Using the FORCE option also prevents an error from being generated in these situations. The next example illustrates the FORCE option being specified in the concatenation of the TXN data set to the MASTER data set.
Code:
PROC APPEND
BASE=master
DATA=txn
FORCE;
RUN;
Selecting Observations for Concatenation
By default, all the rows of an input data set are typically concatenated to the end of the base data set. But when only selected rows of an input data set are desired, an optional WHERE clause can be specified for the associated input data set as a data set option. This is accomplished by specifying an optional WHERE clause. The WHERE clause defines the logical conditions that control which observations will be selected from an input data set, and ultimately concatenated to the end of the base data set. In the next example, a WHERE clause is specified to select only ‘PG’ and ‘PG13’ movies from the input data set for concatenation to the end of the base data set.
Code:
PROC APPEND
BASE=master
DATA=txn (WHERE=(rating in ('PG', 'PG-13')));
RUN;
Designating the Master
It is not at all a given that BASE= should point to the longer of the two data sets and DATA= to the shorter. One might do the opposite and use the shorter as the base
§ because one has been established as the master and the other is merely a container for a subset
§ to preserve structural properties (such as variable order)
§ because the two are in different libraries and the concatenation belongs in one of those and not the other
§ to maintain BY group ordering
§ to preserve a name for the concatenation without the need to invoke PROC DATASETS to swap names
If none of these concerns arise, and one is indifferent about the order, and one knows which data set is the larger of the two, it does make sense to reduce I/O by making the larger data set the base.
Conclusion
The APPEND procedure is a powerful tool for SAS users to use for data set concatenation purposes. Users can also concatenate unlike data sets using the FORCE option on the PROC APPEND statement as well as select which observations are captured from the input data set using a WHERE clause and concatenated to the end of the base data set. Users are encouraged to explore the many capabilities the APPEND procedure (or the APPEND statement of the DATASETS procedure) has to offer.
No comments:
Post a Comment