Warren Repole

Don't Be a SAS® Dinosaur: Modernize Your SAS Programs
by Warren Repole

Combine Data: Fuzzy Merges


Scenario:

You need to combine two SAS data sets whose keys cannot be matched exactly. Specifically, the value of the key variable in the first data set differs from the corresponding variable in the second data set or falls between a range of values stored in two variables in the second data set.

A match merge only handles exact value matches.


The old way: Nested SET Statements

The original approach is to perform a loop within the DATA step to read all potential matching observations from the secondary data set for each observation of the primary data set.

Download this program
data pairs_dstep;
  set sashelp.class(keep=Name Age);
  do i=1 to Subjects;
    set sashelp.class(keep=Name Age rename=(Name=NameOlder Age=AgeOlder))
          nobs=Subjects point=i;
    if Age lt AgeOlder then output;
  end;
run;
proc sort data=pairs_dstep;
  by Age AgeOlder Name NameOlder;
run;
proc print data=pairs_dstep(obs=20);
  by Age AgeOlder;
  id Age AgeOlder;
  title1 "Student Mentor Pairings";
run;

SAS Log

4817   data pairs_dstep;
4818     set sashelp.class(keep=Name Age);
4819     do i=1 to Subjects;
4820       set sashelp.class(keep=Name Age rename=(Name=NameOlder Age=AgeOlder))
4821             nobs=Subjects point=i;
4822       if Age lt AgeOlder then output;
4823     end;
4824   run;
 
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.PAIRS_DSTEP has 145 observations and 4 variables.
 
4825   proc sort data=pairs_dstep;
4826     by Age AgeOlder Name NameOlder;
4827   run;
 
NOTE: There were 145 observations read from the data set WORK.PAIRS_DSTEP.
NOTE: The data set WORK.PAIRS_DSTEP has 145 observations and 4 variables.
 
4828   proc print data=pairs_dstep(obs=20);
4829     by Age AgeOlder;
4830     id Age AgeOlder;
4831     title1 "Student Mentor Pairings";
4832   run;
 
NOTE: There were 20 observations read from the data set WORK.PAIRS_DSTEP.
 
 

SAS Listing Output

Student Mentor Pairings
 
        Age               Name
Age    Older     Name     Older
 
 11      12     Joyce     James
                Joyce     Jane
                Joyce     John
                Joyce     Louise
                Joyce     Robert
                Thomas    James
                Thomas    Jane
                Thomas    John
                Thomas    Louise
                Thomas    Robert
 
 11      13     Joyce     Alice
                Joyce     Barbara
                Joyce     Jeffrey
                Thomas    Alice
                Thomas    Barbara
                Thomas    Jeffrey
 
 11      14     Joyce     Alfred
                Joyce     Carol
                Joyce     Henry
                Joyce     Judy
bronto

The new way: PROC SQL Join with Range Comparison
(available in SAS Version 6)

An alternate approach is to perform the match as an SQL join that produces a partial Cartesian product of the data set, then filters down to the matching rows.

Download this program
proc sql;
  create table pairs_sql as
    select y.Name, y.Age, o.Name as NameOlder, o.Age as AgeOlder
      from sashelp.class y, sashelp.class o
      where y.Age lt o.Age
      order by Age, AgeOlder, Name, NameOlder;
quit;
proc print data=pairs_sql(obs=20);
  by Age AgeOlder;
  id Age AgeOlder;
  title1 "Student Mentor Pairings";
run;

SAS Log

4844   proc sql;
4845     create table pairs_sql as
4846       select y.Name, y.Age, o.Name as NameOlder, o.Age as AgeOlder
4847         from sashelp.class y, sashelp.class o
4848         where y.Age lt o.Age
4849         order by Age, AgeOlder, Name, NameOlder;
NOTE: The execution of this query involves performing one or more Cartesian product joins
      that can not be optimized.
NOTE: Table WORK.PAIRS_SQL created, with 145 rows and 4 columns.
 
4850   quit;
 
4851   proc print data=pairs_sql(obs=20);
4852     by Age AgeOlder;
4853     id Age AgeOlder;
4854     title1 "Student Mentor Pairings";
4855   run;
 
NOTE: There were 20 observations read from the data set WORK.PAIRS_SQL.
 

SAS Listing Output

Same as for the old way


Advantages of the alternate approach: Good

  • A simple SQL query handles the joining process.
  • No sorting is required for either data set. The DATA step approach may require each data set to be sorted by the appropriate keys.

Disadvantages of the alternate approach: Bad

  • Additional data manipulation and programming logic is limited in an SQL query.
  • The resources required by the SQL query often exceed the resources used by the combination of PROC SORT steps and the DATA step.

Additional documentation for this technique can be found in Base SAS® 9.2 Procedures Guide. Cary, NC: SAS Institute Inc.

Visit http://support.sas.com/documentation/onlinedoc/sas9doc.html for SAS 9 documentation.

The URL for this page is http://www.repole.com/dinosaur/fuzzymerge.html

These techniques are mentioned in other SAS references and publications:


Back to SAS Dinosaur home page

Printable copy of this page (without sample output)

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries.

® indicates USA registration.