Warren Repole

Don't Be a SAS® Dinosaur: Modernize Your SAS Programs
by Warren Repole

Raw Data: Manipulate Input Data Lines


Scenario:

A raw data file contains anomalies that hinder your ability to convert the data properly into a SAS data set.

While it is preferable to make corrections to the raw data file itself, you do not have write access to the file.


The old way: Manipulate User-Defined Variables after INPUT

The original approach is to alter the main INPUT statement or modify the values after the main INPUT statement.

Download this program
data readEMdash1;
  infile "c:\temp\dinosaur\EMdash.dat";
  input @1 number1 number2
        @1 char1 $ char2 $ ;
  if number2=. then do;
    char2fixed=translate(char2,"-","97"x,);
    number2fixed=input(char2fixed,8.);
  end;
  else number2fixed=number2;
  format char: $hex16.;
run;
proc print data=readEMdash1;
  var Number1 Number2 Char2 Char2Fixed Number2Fixed;
  title1 "Data containing EM dash character instead of hyphen";
run;

SAS Log

4736   data readEMdash1;
4737     infile "c:\temp\dinosaur\EMdash.dat";
4738     input @1 number1 number2
4739           @1 char1 $ char2 $ ;
4740     if number2=. then do;
4741       char2fixed=translate(char2,"-","97"x,);
4742       number2fixed=input(char2fixed,8.);
4743     end;
4744     else number2fixed=number2;
4745     format char: $hex16.;
4746   run;
 
NOTE: The infile "c:\temp\dinosaur\EMdash.dat" is:
      Filename=c:\temp\dinosaur\EMdash.dat,
      RECFM=V,LRECL=256,File Size (bytes)=257,
      Last Modified=14May2008:20:54:36,
      Create Time=05Aug2007:22:03:48
 
NOTE: Invalid data for number2 in line 5 14-25.
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8--
5         711.35999966 -45.68000031 25
number1=711.35999966 number2=. char1=3731312E33353939 char2=9734352E36383030
char2fixed=2D34352E36383030 number2fixed=-45.68 _ERROR_=1 _N_=5
NOTE: Invalid data for number2 in line 7 14-25.
7         715.19999981 -6.719999313 25
number1=715.19999981 number2=. char1=3731352E31393939 char2=97362E3731393939
char2fixed=2D362E3731393939 number2fixed=-6.71999 _ERROR_=1 _N_=7
NOTE: Invalid data for number2 in line 8 13-23.
8         713.5899992 -1.61000061 23
number1=713.5899992 number2=. char1=3731332E35383939 char2=97312E3631303030
char2fixed=2D312E3631303030 number2fixed=-1.61 _ERROR_=1 _N_=8
NOTE: Invalid data for number2 in line 9 14-25.
9         708.34999943 -5.239999771 25
number1=708.34999943 number2=. char1=3730382E33343939 char2=97352E3233393939
char2fixed=2D352E3233393939 number2fixed=-5.23999 _ERROR_=1 _N_=9
NOTE: 10 records were read from the infile "c:\temp\dinosaur\EMdash.dat".
      The minimum record length was 14.
      The maximum record length was 25.
NOTE: The data set WORK.READEMDASH1 has 10 observations and 6 variables.
 
4747   proc print data=readEMdash1;
4748     var Number1 Number2 Char2 Char2Fixed Number2Fixed;
4749     title1 "Data containing EM dash character instead of hyphen";
4750   run;
 
NOTE: There were 10 observations read from the data set WORK.READEMDASH1.
 
 

SAS Listing Output

Data containing EM dash character instead of hyphen
 
Obs    number1    number2    char2               char2fixed          number2fixed
 
  1    740.200      .        2020202020202020    2020202020202020         .
  2    747.380     7.1800    372E313739393939    2020202020202020        7.1800
  3    750.400     3.0200    332E303230303030    2020202020202020        3.0200
  4    757.040     6.6400    362E363430303030    2020202020202020        6.6400
  5    711.360      .        9734352E36383030    2D34352E36383030      -45.6800
  6    721.920    10.5600    31302E3535393939    2020202020202020       10.5600
  7    715.200      .        97362E3731393939    2D362E3731393939       -6.7200
  8    713.590      .        97312E3631303030    2D312E3631303030       -1.6100
  9    708.350      .        97352E3233393939    2D352E3233393939       -5.2400
 10    721.810    13.4600    31332E3436303030    2020202020202020       13.4600
stego

The new way: Manipulate the _INFILE_ Variable before INPUT
(available in SAS Version 8)

An alternate approach is to alter the input buffer prior to the main INPUT statement. The DATA step variable _INFILE_ contains the current contents of the input buffer.

Download this program
data readEMdash2;
  infile "c:\temp\dinosaur\EMdash.dat";
  input @;
  _infile_=translate(_infile_,"-","97"x,);
  input @1 number1 number2;
run;
proc print data=readEMdash2;
  var Number1 Number2;
  title1 "Data containing EM dash character instead of hyphen";
run;

SAS Log

4796   data readEMdash2;
4797     infile "c:\temp\dinosaur\EMdash.dat";
4798     input @;
4799     _infile_=translate(_infile_,"-","97"x,);
4800     input @1 number1 number2;
4801   run;
 
NOTE: The infile "c:\temp\dinosaur\EMdash.dat" is:
      Filename=c:\temp\dinosaur\EMdash.dat,
      RECFM=V,LRECL=256,File Size (bytes)=257,
      Last Modified=14May2008:20:54:37,
      Create Time=05Aug2007:22:03:48
 
NOTE: 10 records were read from the infile "c:\temp\dinosaur\EMdash.dat".
      The minimum record length was 14.
      The maximum record length was 25.
NOTE: The data set WORK.READEMDASH2 has 10 observations and 2 variables.
 
4802   proc print data=readEMdash2;
4803     var Number1 Number2;
4804     title1 "Data containing EM dash character instead of hyphen";
4805   run;
 
NOTE: There were 10 observations read from the data set WORK.READEMDASH2.
 

SAS Listing Output

Data containing EM dash character instead of hyphen
 
Obs    number1     number2
 
  1    740.200       .
  2    747.380      7.1800
  3    750.400      3.0200
  4    757.040      6.6400
  5    711.360    -45.6800
  6    721.920     10.5600
  7    715.200     -6.7200
  8    713.590     -1.6100
  9    708.350     -5.2400
 10    721.810     13.4600

Advantages of the alternate approach: Good

  • This approach is feasible when no write access has been provided to the original data.
  • No programming logic is needed for choosing between different INPUT statements.
  • Many types of dynamic corrections can be accomplished, some with minimal effort.

Disadvantages of the alternate approach: Bad

  • This approach may not be appropriate if corrections require expanding the input buffer beyond its original size, as truncation of shorter records may occur. To avoid truncation, specify either the LRECL option or the PAD option in the INFILE statement.

Additional documentation for this technique can be found in SAS® 9.2 Language Reference: Dictionary. Cary, NC: SAS Institute Inc.

Visit http://support.sas.com/documentation/onlinedoc/sas9doc.html for SAS 9 documentation.

The URL for this page is http://www.repole.com/dinosaur/inputbuffer.html

To view the test data used in this example, go to http://www.repole.com/dinosaur/rawdata-data.html

These techniques are mentioned in other SAS references and publications:


Back to SAS Dinosaur home page

Printable copy of this page (without sample output)

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries.

® indicates USA registration.