Tutorial#

This tutorial will show the Python equivalent (/Python/sample_programs/conversion_examples/Errorloc01.py) of the SAS language Proc Errorloc sample program 1 (/Python/sample_programs/conversion_examples/Errorloc01.sas).
Numerous additional SAS and Python equivalents are also available in the /Python/sample_programs/conversion_examples/ folder.

The example program shows how to

  • create a synthetic table

  • sort the table

  • specify

    • parameters

    • input table

    • output table options

  • access results

It will discuss some relevant differences between SAS and Python as well.

SAS Language example#

/* create synthetic input table */
data example_indata;
input IDENT $ X1 X2 ZONE $1.;
cards;
R03 10 40 B
R02 -4 49 A
R04 4 49 A
R01 16 49 A
R05 15 51 B
R07 -4 29 B
R06 30 70 B
;
run;

/* sort by BY and KEY variables */
proc sort data=example_indata; by ZONE IDENT;run;

/* create Banff call */
proc errorloc
data=example_indata
outstatus=outstatus
outreject=outreject
edits="x1>=-5; x1<=15; x2>=30; x1+x2<=50;"
weights="x1=1.5;"
cardinality=2
timeperobs=.1
;
id IDENT;
by ZONE;

/* execute Banff call */
run; 

Python language equivalent#

# import Banff package
import banff
import pyarrow as pa

# create a schema for the indata table
indata_schema = pa.schema([
            ("IDENT", pa.string()),
            ("X1", pa.int64()),
            ("X2", pa.int64()),
            ("ZONE", pa.string()),
    ])

# create table using schema and lists of values for each column
indata = pa.table(
    schema=indata_schema,
    data=[
        ["R03", "R02", "R04", "R01", "R05", "R07", "R06"],
        [10, -4, 4, 16, 15, -4, 30],
        [40, 49, 49, 49, 51, 29, 70],
        ["B", "A", "A", "A", "B", "B", "B"],
    ],
)

# sort according to `by` variable and `unit_id`
indata = indata.sort_by([
    ("ZONE", "ascending"),
    ("IDENT", "ascending"),
])

# run Banff procedure
banff_call = banff.errorloc(
    indata=indata,
    edits="x1>=-5; x1<=15; x2>=30; x1+x2<=50;",
    weights="x1=1.5;",
    cardinality=2,
    time_per_obs=0.1,
    unit_id="IDENT",
    by="ZONE",
)

Line by line explanation#

Import packages#

In Python, packages must be imported into a session before they can be used.

import banff
import pyarrow as pa

The pyarrow package is used for creating and manipulating tables. Note that the alias pa is used for the pyarrow package.

Create synthetic data#

We create the same synthetic table that was created in SAS.

A Pyarrow Schema object is created and assigned to the variable indata_schema.

# create a schema for the indata table
indata_schema = pa.schema([
            ("IDENT", pa.string()),
            ("X1", pa.int64()),
            ("X2", pa.int64()),
            ("ZONE", pa.string()),
    ])

A Pyarrow Table object is created and stored in the indata variable.

# create table using schema and lists of values for each column
indata = pa.table(
    schema=indata_schema,
    data=[
        ["R03", "R02", "R04", "R01", "R05", "R07", "R06"],
        [10, -4, 4, 16, 15, -4, 30],
        [40, 49, 49, 49, 51, 29, 70],
        ["B", "A", "A", "A", "B", "B", "B"],
    ],
)
  • the order of columns in the schema corresponds to the order of lists of data

  • for more details, see pyarrow.table documentation

  • unlike SAS, where tables are typically stored as files, Pyarrow Tables are typically stored in memory as “objects”, hence the need to assign to the indata variable

The indata table is sorted using its sort_by() method and assigned back to itself.

# sort according to `by` variable and `unit_id`
indata = indata.sort_by([
    ("ZONE", "ascending"),
    ("IDENT", "ascending"),
])
  • in SAS this would be analogous to using proc sort data=indata without specifying the out option

  • most operations in pyarrow return a sort of copy of the table, meaning the source table is not modified

  • to retain the unsorted table, simply assign the sorted table to a different variable

    • indata_sorted = indata.sort_by(... will give you a new copy which is sorted and leave the original indata copy as-is

Python Concepts#

  • in Python, variables are often objects, containing not only data but also methods

  • methods are often useful for getting information about, or performing operations on, the data

  • the indata variable is a Pyarrow Table (pyarrow.Table) object, and accordingly has its own sort_by() method

    • documentation on other methods available for Pyarrow Tables can be found here.

Running The Banff Procedure#

banff_call = banff.errorloc(
    indata=indata,
    edits="x1>=-5; x1<=15; x2>=30; x1+x2<=50;",
    weights="x1=1.5;",
    cardinality=2,
    time_per_obs=0.1,
    unit_id="IDENT",
    by="ZONE",
)

Calling banff.errorloc() results in the errorloc procedure executing and an object being assigned to the banff_call variable. The object can be used to access output tables.

  • note that all parameters and tables are specified as comma separated key-value pairs

  • they can appear in any order

  • indata=indata specifies that the indata table (recently sorted) should be provided as the “indata” table

Procedure Execution#

During procedure execution, the text written to the console by the procedure should be nearly identical to the SAS log from the equivalent example.

Accessing output tables#

After execution completes, output table options are processed. The default option was specified above, so the output tables are available as pyarrow tables.

They are stored in the banff_call object, access them using banff_call.outstatus and banff_call.outreject. From here they can be handled by users like any other pyarrow table, for example:

  • written to file

  • manipulated (sorted, merged, etc.)

  • used as input for another procedure (instatus=banff_call.outstatus)

Other input and output options#

To provide a consistent means of reading/writing files and converting between table formats, while maintaining the highest possible floating-point precision, support for various input and output table formats has been implemented. For complete details, see the User Guide.

The following code will demonstrate the use of files for input and output tables by modifying the above example.

banff.errorloc( 
    indata=r"C:\temp\in_data.feather",
    outstatus=r"C:\temp\out_status.parquet",
    outreject=r"C:\temp\out_reject.parquet",
    edits="x1>=-5; x1<=15; x2>=30; x1+x2<=50;",
    weights="x1=1.5;",
    cardinality=2,
    time_per_obs=0.1,
    unit_id="IDENT",
    by="ZONE"
)
  • note that there is no need to assign the object returned by banff.errorloc() to a variable because output data is written to disk