Overview#

Banff is a statistical package developed by Statistics Canada, consisting of nine modular procedures performing various statistical data editing (SDE) functions, including imputation. Some general notes about Banff:

  • Most of the SDE methods included in Banff are designed for economic surveys, and in particular numerical variables such as revenue and employee counts. Banff does not currently include methods recommended for the imputation of categorical or ordinal data.

  • Banff includes a number of methods designed for data whose variables are constrained by linear relationships, also commonly referred to as linear edit rules or simply edits. This includes procedures that review data with respect to the edits, choose which variables to impute when the edits fail, and impute records to ensure that all edits are satisfied.

  • While each Banff procedure can be run independently, they follow a modular template and can be run in sequence as part of a larger SDE process flow. Outputs from one procedure act as natural inputs for subsequent procedures.

  • Banff uses status flags to track metadata such as selection and imputation flags. These status flags allow the Banff procedures to pass information from one procedure to another, and also serve as a log of the overall SDE process.

This user guide is meant to provide Banff users with general information about the package, covering important methodological concepts common to multiple procedures, and technical details. For information specific to individual procedures, including descriptions of all parameters, please see the Procedure Guides linked in the belowtable of procedures. A list of all outputs generated by the procedures can be found in the Output Tables document. A full description of the underlying methods, with examples and guidance on how they should be applied, can be found in the Functional Description.

When running Banff procedures in sequence as part of an SDE process flow, users are responsible for input and output between steps. An additional package, the Banff Processor, is a metadata-driven utility designed specifically for large-scale SDE production, incorporating the Banff procedures, and handling all intermediate data management.

The Banff user guide often uses terminology from the Generic Statistical Data Editing Model (GSDEM). Users are encouraged to reference the GSDEM for common terminology regarding SDE concepts.

Table of contents#

Methodology#

Most SDE functions can be categorized into one of the following function types defined in the GSDEM:

  • Review: Functions that examine the data to identify potential problems.

  • Selection: Functions that select units or fields within units for specified further treatment.

  • Treatment: Functions that change the data in a way that is considered appropriate to improve the data quality. The modification of specific fields within a unit (i.e. filling in missing values or changing erroneous ones) is referred to as imputation.

The nine Banff procedures are listed in the table below, alongside a brief description of each, and the function types they perform. Each procedure has an alias in Banff, which is used to execute the procedure in Python; this is also used to reference the procedures within the user guides.

List of procedures#

Procedure name

Alias

Function types

Description

Verify Edits

verifyedits

none

Checks the edits for consistency and redundancy.

Edit Statistics

editstats

Review

Produces edit summary statistics tables on records that pass, miss or fail each consistency edit.

Outlier Detection

outlier

Review, Selection

Identifies outlying observations using Hidiroglou-Berthelot or Sigma-Gap methods.

Error Localization

errorloc

Review, Selection

For each record, selects the minimum number of variables to impute such that each observation can be made to pass all edits.

Deterministic Imputation

deterministic

Treatment

Performs imputation when only one combination of values permits the record to pass the set of edits.

Donor Imputation

donorimp

Treatment

Performs nearest neighbour donor imputation such that each imputed record satisfies the specified post-imputation edits.

Estimator Imputation

estimator

Treatment

Performs imputation using estimation functions and/or linear regression estimators.

Prorating

prorate

Review, Selection, Treatment

Prorates and rounds records to satisfy user-specified edits.

Mass Imputation

massimp

Review, Selection, Treatment

Performs donor imputation for a block of variables using a nearest neighbour approach or random selection.

This user guide does not include information about specific procedures; this can instead be accessed via the links in the table above or from the procedure guide index. The procedure guides include all information required to run the procedures, including detailed descriptions of every parameter, but only a brief description of the methods. For a full mathematical description of the procedure methods, with examples, please see the Functional Description.

Interaction between procedures#

The Banff procedures are designed to be run sequentially as part of an SDE process flow. The outputs from one procedure often act as inputs for subsequent procedures, and the statistical data that is the target of the SDE process is updated continuously throughout the process. A standard approach using the Banff procedures may resemble the following:

  1. Validity and consistency edits are reviewed and refined using verifyedits.

  2. An initial review of the raw statistical data is performed with editstats.

  3. Review and selection functions are performed using outlier and errorloc to identify potential problems, and select fields within units for further treatment.

  4. Imputation is performed using the treatment functions available in deterministic, donorimp, and estimator to impute missing values, outliers, and to correct inconsistencies.

  5. Prorating is performed using prorate to ensure all resulting values satisfy summation constraints, and to round any decimal values.

  6. Optionally, massimp is used to impute large blocks of non-response.

  7. A final review of the imputed data is performed with editstats to ensure the outgoing data meets desired quality standards.

For this process to work, information needs to be passed from one step to another. Selection flags generated by outlier and errorloc are stored in a status file that is read by subsequent treatment procedures. When one of the treatment procedures successfully imputes a field requiring imputation, the selection flag on the status file is replaced with an imputation flag, indicating to subsequent procedures that treatment is no longer required.

To manage the changes to both the statistical data and status flags, the Banff procedures are modular, with a standard set of inputs and outputs shared amongst procedures performing similar SDE function types. Of the nine Banff procedures, only two (prorate and massimp) perform all three function types; these procedures can be run in isolation to review the data, select records and/or fields for treatment, and perform imputation. Procedures that perform only treatment (deterministic, donorimp, and estimator) first require selection flags generated by one of the procedures performing selection (outlier or errorloc). The editstats procedure reviews the data for errors but does not produce selection flags. Finally, verifyedits does not perform any of the statistical data editing function types, but should be used before the SDE process begins to review and improve edits used by other procedures.

Users should familiarize themselves with two types of data that are integral to understanding Banff:

  1. Statistical tables indata and outdata: These input and output files represent the statistical data, or microdata, that is the target of the SDE process. All procedures except verifyedits act on statistical data and required indata as an input. Only those procedure that perform treatment functions (deterministic, donorimp, estimator, prorate, and massimp) produce outdata, containing the imputation results.

  2. Status files instatus and outstatus: These input and output files contain status flags, important metadata identifying which fields in the statistical data require treatment (selection flags) and which have already been treated (imputation flags).

Statistical tables and status files are discussed in further detail in the following sections.

Statistical tables indata and outdata#

Except for verifyedits, all Banff procedures operate on statistical data (also called microdata) arranged in tabular form. The main input for most procedures is the indata table, which is the target of the data editing process. Some procedures also make use of indata_hist, historical data for the same set of records. Procedures that perform treatment functions produce the output table outdata.

The indata table must consist of tabular data arranged in rows and columns. Banff documents refer to rows as records while columns are referred to as variables, or fields in the case of status flags. At least one character variable must serve as a unique record identifier, specified by the unit_id parameter in most procedures. Banff uses this identifier to track metadata throughout the data editing process. There are some restrictions on variables names for Banff inputs; please see this section for more information.

Procedures performing treatment functions (deterministic, donorimp, estimator, prorate and massimp) produce the outdata table, output statistical data (i.e., microdata) that includes the result of the treatment function. This includes both imputed values (e.g., imputed from donorimp) and modified values (e.g., prorated values from prorate). Some important notes about outdata:

  • The outdata table is typically not a complete copy of indata but only contains rows and columns affected by the procedure. For example, if indata includes 2000 rows and 25 columns, but only 500 rows and 10 columns are affected by a procedure, then outdata will only include 500 rows and 10 columns. To continue the SDE process, users should manually update the indata file with the new information from outdata. (Note: The Banff team is looking into automatically updating indata with the results of outdata in a future release.)

  • The outdata table always contains variables specified by the unit_id and by parameters.

  • If no records are successfully imputed or altered by the procedure, then outdata is empty. No error will occur.

Status flags and files#

Banff stores important metadata information about the SDE process in status flags. These status flags capture important information including selection flags, exclusion flags, and imputation flags. This information is used in two ways:

  • As inputs for subsequent steps in the data editing process. For example, the error localization procedure errorloc produces selection flags to identify variables that require imputation. These selection flags are read by imputation procedures such as donorimp and estimator in order to perform imputation.

  • As a complete record of the data editing process. For example, the status flag history of a single observation can explain why, how and when it was modified.

Table of Banff-produced status flags#

Status flag

Description

FTI

Field to impute: Selection flag indicating an observation requires additional treatment such as imputation. Generated by outlier and errorloc procedures, or by a user-defined procedure.

FTE

Field to exclude: Selection flag indicating an observation should be excluded from certain methods. Generated by outlier procedure, or by a user-defined procedure.

IDE

Imputed by deterministic imputation: Field has been imputed using deterministic procedure and should be considered as a reported value (not imputed) for the purpose of subsequent methods. (Also referred to as deductive imputation.)

IDN

Imputed by donor imputation: Field has been imputed using the donorimp procedure.

I--

Imputed by estimation imputation: Field has been imputed using the estimator procedure. The exact code depends on the specified algorithm, i.e., an ICR flag indicates the CURRATIO algorithm was used while an IDT flag indicates the DIFTREND algorithm was used.

IPR

Imputed by prorating: Field has been imputed using the prorate procedure.

IMAS

Imputed by mass imputation: Field has been imputed using the massimp procedure.

Status files instatus and outstatus#

Selection and imputation flags are always associated with individual values on indata. Because indata is tabular, each observation can be associated with a specific record (row) and field (column). Records are identified by the user-specified unique record identifier unit_id, while fields are referenced by their variable name. Status flags are stored in status files with the following columns:

Column

Description

<unit-id>

Record identifier (i.e., row) to which the status flag applies. (The actual column header is the name of the variable by the unit_id parameter.)

FIELDID

Field identifier (i.e., column) to which the status flag applies.

STATUS

Status flag such as “FTI”, “FTE”, or “IDN”.

VALUE

Value of the variable when the status code was generated. For procedures performing selection (errorloc and outlier), this column reflects the value of the observation on indata when selection occurred. For procedures performing treatment (deterministic, donorimp, estimator, prorate, massimp), this column reflects the value of the observation on outdata, after treatment has occurred.

All procedures performing selection or treatment functions (i.e., all but verifyedits and editstats) automatically produce output status files labelled outstatus containing selection or imputation flags. Some procedures also read status files as inputs (instatus); these may be required, depending on the procedure. A brief summary of the behaviour of each procedure with respect to status files is included in the table below.

Status files by procedure#

The following table summarizes which flags are read from instatus or produced on outstatus by each procedure. If a flag is required by a procedure, then instatus is mandatory.

Procedure

Flags read from instatus

Flags produced on outstatus

verifyedits

N/A

N/A

editstats

N/A

N/A

outlier

N/A

FTI, FTE

errorloc

FTI (optional)

FTI

deterministic

FTI (required)

IDE

donorimp

FTI (required), FTE(optional), I– (optional)

IDN

estimator

FTI (required), FTE (optional), I–(optional)

I– (exact code depends on specified algorithm)

prorate

I– (optional)

IPR

massimp

N/A

IMAS

Some input flags are optional, but can change the behaviour of the procedure. For example, within prorate, users can choose whether to prorate all values, only original values, or only previously imputed values. If they choose to only impute original or previously imputed values, an instatus file with I– status flags is required.

Specifying linear edits#

In the statistical data editing process, the term edits generally refers to constraints that records must satisfy to be considered valid. Linear edits refer to constraints that can be expressed as linear equalities or inequalities of the form

$$ a_1 x_1 + a_2 x_2 + … + a_n x_n = b \ or \ a_1 x_1 + a_2 x_2 + … + a_n x_n \le b $$

where $x$$1$ to $x$$n$ are numerical variables from the target statistical data, and $a$$1$ to $a$$n$ and $b$ are constants specified by the user. Of the nine Banff procedures, six of them require edits as input parameters:

  • verifyedits

  • editstats

  • errorloc

  • deterministic

  • donorimp

  • prorate (Note: there are additional restrictions and unique syntax for edits in the prorating procedure)

Formatting#

Use the edits parameter to specify a list of edits, following these rules:

  • As with all string parameters, the list of edits must be surrounded by quotes " or triple quotes """ for multi-line edits.

  • Each individual edit must be followed by a semi-colon ;.

  • Individual edits must include one of the following operators:

    • < (less than)

    • <= (less than or equal to)

    • = (equal to)

    • != (not equal to)

    • >= (greater than or equal to)

    • > (greater than)

  • Within an individual edit, one or more components must appear on each side of an operator, separated by + or -. A component can be a constant, a variable found on indata, or a constant and variable multiplied together: constant * variable or variable * constant. When multiplying constants and variables, they must be separated by an asterisk *.

  • Optionally, users may add a modifier at the beginning of an individual edit, followed by a colon :. Acceptable modifiers are pass or fail and are not case sensitive.

A simple example with three edits specified on one line:

    errorloc_call  = banff.errorloc(
        edits= "Profit = Revenue - Expenses; Profit >= 0; Expenses >= 0;"
        ... # etc. (parameters, output tables)
        )

In this example, the edits are spread over multiple lines and surrounded by triple quotes """. These edits also include constants and the pass and fail modifiers.

    errorloc_call  = banff.errorloc(
        edits= """
            Profit = Revenue - Expenses; Profit >= 0; Expenses >= 0;
            0.9 * Total <= Var1 + Var2 + Var3;
            Var1 + Var2 + Var3 <= 1.1 * Total;
            Pass: Var4 >= Var5;
            Fail: Var4 > Var5;
            Fail: Num_Employees != Employees_BR
            """
        ... # etc. (parameters, output tables)
        )

Allowable edits#

While users may express edits in a number of ways, the Banff procedures convert them to canonical form before processing, meaning each edit is expressed as a pass edit with an = or <= operator. Strict inequalities are not allowed, resulting in the following rules or modifications:

  • Pass edits with < or > are replaced by <= and >= respectively.

  • Fail edits with <= or >= are replaced by < and > respectively.

  • Pass edits with != cannot be converted to canonical form and generate an error.

  • Fail edits with = cannot be converted to canonical form and generate an error.

(Note: Users who wish to specify a strict inequality should instead include a small constant in their edit, i.e., A < B can be replaced by A <= B - 0.005 for values that are only recorded to two decimals of precision.)

The following table gives examples of valid and invalid original edits alongside their canonical form, if possible:

Original edit

Canonical form

Pass: A > B + 3

-A + B <= -3

Pass: C = D

C - D = 0

Pass: Z < A

-A + Z <= 0

Pass: M != N

INVALID

Fail: A > B + 3

A - B <= 3

Fail: C = D

INVALID

Fail: Z <= A

A - Z <= 0

Fail: N != M

-M + N = 0

Additionally, the set of edits specified by a user must be consistent; that is, the linear edits must form a non-empty feasible space. Specifying an inconsistent set of edits for a procedure will result in an error. Users are encouraged to use verifyedits to review their edits for consistency, redundancy, determinacy and hidden equalities before using them in one of the other procedures.

Processing within by-groups#

The by parameter found in all Banff procedures excluding verifyedits allows users to partition the input statistical data into groups that are processed independently. Within the Banff documents, these are referred to as “by-groups”. Users specify one or more variables on indata that the procedure uses to form the by-groups. For example, by = "Province" will create by-groups for each unique value of the variable Province on indata while by = "Province Industry" will create by-groups for each unique combination of the variables Province and Industry.

Notes on by-groups:

  • Procedures errorloc, deterministic, and prorate operate on each record on indata independently; processing within by-groups has no effect on the procedure outputs. Nonetheless, the by parameter is still included in these procedures, and may be used in the future for performance enhancements.

  • If by and indata_hist are both specified in a procedure, then the indata_hist table must also include the by variables.

  • Input tables such as indata and indata_hist must be sorted according to the by variables before execution. By default, Banff procedures handle this sorting automatically; this sorting can be disabled (for example, if the input file is already sorted) by setting presort=False.

  • When instatus is specified, some procedures are more efficient if the by variables are included on instatus. By default, Banff will add the by variables automatically; this action can be disabled (for example, if the variables are already present) by setting prefill_by_vars=False.

  • By default, many procedures will output diagnostics to the log for each by-group; to disable this feature, specify no_by_stats=True.

  • The outdata file will always contain the by variables.

Technical guide#

Executing the Banff procedures#

To execute a Banff procedure in a Python script, we first import the Banff package alongside any other packages we plan on using:

import banff
import pandas as pandas

We now create a new object with a name of our choosing (in this case, “my_errorloc”) as one of the Banff procedures, using the alias from table (link):

my_errorloc = banff.errorloc(
    indata=example_indata,
    outstatus=True,
    outreject=True,
    edits="x1>=-5; x1<=15; x2>=30; x1+x2<=50;",
    weights="x1=1.5;",
    cardinality=2,
    time_per_obs=0.1,
    unit_id="IDENT",
    by="ZONE"
)
  • Procedures must be referenced using the Banff package name, i.e., banff.errorloc()

  • Parameters (e.g., edits) and tables (e.g., indata) are specified as comma-separated key-value pairs, and can appear in any order

Outputs can be stored in memory or saved to disc (see this section for details). If stored in memory, they are stored in the user-created object (e.g., my_errorloc) created when the procedure was executed:

print(my_errorloc.outstatus) # Print the outstatus table
errorloc_outstatus = my_errorloc.outstatus # Save outstatus as a new PyArrow Table called "errorloc_outstatus"

Variable names on input tables#

Many Banff procedure parameters reference variables (i.e., columns) from input tables such as indata. Any variable referenced by a Banff procedure must consist of a single string without spaces or special characters except underscore (_). For example, "first_name" is an acceptable variable name while "first name" and "first-name$" are not. To ensure that input tables are compatible with Banff, users may need to modify the variable names. Additional constraints:

  • Variable names cannot exceed 64 characters in length.

  • Variable names must be unique within an individual input table. (The same variable can appear on multiple input tables without issue.)

Common Banff parameters that reference variables include unit_id, var, weights, and by. They are simply referenced by their variable name, and lists of two or more variables are separated by a single space. For example:

  • Single variable: unit_id = "business_number"

  • Variable list: by = "province industry"

Note that variable names are not case-sensitive in Banff; if an input table has two or more columns that differ only in case, it is unclear which column will be used during processing. (Users are strongly encouraged to give distinct case-insensitive names to all columns on any input tables.)

Banff log#

Log messages are written to the terminal during execution. These messages originate from 2 sources: Python code and procedure C code. Messages can be displayed in English or French.

Setting The Log Language#

The banff package produces log messages in either English or French. The package attempts to detect the language from its host environment during import. Use the function banff.set_language() to set the language at runtime, specifying a member of the banff.SupportedLanguage enumeration (i.e. .en or .fr)

Example: setting the language to French

banff.set_language(banff.SupportedLanguage.fr)

Python Log Messages#

Python handles logging using the standard logging package. All Python log messages have an associated log level, such as ERROR, WARNING, INFO, and DEBUG. Messages from Python are generally one line per message and are prefixed with a timestamp and level.

By default, only warning and error messages are displayed. Use the trace parameter to change what log levels are displayed.

Example Python Messages
The following are 3 unrelated messages from the INFO, DEBUG, and ERROR log levels. By default, only the 3rd message would be printed to the terminal.

2024-11-29 10:47:51,853 [INFO]:  Time zone for log entries: Eastern Standard Time (UTC-5.0)
2024-11-29 10:48:47,654 [DEBUG   , banff.donorimp._execute._preprocess_inputs]:  Adding BY variables to 'instatus' dataset
2024-11-29 10:49:59,867 [ERROR]:  Procedure 'Donor Imputation' encountered an error and terminated early: missing mandatory dataset (return code 4)

Python Log Verbosity (trace=)#

Use the trace parameter to control which log levels are printed by specifying one of the following log levels

  • banff.log_level.ERROR

  • banff.log_level.WARNING

  • banff.log_level.INFO

  • banff.log_level.DEBUG

Messages from the specified log level and higher levels will be printed, lower levels will not.

For convenience, specifying trace=True enables all logging output.

Procedure C Log Messages#

Messages from procedure C code have no associated log level and do not include a timestamp. Some messages are prefixed with ERROR:, WARNING: or NOTE:, while other messages have no prefix and may span multiple lines. Most messages are informational. Whenever an ERROR: message is printed, there should be a subsequent related error-level message from Python and an exception should be raised.

Example Procedure C Messages

NOTE: --- Banff System 3.01.001b19.dev6 developed by Statistics Canada ---
NOTE: PROCEDURE DONORIMPUTATION Version 3.01.001b19.dev6
. . .
NOTE: The minimal set of edits is equivalent to the original set of edits specified.
. . .
Number of valid observations ......................:      12     100.00%
Number of donors ..................................:       5      41.67%
Number of donors to reach DONORLIMIT ..............:       3      60.00%
Number of recipients ..............................:       7      58.33%

NOTE: The above message was for the following by group:
      prov=30

Procedure C Log Verbosity#

See no_by_stats.

Suppressing and Troubleshooting Log Messages (capture=)#

The capture parameter can be used to suppress all log output (using capture=None). This option is diabled by default (capture=False).

Specifying capture=True will cause log messages to be printed all at once at the end of procedure execution, instead of printed immediately throughout execution. The difference may only be noticable during long running procedure calls. Using this option can improve performance in some cases, such as when processing a very large number of by groups and without specifying no_by_stats=True.

Jupyter Notebooks and Missing Log Messages
When running a procedure using Juputer Notebooks, Procedure C log messages may be missing, particular when running on Visual Studio Code on Windows.
To fix this issue, try using the option capture=True.

Due to how Jupyter notebooks manage Python’s terminal output, messages from procedure C code may not be displayed. To resolve this, specify capture=True in the procedure call when running in a Jupyter notebook, or .

Use Your Own Logger (logger=)#

Use the logger parameter to specify a Logger that you have created. All messages from Python and C will be sent to that logger. This allows customization of message prefixes, support for writing to file, etc.

Note that procedure C messages are sent to the logger in a single INFO level Python log message.

Example: writing logs to file

import banff
import logging
my_logger = logging.getLogger(__name__)
logging.basicConfig(filename='example.log', encoding='utf-8', level=logging.DEBUG)

# run Banff procedure
banff_call = banff.donorimp(
    logger=my_logger,
    indata=indata,
    instatus=donorstat,
    outmatching_fields=True,
...

Presort#

When by variables are specified, Banff procedures require input tables be sorted according to those variables before execution. Some procedures also require sorting by unit_id. Setting presort = True automatically sorts any specified input tables (e.g., indata, indata_hist, and instatus) according to procedure’s requirements. By default, presort = True for all applicable Banff procedures. Users may disable this feature by specifying presort = False.

prefill_by_vars#

Setting prefill_by_vars = True will automatically add any specified by variables to the input status file(s), if necessary, before running the Banff procedure. In some cases, the presence of by variables on input status files may significantly improve procedure performance. By default, prefill_by_vars = True for all applicable Banff procedures. Users may disable this feature by specifying prefill_by_vars = False.

no_by_stats#

Many of the Banff procedures output information to the Banff log. When by-groups are specified, this information is typically produced for each by-group. Specify no_by_stats = True to reduce log output by suppressing by-group-specific messages. This parameter is available for most procedures that allow by groups, though some procedures have few by-group specific messages.

Expressions#

The parameters exclude_where_indata and exclude_where_indata_hist apply boolean logic to input tables using SQL expressions. SQL expression support is implemented using DuckDB. See their documentation on expressions for a complete guide of supported syntax.

Input and output table specification#

For both input and output tables, users can specify in-memory objects or files on disc. A number of different formats are supported for both types. Objects are associated with identifiers (e.g., "pandas dataframe") while files are associated with extensions (e.g., "filename.parquet"); please see the table below for details. Note that some are recommended for testing purposes only, and not all formats are supported for outputs tables.

Supported formats#

Format

Type

Supported identifier(s) or extension(s)

Notes

PyArrow Table

Object

"pyarrow", "table", "pyarrow table"

Recommended format for in-memory objects.

Pandas DataFrame

Object

"pandas", "dataframe", "pandas dataframe"

Apache Parquet

File

.parquet, .parq

Minimal RAM usage, good performance with large tables.

Aparche Feather

File

.feather

Least RAM usage, good performance with large tables.

SAS Dataset

File

.sas7bdat

For testing purposes, input only; not recommended in production.

Comma Separated Value

File

.csv

For testing purposes only; not recommended in production.

For tips related to file paths in Python, see Escape Characters and File Paths

Specifying input tables#

To input from an in-memory object, simply reference the object name directly from the procedure call. The procedure will automatically detect the type of object from amongst the supported types.

    donorimp_call = banff.donorimp(
        indata=df, # where df is a Pandas DataFrame preivously generated
        instatus=table, # where table is a PyArrow Table previously generated
        ... # etc. (parameters, output tables)
        )

To specify an input from file, include either a relative or complete file path:

    donorimp_call  = banff.donorimp(
        indata="./input_data.parquet", # Parquet file with local reference
        instatus=r"C:\temp\input_status.feather", # Feather file with Windows reference
        ... # etc. (parameters, output datasets)
        )

Users can mix both type of inputs as well:

    donorimp_call  = banff.donorimp(
        indata="./input_data.parquet", # Parquet file with local reference
        instatus=table, # where table is a PyArrow Table previously generated
        ... # etc. (parameters, output tables)
        )

Specifying output tables#

The Banff procedures automatically create a number of output tables. Some of these are optional, and can be disabled by specifying False. (Specifying False for an optional output will prevent it from being produced at all, possibly reducing memory usage. Specifying False for a mandatory output will result in an error.) The default format for output tables is the in-memory PyArrow Table. To produce the output in another in-memory format, specify its associated identifier as a string. To write outputs to file, specify a file path with a supported extension.

See the table of supported formats for a list of identifiers and extensions. Please see the Output Tables document for a full list of output tables by procedure.

The following examples includes both mandatory and optional outputs, saved as a mix of in-memory objects and files.

    estimator_call  = banff.estimator(
        outdata=True, # Saved as a PyArrow Table
        outstatus=True, # Saved as a PyArrow Table
        outacceptable=True, # Optional output saved as PyArrow Table
        outest_ef="pyarrow", # Optional output saved as a PyArrow Table
        outest_lr="dataframe", # Optional output saved as a Pandas Dataframe
        outest_parm=False, # Optional output disabled
        outrand_err="./output_data.parquet", # Optional output saved as parquet file
        ... # etc. (parameters, output tables)
        )

Note: because tables are enabled by default, and PyArrow Table is the default output format, the following would produce identical results:

    estimator_call  = banff.estimator(
        outest_lr="dataframe", # Optional output saved as a Pandas Dataframe
        outest_parm=False, # Optional output disabled
        outrand_err="./output_data.parquet", # Optional output saved as parquet file
        ... # etc. (parameters, output tables)
        )

NOTE: Outputs will automatically overwrite existing objects and files with the same name.

Customize Default Output Specification#

To determine the current default output table format

>>> banff.get_default_output_spec()
'pyarrow'
  • this corresponds to pyarrow.Table

The default can be set to any identifier from the table of Supported Formats

Example: Switch default output format to pandas.DataFrame

banff.set_default_output_spec('pandas')

Accessing output tables#

For objects saved in memory, access them using the object member naming the output tables:

    estimator_call  = banff.estimator(
        outdata=True, # Saved as a PyArrow Table
        outstatus=True, # Saved as a PyArrow Table
        ... # etc.
        )
    print(estimator_call.outdata) # Print outdata to the terminal
    my_table = estimator_call.outstatus # Save outstatus as a new object called my_table

Note: because outdata,outstatus are mandatory outputs, they would still be accessible as estimator_call.outdata and estimator_call.outstatus even if not specified using the True statements.

Other#

Escape Characters and File Paths#

On Windows, the backslash character (\) is typically used to separate folders and files in a file path.

  • Example "C:\users\stc_user\documents\dataset.csv"

In Python however, the character \ is an “escape character” and is treated specially. Providing a file path using the example above may cause a runtime error. To disable this special treatment, use a “raw string” by adding the r prefix:

  • r"C:\users\stc_user\documents\dataset.csv"

Alternatively,

  • double backslash: C:\\users\\stc_user\\documents\\dataset.csv

  • forward backslash: C:/users/stc_user/documents/dataset.csv

Errors and Exceptions#

Python generally handles runtime errors by “raising an exception”. This has been adopted by the banff package. Whenever an error occurs, an exception is raised. This could occur while the package is loading or preprocessing input data, running a procedure, or writing output data.

Generally exceptions will contain a helpful error message. Exceptions are often “chained” to provide additional context to the exception.

Example: Exception while writing output dataset

The following is console output generated when the package fails to write an output dataset because the destination folder does not exist.

[ERROR   , banff.donorimp._execute._write_outputs]:  Error occurred while writing 'outmatching_fields' output dataset
[ERROR   , banff.donorimp._execute._write_outputs]:  Directory of output file does not exist: 'C:\temp\definitely\a\fake'
[ERROR   , banff.donorimp._execute._write_outputs]:  [WinError 3] The system cannot find the path specified: 'C:\\temp\\definitely\\a\\fake'
Traceback (most recent call last):
  File "C:\git\banff_redesign\Python\src\banff\_common\src\io_util\io_util.py", line 578, in write_output_dataset
    dst.parent.resolve(strict=True)  # strict: exception if not exists
  File "C:\Program Files\Python310\lib\pathlib.py", line 1077, in resolve
    s = self._accessor.realpath(self, strict=strict)
  File "C:\Program Files\Python310\lib\ntpath.py", line 689, in realpath
    path = _getfinalpathname(path)
FileNotFoundError: [WinError 3] The system cannot find the path specified: 'C:\\temp\\definitely\\a\\fake'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\git\banff_redesign\Python\src\banff\_common\src\proc\stc_proc.py", line 649, in _write_outputs
    ds.user_output = io.write_output_dataset(ds.ds_intermediate, ds.user_spec, log_lcl)
  File "C:\git\banff_redesign\Python\src\banff\_common\src\io_util\io_util.py", line 581, in write_output_dataset
    raise FileNotFoundError(mesg) from e
FileNotFoundError: Directory of output file does not exist: 'C:\temp\definitely\a\fake'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\git\banff_redesign\Python\sample_programs\conversion_examples\DonorImp01.py", line 61, in <module>
    banff_call = banff.donorimp(
  File "C:\git\banff_redesign\Python\src\banff\proc\proc_donorimp.py", line 119, in __init__
    super().__init__(
  File "C:\git\banff_redesign\Python\src\banff\proc\banff_proc.py", line 66, in __init__
    self._execute()
  File "C:\git\banff_redesign\Python\src\banff\_common\src\proc\stc_proc.py", line 367, in _execute
    self._write_outputs(log=log_lcl)
  File "C:\git\banff_redesign\Python\src\banff\_common\src\proc\stc_proc.py", line 654, in _write_outputs
    raise ProcedureOutputError(mesg) from e
banff._common.src.exceptions.ProcedureOutputError: Error occurred while writing 'outmatching_fields' output dataset

The first 3 lines are log messages generated by the banff package. The remaining lines are a standard exception traceback generated by Python itself. From top to bottom it shows a chain of 3 exceptions.
The first is a low-level error indicating that a file path cannot be found, “[WinError 3] The system cannot find the path specified: 'C:\\temp\\definitely\\a\\fake'”.
The second more specifically indicates that the “Directory of output file does not exist: 'C:\temp\definitely\a\fake'”.
The third provides context about what was happening when this error occurred, “Error occurred while writing 'outmatching_fields' output dataset”.

Working with SAS Files in Python#

The banff package provides a few useful functions for reading SAS files into memory or converting to another format.

To use these functions your program must import banff

Function

Description

banff.io_util.SAS_file_to_feather_file(file_path, destination)

Reads SAS dataset at file_path and writes it to feather file at destination

banff.io_util.SAS_file_to_parquet_file(file_path, destination)

Reads SAS dataset at file_path and writes it to parquet file at destination

banff.io_util.DF_from_sas_file(file_path)

Reads SAS dataset at file_path and returns it as a pandas.DataFrame

banff.io_util.PAT_from_sas_file(file_path)

Reads SAS dataset at file_path and returns it as a pyarrow.Table

Performance Considerations#

The formats used for input and output datasets will affect performance.

When there may not be sufficient RAM available (due to small RAM size or large datasets), datasets should be stored on disk. The file format selected will have an effect on performance. Apache Parquet (.parquet) and Apache Feather (.feather) file formats currently deliver the best performance when using files for input or output datasets.

Feather should use the least amount of RAM, making it ideal for large datasets or execution environments with little RAM, it is the recommended format for temporary files. Parquet is generally the smallest file size, however it still provides impressive read and write performance in multi-CPU environments and reasonably minimal RAM usage, it is recommended for medium-long term storage of data.

Using the SAS dataset format for large input datasets may result in degraded performance, particularly in environments with little RAM. This format is only recommended for use with small datasets (under a few hundred MB). Using the SAS format is discouraged in general, with Apache Arrow formats (parquet and feather) being recommended instead.