Overview#
Banff is a statistical package developed by Statistics Canada, consisting of nine modular procedures performing various statistical data editing (SDE) functions, including imputation. Some general notes about Banff:
Most of the SDE methods included in Banff are designed for economic surveys, and in particular numerical variables such as revenue and employee counts. Banff does not currently include methods recommended for the imputation of categorical or ordinal data.
Banff includes a number of methods designed for data whose variables are constrained by linear relationships, also commonly referred to as linear edit rules or simply edits. This includes procedures that review data with respect to the edits, choose which variables to impute when the edits fail, and impute records to ensure that all edits are satisfied.
While each Banff procedure can be run independently, they follow a modular template and can be run in sequence as part of a larger SDE process flow. Outputs from one procedure act as natural inputs for subsequent procedures.
Banff uses status flags to track metadata such as selection and imputation flags. These status flags allow the Banff procedures to pass information from one procedure to another, and also serve as a log of the overall SDE process.
This user guide is meant to provide Banff users with general information about the package, covering important methodological concepts common to multiple procedures, and technical details. For information specific to individual procedures, including descriptions of all parameters, please see the Procedure Guides linked in the belowtable of procedures. A list of all outputs generated by the procedures can be found in the Output Tables document. A full description of the underlying methods, with examples and guidance on how they should be applied, can be found in the Functional Description.
When running Banff procedures in sequence as part of an SDE process flow, users are responsible for input and output between steps. An additional package, the Banff Processor, is a metadata-driven utility designed specifically for large-scale SDE production, incorporating the Banff procedures, and handling all intermediate data management.
The Banff user guide often uses terminology from the Generic Statistical Data Editing Model (GSDEM). Users are encouraged to reference the GSDEM for common terminology regarding SDE concepts.
Table of contents#
Methodology#
Most SDE functions can be categorized into one of the following function types defined in the GSDEM:
Review: Functions that examine the data to identify potential problems.
Selection: Functions that select units or fields within units for specified further treatment.
Treatment: Functions that change the data in a way that is considered appropriate to improve the data quality. The modification of specific fields within a unit (i.e. filling in missing values or changing erroneous ones) is referred to as imputation.
The nine Banff procedures are listed in the table below, alongside a brief description of each, and the function types they perform. Each procedure has an alias in Banff, which is used to execute the procedure in Python; this is also used to reference the procedures within the user guides.
List of procedures#
Procedure name |
Alias |
Function types |
Description |
|---|---|---|---|
|
none |
Checks the edits for consistency and redundancy. |
|
|
Review |
Produces edit summary statistics tables on records that pass, miss or fail each consistency edit. |
|
|
Review, Selection |
Identifies outlying observations using Hidiroglou-Berthelot or Sigma-Gap methods. |
|
|
Review, Selection |
For each record, selects the minimum number of variables to impute such that each observation can be made to pass all edits. |
|
|
Treatment |
Performs imputation when only one combination of values permits the record to pass the set of edits. |
|
|
Treatment |
Performs nearest neighbour donor imputation such that each imputed record satisfies the specified post-imputation edits. |
|
|
Treatment |
Performs imputation using estimation functions and/or linear regression estimators. |
|
|
Review, Selection, Treatment |
Prorates and rounds records to satisfy user-specified edits. |
|
|
Review, Selection, Treatment |
Performs donor imputation for a block of variables using a nearest neighbour approach or random selection. |
This user guide does not include information about specific procedures; this can instead be accessed via the links in the table above or from the procedure guide index. The procedure guides include all information required to run the procedures, including detailed descriptions of every parameter, but only a brief description of the methods. For a full mathematical description of the procedure methods, with examples, please see the Functional Description.
Interaction between procedures#
The Banff procedures are designed to be run sequentially as part of an SDE process flow. The outputs from one procedure often act as inputs for subsequent procedures, and the statistical data that is the target of the SDE process is updated continuously throughout the process. A standard approach using the Banff procedures may resemble the following:
Validity and consistency edits are reviewed and refined using
verifyedits.An initial review of the raw statistical data is performed with
editstats.Review and selection functions are performed using
outlieranderrorlocto identify potential problems, and select fields within units for further treatment.Imputation is performed using the treatment functions available in
deterministic,donorimp, andestimatorto impute missing values, outliers, and to correct inconsistencies.Prorating is performed using
prorateto ensure all resulting values satisfy summation constraints, and to round any decimal values.Optionally,
massimpis used to impute large blocks of non-response.A final review of the imputed data is performed with
editstatsto ensure the outgoing data meets desired quality standards.
For this process to work, information needs to be passed from one step to another. Selection flags generated by outlier and errorloc are stored in a status file that is read by subsequent treatment procedures. When one of the treatment procedures successfully imputes a field requiring imputation, the selection flag on the status file is replaced with an imputation flag, indicating to subsequent procedures that treatment is no longer required.
To manage the changes to both the statistical data and status flags, the Banff procedures are modular, with a standard set of inputs and outputs shared amongst procedures performing similar SDE function types. Of the nine Banff procedures, only two (prorate and massimp) perform all three function types; these procedures can be run in isolation to review the data, select records and/or fields for treatment, and perform imputation. Procedures that perform only treatment (deterministic, donorimp, and estimator) first require selection flags generated by one of the procedures performing selection (outlier or errorloc). The editstats procedure reviews the data for errors but does not produce selection flags. Finally, verifyedits does not perform any of the statistical data editing function types, but should be used before the SDE process begins to review and improve edits used by other procedures.
Users should familiarize themselves with two types of data that are integral to understanding Banff:
Statistical tables
indataandoutdata: These input and output files represent the statistical data, or microdata, that is the target of the SDE process. All procedures exceptverifyeditsact on statistical data and requiredindataas an input. Only those procedure that perform treatment functions (deterministic,donorimp,estimator,prorate, andmassimp) produceoutdata, containing the imputation results.Status files
instatusandoutstatus: These input and output files contain status flags, important metadata identifying which fields in the statistical data require treatment (selection flags) and which have already been treated (imputation flags).
Statistical tables and status files are discussed in further detail in the following sections.
Statistical tables indata and outdata#
Except for verifyedits, all Banff procedures operate on statistical data (also called microdata) arranged in tabular form. The main input for most procedures is the indata table, which is the target of the data editing process. Some procedures also make use of indata_hist, historical data for the same set of records. Procedures that perform treatment functions produce the output table outdata.
The indata table must consist of tabular data arranged in rows and columns. Banff documents refer to rows as records while columns are referred to as variables, or fields in the case of status flags. At least one character variable must serve as a unique record identifier, specified by the unit_id parameter in most procedures. Banff uses this identifier to track metadata throughout the data editing process. There are some restrictions on variables names for Banff inputs; please see this section for more information.
Procedures performing treatment functions (deterministic, donorimp, estimator, prorate and massimp) produce the outdata table, output statistical data (i.e., microdata) that includes the result of the treatment function. This includes both imputed values (e.g., imputed from donorimp) and modified values (e.g., prorated values from prorate). Some important notes about outdata:
The
outdatatable is typically not a complete copy ofindatabut only contains rows and columns affected by the procedure. For example, ifindataincludes 2000 rows and 25 columns, but only 500 rows and 10 columns are affected by a procedure, thenoutdatawill only include 500 rows and 10 columns. To continue the SDE process, users should manually update theindatafile with the new information fromoutdata. (Note: The Banff team is looking into automatically updatingindatawith the results ofoutdatain a future release.)The
outdatatable always contains variables specified by theunit_idandbyparameters.If no records are successfully imputed or altered by the procedure, then
outdatais empty. No error will occur.
Status flags and files#
Banff stores important metadata information about the SDE process in status flags. These status flags capture important information including selection flags, exclusion flags, and imputation flags. This information is used in two ways:
As inputs for subsequent steps in the data editing process. For example, the error localization procedure
errorlocproduces selection flags to identify variables that require imputation. These selection flags are read by imputation procedures such asdonorimpandestimatorin order to perform imputation.As a complete record of the data editing process. For example, the status flag history of a single observation can explain why, how and when it was modified.
Table of Banff-produced status flags#
Status flag |
Description |
|---|---|
|
Field to impute: Selection flag indicating an observation requires additional treatment such as imputation. Generated by |
|
Field to exclude: Selection flag indicating an observation should be excluded from certain methods. Generated by |
|
Imputed by deterministic imputation: Field has been imputed using |
|
Imputed by donor imputation: Field has been imputed using the |
|
Imputed by estimation imputation: Field has been imputed using the |
|
Imputed by prorating: Field has been imputed using the |
|
Imputed by mass imputation: Field has been imputed using the |
Status files instatus and outstatus#
Selection and imputation flags are always associated with individual values on indata. Because indata is tabular, each observation can be associated with a specific record (row) and field (column). Records are identified by the user-specified unique record identifier unit_id, while fields are referenced by their variable name. Status flags are stored in status files with the following columns:
Column |
Description |
|---|---|
Record identifier (i.e., row) to which the status flag applies. (The actual column header is the name of the variable by the |
|
|
Field identifier (i.e., column) to which the status flag applies. |
|
Status flag such as “FTI”, “FTE”, or “IDN”. |
|
Value of the variable when the status code was generated. For procedures performing selection ( |
All procedures performing selection or treatment functions (i.e., all but verifyedits and editstats) automatically produce output status files labelled outstatus containing selection or imputation flags. Some procedures also read status files as inputs (instatus); these may be required, depending on the procedure. A brief summary of the behaviour of each procedure with respect to status files is included in the table below.
Status files by procedure#
The following table summarizes which flags are read from instatus or produced on outstatus by each procedure. If a flag is required by a procedure, then instatus is mandatory.
Procedure |
Flags read from |
Flags produced on |
|---|---|---|
|
N/A |
N/A |
|
N/A |
N/A |
|
N/A |
FTI, FTE |
|
FTI (optional) |
FTI |
|
FTI (required) |
IDE |
|
FTI (required), FTE(optional), I– (optional) |
IDN |
|
FTI (required), FTE (optional), I–(optional) |
I– (exact code depends on specified algorithm) |
|
I– (optional) |
IPR |
|
N/A |
IMAS |
Some input flags are optional, but can change the behaviour of the procedure. For example, within prorate, users can choose whether to prorate all values, only original values, or only previously imputed values. If they choose to only impute original or previously imputed values, an instatus file with I– status flags is required.
Specifying linear edits#
In the statistical data editing process, the term edits generally refers to constraints that records must satisfy to be considered valid. Linear edits refer to constraints that can be expressed as linear equalities or inequalities of the form
$$ a_1 x_1 + a_2 x_2 + … + a_n x_n = b \ or \ a_1 x_1 + a_2 x_2 + … + a_n x_n \le b $$
where $x$$1$ to $x$$n$ are numerical variables from the target statistical data, and $a$$1$ to $a$$n$ and $b$ are constants specified by the user. Of the nine Banff procedures, six of them require edits as input parameters:
verifyeditseditstatserrorlocdeterministicdonorimpprorate(Note: there are additional restrictions and unique syntax for edits in the prorating procedure)
Formatting#
Use the edits parameter to specify a list of edits, following these rules:
As with all string parameters, the list of edits must be surrounded by quotes
"or triple quotes"""for multi-line edits.Each individual edit must be followed by a semi-colon
;.Individual edits must include one of the following operators:
<(less than)<=(less than or equal to)=(equal to)!=(not equal to)>=(greater than or equal to)>(greater than)
Within an individual edit, one or more components must appear on each side of an operator, separated by
+or-. A component can be aconstant, avariablefound onindata, or a constant and variable multiplied together:constant * variableorvariable * constant. When multiplying constants and variables, they must be separated by an asterisk*.Optionally, users may add a
modifierat the beginning of an individual edit, followed by a colon:. Acceptable modifiers arepassorfailand are not case sensitive.
A simple example with three edits specified on one line:
errorloc_call = banff.errorloc(
edits= "Profit = Revenue - Expenses; Profit >= 0; Expenses >= 0;"
... # etc. (parameters, output tables)
)
In this example, the edits are spread over multiple lines and surrounded by triple quotes """. These edits also include constants and the pass and fail modifiers.
errorloc_call = banff.errorloc(
edits= """
Profit = Revenue - Expenses; Profit >= 0; Expenses >= 0;
0.9 * Total <= Var1 + Var2 + Var3;
Var1 + Var2 + Var3 <= 1.1 * Total;
Pass: Var4 >= Var5;
Fail: Var4 > Var5;
Fail: Num_Employees != Employees_BR
"""
... # etc. (parameters, output tables)
)
Allowable edits#
While users may express edits in a number of ways, the Banff procedures convert them to canonical form before processing, meaning each edit is expressed as a pass edit with an = or <= operator. Strict inequalities are not allowed, resulting in the following rules or modifications:
Pass edits with
<or>are replaced by<=and>=respectively.Fail edits with
<=or>=are replaced by<and>respectively.Pass edits with
!=cannot be converted to canonical form and generate an error.Fail edits with
=cannot be converted to canonical form and generate an error.
(Note: Users who wish to specify a strict inequality should instead include a small constant in their edit, i.e., A < B can be replaced by A <= B - 0.005 for values that are only recorded to two decimals of precision.)
The following table gives examples of valid and invalid original edits alongside their canonical form, if possible:
Original edit |
Canonical form |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Additionally, the set of edits specified by a user must be consistent; that is, the linear edits must form a non-empty feasible space. Specifying an inconsistent set of edits for a procedure will result in an error. Users are encouraged to use verifyedits to review their edits for consistency, redundancy, determinacy and hidden equalities before using them in one of the other procedures.
Processing within by-groups#
The by parameter found in all Banff procedures excluding verifyedits allows users to partition the input statistical data into groups that are processed independently. Within the Banff documents, these are referred to as “by-groups”. Users specify one or more variables on indata that the procedure uses to form the by-groups. For example, by = "Province" will create by-groups for each unique value of the variable Province on indata while by = "Province Industry" will create by-groups for each unique combination of the variables Province and Industry.
Notes on by-groups:
Procedures
errorloc,deterministic, andprorateoperate on each record onindataindependently; processing within by-groups has no effect on the procedure outputs. Nonetheless, thebyparameter is still included in these procedures, and may be used in the future for performance enhancements.If
byandindata_histare both specified in a procedure, then theindata_histtable must also include thebyvariables.Input tables such as
indataandindata_histmust be sorted according to thebyvariables before execution. By default, Banff procedures handle this sorting automatically; this sorting can be disabled (for example, if the input file is already sorted) by settingpresort=False.When
instatusis specified, some procedures are more efficient if thebyvariables are included oninstatus. By default, Banff will add thebyvariables automatically; this action can be disabled (for example, if the variables are already present) by settingprefill_by_vars=False.By default, many procedures will output diagnostics to the log for each by-group; to disable this feature, specify
no_by_stats=True.The
outdatafile will always contain thebyvariables.
Technical guide#
Executing the Banff procedures#
To execute a Banff procedure in a Python script, we first import the Banff package alongside any other packages we plan on using:
import banff
import pandas as pandas
We now create a new object with a name of our choosing (in this case, “my_errorloc”) as one of the Banff procedures, using the alias from table (link):
my_errorloc = banff.errorloc(
indata=example_indata,
outstatus=True,
outreject=True,
edits="x1>=-5; x1<=15; x2>=30; x1+x2<=50;",
weights="x1=1.5;",
cardinality=2,
time_per_obs=0.1,
unit_id="IDENT",
by="ZONE"
)
Procedures must be referenced using the Banff package name, i.e.,
banff.errorloc()Parameters (e.g.,
edits) and tables (e.g.,indata) are specified as comma-separated key-value pairs, and can appear in any order
Outputs can be stored in memory or saved to disc (see this section for details). If stored in memory, they are stored in the user-created object (e.g., my_errorloc) created when the procedure was executed:
print(my_errorloc.outstatus) # Print the outstatus table
errorloc_outstatus = my_errorloc.outstatus # Save outstatus as a new PyArrow Table called "errorloc_outstatus"
Variable names on input tables#
Many Banff procedure parameters reference variables (i.e., columns) from input tables such as indata. Any variable referenced by a Banff procedure must consist of a single string without spaces or special characters except underscore (_). For example, "first_name" is an acceptable variable name while "first name" and "first-name$" are not. To ensure that input tables are compatible with Banff, users may need to modify the variable names. Additional constraints:
Variable names cannot exceed 64 characters in length.
Variable names must be unique within an individual input table. (The same variable can appear on multiple input tables without issue.)
Common Banff parameters that reference variables include unit_id, var, weights, and by. They are simply referenced by their variable name, and lists of two or more variables are separated by a single space. For example:
Single variable:
unit_id = "business_number"Variable list:
by = "province industry"
Note that variable names are not case-sensitive in Banff; if an input table has two or more columns that differ only in case, it is unclear which column will be used during processing. (Users are strongly encouraged to give distinct case-insensitive names to all columns on any input tables.)
Banff log#
Log messages are written to the terminal during execution. These messages originate from 2 sources: Python code and procedure C code. Messages can be displayed in English or French.
Setting The Log Language#
The banff package produces log messages in either English or French. The package attempts to detect the language from its host environment during import. Use the function banff.set_language() to set the language at runtime, specifying a member of the banff.SupportedLanguage enumeration (i.e. .en or .fr)
Example: setting the language to French
banff.set_language(banff.SupportedLanguage.fr)
Python Log Messages#
Python handles logging using the standard logging package. All Python log messages have an associated log level, such as ERROR, WARNING, INFO, and DEBUG. Messages from Python are generally one line per message and are prefixed with a timestamp and level.
By default, only warning and error messages are displayed. Use the trace parameter to change what log levels are displayed.
Example Python Messages
The following are 3 unrelated messages from the INFO, DEBUG, and ERROR log levels. By default, only the 3rd message would be printed to the terminal.2024-11-29 10:47:51,853 [INFO]: Time zone for log entries: Eastern Standard Time (UTC-5.0) 2024-11-29 10:48:47,654 [DEBUG , banff.donorimp._execute._preprocess_inputs]: Adding BY variables to 'instatus' dataset 2024-11-29 10:49:59,867 [ERROR]: Procedure 'Donor Imputation' encountered an error and terminated early: missing mandatory dataset (return code 4)
Python Log Verbosity (trace=)#
Use the trace parameter to control which log levels are printed by specifying one of the following log levels
banff.log_level.ERRORbanff.log_level.WARNINGbanff.log_level.INFObanff.log_level.DEBUG
Messages from the specified log level and higher levels will be printed, lower levels will not.
For convenience, specifying trace=True enables all logging output.
Procedure C Log Messages#
Messages from procedure C code have no associated log level and do not include a timestamp. Some messages are prefixed with ERROR:, WARNING: or NOTE:, while other messages have no prefix and may span multiple lines. Most messages are informational. Whenever an ERROR: message is printed, there should be a subsequent related error-level message from Python and an exception should be raised.
Example Procedure C Messages
NOTE: --- Banff System 3.01.001b19.dev6 developed by Statistics Canada --- NOTE: PROCEDURE DONORIMPUTATION Version 3.01.001b19.dev6 . . . NOTE: The minimal set of edits is equivalent to the original set of edits specified. . . . Number of valid observations ......................: 12 100.00% Number of donors ..................................: 5 41.67% Number of donors to reach DONORLIMIT ..............: 3 60.00% Number of recipients ..............................: 7 58.33% NOTE: The above message was for the following by group: prov=30
Procedure C Log Verbosity#
See no_by_stats.
Suppressing and Troubleshooting Log Messages (capture=)#
The capture parameter can be used to suppress all log output (using capture=None). This option is diabled by default (capture=False).
Specifying capture=True will cause log messages to be printed all at once at the end of procedure execution, instead of printed immediately throughout execution. The difference may only be noticable during long running procedure calls. Using this option can improve performance in some cases, such as when processing a very large number of by groups and without specifying no_by_stats=True.
Jupyter Notebooks and Missing Log Messages
When running a procedure using Juputer Notebooks, Procedure C log messages may be missing, particular when running on Visual Studio Code on Windows.
To fix this issue, try using the optioncapture=True.
alternatively use your own logger
Due to how Jupyter notebooks manage Python’s terminal output, messages from procedure C code may not be displayed. To resolve this, specify capture=True in the procedure call when running in a Jupyter notebook, or .
Use Your Own Logger (logger=)#
Use the logger parameter to specify a Logger that you have created. All messages from Python and C will be sent to that logger. This allows customization of message prefixes, support for writing to file, etc.
Note that procedure C messages are sent to the logger in a single INFO level Python log message.
Example: writing logs to file
import banff import logging my_logger = logging.getLogger(__name__) logging.basicConfig(filename='example.log', encoding='utf-8', level=logging.DEBUG) # run Banff procedure banff_call = banff.donorimp( logger=my_logger, indata=indata, instatus=donorstat, outmatching_fields=True, ...
Presort#
When by variables are specified, Banff procedures require input tables be sorted according to those variables before execution. Some procedures also require sorting by unit_id. Setting presort = True automatically sorts any specified input tables (e.g., indata, indata_hist, and instatus) according to procedure’s requirements. By default, presort = True for all applicable Banff procedures. Users may disable this feature by specifying presort = False.
prefill_by_vars#
Setting prefill_by_vars = True will automatically add any specified by variables to the input status file(s), if necessary, before running the Banff procedure. In some cases, the presence of by variables on input status files may significantly improve procedure performance. By default, prefill_by_vars = True for all applicable Banff procedures. Users may disable this feature by specifying prefill_by_vars = False.
no_by_stats#
Many of the Banff procedures output information to the Banff log. When by-groups are specified, this information is typically produced for each by-group. Specify no_by_stats = True to reduce log output by suppressing by-group-specific messages. This parameter is available for most procedures that allow by groups, though some procedures have few by-group specific messages.
Expressions#
The parameters exclude_where_indata and exclude_where_indata_hist apply boolean logic to input tables using SQL expressions. SQL expression support is implemented using DuckDB. See their documentation on expressions for a complete guide of supported syntax.
Input and output table specification#
For both input and output tables, users can specify in-memory objects or files on disc. A number of different formats are supported for both types. Objects are associated with identifiers (e.g., "pandas dataframe") while files are associated with extensions (e.g., "filename.parquet"); please see the table below for details. Note that some are recommended for testing purposes only, and not all formats are supported for outputs tables.
Supported formats#
Format |
Type |
Supported identifier(s) or extension(s) |
Notes |
|---|---|---|---|
PyArrow Table |
Object |
|
Recommended format for in-memory objects. |
Pandas DataFrame |
Object |
|
|
Apache Parquet |
File |
|
Minimal RAM usage, good performance with large tables. |
Aparche Feather |
File |
|
Least RAM usage, good performance with large tables. |
SAS Dataset |
File |
|
For testing purposes, input only; not recommended in production. |
Comma Separated Value |
File |
|
For testing purposes only; not recommended in production. |
For tips related to file paths in Python, see Escape Characters and File Paths
Specifying input tables#
To input from an in-memory object, simply reference the object name directly from the procedure call. The procedure will automatically detect the type of object from amongst the supported types.
donorimp_call = banff.donorimp(
indata=df, # where df is a Pandas DataFrame preivously generated
instatus=table, # where table is a PyArrow Table previously generated
... # etc. (parameters, output tables)
)
To specify an input from file, include either a relative or complete file path:
donorimp_call = banff.donorimp(
indata="./input_data.parquet", # Parquet file with local reference
instatus=r"C:\temp\input_status.feather", # Feather file with Windows reference
... # etc. (parameters, output datasets)
)
Users can mix both type of inputs as well:
donorimp_call = banff.donorimp(
indata="./input_data.parquet", # Parquet file with local reference
instatus=table, # where table is a PyArrow Table previously generated
... # etc. (parameters, output tables)
)
Specifying output tables#
The Banff procedures automatically create a number of output tables. Some of these are optional, and can be disabled by specifying False. (Specifying False for an optional output will prevent it from being produced at all, possibly reducing memory usage. Specifying False for a mandatory output will result in an error.) The default format for output tables is the in-memory PyArrow Table. To produce the output in another in-memory format, specify its associated identifier as a string. To write outputs to file, specify a file path with a supported extension.
See the table of supported formats for a list of identifiers and extensions. Please see the Output Tables document for a full list of output tables by procedure.
The following examples includes both mandatory and optional outputs, saved as a mix of in-memory objects and files.
estimator_call = banff.estimator(
outdata=True, # Saved as a PyArrow Table
outstatus=True, # Saved as a PyArrow Table
outacceptable=True, # Optional output saved as PyArrow Table
outest_ef="pyarrow", # Optional output saved as a PyArrow Table
outest_lr="dataframe", # Optional output saved as a Pandas Dataframe
outest_parm=False, # Optional output disabled
outrand_err="./output_data.parquet", # Optional output saved as parquet file
... # etc. (parameters, output tables)
)
Note: because tables are enabled by default, and PyArrow Table is the default output format, the following would produce identical results:
estimator_call = banff.estimator(
outest_lr="dataframe", # Optional output saved as a Pandas Dataframe
outest_parm=False, # Optional output disabled
outrand_err="./output_data.parquet", # Optional output saved as parquet file
... # etc. (parameters, output tables)
)
NOTE: Outputs will automatically overwrite existing objects and files with the same name.
Customize Default Output Specification#
To determine the current default output table format
>>> banff.get_default_output_spec()
'pyarrow'
this corresponds to
pyarrow.Table
The default can be set to any identifier from the table of Supported Formats
Example: Switch default output format to
pandas.DataFramebanff.set_default_output_spec('pandas')
Accessing output tables#
For objects saved in memory, access them using the object member naming the output tables:
estimator_call = banff.estimator(
outdata=True, # Saved as a PyArrow Table
outstatus=True, # Saved as a PyArrow Table
... # etc.
)
print(estimator_call.outdata) # Print outdata to the terminal
my_table = estimator_call.outstatus # Save outstatus as a new object called my_table
Note: because outdata,outstatus are mandatory outputs, they would still be accessible as estimator_call.outdata and estimator_call.outstatus even if not specified using the True statements.
Other#
Escape Characters and File Paths#
On Windows, the backslash character (\) is typically used to separate folders and files in a file path.
Example
"C:\users\stc_user\documents\dataset.csv"
In Python however, the character \ is an “escape character” and is treated specially. Providing a file path using the example above may cause a runtime error. To disable this special treatment, use a “raw string” by adding the r prefix:
r"C:\users\stc_user\documents\dataset.csv"
Alternatively,
double backslash:
C:\\users\\stc_user\\documents\\dataset.csvforward backslash:
C:/users/stc_user/documents/dataset.csv
Errors and Exceptions#
Python generally handles runtime errors by “raising an exception”. This has been adopted by the banff package. Whenever an error occurs, an exception is raised. This could occur while the package is loading or preprocessing input data, running a procedure, or writing output data.
Generally exceptions will contain a helpful error message. Exceptions are often “chained” to provide additional context to the exception.
Example: Exception while writing output dataset
The following is console output generated when the package fails to write an output dataset because the destination folder does not exist.
[ERROR , banff.donorimp._execute._write_outputs]: Error occurred while writing 'outmatching_fields' output dataset [ERROR , banff.donorimp._execute._write_outputs]: Directory of output file does not exist: 'C:\temp\definitely\a\fake' [ERROR , banff.donorimp._execute._write_outputs]: [WinError 3] The system cannot find the path specified: 'C:\\temp\\definitely\\a\\fake' Traceback (most recent call last): File "C:\git\banff_redesign\Python\src\banff\_common\src\io_util\io_util.py", line 578, in write_output_dataset dst.parent.resolve(strict=True) # strict: exception if not exists File "C:\Program Files\Python310\lib\pathlib.py", line 1077, in resolve s = self._accessor.realpath(self, strict=strict) File "C:\Program Files\Python310\lib\ntpath.py", line 689, in realpath path = _getfinalpathname(path) FileNotFoundError: [WinError 3] The system cannot find the path specified: 'C:\\temp\\definitely\\a\\fake' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "C:\git\banff_redesign\Python\src\banff\_common\src\proc\stc_proc.py", line 649, in _write_outputs ds.user_output = io.write_output_dataset(ds.ds_intermediate, ds.user_spec, log_lcl) File "C:\git\banff_redesign\Python\src\banff\_common\src\io_util\io_util.py", line 581, in write_output_dataset raise FileNotFoundError(mesg) from e FileNotFoundError: Directory of output file does not exist: 'C:\temp\definitely\a\fake' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "C:\git\banff_redesign\Python\sample_programs\conversion_examples\DonorImp01.py", line 61, in <module> banff_call = banff.donorimp( File "C:\git\banff_redesign\Python\src\banff\proc\proc_donorimp.py", line 119, in __init__ super().__init__( File "C:\git\banff_redesign\Python\src\banff\proc\banff_proc.py", line 66, in __init__ self._execute() File "C:\git\banff_redesign\Python\src\banff\_common\src\proc\stc_proc.py", line 367, in _execute self._write_outputs(log=log_lcl) File "C:\git\banff_redesign\Python\src\banff\_common\src\proc\stc_proc.py", line 654, in _write_outputs raise ProcedureOutputError(mesg) from e banff._common.src.exceptions.ProcedureOutputError: Error occurred while writing 'outmatching_fields' output datasetThe first 3 lines are log messages generated by the
banffpackage. The remaining lines are a standard exception traceback generated by Python itself. From top to bottom it shows a chain of 3 exceptions.
The first is a low-level error indicating that a file path cannot be found, “[WinError 3] The system cannot find the path specified: 'C:\\temp\\definitely\\a\\fake'”.
The second more specifically indicates that the “Directory of output file does not exist: 'C:\temp\definitely\a\fake'”.
The third provides context about what was happening when this error occurred, “Error occurred while writing 'outmatching_fields' output dataset”.
Working with SAS Files in Python#
The banff package provides a few useful functions for reading SAS files into memory or converting to another format.
To use these functions your program must import banff
Function |
Description |
|---|---|
|
Reads SAS dataset at |
|
Reads SAS dataset at |
|
Reads SAS dataset at |
|
Reads SAS dataset at |
Performance Considerations#
The formats used for input and output datasets will affect performance.
When there may not be sufficient RAM available (due to small RAM size or large datasets), datasets should be stored on disk. The file format selected will have an effect on performance. Apache Parquet (.parquet) and Apache Feather (.feather) file formats currently deliver the best performance when using files for input or output datasets.
Feather should use the least amount of RAM, making it ideal for large datasets or execution environments with little RAM, it is the recommended format for temporary files. Parquet is generally the smallest file size, however it still provides impressive read and write performance in multi-CPU environments and reasonably minimal RAM usage, it is recommended for medium-long term storage of data.
Using the SAS dataset format for large input datasets may result in degraded performance, particularly in environments with little RAM. This format is only recommended for use with small datasets (under a few hundred MB). Using the SAS format is discouraged in general, with Apache Arrow formats (parquet and feather) being recommended instead.