Overview#
Banff is a statistical package developed by Statistics Canada, consisting of nine modular procedures performing various statistical data editing (SDE) functions, including imputation. Some general notes about Banff:
Most of the SDE methods included in Banff are designed for economic surveys, and in particular numerical variables such as revenue and employee counts. Banff does not currently include methods recommended for the imputation of categorical or ordinal data.
Banff includes a number of methods designed for data whose variables are constrained by linear relationships, also commonly referred to as linear edit rules or simply edits. This includes procedures that review data with respect to the edits, choose which variables to impute when the edits fail, and impute records to ensure that all edits are satisfied.
While each Banff procedure can be run independently, they follow a modular template and can be run in sequence as part of a larger SDE process flow. Outputs from one procedure act as natural inputs for subsequent procedures.
Banff uses status flags to track metadata such as selection and imputation flags. These status flags allow the Banff procedures to pass information from one procedure to another, and also serve as a log of the overall SDE process.
This user guide is meant to provide Banff users with general information about the package, covering important methodological concepts common to multiple procedures, and technical details. For information specific to individual procedures, including descriptions of all parameters, please see the Procedure Guides linked in the belowtable of procedures. A list of all outputs generated by the procedures can be found in the Output Tables document. A full description of the underlying methods, with examples and guidance on how they should be applied, can be found in the Functional Description.
When running Banff procedures in sequence as part of an SDE process flow, users are responsible for input and output between steps. An additional package, the Banff Processor, is a metadata-driven utility designed specifically for large-scale SDE production, incorporating the Banff procedures, and handling all intermediate data management.
The Banff user guide often uses terminology from the Generic Statistical Data Editing Model (GSDEM). Users are encouraged to reference the GSDEM for common terminology regarding SDE concepts.
Table of contents#
Methodology#
Most SDE functions can be categorized into one of the following function types defined in the GSDEM:
Review: Functions that examine the data to identify potential problems.
Selection: Functions that select units or fields within units for specified further treatment.
Treatment: Functions that change the data in a way that is considered appropriate to improve the data quality. The modification of specific fields within a unit (i.e. filling in missing values or changing erroneous ones) is referred to as imputation.
The nine Banff procedures are listed in the table below, alongside a brief description of each, and the function types they perform. Each procedure has an alias in Banff, which is used to execute the procedure in Python; this is also used to reference the procedures within the user guides.
List of procedures#
Procedure name |
Alias |
Function types |
Description |
---|---|---|---|
|
none |
Checks the edits for consistency and redundancy. |
|
|
Review |
Produces edit summary statistics tables on records that pass, miss or fail each consistency edit. |
|
|
Review, Selection |
Identifies outlying observations using Hidiroglou-Berthelot or Sigma-Gap methods. |
|
|
Review, Selection |
For each record, selects the minimum number of variables to impute such that each observation can be made to pass all edits. |
|
|
Treatment |
Performs imputation when only one combination of values permits the record to pass the set of edits. |
|
|
Treatment |
Performs nearest neighbour donor imputation such that each imputed record satisfies the specified post-imputation edits. |
|
|
Treatment |
Performs imputation using estimation functions and/or linear regression estimators. |
|
|
Review, Selection, Treatment |
Prorates and rounds records to satisfy user-specified edits. |
|
|
Review, Selection, Treatment |
Performs donor imputation for a block of variables using a nearest neighbour approach or random selection. |
This user guide does not include information about specific procedures; this can instead be accessed via the links in the table above or from the procedure guide index. The procedure guides include all information required to run the procedures, including detailed descriptions of every parameter, but only a brief description of the methods. For a full mathematical description of the procedure methods, with examples, please see the Functional Description.
Interaction between procedures#
The Banff procedures are designed to be run sequentially as part of an SDE process flow. The outputs from one procedure often act as inputs for subsequent procedures, and the statistical data that is the target of the SDE process is updated continuously throughout the process. A standard approach using the Banff procedures may resemble the following:
Validity and consistency edits are reviewed and refined using
verifyedits
.An initial review of the raw statistical data is performed with
editstats
.Review and selection functions are performed using
outlier
anderrorloc
to identify potential problems, and select fields within units for further treatment.Imputation is performed using the treatment functions available in
deterministic
,donorimp
, andestimator
to impute missing values, outliers, and to correct inconsistencies.Prorating is performed using
prorate
to ensure all resulting values satisfy summation constraints, and to round any decimal values.Optionally,
massimp
is used to impute large blocks of non-response.A final review of the imputed data is performed with
editstats
to ensure the outgoing data meets desired quality standards.
For this process to work, information needs to be passed from one step to another. Selection flags generated by outlier
and errorloc
are stored in a status file that is read by subsequent treatment procedures. When one of the treatment procedures successfully imputes a field requiring imputation, the selection flag on the status file is replaced with an imputation flag, indicating to subsequent procedures that treatment is no longer required.
To manage the changes to both the statistical data and status flags, the Banff procedures are modular, with a standard set of inputs and outputs shared amongst procedures performing similar SDE function types. Of the nine Banff procedures, only two (prorate
and massimp
) perform all three function types; these procedures can be run in isolation to review the data, select records and/or fields for treatment, and perform imputation. Procedures that perform only treatment (deterministic
, donorimp
, and estimator
) first require selection flags generated by one of the procedures performing selection (outlier
or errorloc
). The editstats
procedure reviews the data for errors but does not produce selection flags. Finally, verifyedits
does not perform any of the statistical data editing function types, but should be used before the SDE process begins to review and improve edits
used by other procedures.
Users should familiarize themselves with two types of data that are integral to understanding Banff:
Statistical tables
indata
andoutdata
: These input and output files represent the statistical data, or microdata, that is the target of the SDE process. All procedures exceptverifyedits
act on statistical data and requiredindata
as an input. Only those procedure that perform treatment functions (deterministic
,donorimp
,estimator
,prorate
, andmassimp
) produceoutdata
, containing the imputation results.Status files
instatus
andoutstatus
: These input and output files contain status flags, important metadata identifying which fields in the statistical data require treatment (selection flags) and which have already been treated (imputation flags).
Statistical tables and status files are discussed in further detail in the following sections.
Statistical tables indata
and outdata
#
Except for verifyedits
, all Banff procedures operate on statistical data (also called microdata) arranged in tabular form. The main input for most procedures is the indata
table, which is the target of the data editing process. Some procedures also make use of indata_hist
, historical data for the same set of records. Procedures that perform treatment functions produce the output table outdata
.
The indata
table must consist of tabular data arranged in rows and columns. Banff documents refer to rows as records while columns are referred to as variables, or fields in the case of status flags. At least one character variable must serve as a unique record identifier, specified by the unit_id
parameter in most procedures. Banff uses this identifier to track metadata throughout the data editing process. There are some restrictions on variables names for Banff inputs; please see this section for more information.
Procedures performing treatment functions (deterministic
, donorimp
, estimator
, prorate
and massimp
) produce the outdata
table, output statistical data (i.e., microdata) that includes the result of the treatment function. This includes both imputed values (e.g., imputed from donorimp
) and modified values (e.g., prorated values from prorate
). Some important notes about outdata
:
The
outdata
table is typically not a complete copy ofindata
but only contains rows and columns affected by the procedure. For example, ifindata
includes 2000 rows and 25 columns, but only 500 rows and 10 columns are affected by a procedure, thenoutdata
will only include 500 rows and 10 columns. To continue the SDE process, users should manually update theindata
file with the new information fromoutdata
. (Note: The Banff team is looking into automatically updatingindata
with the results ofoutdata
in a future release.)The
outdata
table always contains variables specified by theunit_id
andby
parameters.If no records are successfully imputed or altered by the procedure, then
outdata
is empty. No error will occur.
Status flags and files#
Banff stores important metadata information about the SDE process in status flags. These status flags capture important information including selection flags, exclusion flags, and imputation flags. This information is used in two ways:
As inputs for subsequent steps in the data editing process. For example, the error localization procedure
errorloc
produces selection flags to identify variables that require imputation. These selection flags are read by imputation procedures such asdonorimp
andestimator
in order to perform imputation.As a complete record of the data editing process. For example, the status flag history of a single observation can explain why, how and when it was modified.
Table of Banff-produced status flags#
Status flag |
Description |
---|---|
|
Field to impute: Selection flag indicating an observation requires additional treatment such as imputation. Generated by |
|
Field to exclude: Selection flag indicating an observation should be excluded from certain methods. Generated by |
|
Imputed by deterministic imputation: Field has been imputed using |
|
Imputed by donor imputation: Field has been imputed using the |
|
Imputed by estimation imputation: Field has been imputed using the |
|
Imputed by prorating: Field has been imputed using the |
|
Imputed by mass imputation: Field has been imputed using the |
Status files instatus
and outstatus
#
Selection and imputation flags are always associated with individual values on indata
. Because indata
is tabular, each observation can be associated with a specific record (row) and field (column). Records are identified by the user-specified unique record identifier unit_id
, while fields are referenced by their variable name. Status flags are stored in status files with the following columns:
Column |
Description |
---|---|
Record identifier (i.e., row) to which the status flag applies. (The actual column header is the name of the variable by the |
|
|
Field identifier (i.e., column) to which the status flag applies. |
|
Status flag such as “FTI”, “FTE”, or “IDN”. |
|
Value of the variable when the status code was generated. For procedures performing selection ( |
All procedures performing selection or treatment functions (i.e., all but verifyedits
and editstats
) automatically produce output status files labelled outstatus
containing selection or imputation flags. Some procedures also read status files as inputs (instatus
); these may be required, depending on the procedure. A brief summary of the behaviour of each procedure with respect to status files is included in the table below.
Status files by procedure#
The following table summarizes which flags are read from instatus
or produced on outstatus
by each procedure. If a flag is required by a procedure, then instatus
is mandatory.
Procedure |
Flags read from |
Flags produced on |
---|---|---|
|
N/A |
N/A |
|
N/A |
N/A |
|
N/A |
FTI, FTE |
|
FTI (optional) |
FTI |
|
FTI (required) |
IDE |
|
FTI (required), FTE(optional), I– (optional) |
IDN |
|
FTI (required), FTE (optional), I–(optional) |
I– (exact code depends on specified algorithm) |
|
I– (optional) |
IPR |
|
N/A |
IMAS |
Some input flags are optional, but can change the behaviour of the procedure. For example, within prorate
, users can choose whether to prorate all values, only original values, or only previously imputed values. If they choose to only impute original or previously imputed values, an instatus
file with I– status flags is required.
Specifying linear edits#
In the statistical data editing process, the term edits generally refers to constraints that records must satisfy to be considered valid. Linear edits refer to constraints that can be expressed as linear equalities or inequalities of the form
$$ a_1 x_1 + a_2 x_2 + … + a_n x_n = b \ or \ a_1 x_1 + a_2 x_2 + … + a_n x_n \le b $$
where $x$$1$ to $x$$n$ are numerical variables from the target statistical data, and $a$$1$ to $a$$n$ and $b$ are constants specified by the user. Of the nine Banff procedures, six of them require edits as input parameters:
verifyedits
editstats
errorloc
deterministic
donorimp
prorate
(Note: there are additional restrictions and unique syntax for edits in the prorating procedure)
Formatting#
Use the edits
parameter to specify a list of edits, following these rules:
As with all string parameters, the list of edits must be surrounded by quotes
"
or triple quotes"""
for multi-line edits.Each individual edit must be followed by a semi-colon
;
.Individual edits must include one of the following operators:
<
(less than)<=
(less than or equal to)=
(equal to)!=
(not equal to)>=
(greater than or equal to)>
(greater than)
Within an individual edit, one or more components must appear on each side of an operator, separated by
+
or-
. A component can be aconstant
, avariable
found onindata
, or a constant and variable multiplied together:constant * variable
orvariable * constant
. When multiplying constants and variables, they must be separated by an asterisk*
.Optionally, users may add a
modifier
at the beginning of an individual edit, followed by a colon:
. Acceptable modifiers arepass
orfail
and are not case sensitive.
A simple example with three edits specified on one line:
errorloc_call = banff.errorloc(
edits= "Profit = Revenue - Expenses; Profit >= 0; Expenses >= 0;"
... # etc. (parameters, output tables)
)
In this example, the edits are spread over multiple lines and surrounded by triple quotes """
. These edits also include constants and the pass
and fail
modifiers.
errorloc_call = banff.errorloc(
edits= """
Profit = Revenue - Expenses; Profit >= 0; Expenses >= 0;
0.9 * Total <= Var1 + Var2 + Var3;
Var1 + Var2 + Var3 <= 1.1 * Total;
Pass: Var4 >= Var5;
Fail: Var4 > Var5;
Fail: Num_Employees != Employees_BR
"""
... # etc. (parameters, output tables)
)
Allowable edits#
While users may express edits in a number of ways, the Banff procedures convert them to canonical form before processing, meaning each edit is expressed as a pass edit with an =
or <=
operator. Strict inequalities are not allowed, resulting in the following rules or modifications:
Pass edits with
<
or>
are replaced by<=
and>=
respectively.Fail edits with
<=
or>=
are replaced by<
and>
respectively.Pass edits with
!=
cannot be converted to canonical form and generate an error.Fail edits with
=
cannot be converted to canonical form and generate an error.
(Note: Users who wish to specify a strict inequality should instead include a small constant in their edit, i.e., A < B
can be replaced by A <= B - 0.005
for values that are only recorded to two decimals of precision.)
The following table gives examples of valid and invalid original edits alongside their canonical form, if possible:
Original edit |
Canonical form |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Additionally, the set of edits specified by a user must be consistent; that is, the linear edits must form a non-empty feasible space. Specifying an inconsistent set of edits for a procedure will result in an error. Users are encouraged to use verifyedits
to review their edits for consistency, redundancy, determinacy and hidden equalities before using them in one of the other procedures.
Processing within by-groups#
The by
parameter found in all Banff procedures excluding verifyedits
allows users to partition the input statistical data into groups that are processed independently. Within the Banff documents, these are referred to as “by-groups”. Users specify one or more variables on indata
that the procedure uses to form the by-groups. For example, by = "Province"
will create by-groups for each unique value of the variable Province
on indata
while by = "Province Industry"
will create by-groups for each unique combination of the variables Province
and Industry
.
Notes on by-groups:
Procedures
errorloc
,deterministic
, andprorate
operate on each record onindata
independently; processing within by-groups has no effect on the procedure outputs. Nonetheless, theby
parameter is still included in these procedures, and may be used in the future for performance enhancements.If
by
andindata_hist
are both specified in a procedure, then theindata_hist
table must also include theby
variables.Input tables such as
indata
andindata_hist
must be sorted according to theby
variables before execution. By default, Banff procedures handle this sorting automatically; this sorting can be disabled (for example, if the input file is already sorted) by settingpresort=False
.When
instatus
is specified, some procedures are more efficient if theby
variables are included oninstatus
. By default, Banff will add theby
variables automatically; this action can be disabled (for example, if the variables are already present) by settingprefill_by_vars=False
.By default, many procedures will output diagnostics to the log for each by-group; to disable this feature, specify
no_by_stats=True
.The
outdata
file will always contain theby
variables.
Technical guide#
Executing the Banff procedures#
To execute a Banff procedure in a Python script, we first import the Banff package alongside any other packages we plan on using:
import banff
import pandas as pandas
We now create a new object with a name of our choosing (in this case, “my_errorloc”) as one of the Banff procedures, using the alias from table (link):
my_errorloc = banff.errorloc(
indata=example_indata,
outstatus=True,
outreject=True,
edits="x1>=-5; x1<=15; x2>=30; x1+x2<=50;",
weights="x1=1.5;",
cardinality=2,
time_per_obs=0.1,
unit_id="IDENT",
by="ZONE"
)
Procedures must be referenced using the Banff package name, i.e.,
banff.errorloc()
Parameters (e.g.,
edits
) and tables (e.g.,indata
) are specified as comma-separated key-value pairs, and can appear in any order
Outputs can be stored in memory or saved to disc (see this section for details). If stored in memory, they are stored in the user-created object (e.g., my_errorloc
) created when the procedure was executed:
print(my_errorloc.outstatus) # Print the outstatus table
errorloc_outstatus = my_errorloc.outstatus # Save outstatus as a new PyArrow Table called "errorloc_outstatus"
Variable names on input tables#
Many Banff procedure parameters reference variables (i.e., columns) from input tables such as indata
. Any variable referenced by a Banff procedure must consist of a single string without spaces or special characters except underscore (_
). For example, "first_name"
is an acceptable variable name while "first name"
and "first-name$"
are not. To ensure that input tables are compatible with Banff, users may need to modify the variable names. Additional constraints:
Variable names cannot exceed 64 characters in length.
Variable names must be unique within an individual input table. (The same variable can appear on multiple input tables without issue.)
Common Banff parameters that reference variables include unit_id
, var
, weights
, and by
. They are simply referenced by their variable name, and lists of two or more variables are separated by a single space. For example:
Single variable:
unit_id = "business_number"
Variable list:
by = "province industry"
Note that variable names are not case-sensitive in Banff; if an input table has two or more columns that differ only in case, it is unclear which column will be used during processing. (Users are strongly encouraged to give distinct case-insensitive names to all columns on any input tables.)
Banff log#
Log messages are written to the terminal during execution. These messages originate from 2 sources: Python code and procedure C code. Messages can be displayed in English or French.
Setting The Log Language#
The banff
package produces log messages in either English or French. The package attempts to detect the language from its host environment during import. Use the function banff.set_language()
to set the language at runtime, specifying a member of the banff.SupportedLanguage
enumeration (i.e. .en
or .fr
)
Example: setting the language to French
banff.set_language(banff.SupportedLanguage.fr)
Python Log Messages#
Python handles logging using the standard logging
package. All Python log messages have an associated log level, such as ERROR, WARNING, INFO, and DEBUG. Messages from Python are generally one line per message and are prefixed with a timestamp and level.
By default, only warning and error messages are displayed. Use the trace
parameter to change what log levels are displayed.
Example Python Messages
The following are 3 unrelated messages from the INFO, DEBUG, and ERROR log levels. By default, only the 3rd message would be printed to the terminal.2024-11-29 10:47:51,853 [INFO]: Time zone for log entries: Eastern Standard Time (UTC-5.0) 2024-11-29 10:48:47,654 [DEBUG , banff.donorimp._execute._preprocess_inputs]: Adding BY variables to 'instatus' dataset 2024-11-29 10:49:59,867 [ERROR]: Procedure 'Donor Imputation' encountered an error and terminated early: missing mandatory dataset (return code 4)
Python Log Verbosity (trace=
)#
Use the trace
parameter to control which log levels are printed by specifying one of the following log levels
banff.log_level.ERROR
banff.log_level.WARNING
banff.log_level.INFO
banff.log_level.DEBUG
Messages from the specified log level and higher levels will be printed, lower levels will not.
For convenience, specifying trace=True
enables all logging output.
Procedure C Log Messages#
Messages from procedure C code have no associated log level and do not include a timestamp. Some messages are prefixed with ERROR:
, WARNING:
or NOTE:
, while other messages have no prefix and may span multiple lines. Most messages are informational. Whenever an ERROR:
message is printed, there should be a subsequent related error-level message from Python and an exception should be raised.
Example Procedure C Messages
NOTE: --- Banff System 3.01.001b19.dev6 developed by Statistics Canada --- NOTE: PROCEDURE DONORIMPUTATION Version 3.01.001b19.dev6 . . . NOTE: The minimal set of edits is equivalent to the original set of edits specified. . . . Number of valid observations ......................: 12 100.00% Number of donors ..................................: 5 41.67% Number of donors to reach DONORLIMIT ..............: 3 60.00% Number of recipients ..............................: 7 58.33% NOTE: The above message was for the following by group: prov=30
Procedure C Log Verbosity#
See no_by_stats.
Suppressing and Troubleshooting Log Messages (capture=
)#
The capture
parameter can be used to suppress all log output (using capture=None
). This option is diabled by default (capture=False
).
Specifying capture=True
will cause log messages to be printed all at once at the end of procedure execution, instead of printed immediately throughout execution. The difference may only be noticable during long running procedure calls. Using this option can improve performance in some cases, such as when processing a very large number of by groups and without specifying no_by_stats=True
.
Jupyter Notebooks and Missing Log Messages
When running a procedure using Juputer Notebooks, Procedure C log messages may be missing, particular when running on Visual Studio Code on Windows.
To fix this issue, try using the optioncapture=True
.
alternatively use your own logger
Due to how Jupyter notebooks manage Python’s terminal output, messages from procedure C code may not be displayed. To resolve this, specify capture=True
in the procedure call when running in a Jupyter notebook, or .
Use Your Own Logger (logger=
)#
Use the logger
parameter to specify a Logger that you have created. All messages from Python and C will be sent to that logger. This allows customization of message prefixes, support for writing to file, etc.
Note that procedure C messages are sent to the logger in a single INFO level Python log message.
Example: writing logs to file
import banff import logging my_logger = logging.getLogger(__name__) logging.basicConfig(filename='example.log', encoding='utf-8', level=logging.DEBUG) # run Banff procedure banff_call = banff.donorimp( logger=my_logger, indata=indata, instatus=donorstat, outmatching_fields=True, ...
Presort#
When by
variables are specified, Banff procedures require input tables be sorted according to those variables before execution. Some procedures also require sorting by unit_id
. Setting presort = True
automatically sorts any specified input tables (e.g., indata
, indata_hist
, and instatus
) according to procedure’s requirements. By default, presort = True
for all applicable Banff procedures. Users may disable this feature by specifying presort = False
.
prefill_by_vars#
Setting prefill_by_vars = True
will automatically add any specified by
variables to the input status file(s), if necessary, before running the Banff procedure. In some cases, the presence of by
variables on input status files may significantly improve procedure performance. By default, prefill_by_vars = True
for all applicable Banff procedures. Users may disable this feature by specifying prefill_by_vars = False
.
no_by_stats#
Many of the Banff procedures output information to the Banff log. When by-groups are specified, this information is typically produced for each by-group. Specify no_by_stats = True
to reduce log output by suppressing by-group-specific messages. This parameter is available for most procedures that allow by
groups, though some procedures have few by-group specific messages.
Expressions#
The parameters exclude_where_indata
and exclude_where_indata_hist
apply boolean logic to input tables using SQL expressions. SQL expression support is implemented using DuckDB. See their documentation on expressions for a complete guide of supported syntax.
Input and output table specification#
For both input and output tables, users can specify in-memory objects or files on disc. A number of different formats are supported for both types. Objects are associated with identifiers (e.g., "pandas dataframe"
) while files are associated with extensions (e.g., "filename.parquet"
); please see the table below for details. Note that some are recommended for testing purposes only, and not all formats are supported for outputs tables.
Supported formats#
Format |
Type |
Supported identifier(s) or extension(s) |
Notes |
---|---|---|---|
PyArrow Table |
Object |
|
Recommended format for in-memory objects. |
Pandas DataFrame |
Object |
|
|
Apache Parquet |
File |
|
Minimal RAM usage, good performance with large tables. |
Aparche Feather |
File |
|
Least RAM usage, good performance with large tables. |
SAS Dataset |
File |
|
For testing purposes, input only; not recommended in production. |
Comma Separated Value |
File |
|
For testing purposes only; not recommended in production. |
For tips related to file paths in Python, see Escape Characters and File Paths
Specifying input tables#
To input from an in-memory object, simply reference the object name directly from the procedure call. The procedure will automatically detect the type of object from amongst the supported types.
donorimp_call = banff.donorimp(
indata=df, # where df is a Pandas DataFrame preivously generated
instatus=table, # where table is a PyArrow Table previously generated
... # etc. (parameters, output tables)
)
To specify an input from file, include either a relative or complete file path:
donorimp_call = banff.donorimp(
indata="./input_data.parquet", # Parquet file with local reference
instatus=r"C:\temp\input_status.feather", # Feather file with Windows reference
... # etc. (parameters, output datasets)
)
Users can mix both type of inputs as well:
donorimp_call = banff.donorimp(
indata="./input_data.parquet", # Parquet file with local reference
instatus=table, # where table is a PyArrow Table previously generated
... # etc. (parameters, output tables)
)
Specifying output tables#
The Banff procedures automatically create a number of output tables. Some of these are optional, and can be disabled by specifying False
. (Specifying False
for an optional output will prevent it from being produced at all, possibly reducing memory usage. Specifying False
for a mandatory output will result in an error.) The default format for output tables is the in-memory PyArrow Table. To produce the output in another in-memory format, specify its associated identifier as a string. To write outputs to file, specify a file path with a supported extension.
See the table of supported formats for a list of identifiers and extensions. Please see the Output Tables document for a full list of output tables by procedure.
The following examples includes both mandatory and optional outputs, saved as a mix of in-memory objects and files.
estimator_call = banff.estimator(
outdata=True, # Saved as a PyArrow Table
outstatus=True, # Saved as a PyArrow Table
outacceptable=True, # Optional output saved as PyArrow Table
outest_ef="pyarrow", # Optional output saved as a PyArrow Table
outest_lr="dataframe", # Optional output saved as a Pandas Dataframe
outest_parm=False, # Optional output disabled
outrand_err="./output_data.parquet", # Optional output saved as parquet file
... # etc. (parameters, output tables)
)
Note: because tables are enabled by default, and PyArrow Table is the default output format, the following would produce identical results:
estimator_call = banff.estimator(
outest_lr="dataframe", # Optional output saved as a Pandas Dataframe
outest_parm=False, # Optional output disabled
outrand_err="./output_data.parquet", # Optional output saved as parquet file
... # etc. (parameters, output tables)
)
NOTE: Outputs will automatically overwrite existing objects and files with the same name.
Customize Default Output Specification#
To determine the current default output table format
>>> banff.get_default_output_spec()
'pyarrow'
this corresponds to
pyarrow.Table
The default can be set to any identifier from the table of Supported Formats
Example: Switch default output format to
pandas.DataFrame
banff.set_default_output_spec('pandas')
Accessing output tables#
For objects saved in memory, access them using the object member naming the output tables:
estimator_call = banff.estimator(
outdata=True, # Saved as a PyArrow Table
outstatus=True, # Saved as a PyArrow Table
... # etc.
)
print(estimator_call.outdata) # Print outdata to the terminal
my_table = estimator_call.outstatus # Save outstatus as a new object called my_table
Note: because outdata
,outstatus
are mandatory outputs, they would still be accessible as estimator_call.outdata
and estimator_call.outstatus
even if not specified using the True
statements.
Other#
Escape Characters and File Paths#
On Windows, the backslash character (\
) is typically used to separate folders and files in a file path.
Example
"C:\users\stc_user\documents\dataset.csv"
In Python however, the character \
is an “escape character” and is treated specially. Providing a file path using the example above may cause a runtime error. To disable this special treatment, use a “raw string” by adding the r
prefix:
r"C:\users\stc_user\documents\dataset.csv"
Alternatively,
double backslash:
C:\\users\\stc_user\\documents\\dataset.csv
forward backslash:
C:/users/stc_user/documents/dataset.csv
Errors and Exceptions#
Python generally handles runtime errors by “raising an exception”. This has been adopted by the banff
package. Whenever an error occurs, an exception is raised. This could occur while the package is loading or preprocessing input data, running a procedure, or writing output data.
Generally exceptions will contain a helpful error message. Exceptions are often “chained” to provide additional context to the exception.
Example: Exception while writing output dataset
The following is console output generated when the package fails to write an output dataset because the destination folder does not exist.
[ERROR , banff.donorimp._execute._write_outputs]: Error occurred while writing 'outmatching_fields' output dataset [ERROR , banff.donorimp._execute._write_outputs]: Directory of output file does not exist: 'C:\temp\definitely\a\fake' [ERROR , banff.donorimp._execute._write_outputs]: [WinError 3] The system cannot find the path specified: 'C:\\temp\\definitely\\a\\fake' Traceback (most recent call last): File "C:\git\banff_redesign\Python\src\banff\_common\src\io_util\io_util.py", line 578, in write_output_dataset dst.parent.resolve(strict=True) # strict: exception if not exists File "C:\Program Files\Python310\lib\pathlib.py", line 1077, in resolve s = self._accessor.realpath(self, strict=strict) File "C:\Program Files\Python310\lib\ntpath.py", line 689, in realpath path = _getfinalpathname(path) FileNotFoundError: [WinError 3] The system cannot find the path specified: 'C:\\temp\\definitely\\a\\fake' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "C:\git\banff_redesign\Python\src\banff\_common\src\proc\stc_proc.py", line 649, in _write_outputs ds.user_output = io.write_output_dataset(ds.ds_intermediate, ds.user_spec, log_lcl) File "C:\git\banff_redesign\Python\src\banff\_common\src\io_util\io_util.py", line 581, in write_output_dataset raise FileNotFoundError(mesg) from e FileNotFoundError: Directory of output file does not exist: 'C:\temp\definitely\a\fake' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "C:\git\banff_redesign\Python\sample_programs\conversion_examples\DonorImp01.py", line 61, in <module> banff_call = banff.donorimp( File "C:\git\banff_redesign\Python\src\banff\proc\proc_donorimp.py", line 119, in __init__ super().__init__( File "C:\git\banff_redesign\Python\src\banff\proc\banff_proc.py", line 66, in __init__ self._execute() File "C:\git\banff_redesign\Python\src\banff\_common\src\proc\stc_proc.py", line 367, in _execute self._write_outputs(log=log_lcl) File "C:\git\banff_redesign\Python\src\banff\_common\src\proc\stc_proc.py", line 654, in _write_outputs raise ProcedureOutputError(mesg) from e banff._common.src.exceptions.ProcedureOutputError: Error occurred while writing 'outmatching_fields' output datasetThe first 3 lines are log messages generated by the
banff
package. The remaining lines are a standard exception traceback generated by Python itself. From top to bottom it shows a chain of 3 exceptions.
The first is a low-level error indicating that a file path cannot be found, “[WinError 3] The system cannot find the path specified: 'C:\\temp\\definitely\\a\\fake'
”.
The second more specifically indicates that the “Directory of output file does not exist: 'C:\temp\definitely\a\fake'
”.
The third provides context about what was happening when this error occurred, “Error occurred while writing 'outmatching_fields' output dataset
”.
Working with SAS Files in Python#
The banff package provides a few useful functions for reading SAS files into memory or converting to another format.
To use these functions your program must import banff
Function |
Description |
---|---|
|
Reads SAS dataset at |
|
Reads SAS dataset at |
|
Reads SAS dataset at |
|
Reads SAS dataset at |
Performance Considerations#
The formats used for input and output datasets will affect performance.
When there may not be sufficient RAM available (due to small RAM size or large datasets), datasets should be stored on disk. The file format selected will have an effect on performance. Apache Parquet (.parquet
) and Apache Feather (.feather
) file formats currently deliver the best performance when using files for input or output datasets.
Feather should use the least amount of RAM, making it ideal for large datasets or execution environments with little RAM, it is the recommended format for temporary files. Parquet is generally the smallest file size, however it still provides impressive read and write performance in multi-CPU environments and reasonably minimal RAM usage, it is recommended for medium-long term storage of data.
Using the SAS dataset format for large input datasets may result in degraded performance, particularly in environments with little RAM. This format is only recommended for use with small datasets (under a few hundred MB). Using the SAS format is discouraged in general, with Apache Arrow formats (parquet and feather) being recommended instead.