# Overview Banff is a statistical package developed by Statistics Canada, consisting of nine modular procedures performing various statistical data editing (SDE) functions, including imputation. Some general notes about Banff: * Most of the SDE methods included in Banff are designed for economic surveys, and in particular numerical variables such as revenue and employee counts. Banff does not currently include methods recommended for the imputation of categorical or ordinal data. * Banff includes a number of methods designed for data whose variables are constrained by linear relationships, also commonly referred to as linear edit rules or simply edits. This includes procedures that review data with respect to the edits, choose which variables to impute when the edits fail, and impute records to ensure that all edits are satisfied. * While each Banff procedure can be run independently, they follow a modular template and can be run in sequence as part of a larger SDE process flow. Outputs from one procedure act as natural inputs for subsequent procedures. * Banff uses status flags to track metadata such as selection and imputation flags. These status flags allow the Banff procedures to pass information from one procedure to another, and also serve as a log of the overall SDE process. This user guide is meant to provide Banff users with general information about the package, covering important methodological concepts common to multiple procedures, and technical details. For information specific to individual procedures, including descriptions of all parameters, please see the Procedure Guides linked in the below[table of procedures](#list-of-procedures). A list of all outputs generated by the procedures can be found in the [Output Tables](./output_tables.md) document. A full description of the underlying methods, with examples and guidance on how they should be applied, can be found in the [Functional Description](./Banff%20Functional%20Description.pdf). When running Banff procedures in sequence as part of an SDE process flow, users are responsible for input and output between steps. An additional package, the *Banff Processor*, is a metadata-driven utility designed specifically for large-scale SDE production, incorporating the Banff procedures, and handling all intermediate data management. The Banff user guide often uses terminology from the [Generic Statistical Data Editing Model (GSDEM)](https://statswiki.unece.org/spaces/sde/pages/117771706/GSDEM). Users are encouraged to reference the GSDEM for common terminology regarding SDE concepts. # Table of contents - [Methodology](#methodology) - [List of procedures](#list-of-procedures) - [Interaction between procedures](#interaction-between-procedures) - [Statistical tables `indata` and `outdata`](#statistical-tables-indata-and-outdata) - [Status flags and files](#status-flags-and-files) - [Table of Banff-produced status flags](#table-of-banff-produced-status-flags) - [Status files `instatus` and `outstatus`](#status-files-instatus-and-outstatus) - [Status files by procedure](#status-files-by-procedure) - [Specifying linear edits](#specifying-linear-edits) - [Formatting](#formatting) - [Allowable edits](#allowable-edits) - [Processing within by-groups](#processing-within-by-groups) - [Technical guide](#technical-guide) - [Executing the Banff procedures](#executing-the-banff-procedures) - [Variable names on input tables](#variable-names-on-input-tables) - [Banff log](#banff-log) - [Setting The Log Language](#setting-the-log-language) - [Python Log Verbosity (`trace=`)](#python-log-verbosity-trace) - [Procedure C Log Messages](#procedure-c-log-messages) - [Procedure C Log Verbosity](#procedure-c-log-verbosity) - [Suppressing and Troubleshooting Log Messages (`capture=`)](#suppressing-and-troubleshooting-log-messages-capture) - [Use Your Own Logger (`logger=`)](#use-your-own-logger-logger) - [Presort](#presort) - [prefill_by_vars](#prefill_by_vars) - [no_by_stats](#no_by_stats) - [Expressions](#expressions) - [Input and output table specification](#input-and-output-table-specification) - [Supported formats](#supported-formats) - [Specifying input tables](#specifying-input-tables) - [Specifying output tables](#specifying-output-tables) - [Customize Default Output Specification](#customize-default-output-specification) - [Other](#other) - [Escape Characters and File Paths](#escape-characters-and-file-paths) - [Errors and Exceptions](#errors-and-exceptions) - [Working with SAS Files in Python](#working-with-sas-files-in-python) - [Performance Considerations](#performance-considerations) # Methodology Most SDE functions can be categorized into one of the following **function types** defined in the GSDEM: * **Review:** Functions that examine the data to identify potential problems. * **Selection:** Functions that select units or fields within units for specified further treatment. * **Treatment:** Functions that change the data in a way that is considered appropriate to improve the data quality. The modification of specific fields within a unit (i.e. filling in missing values or changing erroneous ones) is referred to as imputation. The nine Banff procedures are listed in the table below, alongside a brief description of each, and the function types they perform. Each procedure has an *alias* in Banff, which is used to execute the procedure in Python; this is also used to reference the procedures within the user guides. ## List of procedures | Procedure name | Alias | Function types | Description | | ------------------------ | --------------- | ---------------------------- |--------------- | | [Verify Edits](./Procedure%20Guides/verifyedits.md) | `verifyedits` | *none* | Checks the edits for consistency and redundancy. | | [Edit Statistics](./Procedure%20Guides/editstats.md) | `editstats` | Review | Produces edit summary statistics tables on records that pass, miss or fail each consistency edit. | | [Outlier Detection](./Procedure%20Guides/outlier.md) | `outlier` | Review, Selection | Identifies outlying observations using Hidiroglou-Berthelot or Sigma-Gap methods. | | [Error Localization](./Procedure%20Guides/errorloc.md) | `errorloc` | Review, Selection | For each record, selects the minimum number of variables to impute such that each observation can be made to pass all edits. | | [Deterministic Imputation](./Procedure%20Guides/deterministic.md) | `deterministic` | Treatment | Performs imputation when only one combination of values permits the record to pass the set of edits. | | [Donor Imputation](./Procedure%20Guides/donorimp.md) | `donorimp` | Treatment | Performs nearest neighbour donor imputation such that each imputed record satisfies the specified post-imputation edits. | | [Estimator Imputation](./Procedure%20Guides/estimator.md) | `estimator` | Treatment | Performs imputation using estimation functions and/or linear regression estimators. | | [Prorating](./Procedure%20Guides/prorate.md) | `prorate` | Review, Selection, Treatment | Prorates and rounds records to satisfy user-specified edits. | | [Mass Imputation](./Procedure%20Guides/massimp.md) | `massimp` | Review, Selection, Treatment | Performs donor imputation for a block of variables using a nearest neighbour approach or random selection. | This user guide does not include information about specific procedures; this can instead be accessed via the links in the table above or from the procedure guide [index](./Procedure%20Guides/index.md). The procedure guides include all information required to run the procedures, including detailed descriptions of every parameter, but only a brief description of the methods. For a full mathematical description of the procedure methods, with examples, please see the [Functional Description](Banff%20Functional%20Description.pdf). ## Interaction between procedures The Banff procedures are designed to be run sequentially as part of an SDE process flow. The outputs from one procedure often act as inputs for subsequent procedures, and the statistical data that is the target of the SDE process is updated continuously throughout the process. A standard approach using the Banff procedures may resemble the following: 1. Validity and consistency edits are reviewed and refined using `verifyedits`. 2. An initial review of the raw statistical data is performed with `editstats`. 3. Review and selection functions are performed using `outlier` and `errorloc` to identify potential problems, and select fields within units for further treatment. 4. Imputation is performed using the treatment functions available in `deterministic`, `donorimp`, and `estimator` to impute missing values, outliers, and to correct inconsistencies. 5. Prorating is performed using `prorate` to ensure all resulting values satisfy summation constraints, and to round any decimal values. 6. Optionally, `massimp` is used to impute large blocks of non-response. 7. A final review of the imputed data is performed with `editstats` to ensure the outgoing data meets desired quality standards. For this process to work, information needs to be passed from one step to another. Selection flags generated by `outlier` and `errorloc` are stored in a status file that is read by subsequent treatment procedures. When one of the treatment procedures successfully imputes a field requiring imputation, the selection flag on the status file is replaced with an imputation flag, indicating to subsequent procedures that treatment is no longer required. To manage the changes to both the statistical data and status flags, the Banff procedures are modular, with a standard set of inputs and outputs shared amongst procedures performing similar SDE function types. Of the nine Banff procedures, only two (`prorate` and `massimp`) perform all three function types; these procedures can be run in isolation to review the data, select records and/or fields for treatment, and perform imputation. Procedures that perform only treatment (`deterministic`, `donorimp`, and `estimator`) first require selection flags generated by one of the procedures performing selection (`outlier` or `errorloc`). The `editstats` procedure reviews the data for errors but does not produce selection flags. Finally, `verifyedits` does not perform any of the statistical data editing function types, but should be used before the SDE process begins to review and improve `edits` used by other procedures. Users should familiarize themselves with two types of data that are integral to understanding Banff: 1. Statistical tables `indata` and `outdata`: These input and output files represent the statistical data, or microdata, that is the target of the SDE process. All procedures except `verifyedits` act on statistical data and required `indata` as an input. Only those procedure that perform treatment functions (`deterministic`, `donorimp`, `estimator`, `prorate`, and `massimp`) produce `outdata`, containing the imputation results. 2. Status files `instatus` and `outstatus`: These input and output files contain status flags, important metadata identifying which fields in the statistical data require treatment (selection flags) and which have already been treated (imputation flags). Statistical tables and status files are discussed in further detail in the following sections. ## Statistical tables `indata` and `outdata` Except for `verifyedits`, all Banff procedures operate on statistical data (also called microdata) arranged in tabular form. The main input for most procedures is the `indata` table, which is the target of the data editing process. Some procedures also make use of `indata_hist`, historical data for the same set of records. Procedures that perform treatment functions produce the output table `outdata`. The `indata` table must consist of tabular data arranged in rows and columns. Banff documents refer to rows as *records* while columns are referred to as *variables*, or *fields* in the case of status flags. At least one character variable must serve as a unique record identifier, specified by the `unit_id` parameter in most procedures. Banff uses this identifier to track metadata throughout the data editing process. There are some restrictions on variables names for Banff inputs; please see [this section](#variable-names-on-input-tables) for more information. Procedures performing treatment functions (`deterministic`, `donorimp`, `estimator`, `prorate` and `massimp`) produce the `outdata` table, output statistical data (i.e., microdata) that includes the result of the treatment function. This includes both imputed values (e.g., imputed from `donorimp`) and modified values (e.g., prorated values from `prorate`). Some important notes about `outdata`: * The `outdata` table is typically **not a complete copy of `indata`** but only contains rows and columns affected by the procedure. For example, if `indata` includes 2000 rows and 25 columns, but only 500 rows and 10 columns are affected by a procedure, then `outdata` will only include 500 rows and 10 columns. To continue the SDE process, users should manually update the `indata` file with the new information from `outdata`. (*Note: The Banff team is looking into automatically updating `indata` with the results of `outdata` in a future release.*) * The `outdata` table always contains variables specified by the `unit_id` and `by` parameters. * If no records are successfully imputed or altered by the procedure, then `outdata` is empty. No error will occur. ## Status flags and files Banff stores important metadata information about the SDE process in status flags. These status flags capture important information including selection flags, exclusion flags, and imputation flags. This information is used in two ways: * As inputs for subsequent steps in the data editing process. For example, the error localization procedure `errorloc` produces selection flags to identify variables that require imputation. These selection flags are read by imputation procedures such as `donorimp` and `estimator` in order to perform imputation. * As a complete record of the data editing process. For example, the status flag history of a single observation can explain *why*, *how* and *when* it was modified. ### Table of Banff-produced status flags | Status flag | Description | | ----------- | ----------- | | `FTI` | *Field to impute:* Selection flag indicating an observation requires additional treatment such as imputation. Generated by `outlier` and `errorloc` procedures, or by a user-defined procedure. | | `FTE` | *Field to exclude:* Selection flag indicating an observation should be excluded from certain methods. Generated by `outlier` procedure, or by a user-defined procedure. | | `IDE` | *Imputed by deterministic imputation:* Field has been imputed using `deterministic` procedure and should be considered as a reported value (not imputed) for the purpose of subsequent methods. (Also referred to as *deductive* imputation.) | | `IDN` | *Imputed by donor imputation:* Field has been imputed using the `donorimp` procedure. | | `I--` | *Imputed by estimation imputation:* Field has been imputed using the `estimator` procedure. The exact code depends on the specified algorithm, i.e., an ICR flag indicates the `CURRATIO` algorithm was used while an IDT flag indicates the `DIFTREND` algorithm was used. | | `IPR` | *Imputed by prorating:* Field has been imputed using the `prorate` procedure. | | `IMAS` | *Imputed by mass imputation:* Field has been imputed using the `massimp` procedure. | ### Status files `instatus` and `outstatus` Selection and imputation flags are always associated with individual values on `indata`. Because `indata` is tabular, each observation can be associated with a specific record (row) and field (column). Records are identified by the user-specified unique record identifier `unit_id`, while fields are referenced by their variable name. Status flags are stored in status files with the following columns: | Column | Description | | ------------ | ----------- | | \ | Record identifier (i.e., row) to which the status flag applies. (The actual column header is the name of the variable by the `unit_id` parameter.) | | `FIELDID` | Field identifier (i.e., column) to which the status flag applies. | | `STATUS` | Status flag such as "FTI", "FTE", or "IDN". | | `VALUE` | Value of the variable when the status code was generated. For procedures performing selection (`errorloc` and `outlier`), this column reflects the value of the observation on `indata` when selection occurred. For procedures performing treatment (`deterministic`, `donorimp`, `estimator`, `prorate`, `massimp`), this column reflects the value of the observation on `outdata`, after treatment has occurred. | All procedures performing selection or treatment functions (i.e., all but `verifyedits` and `editstats`) automatically produce output status files labelled `outstatus` containing selection or imputation flags. Some procedures also read status files as inputs (`instatus`); these may be required, depending on the procedure. A brief summary of the behaviour of each procedure with respect to status files is included in the table below. ### Status files by procedure The following table summarizes which flags are read from `instatus` or produced on `outstatus` by each procedure. If a flag is required by a procedure, then `instatus` is mandatory. | Procedure | Flags read from `instatus` | Flags produced on `outstatus` | | --------------- | --------------------------------------------- |------------------------------------------------ | | `verifyedits` | N/A | N/A | | `editstats` | N/A | N/A | | `outlier` | N/A | FTI, FTE | | `errorloc` | FTI (optional) | FTI | | `deterministic` | FTI (required) | IDE | | `donorimp` | FTI (required), FTE(optional), I-- (optional) | IDN | | `estimator` | FTI (required), FTE (optional), I--(optional) | I-- (exact code depends on specified algorithm) | | `prorate` | I-- (optional) | IPR | | `massimp` | N/A | IMAS | Some input flags are optional, but can change the behaviour of the procedure. For example, within `prorate`, users can choose whether to prorate all values, only original values, or only previously imputed values. If they choose to only impute original or previously imputed values, an `instatus` file with I-- status flags is required. ## Specifying linear edits In the statistical data editing process, the term *edits* generally refers to constraints that records must satisfy to be considered valid. Linear edits refer to constraints that can be expressed as linear equalities or inequalities of the form $$ a_1 x_1 + a_2 x_2 + ... + a_n x_n = b \\ or \\ a_1 x_1 + a_2 x_2 + ... + a_n x_n \le b $$ where $x$$1$ to $x$$n$ are numerical variables from the target statistical data, and $a$$1$ to $a$$n$ and $b$ are constants specified by the user. Of the nine Banff procedures, six of them require edits as input parameters: * `verifyedits` * `editstats` * `errorloc` * `deterministic` * `donorimp` * `prorate` (Note: there are additional restrictions and unique syntax for edits in the prorating procedure) ### Formatting Use the `edits` parameter to specify a list of edits, following these rules: * As with all string parameters, the list of edits must be surrounded by quotes `"` or triple quotes `"""` for multi-line edits. * Each individual edit must be followed by a semi-colon `;`. * Individual edits must include one of the following *operators*: * `<` (less than) * `<=` (less than or equal to) * `=` (equal to) * `!=` (not equal to) * `>=` (greater than or equal to) * `>` (greater than) * Within an individual edit, one or more *components* must appear on each side of an operator, separated by `+` or `-`. A component can be a `constant`, a `variable` found on `indata`, or a constant and variable multiplied together: `constant * variable` or `variable * constant`. When multiplying constants and variables, they must be separated by an asterisk `*`. * Optionally, users may add a `modifier` at the beginning of an individual edit, followed by a colon `:`. Acceptable modifiers are `pass` or `fail` and are not case sensitive. A simple example with three edits specified on one line: ```python errorloc_call = banff.errorloc( edits= "Profit = Revenue - Expenses; Profit >= 0; Expenses >= 0;" ... # etc. (parameters, output tables) ) ``` In this example, the edits are spread over multiple lines and surrounded by triple quotes `"""`. These edits also include constants and the `pass` and `fail` modifiers. ```python errorloc_call = banff.errorloc( edits= """ Profit = Revenue - Expenses; Profit >= 0; Expenses >= 0; 0.9 * Total <= Var1 + Var2 + Var3; Var1 + Var2 + Var3 <= 1.1 * Total; Pass: Var4 >= Var5; Fail: Var4 > Var5; Fail: Num_Employees != Employees_BR """ ... # etc. (parameters, output tables) ) ``` ### Allowable edits While users may express edits in a number of ways, the Banff procedures convert them to *canonical form* before processing, meaning each edit is expressed as a *pass* edit with an `=` or `<=` operator. Strict inequalities are not allowed, resulting in the following rules or modifications: * Pass edits with `<` or `>` are replaced by `<=` and `>=` respectively. * Fail edits with `<=` or `>=` are replaced by `<` and `>` respectively. * Pass edits with `!=` cannot be converted to canonical form and generate an error. * Fail edits with `=` cannot be converted to canonical form and generate an error. (*Note: Users who wish to specify a strict inequality should instead include a small constant in their edit, i.e., `A < B` can be replaced by `A <= B - 0.005` for values that are only recorded to two decimals of precision.*) The following table gives examples of valid and invalid original edits alongside their canonical form, if possible: | Original edit | Canonical form | | --------------- | --------------- | | `Pass: A > B + 3` | `-A + B <= -3` | | `Pass: C = D` | `C - D = 0` | | `Pass: Z < A` | `-A + Z <= 0` | | `Pass: M != N` | `INVALID` | | `Fail: A > B + 3` | `A - B <= 3` | | `Fail: C = D` | `INVALID` | | `Fail: Z <= A` | `A - Z <= 0` | | `Fail: N != M` | `-M + N = 0` | Additionally, the set of edits specified by a user must be consistent; that is, the linear edits must form a non-empty feasible space. Specifying an inconsistent set of edits for a procedure will result in an error. Users are encouraged to use `verifyedits` to review their edits for consistency, redundancy, determinacy and hidden equalities before using them in one of the other procedures. ## Processing within by-groups The `by` parameter found in all Banff procedures excluding `verifyedits` allows users to partition the input statistical data into groups that are processed independently. Within the Banff documents, these are referred to as "by-groups". Users specify one or more variables on `indata` that the procedure uses to form the by-groups. For example, `by = "Province"` will create by-groups for each unique value of the variable `Province` on `indata` while `by = "Province Industry"` will create by-groups for each unique combination of the variables `Province` and `Industry`. Notes on by-groups: * Procedures `errorloc`, `deterministic`, and `prorate` operate on each record on `indata` independently; processing within by-groups has no effect on the procedure outputs. Nonetheless, the `by` parameter is still included in these procedures, and may be used in the future for performance enhancements. * If `by` and `indata_hist` are both specified in a procedure, then the `indata_hist` table must also include the `by` variables. * Input tables such as `indata` and `indata_hist` must be sorted according to the `by` variables before execution. By default, Banff procedures handle this sorting automatically; this sorting can be disabled (for example, if the input file is already sorted) by setting `presort=False`. * When `instatus` is specified, some procedures are more efficient if the `by` variables are included on `instatus`. By default, Banff will add the `by` variables automatically; this action can be disabled (for example, if the variables are already present) by setting `prefill_by_vars=False`. * By default, many procedures will output diagnostics to the log for each by-group; to disable this feature, specify `no_by_stats=True`. * The `outdata` file will always contain the `by` variables. # Technical guide ## Executing the Banff procedures To execute a Banff procedure in a Python script, we first import the Banff package alongside any other packages we plan on using: ```python import banff import pandas as pandas ``` We now create a new object with a name of our choosing (in this case, "my_errorloc") as one of the Banff procedures, using the alias from table (link): ```python my_errorloc = banff.errorloc( indata=example_indata, outstatus=True, outreject=True, edits="x1>=-5; x1<=15; x2>=30; x1+x2<=50;", weights="x1=1.5;", cardinality=2, time_per_obs=0.1, unit_id="IDENT", by="ZONE" ) ``` * Procedures must be referenced using the Banff package name, i.e., `banff.errorloc()` * Parameters (e.g., `edits`) and tables (e.g., `indata`) are specified as comma-separated key-value pairs, and can appear in any order Outputs can be stored in memory or saved to disc (see [this section](#specifying-output-tables) for details). If stored in memory, they are stored in the user-created object (e.g., `my_errorloc`) created when the procedure was executed: ```python print(my_errorloc.outstatus) # Print the outstatus table errorloc_outstatus = my_errorloc.outstatus # Save outstatus as a new PyArrow Table called "errorloc_outstatus" ``` ## Variable names on input tables Many Banff procedure parameters reference variables (i.e., columns) from input tables such as `indata`. Any variable referenced by a Banff procedure must consist of a single string without spaces or special characters except underscore (`_`). For example, `"first_name"` is an acceptable variable name while `"first name"` and `"first-name$"` are not. To ensure that input tables are compatible with Banff, users may need to modify the variable names. Additional constraints: * Variable names cannot exceed 64 characters in length. * Variable names must be unique within an individual input table. (The same variable can appear on multiple input tables without issue.) Common Banff parameters that reference variables include `unit_id`, `var`, `weights`, and `by`. They are simply referenced by their variable name, and lists of two or more variables are separated by a single space. For example: * Single variable: `unit_id = "business_number"` * Variable list: `by = "province industry"` Note that variable names are not case-sensitive in Banff; if an input table has two or more columns that differ only in case, it is unclear which column will be used during processing. (Users are strongly encouraged to give distinct case-insensitive names to all columns on any input tables.) ## Banff log Log messages are written to the terminal during execution. These messages originate from 2 sources: Python code and procedure C code. Messages can be displayed in English or French. ### Setting The Log Language The `banff` package produces log messages in either English or French. The package attempts to detect the language from its host environment during import. Use the function `banff.set_language()` to set the language at runtime, specifying a member of the `banff.SupportedLanguage` enumeration (i.e. `.en` or `.fr`) > **Example**: setting the language to French > > ```python > banff.set_language(banff.SupportedLanguage.fr) > ``` ### Python Log Messages Python handles logging using the standard [`logging`](https://docs.python.org/3/library/logging.html#) package. All Python log messages have an associated [*log level*](https://docs.python.org/3/library/logging.html#logging-levels), such as *ERROR*, *WARNING*, *INFO*, and *DEBUG*. Messages from Python are generally one line per message and are prefixed with a timestamp and level. By default, only warning and error messages are displayed. Use the `trace` parameter to change what log levels are displayed. > **Example Python Messages** > The following are 3 unrelated messages from the *INFO*, *DEBUG*, and *ERROR* log levels. By default, only the 3rd message would be printed to the terminal. > > ```text > 2024-11-29 10:47:51,853 [INFO]: Time zone for log entries: Eastern Standard Time (UTC-5.0) > 2024-11-29 10:48:47,654 [DEBUG , banff.donorimp._execute._preprocess_inputs]: Adding BY variables to 'instatus' dataset > 2024-11-29 10:49:59,867 [ERROR]: Procedure 'Donor Imputation' encountered an error and terminated early: missing mandatory dataset (return code 4) > ``` #### Python Log Verbosity (`trace=`) Use the `trace` parameter to control which log levels are printed by specifying one of the following log levels - `banff.log_level.ERROR` - `banff.log_level.WARNING` - `banff.log_level.INFO` - `banff.log_level.DEBUG` Messages from the specified log level and higher levels will be printed, lower levels will not. For convenience, specifying `trace=True` enables all logging output. ### Procedure C Log Messages Messages from procedure C code have no associated log level and do not include a timestamp. Some messages are prefixed with `ERROR:`, `WARNING:` or `NOTE:`, while other messages have no prefix and may span multiple lines. Most messages are informational. Whenever an `ERROR:` message is printed, there should be a subsequent related error-level message from Python and an exception should be raised. > **Example Procedure C Messages** > > ```text > NOTE: --- Banff System 3.01.001b19.dev6 developed by Statistics Canada --- > NOTE: PROCEDURE DONORIMPUTATION Version 3.01.001b19.dev6 > . . . > NOTE: The minimal set of edits is equivalent to the original set of edits specified. > . . . > Number of valid observations ......................: 12 100.00% > Number of donors ..................................: 5 41.67% > Number of donors to reach DONORLIMIT ..............: 3 60.00% > Number of recipients ..............................: 7 58.33% > > NOTE: The above message was for the following by group: > prov=30 > ``` #### Procedure C Log Verbosity See [no_by_stats](#no_by_stats). ### Suppressing and Troubleshooting Log Messages (`capture=`) The `capture` parameter can be used to suppress all log output (using `capture=None`). This option is diabled by default (`capture=False`). Specifying `capture=True` will cause log messages to be printed all at once at the end of procedure execution, instead of printed immediately throughout execution. The difference may only be noticable during long running procedure calls. Using this option can improve performance in some cases, such as when processing a very large number of by groups and without specifying `no_by_stats=True`. > *Jupyter Notebooks and Missing Log Messages* > When running a procedure using Juputer Notebooks, Procedure C log messages may be missing, particular when running on Visual Studio Code on Windows. > To fix this issue, try using the option `capture=True`. > > * alternatively [use your own logger](#use-your-own-logger-logger) Due to how Jupyter notebooks manage Python's terminal output, messages from procedure C code may not be displayed. To resolve this, specify `capture=True` in the procedure call when running in a Jupyter notebook, or . ### Use Your Own Logger (`logger=`) Use the `logger` parameter to specify a [Logger](https://docs.python.org/3/library/logging.html#logger-objects) that you have created. All messages from Python and C will be sent to that logger. This allows customization of message prefixes, support for writing to file, etc. Note that procedure C messages are sent to the logger in a single *INFO* level Python log message. > ***Example***: writing logs to file > > ```python > import banff > import logging > my_logger = logging.getLogger(__name__) > logging.basicConfig(filename='example.log', encoding='utf-8', level=logging.DEBUG) > > # run Banff procedure > banff_call = banff.donorimp( > logger=my_logger, > indata=indata, > instatus=donorstat, > outmatching_fields=True, > ... > ``` ### Presort When `by` variables are specified, Banff procedures require input tables be sorted according to those variables before execution. Some procedures also require sorting by `unit_id`. Setting `presort = True` automatically sorts any specified input tables (e.g., `indata`, `indata_hist`, and `instatus`) according to procedure's requirements. By default, `presort = True` for all applicable Banff procedures. Users may disable this feature by specifying `presort = False`. ### prefill_by_vars Setting `prefill_by_vars = True` will automatically add any specified `by` variables to the input status file(s), if necessary, before running the Banff procedure. In some cases, the presence of `by` variables on input status files may significantly improve procedure performance. By default, `prefill_by_vars = True` for all applicable Banff procedures. Users may disable this feature by specifying `prefill_by_vars = False`. ### no_by_stats Many of the Banff procedures output information to the Banff log. When by-groups are specified, this information is typically produced for each by-group. Specify `no_by_stats = True` to reduce log output by suppressing by-group-specific messages. This parameter is available for most procedures that allow `by` groups, though some procedures have few by-group specific messages. ## Expressions The parameters `exclude_where_indata` and `exclude_where_indata_hist` apply boolean logic to input tables using SQL expressions. SQL expression support is implemented using DuckDB. See their [documentation on expressions](https://duckdb.org/docs/sql/expressions/overview) for a complete guide of supported syntax. ## Input and output table specification For both input and output tables, users can specify in-memory objects or files on disc. A number of different formats are supported for both types. Objects are associated with identifiers (e.g., `"pandas dataframe"`) while files are associated with extensions (e.g., `"filename.parquet"`); please see the table below for details. Note that some are recommended for testing purposes only, and not all formats are supported for outputs tables. ### Supported formats | Format | Type | Supported identifier(s) or extension(s) | Notes | | --------------------- | ------ | ----------------------------------------------- | --------------------------------------------------------- | | PyArrow Table | Object | `"pyarrow"`, `"table"`, `"pyarrow table"` | Recommended format for in-memory objects. | | Pandas DataFrame | Object | `"pandas"`, `"dataframe"`, `"pandas dataframe"` | | | Apache Parquet | File | `.parquet`, `.parq` | Minimal RAM usage, good performance with large tables. | | Aparche Feather | File | `.feather` | Least RAM usage, good performance with large tables. | | SAS Dataset | File | `.sas7bdat` | For testing purposes, input only; not recommended in production. | | Comma Separated Value | File | `.csv` | For testing purposes only; not recommended in production. | For tips related to file paths in Python, see [Escape Characters and File Paths](#escape-characters-and-file-paths) ### Specifying input tables To input from an in-memory object, simply reference the object name directly from the procedure call. The procedure will automatically detect the type of object from amongst the supported types. ```python donorimp_call = banff.donorimp( indata=df, # where df is a Pandas DataFrame preivously generated instatus=table, # where table is a PyArrow Table previously generated ... # etc. (parameters, output tables) ) ``` To specify an input from file, include either a relative or complete file path: ```python donorimp_call = banff.donorimp( indata="./input_data.parquet", # Parquet file with local reference instatus=r"C:\temp\input_status.feather", # Feather file with Windows reference ... # etc. (parameters, output datasets) ) ``` Users can mix both type of inputs as well: ```python donorimp_call = banff.donorimp( indata="./input_data.parquet", # Parquet file with local reference instatus=table, # where table is a PyArrow Table previously generated ... # etc. (parameters, output tables) ) ``` ### Specifying output tables The Banff procedures automatically create a number of output tables. Some of these are optional, and can be disabled by specifying `False`. (Specifying `False` for an optional output will prevent it from being produced at all, possibly reducing memory usage. Specifying `False` for a mandatory output will result in an error.) The default format for output tables is the in-memory PyArrow Table. To produce the output in another in-memory format, specify its associated _identifier_ as a string. To write outputs to file, specify a file path with a supported _extension_. See the [table of supported formats](#supported-formats) for a list of identifiers and extensions. Please see the [Output Tables document](output_tables.md) for a full list of output tables by procedure. The following examples includes both mandatory and optional outputs, saved as a mix of in-memory objects and files. ```python estimator_call = banff.estimator( outdata=True, # Saved as a PyArrow Table outstatus=True, # Saved as a PyArrow Table outacceptable=True, # Optional output saved as PyArrow Table outest_ef="pyarrow", # Optional output saved as a PyArrow Table outest_lr="dataframe", # Optional output saved as a Pandas Dataframe outest_parm=False, # Optional output disabled outrand_err="./output_data.parquet", # Optional output saved as parquet file ... # etc. (parameters, output tables) ) ``` *Note: because tables are enabled by default, and PyArrow Table is the default output format, the following would produce identical results:* ```python estimator_call = banff.estimator( outest_lr="dataframe", # Optional output saved as a Pandas Dataframe outest_parm=False, # Optional output disabled outrand_err="./output_data.parquet", # Optional output saved as parquet file ... # etc. (parameters, output tables) ) ``` **NOTE: Outputs will automatically overwrite existing objects and files with the same name.** ### Customize Default Output Specification To determine the current default output table format ```python >>> banff.get_default_output_spec() 'pyarrow' ``` - this corresponds to `pyarrow.Table` The default can be set to any *identifier* from the [table of Supported Formats](#supported-formats) > **Example**: Switch default output format to `pandas.DataFrame` > > ```python > banff.set_default_output_spec('pandas') > ``` ### Accessing output tables For objects saved in memory, access them using the object member naming the output tables: ```python estimator_call = banff.estimator( outdata=True, # Saved as a PyArrow Table outstatus=True, # Saved as a PyArrow Table ... # etc. ) print(estimator_call.outdata) # Print outdata to the terminal my_table = estimator_call.outstatus # Save outstatus as a new object called my_table ``` *Note: because `outdata`,`outstatus` are mandatory outputs, they would still be accessible as `estimator_call.outdata` and `estimator_call.outstatus` even if not specified using the `True` statements.* ## Other ### Escape Characters and File Paths On Windows, the backslash character (`\`) is typically used to separate folders and files in a file path. - Example `"C:\users\stc_user\documents\dataset.csv"` In Python however, the character `\` is an *"escape character"* and is treated specially. Providing a file path using the example above may cause a runtime error. To disable this special treatment, use a "*raw string*" by adding the `r` prefix: - `r"C:\users\stc_user\documents\dataset.csv"` Alternatively, - double backslash: `C:\\users\\stc_user\\documents\\dataset.csv` - forward backslash: `C:/users/stc_user/documents/dataset.csv` ### Errors and Exceptions Python generally handles runtime errors by "raising an exception". This has been adopted by the `banff` package. Whenever an error occurs, an exception is raised. This could occur while the package is loading or preprocessing input data, running a procedure, or writing output data. Generally exceptions will contain a helpful error message. Exceptions are often "chained" to provide additional context to the exception. > ***Example**: Exception while writing output dataset* > > The following is console output generated when the package fails to write an output dataset because the destination folder does not exist. > > ```text > [ERROR , banff.donorimp._execute._write_outputs]: Error occurred while writing 'outmatching_fields' output dataset > [ERROR , banff.donorimp._execute._write_outputs]: Directory of output file does not exist: 'C:\temp\definitely\a\fake' > [ERROR , banff.donorimp._execute._write_outputs]: [WinError 3] The system cannot find the path specified: 'C:\\temp\\definitely\\a\\fake' > Traceback (most recent call last): > File "C:\git\banff_redesign\Python\src\banff\_common\src\io_util\io_util.py", line 578, in write_output_dataset > dst.parent.resolve(strict=True) # strict: exception if not exists > File "C:\Program Files\Python310\lib\pathlib.py", line 1077, in resolve > s = self._accessor.realpath(self, strict=strict) > File "C:\Program Files\Python310\lib\ntpath.py", line 689, in realpath > path = _getfinalpathname(path) > FileNotFoundError: [WinError 3] The system cannot find the path specified: 'C:\\temp\\definitely\\a\\fake' > > The above exception was the direct cause of the following exception: > > Traceback (most recent call last): > File "C:\git\banff_redesign\Python\src\banff\_common\src\proc\stc_proc.py", line 649, in _write_outputs > ds.user_output = io.write_output_dataset(ds.ds_intermediate, ds.user_spec, log_lcl) > File "C:\git\banff_redesign\Python\src\banff\_common\src\io_util\io_util.py", line 581, in write_output_dataset > raise FileNotFoundError(mesg) from e > FileNotFoundError: Directory of output file does not exist: 'C:\temp\definitely\a\fake' > > The above exception was the direct cause of the following exception: > > Traceback (most recent call last): > File "C:\git\banff_redesign\Python\sample_programs\conversion_examples\DonorImp01.py", line 61, in > banff_call = banff.donorimp( > File "C:\git\banff_redesign\Python\src\banff\proc\proc_donorimp.py", line 119, in __init__ > super().__init__( > File "C:\git\banff_redesign\Python\src\banff\proc\banff_proc.py", line 66, in __init__ > self._execute() > File "C:\git\banff_redesign\Python\src\banff\_common\src\proc\stc_proc.py", line 367, in _execute > self._write_outputs(log=log_lcl) > File "C:\git\banff_redesign\Python\src\banff\_common\src\proc\stc_proc.py", line 654, in _write_outputs > raise ProcedureOutputError(mesg) from e > banff._common.src.exceptions.ProcedureOutputError: Error occurred while writing 'outmatching_fields' output dataset > ``` > > The first 3 lines are log messages generated by the `banff` package. The remaining lines are a standard exception traceback generated by Python itself. From top to bottom it shows a chain of 3 exceptions. > The first is a low-level error indicating that a file path cannot be found, "*`[WinError 3] The system cannot find the path specified: 'C:\\temp\\definitely\\a\\fake'`*". > The second more specifically indicates that the "*`Directory of output file does not exist: 'C:\temp\definitely\a\fake'`*". > The third provides context about what was happening when this error occurred, "*`Error occurred while writing 'outmatching_fields' output dataset`*". ### Working with SAS Files in Python The banff package provides a few useful functions for reading SAS files into memory or converting to another format. To use these functions your program must `import banff` |Function|Description| |--|--| |`banff.io_util.SAS_file_to_feather_file(file_path, destination)`|Reads *SAS dataset* at `file_path` and writes it to *feather* file at `destination`| |`banff.io_util.SAS_file_to_parquet_file(file_path, destination)`|Reads *SAS dataset* at `file_path` and writes it to *parquet* file at `destination`| |`banff.io_util.DF_from_sas_file(file_path)`|Reads *SAS dataset* at `file_path` and returns it as a *`pandas.DataFrame`*| |`banff.io_util.PAT_from_sas_file(file_path)`|Reads *SAS dataset* at `file_path` and returns it as a *`pyarrow.Table`*| ### Performance Considerations The formats used for input and output datasets will affect performance. When there may not be sufficient RAM available (due to small RAM size or large datasets), datasets should be stored on disk. The file format selected will have an effect on performance. Apache Parquet (`.parquet`) and Apache Feather (`.feather`) file formats currently deliver the best performance when using files for input or output datasets. Feather should use the least amount of RAM, making it ideal for large datasets or execution environments with little RAM, it is the recommended format for temporary files. Parquet is generally the smallest file size, however it still provides impressive read and write performance in multi-CPU environments and reasonably minimal RAM usage, it is recommended for medium-long term storage of data. Using the SAS dataset format for large input datasets may result in degraded performance, particularly in environments with little RAM. This format is only recommended for use with small datasets (under a few hundred MB). Using the SAS format is discouraged in general, with Apache Arrow formats (*parquet* and *feather*) being recommended instead.