Banff Processor User Guide#

Introduction#

The Banff Processor is a Python package used to execute a Statistical Data Editing (SDE) process, also commonly referred to as Edit and Imputation (E&I). A specific SDE process typically consists of numerous individual process steps, each executing a SDE function such as error localization, outlier detection, or imputation. The process flow describes which process steps to perform and the sequence in which they are executed. Given a set of input metadata tables that describe the SDE process flow, including each individual step, the Banff processor executes each process step in sequence, managing all intermediate data management. The advantage of the Banff Processor’s metadata-driven system is that the design and modification of the SDE process is managed from metadata tables instead of source code.

Other notes about the Banff Processor:

  • Simplicity: Once the metadata tables are created, the processor can be executed from a single line of code

  • Efficiency: Designed for production-level SDE processes

  • Modularity: Within a process step, users may call the built-in Banff procedures, or user defined Procedures

  • Flexibility: Process Controls and Process Blocks allow users to specify complex process flows

  • Informative: The Processor produces diagnostics from each step, and for the overall process

  • Transparency: Source code is available and freely shared

The user guide often uses terminology from the Generic Statistical Data Editing Model (GSDEM). Users are encouraged to reference the GSDEM for common terminology regarding SDE concepts.

Table of Contents#

Input Metadata Files#

The Banff Processor is driven by metadata tables describing both the overall process flow, as well as the parameters required for individual process steps. The primary metadata table is called JOBS, which specifies the overall process flow, specifically the built-in Banff procedures and/or user-defined programs (plugins) to execute and their relative sequencing. The JOBS table can contain more than one job (i.e., process flow), specified by the jobid (job identifier), though only one job can be executed at a time by the Banff Processor. Each job will include one row per process step; the key columns are:

  • jobid: Job identifier.

  • seqno: Sequence number of the individual process steps.

  • process: Name of either the built-in procedure or user-defined program to execute at each process step.

  • specid: Specification identifier, linking to other metadata tables containing parameters specific to the declared process.

Additional columns on the JOBS table include an optional process control identifier (controlid) as well as parameters that are common to multiple procedures (editgroupid,byid,acceptnegative).

Overall, the Banff Processor uses 18 metadata tables, which can be classified as follows:

  • Tables describing the overall process flow: JOBS PROCESSCONTROLS

  • Process step parameters for built-in Banff procedures: VERIFYEDITSPECS OUTLIERSPECS ERRORLOCSPECS DONORSPECS ESTIMATORSPECS ESTIMATORS ALGORITHMS PRORATESPECS MASSIMPUTATIONSPECS

  • Process step parameters for User Defined Procedures: USERVARS

  • Tables used to define edits: EDITS EDITGROUPS

  • Parameters used by multiple procedures: VARLISTS WEIGHTS EXPRESSIONS

  • Data management: PROCESSOUTPUTS

Only the JOBS table is mandatory, though other tables are required depending on which procedures or options are used. A full description of all metadata tables is included in this document: Metadata Tables

Formatting#

The metadata tables must be saved in .xml format. Users may create the .xml files on their own, or use the Banff Processor Template to create and save the metadata, and the Metadata Generation Tool to convert the template file into .xml tables. By default, the Processor will look for the XML files in the same location as your .json input file, either directly in the same folder or in a \metadata subfolder. Alternatively, you may provide a specific location by setting the metadata_folder parameter in your input JSON file.

Example (JOBS table)#

The following table defines a single job, “Main”, which includes four process steps.

jobid

seqno

controlid

process

specid

edigroupid

byid

acceptnegative

Main

1

ERRORLOC

errloc_specs

edits_1

Y

Main

2

DETERMINISTIC

edits_1

Y

Main

10

DONORIMP

donor_specs

edits_1

by_list

Y

Main

99

PRORATE

pro_specs

edits_2

Y

  • Only the ordering of the sequence numbers (seqno) are important; they do not need to be sequential integers.

  • This job includes four process steps, run sequentially, consisting of four built-in Banff procedures: errorloc, deterministic, donorimp, and prorate.

  • The parameters editgroupid, byid and acceptnegative, which are common to many of the built-in Banff procedures, are included in the JOBS table.

  • Most procedures include mandatory and/or optional parameters that define exactly how the procedure should be executed. These are contained in additional metadata tables, and linked to specific process steps via the specid column. Procedures that do not have any additional parameters (beyond those included in the JOBS table) do not require a specid.

  • The controlid column is optional, and can be used to specify Process Controls.

Additional examples can be found in the project’s ‘examples’ folder.

Metadata Generation Tool#

Metadata stored in the Banff Processor Template must be converted to XML files before running the processor. A conversion tool is provided for this purpose. With the banffprocessor package installed in your python environment, the conversion tool can be run with the following command:

banffconvert "\path\to\your\excel_metadata.xlsx" -o "\my\output\directory" -l fr

or:

banffconvert "\path\to\your\excel_metadata.xlsx" --outdir="\my\output\directory" --lang en

Alternatively to run as a module:

python -m banffprocessor.util.metadata_excel_to_xml "\path\to\your\excel_metadata.xlsx" -o "\my\output\directory"
  • NOTE: The ‘-o’/’–outdir’ parameter is optional. If it is not provided the conversion tool will save XML files to the same directory as the input file.

  • NOTE: The ‘-l’/’–lang’ parameter is optional. Valid values include en and fr, if not specified, the default is set to en.

Finally, the tool can be included and ran directly in a python script:

import banffprocessor.util.metadata_excel_to_xml as e2x

e2x.convert_excel_to_xml("\\path\\to\\your\\excel_metadata.xlsx", "\\my\\output\\directory")

Banff Processor Input Parameters#

Input parameters are specified in a .json file or a ProcessorInput object, either of which are passed to the processor and used to specify your job’s parameters. These are the parameters that are currently available:

Name

Purpose

Required?

job_id

The job id from your jobs.xml you wish to run.

Y

unit_id

The unit id variable is the unique identifier on micro data files for the job

Y

indata_filename

The filename or full filepath of your input/current data file

Y (unless running only VerifyEdits)

indata_aux_filename

The filename or full filepath of your auxiliary data file

N

indata_hist_filename

The filename or full filepath of your historic data file

N

instatus_hist_filename

The filename or full filepath of your historic status data file

N

instatus_filename

The filename or full filepath of a status file to use as input to the first proc in your job requiring a status file. [1]

N

user_plugins_folder

The optional location of the folder containing your custom python procedure plugins. See below for a description of how to create your own plugins

N

metadata_folder

The optional path to a folder where your XML metadata can be found

N

output_folder

The optional path to a folder where your output files will be saved

N

process_output_type

Controls the output datasets retained by each process. Options include all, minimal and custom. When all is specified all outputs are retained. If minimal is specified, only the imputed_file, status_file and status_log are retained. When the value is ‘custom’, the processor looks at the ProcessOutputs metadata to determine what to keep.

N

seed

The seed value to use for consistent results when using the same input data and parameters

N

no_by_stats

If specified, determines if no_by_stats is set to True when calling standard procedures.

N

randnumvar

Specify a random number variable to be used when having to make a choice during error localization or donor imputation. This parameter is optional and is only used by ErrorLoc and DonorImputation simultaneously (it cannot be used in one and not the other). Please see the Banff User Guide document for more details on the use of the randnumvar option in the ErrorLoc and DonorImputation procedures.

This option can be helpful when one needs to get the same error localization or imputation results from one run to the next.

N

save_format

Optional list of file extensions used to determine the format to save output files in. One or more may be provided. Currently supports CSV and Parquet extensions.

N

log_level

Configures whether or not to create a log file and what level of messages it should contain. Value should be 0, 1 (default) or 2. See Output.

N

Example JSON input file:

{
    "job_id": "j1",
    "unit_id": "ident",
    "indata_filename": "indata.parq",
    "indata_aux_filename": "C:\\full\\filepath\\to\\auxdata.parq",
    "indata_hist_filename": "histdata.parq",
    "instatus_hist_filename": "histstatus.parq",
    "instatus_filename": "instatus.parq",
    "user_plugins_folder": "C:\\path\\to\\my\\plugins",
    "metadata_folder": "C:\\path\\to\\xml\\metadata",
    "output_folder": "my_output_subfolder",
    "process_output_type": "All",
    "seed": 1234,
    "no_by_stats": "N",
    "randnumvar": "",
    "save_format": [".parq", ".csv"],
    "log_level": 2
}

Example building a ProcessorInput object inline:

from banffprocessor.processor import Processor
from banffprocessor.processor_input import ProcessorInput
from pathlib import Path

# Supplying parameters directly, instead of an input file
input_params = ProcessorInput(job_id="j1", 
                              unit_id="ident",
                              # Gets the path to the folder containing this file
                              input_folder=Path(__file__).parent,
                              indata_filename="indata.parq",
                              indata_hist_filename="C:\\full\\filepath\\to\\histdata.parq",
                              seed=1234, 
                              save_format=[".parq", ".csv"],
                              log_level=2)

# Normal method with a JSON file
#my_bp = Processor("C:\\path\\to\\my\\processor_input.json")
# Method when providing parameters inline
my_bp = Processor(input_params)

Notes:

  • All folder locations may be given as absolute filepaths or relative to the input folder (either the location of the input .json file or the supplied input_folder parameter if creating inputs inline as demonstrated above).

    • This input_folder location is also used as the default location for other files required by the processor should no value be provided for them, such as metadata, input data files and user-defined procedures

  • File paths must have any backslashes \ escaped by replacing them with a double-backslash \\. For example, C:\this\is\a\filepath would become C:\\this\\is\\a\\filepath

  • Fields that are not required to run the procedures outlined in your jobs file can be omitted or left empty

  • The CSV format has been included for testing purposes and is not intended for production, parquet is currently the recommended format for production for reasons related to accuracy, performance and efficiency

Executing the processor as a command line utility#

With the banffprocessor package installed in your python environment, the processor can be run with the following command:

banffprocessor "\path\to\your\processor_input.json" -l fr

Alternatively to run as a module:

python -m banffprocessor.processor "\path\to\your\processor_input.json" --lang fr
  • NOTE: The ‘-l’/’–lang’ parameter is optional. Valid values include en and fr, if not specified, the default is set to en.

Executing the processor from within a python script#

import banffprocessor

# Optional: Set the language to fr so that log and console messages on written in French.
banffprocessor.set_language(banffprocessor.SupportedLanguage.fr)

bp = banffprocessor.Processor.from_file("path\\to\\my\\input_file.json")
bp.execute()
bp.save_outputs()

Alternatively, you may load your input data files programmatically as Pandas DataFrames:

from banffprocessor.processor import Processor
import pandas as pd

indata = pd.DataFrame()
indata_aux = pd.DataFrame()
instatus = pd.DataFrame()
...
# Load your dataframes with data
...
bp = Processor.from_file(input_filepath="path\\to\\my\\input_file.json", indata=indata, instatus=instatus)
bp.execute()
bp.save_outputs()

User-Defined Procedures#

In addition to the standard Banff Procedures automatically integrated into the processor, you may also include your own .py files implementing custom procedures. By default the Processor will look for python files placed in a \plugins subfolder in the same location as your input JSON file. Alternatively you may provide a specific location to load plugins from in the user_plugins_folder parameter of your input JSON file. You may provide as many plugin files as needed for your job, and each plugin file may contain as many procedure classes as you wish so long as each class is registered in a register() method.

Your plugin must define a class that implements the ProcedureInterface protocol which is found in the package’s source files at \src\banffprocessor\procedures\procedure_interface.py. Your implementing class must have the exact same attribute names and function signatures as the interface does. Here is an example of a plugin that implements the protocol:

class MyProcClass:
    
    @classmethod
    def execute(cls, processor_data) -> int:
        # These give you indata as a pyarrow Table
        #indata = processor_data.indata
        #indata = processor_data.get_dataset("indata", ds_format="pyarrow")
        # This gives you indata as a Pandas DataFrame
        indata = processor_data.get_dataset("indata", ds_format="pandas")
        
        # Get the uservar var1 from the metadata collection, by default a string
        my_var1 = processor_data.current_uservars["var1"]   
        # If our uservar is supposed to be numeric we would need to cast it
        #my_var1 = int(processor_data.current_uservars["var1"])
        
        # Create an outdata DataFrame containing the indata record(s) with ident value R01
        outdata = pd.DataFrame(indata.loc[indata['ident'] == 'R01'])

        # We are expecting to have at least one record found
        # If we don't, return 1 to indicate there was an error and the job should terminate
        if(outdata.empty):
            return 1

        # Set the v1 field in the retrieved record(s) to contain uservar value
        outdata['v1'] = my_var1

        # In order for the processor to update our imputed_file we need to set the outdata file  
        processor_data.outdata = outdata
        
        # If we made it here return 0 to indicate there were no errors
        return 0

# Registers all plugin classes to the factory
# "myproc" is the same name you will provide in your Jobs file entries as 
# the process name, using any capitalization you want (i.e. mYpRoC, MyProc, myProc etc.)
def register(factory) -> None:
    factory.register("myproc", MyProcClass)
    # You may provide multiple names for your proc, if you like
    #factory.register(["myproc", "also_myproc"], MyProcClass)
  • When your execute() is complete, the processor automatically updates the status_file and/or imputed_file with the contents of the corresponding outstatus/outdata datasets, if you have set one.

    • This operation is unable to add or remove data from your imputed_file, it can only update existing records. If you have a need to add or remove data from imputed_file you must do so in your plugin and set processor_data.indata to point to your updated data. This will log a warning in your log file, but it can be ignored if this was intended.

  • Additionally, the processor automatically appends any other datasets you output from a custom plugin to a single “cumulative” version if the process_output_type is set to “All” or “Custom” and the dataset name is specified in a ProcessOutput metadata entry for that custom plugin

  • The execute() method is marked as a classmethod, which means it is first argument cls is a reference to MyProcClass. It also has a second argument processor_data which is an object of type ProcessorData (the definition of which can be found in src\banffprocessor\processor_data.py). This object includes input files, output files (from previously ran procedures and the current procedure), metadata and parameters from your input JSON file.

    • Your execute() method should also return an int representing the return code of the plugin. Any non-0 number indicates that the plugin did not complete successfully and that the processor should stop processing subsequent steps in the imputation stategy, alternately an exception can be raised.

  • Finally your plugin’s module must implement a register() function, outside of any class definitions in your plugin file. This function has one parameter factory. The function must call the register() function of the factory object, providing the name of your procedure as it will appear in your metadata and the name of the class that implements it. Though one plugin per file is recommended, if you have multiple classes in the same file that implement the Banff Procedure Interface, you can register all of them using the same register function. Just include a factory.register(...) call for each plugin procedure you would like to register.

    • NOTE: The name registered to the factory is the same name that you will provide in your Jobs entries as the name of the process. Process names are not case sensitive.

For an example of a job that includes a user-defined procedure see banffprocessor\banff-processor\tests\integration_tests\udp_test with the plugin located in \plugins\my_plugin.py.

Process Controls#

Note: Process Controls are a new feature introduced with version 2.0.0 of the Python processor.

Process controls are processes that run before an imputation step. An example is a filter applied to the input data or status file. This allows users to define process controls that enable the processor to be more generic, reduce the number of steps in an imputation strategy and improve the information provided to subsequent processes (SEVANI).

This feature requires the use of a new field in the Jobs metadata file, controlid. This field references an entry (or entries) in a new metadata file, processcontrols.xml (produced from the PROCESSCONTROLS worksheet in the excel template):

controlid

targetfile

parameter

value

Identifies the control or set of controls to apply to the Jobs steps with the same controlid

The dataset name to apply the control to (names should be written in the same case as they appear in the table)

The desired control type

Determined by the control type

control1

indata

row_filter

strat > 15 and (rec_id not in (SELECT * FROM instatus WHERE status != ‘FTI’))

control1

instatus

column_filter

IDENT, AREA, V1

control1

indata

exclude_rejected

True

control1

N/A

edit_group_filter

N/A

All process controls with the specified controlid are applied to their respective targetfiles for the single job step on which they are declared. Upon completion of the job step the affected targetfiles are returned to their original state and the job continues. However, if the job step begins the execution of a new process block, the targetfile will remain in the state created by the process control(s) applied for the duration of the process block (and any sub-blocks within).

One controlid may be used as many times as needed per-targetfile. If a controlid is repeated for the same targetfile AND parameter then the value for those controls is combined into one. This is intended to allow more modularity in control sets as individual parts of multi-part conditions can be interchanged as desired without affecting the other parts.

Control Types#

  • ROW_FILTER

    • Filters targetfile using an SQL WHERE clause

    • value - The SQL condition which can include column names and/or table names (exactly as shown in available table names)

    • If controlid, targetfile and parameter are repeated for more than one entry, the conditions in their value fields are joined by AND

  • COLUMN_FILTER (can apply multiple for one ID, column name lists are combined into one)

    • Filters targetfile’s to remove columns that don’t appear in the list in the value field

    • value - A comma-separated list of column names to KEEP in targetfile

    • If controlid, targetfile and parameter are repeated for more than one entry, the column lists in their value fields are combined

  • EXCLUDE_REJECTED

    • Filters targetfile by removing any entries with a unit_id that appears in the outreject table

    • value - The text ‘True’ or ‘False’, indicating if the control should be applied or not

    • For one controlid, only one EXCLUDE_REJECTED control may be used per-targetfile

    • NOTE Errorloc and Prorate each produce slightly outreject files.

      • Errorloc: outreject, produced by the current errorloc call, overwrites any existing outreject file. The contents of outreject are also appended to outreject_all.

      • Prorate: The contents of the outreject dataset produced by the current prorate call are appended to outreject_all and also appended to the existing outreject dataset (or just set as the outreject table if one does not yet exist).

  • EDIT_GROUP_FILTER

    • Filters instatus by removing any entries with an editgroupid matching the current job step OR any entries that were produced by an Outlier step with a status value of FTI or FTE

    • value and targetfile fields should not be given for this control type

    • Replaces existing SAS functionality where this filter was automatically applied prior to executing a DonorImputation or Deterministic proc

  • NOTE: Column names from your original input files should be referenced in their original case, those that are created or added by the Processor should be in ALL-CAPS.

Available Table Names#

Table Name

Notes

status_log

Contains all produced outstatus files appended in order

indata

The input data to the current job step. Alias: imputed_file

indata_aux

Auxiliary input data.

indata_hist

Historic input data.

instatus

The input status data to the current job step. Alias: status_file

instatus_hist

Historic status data.

time_store

Information regarding the runtime and exection of each step in a job

outreject and outreject_all

Produced by Errorloc and Prorate

outedit_applic

Can be produced by Editstats

outedit_status

Can be produced by Editstats

outedits_reduced

Can be produced by Editstats

outglobal_status

Can be produced by Editstats

outk_edits_status

Can be produced by Editstats

outvars_role

Can be produced by Editstats

outacceptable

Can be produced by Estimator

outest_ef

Can be produced by Estimator

outest_lr

Can be produced by Estimator

outest_parm

Can be produced by Estimator

outrand_err

Can be produced by Estimator

outmatching_fields

Produced by DonorImputation

outdonormap

Produced by DonorImputation and MassImputation

outlier_status

Can be produced by Outlier

  • Optional datasets are only available during execution and saved to disk if the input parameter process_output_type is set to “All” (2) or the dataset’s name is specified in a ProcessOutput metadata table entry for the process producing it and process_output_type is set to “Custom” (3)

  • Either the table name or its alias may be used, both refer to the same table and data

  • The indata and instatus files are always available

    • If instatus is the first step in a job that doesn’t provide an instatus file to start with, it is not available to be used in a filter for the first step, though it is available in subsequent steps

  • Any procedure-specific file is not available to reference until a job step for that procedure has run in a preceding step, and only if the file referenced is produced according to the process_output_type

  • All files will have the columns SEQNO and JOBID added. These can be filtered to obtain data from a specific job step.

  • The value field supports references to tables located on-disk as well as the in-memory table names found above

    • This does not apply to targetfile, which must be an in-memory table referenced by name

    • The filepath must either be absolute or relative to the input folder for the current job

    • The filepath should be single-quoted, e.g.:
      ident in (SELECT ident FROM '.\subfolder\of\input_folder\idents.csv')

    • See DuckDB docs for information on supported filetypes (i.e. only ‘.parquet’ is supported NOT ‘.parq’)

Process Blocks#

Note: Process Blocks are a new feature introduced in version 2.0.0 of the Python processor.

A job in the Banff Processor is a collection of Jobs metadata table entries, all joined by a common job identifier (jobid) and processed sequentially according to the sequence number (seqno). Only a single job is specified when executing the processor, which is done by specifying the job_id input parameter. A Process Block is essentially a job called from within a job. Process Blocks organize jobs into sub-jobs with the following goals:

  • to allow a process control to be associated with multiple job steps.

  • to allow the reuse of a sequence of steps that are repeated with different inputs.

  • to allow users to design and implement imputation strategies using a modular approach. This means that smaller jobs can be developed and tested in isolation rather than has one large job.

Process blocks are used by setting the process field of a Jobs metadata table entry to job (rather than a traditional Banff procedure such as prorate or donorimputation) and setting the specid field to be the jobid of the process block that is to be run.

Process blocks can call other process blocks, providing further flexiblity. When preparing to execute the Banff Processor, the Jobs metadata is validated using the job_id input parameter as the root of the overall job structure. This validation ensures that no cycles (infinite loops) exist when the job has nested process blocks. If one is found, an error will be printed to the console and/or log file and issue will need to be corrected in order to succesfully execute the job.

jobid

seqno

controlid

process

specid

editgroupid

byid

acceptnegative

main_job

1

n/a

job

sub_job

n/a

n/a

n/a

main_job

2

n/a

outlier

outlier_spec1

n/a

n/a

n/a

main_job

3

n/a

job

sub_job

n/a

n/a

n/a

sub_job

1

n/a

prorate

prorate_spec1

n/a

n/a

n/a

sub_job

2

n/a

donorimp

donorimp_spec1

n/a

n/a

n/a

For example, the Jobs table above would result in the execution of:

  1. prorate (sub_job, 1)

  2. donorimp (sub_job, 2)

  3. outlier (main_job, 2)

  4. prorate (sub_job, 1)

  5. donorimp (sub_job, 2)

A working example of a job with a Process Block can be found in the project directory under ‘examples/example4’.

Output#

Saving Ouputs#

If running from within a python script, output datasets can be saved to disk by calling the save_outputs() function of the banffprocessor object containing the results from a call to execute(). If running from command line save_outputs() will be called automatically once the processor has finished a job.

During operation, if no output_folder parameter is provided, the processor will create an out folder in the same location as your input JSON parameter file to save the Banff log as well as the output status and data files that are created during and after the execution of each Banff proc. The status and data files will be saved in the format determined by:

  1. The saveFormat parameter in your input JSON file

  2. If no value provided for 1., uses the same format as the file in indata_filename

  3. If neither 1. or 2. are provided, defaults to .parq format

Output files/datasets#

The output files from each procedure can be retained and saved. The Processor will automatically add the columns JOBID and SEQNO to the outputs. When an output with the same name is generated and retained, the processor will append these output datasets together and the datasets will need to be filtered by JOBID and SEQNO to limit the data to a specified processing step.

Minimal Outputs#

Data File

Description

imputed_file

This data file contains the final imputed current data.

status_file

This data file contains the final imputed data statuses.

status_log

This data file contains the history of how the statuses changed during the imputation strategy.

outreject

This data file is generated by the ErrorLoc and Prorate procedures. It contains the identification of respondents that could not be processed and the reason why.

time_store

This data file stores the start time, end time and duration of each processing step along with the cumulative execution time.

Optional Outputs#

Data File

Related Procedure

Description

outlier_status

Outlier

It contains the final status file including the additional variables from the outlier_stats option (which is always in effect in the Banff Processor).

outmatching_fields

Donor Imputation

It contains the status of the matching fields from the outmatching_fields option (which is always in effect in the Banff Processor).

outdonormap

DonorImputation

It contains the identifiers of recipients that have been imputed along with their donor identifier and the number of donors tried before the recipient passed the post-imputation edits.

outedits_reduced

EditStats

This data file contains the minimal set of edits.

outedit_status

EditStats

This data file contains the counts of records that passed, missed and failed for each edit.

outk_edits_status

EditStats

This data file contains the distribution of records that passed, missed and failed K edits.

outglobal_status

EditStats

This data file contains the overall counts of records that passed, missed and failed.

outedit_applic

EditStats

This data file contains the counts of edit applications of status pass, miss or fail that involve each field.

outvars_role

EditStats

This data file contains the counts of records of status pass, miss or fail for which field j contributed to the overall record status.

outrand_err

Estimator

This dataset contains the random error report if at least one of the estimator specifications has the RANDOMERROR variable in the ESTIMATOR metadata table set to Y.

outest_ef

Estimator

This dataset contains the report on the calculation of averages for estimator functions if at least one of the estimator specifications uses an estimator function (type EF).

outest_parm

Estimator

This dataset contains the report on imputation statistics by estimator.

outest_lr

Estimator

This dataset contains the report on the calculation of « beta » coefficients for linear regression estimators if at least one of the estimator specifications uses a linear regression (type LR).

outacceptable

Estimator

This data file contains the report on acceptable observations retained to calculate the parameters for each estimator given in the specifications. This file can be large and can slow down execution.

Notes:

  • Refer to the Banff Procedure User Guide for a full description of a file generated by a Banff Procedure.

  • Optional output files will be retained if process_output_type = all or if process_output_type = custom and the dataset name is specified in the ProcessOutputs metadata for the given process.

  • Plugins may output additional optional output files.

The Log#

The Python processor can generate an execution log which provides valuable information about the imputation process which is useful for debugging and analytical purposes. The level of information logged can be configured via the log_level parameter of your input JSON file.

  • If 0, no log file is produced at all, only warnings, errors and a summarization of each procedure is written to the console after it is performed. This summary is always printed, even at levels 1 and 2.

  • If 1 (the default value if log_level is not set), the log file contains INFO-level messages, which is primarily the output from the execution of each proc from the Banff package, as well as warnings and errors.

  • Finally, if 2, the log file contains all messages from 1 as well as any DEBUG-level messages, such as more granular information about produced and processed datasets.

The processor keeps a maximum of 6 log files at once. The most recent job is always logged to banffprocessor.log and when a new job is run, a number is appended to the old log file and a new log is created for the new job. The numbering goes from newest to oldest (i.e. banffprocessor.log is the log for the most recent job, banffprocessor.log.1 is from the next most recent and banffprocessor.log.5 is from the oldest job).

Process Block Output#

When a new process block is to be run, a special folder is created in the output folder for the calling block. This new output folder is named after the new block’s parameters and upon completion of the block will contain all of the files created by the child block. No new log file is created, however. All log outputs for child blocks can be found in the main log file found in the root input folder.