Banff Processor User Guide#
Introduction#
The Banff Processor is a Python package used to execute a Statistical Data Editing (SDE) process, also commonly referred to as Edit and Imputation (E&I). A specific SDE process typically consists of numerous individual process steps, each executing a SDE function such as error localization, outlier detection, or imputation. The process flow describes which process steps to perform and the sequence in which they are executed. Given a set of input metadata tables that describe the SDE process flow, including each individual step, the Banff processor executes each process step in sequence, managing all intermediate data management. The advantage of the Banff Processor’s metadata-driven system is that the design and modification of the SDE process is managed from metadata tables instead of source code.
Other notes about the Banff Processor:
Simplicity: Once the metadata tables are created, the processor can be executed from a single line of code
Efficiency: Designed for production-level SDE processes
Modularity: Within a process step, users may call the built-in Banff procedures, or user defined Procedures
Flexibility: Process Controls and Process Blocks allow users to specify complex process flows
Informative: The Processor produces diagnostics from each step, and for the overall process
Transparency: Source code is available and freely shared
The user guide often uses terminology from the Generic Statistical Data Editing Model (GSDEM). Users are encouraged to reference the GSDEM for common terminology regarding SDE concepts.
Table of Contents#
Input Metadata Files#
The Banff Processor is driven by metadata tables describing both the overall process flow, as well as the parameters required for individual process steps. The primary metadata table is called JOBS, which specifies the overall process flow, specifically the built-in Banff procedures and/or user-defined programs (plugins) to execute and their relative sequencing. The JOBS table can contain more than one job (i.e., process flow), specified by the jobid
(job identifier), though only one job can be executed at a time by the Banff Processor. Each job will include one row per process step; the key columns are:
jobid
: Job identifier.seqno
: Sequence number of the individual process steps.process
: Name of either the built-in procedure or user-defined program to execute at each process step.specid
: Specification identifier, linking to other metadata tables containing parameters specific to the declaredprocess
.
Additional columns on the JOBS table include an optional process control identifier (controlid
) as well as parameters that are common to multiple procedures (editgroupid
,byid
,acceptnegative
).
Overall, the Banff Processor uses 18 metadata tables, which can be classified as follows:
Tables describing the overall process flow:
JOBS
PROCESSCONTROLS
Process step parameters for built-in Banff procedures:
VERIFYEDITSPECS
OUTLIERSPECS
ERRORLOCSPECS
DONORSPECS
ESTIMATORSPECS
ESTIMATORS
ALGORITHMS
PRORATESPECS
MASSIMPUTATIONSPECS
Process step parameters for User Defined Procedures:
USERVARS
Tables used to define edits:
EDITS
EDITGROUPS
Parameters used by multiple procedures:
VARLISTS
WEIGHTS
EXPRESSIONS
Data management:
PROCESSOUTPUTS
Only the JOBS table is mandatory, though other tables are required depending on which procedures or options are used. A full description of all metadata tables is included in this document: Metadata Tables
Formatting#
The metadata tables must be saved in .xml
format. Users may create the .xml
files on their own, or use the Banff Processor Template to create and save the metadata, and the Metadata Generation Tool to convert the template file into .xml
tables. By default, the Processor will look for the XML files in the same location as your .json
input file, either directly in the same folder or in a \metadata
subfolder. Alternatively, you may provide a specific location by setting the metadata_folder
parameter in your input JSON file.
Example (JOBS table)#
The following table defines a single job, “Main”, which includes four process steps.
jobid |
seqno |
controlid |
process |
specid |
edigroupid |
byid |
acceptnegative |
---|---|---|---|---|---|---|---|
Main |
1 |
ERRORLOC |
errloc_specs |
edits_1 |
Y |
||
Main |
2 |
DETERMINISTIC |
edits_1 |
Y |
|||
Main |
10 |
DONORIMP |
donor_specs |
edits_1 |
by_list |
Y |
|
Main |
99 |
PRORATE |
pro_specs |
edits_2 |
Y |
Only the ordering of the sequence numbers (
seqno
) are important; they do not need to be sequential integers.This job includes four process steps, run sequentially, consisting of four built-in Banff procedures:
errorloc
,deterministic
,donorimp
, andprorate
.The parameters
editgroupid
,byid
andacceptnegative
, which are common to many of the built-in Banff procedures, are included in the JOBS table.Most procedures include mandatory and/or optional parameters that define exactly how the procedure should be executed. These are contained in additional metadata tables, and linked to specific process steps via the
specid
column. Procedures that do not have any additional parameters (beyond those included in the JOBS table) do not require aspecid
.The
controlid
column is optional, and can be used to specify Process Controls.
Additional examples can be found in the project’s ‘examples’ folder.
Metadata Generation Tool#
Metadata stored in the Banff Processor Template must be converted to XML files before running the processor. A conversion tool is provided for this purpose. With the banffprocessor package installed in your python environment, the conversion tool can be run with the following command:
banffconvert "\path\to\your\excel_metadata.xlsx" -o "\my\output\directory" -l fr
or:
banffconvert "\path\to\your\excel_metadata.xlsx" --outdir="\my\output\directory" --lang en
Alternatively to run as a module:
python -m banffprocessor.util.metadata_excel_to_xml "\path\to\your\excel_metadata.xlsx" -o "\my\output\directory"
NOTE: The ‘-o’/’–outdir’ parameter is optional. If it is not provided the conversion tool will save XML files to the same directory as the input file.
NOTE: The ‘-l’/’–lang’ parameter is optional. Valid values include en and fr, if not specified, the default is set to en.
Finally, the tool can be included and ran directly in a python script:
import banffprocessor.util.metadata_excel_to_xml as e2x
e2x.convert_excel_to_xml("\\path\\to\\your\\excel_metadata.xlsx", "\\my\\output\\directory")
Banff Processor Input Parameters#
Input parameters are specified in a .json
file or a ProcessorInput
object, either of which are passed to the processor and used to specify your job’s parameters. These are the parameters that are currently available:
Name |
Purpose |
Required? |
---|---|---|
job_id |
The job id from your |
Y |
unit_id |
The unit id variable is the unique identifier on micro data files for the job |
Y |
indata_filename |
The filename or full filepath of your input/current data file |
Y (unless running only VerifyEdits) |
indata_aux_filename |
The filename or full filepath of your auxiliary data file |
N |
indata_hist_filename |
The filename or full filepath of your historic data file |
N |
instatus_hist_filename |
The filename or full filepath of your historic status data file |
N |
instatus_filename |
The filename or full filepath of a status file to use as input to the first proc in your job requiring a status file. [1] |
N |
user_plugins_folder |
The optional location of the folder containing your custom python procedure plugins. See below for a description of how to create your own plugins |
N |
metadata_folder |
The optional path to a folder where your XML metadata can be found |
N |
output_folder |
The optional path to a folder where your output files will be saved |
N |
process_output_type |
Controls the output datasets retained by each process. Options include |
N |
seed |
The seed value to use for consistent results when using the same input data and parameters |
N |
no_by_stats |
If specified, determines if no_by_stats is set to True when calling standard procedures. |
N |
randnumvar |
Specify a random number variable to be used when having to make a choice during error localization or donor imputation. This parameter is optional and is only used by ErrorLoc and DonorImputation simultaneously (it cannot be used in one and not the other). Please see the Banff User Guide document for more details on the use of the |
N |
save_format |
Optional list of file extensions used to determine the format to save output files in. One or more may be provided. Currently supports CSV and Parquet extensions. |
N |
log_level |
Configures whether or not to create a log file and what level of messages it should contain. Value should be 0, 1 (default) or 2. See Output. |
N |
Example JSON input file:
{
"job_id": "j1",
"unit_id": "ident",
"indata_filename": "indata.parq",
"indata_aux_filename": "C:\\full\\filepath\\to\\auxdata.parq",
"indata_hist_filename": "histdata.parq",
"instatus_hist_filename": "histstatus.parq",
"instatus_filename": "instatus.parq",
"user_plugins_folder": "C:\\path\\to\\my\\plugins",
"metadata_folder": "C:\\path\\to\\xml\\metadata",
"output_folder": "my_output_subfolder",
"process_output_type": "All",
"seed": 1234,
"no_by_stats": "N",
"randnumvar": "",
"save_format": [".parq", ".csv"],
"log_level": 2
}
Example building a ProcessorInput
object inline:
from banffprocessor.processor import Processor
from banffprocessor.processor_input import ProcessorInput
from pathlib import Path
# Supplying parameters directly, instead of an input file
input_params = ProcessorInput(job_id="j1",
unit_id="ident",
# Gets the path to the folder containing this file
input_folder=Path(__file__).parent,
indata_filename="indata.parq",
indata_hist_filename="C:\\full\\filepath\\to\\histdata.parq",
seed=1234,
save_format=[".parq", ".csv"],
log_level=2)
# Normal method with a JSON file
#my_bp = Processor("C:\\path\\to\\my\\processor_input.json")
# Method when providing parameters inline
my_bp = Processor(input_params)
Notes:
All folder locations may be given as absolute filepaths or relative to the input folder (either the location of the input .json file or the supplied input_folder parameter if creating inputs inline as demonstrated above).
This input_folder location is also used as the default location for other files required by the processor should no value be provided for them, such as metadata, input data files and user-defined procedures
File paths must have any backslashes
\
escaped by replacing them with a double-backslash\\
. For example,C:\this\is\a\filepath
would becomeC:\\this\\is\\a\\filepath
Fields that are not required to run the procedures outlined in your jobs file can be omitted or left empty
The
CSV
format has been included for testing purposes and is not intended for production, parquet is currently the recommended format for production for reasons related to accuracy, performance and efficiency
Executing the processor as a command line utility#
With the banffprocessor package installed in your python environment, the processor can be run with the following command:
banffprocessor "\path\to\your\processor_input.json" -l fr
Alternatively to run as a module:
python -m banffprocessor.processor "\path\to\your\processor_input.json" --lang fr
NOTE: The ‘-l’/’–lang’ parameter is optional. Valid values include en and fr, if not specified, the default is set to en.
Executing the processor from within a python script#
import banffprocessor
# Optional: Set the language to fr so that log and console messages on written in French.
banffprocessor.set_language(banffprocessor.SupportedLanguage.fr)
bp = banffprocessor.Processor.from_file("path\\to\\my\\input_file.json")
bp.execute()
bp.save_outputs()
Alternatively, you may load your input data files programmatically as Pandas DataFrames:
from banffprocessor.processor import Processor
import pandas as pd
indata = pd.DataFrame()
indata_aux = pd.DataFrame()
instatus = pd.DataFrame()
...
# Load your dataframes with data
...
bp = Processor.from_file(input_filepath="path\\to\\my\\input_file.json", indata=indata, instatus=instatus)
bp.execute()
bp.save_outputs()
User-Defined Procedures#
In addition to the standard Banff Procedures automatically integrated into the processor, you may also include your own .py
files implementing custom procedures. By default the Processor will look for python files placed in a \plugins
subfolder in the same location as your input JSON file. Alternatively you may provide a specific location to load plugins from in the user_plugins_folder
parameter of your input JSON file. You may provide as many plugin files as needed for your job, and each plugin file may contain as many procedure classes as you wish so long as each class is registered in a register()
method.
Your plugin must define a class that implements the ProcedureInterface
protocol which is found in the package’s source files at
\src\banffprocessor\procedures\procedure_interface.py
. Your implementing class must have the exact same attribute names and function signatures as the interface does. Here is an example of a plugin that implements the protocol:
class MyProcClass:
@classmethod
def execute(cls, processor_data) -> int:
# These give you indata as a pyarrow Table
#indata = processor_data.indata
#indata = processor_data.get_dataset("indata", ds_format="pyarrow")
# This gives you indata as a Pandas DataFrame
indata = processor_data.get_dataset("indata", ds_format="pandas")
# Get the uservar var1 from the metadata collection, by default a string
my_var1 = processor_data.current_uservars["var1"]
# If our uservar is supposed to be numeric we would need to cast it
#my_var1 = int(processor_data.current_uservars["var1"])
# Create an outdata DataFrame containing the indata record(s) with ident value R01
outdata = pd.DataFrame(indata.loc[indata['ident'] == 'R01'])
# We are expecting to have at least one record found
# If we don't, return 1 to indicate there was an error and the job should terminate
if(outdata.empty):
return 1
# Set the v1 field in the retrieved record(s) to contain uservar value
outdata['v1'] = my_var1
# In order for the processor to update our imputed_file we need to set the outdata file
processor_data.outdata = outdata
# If we made it here return 0 to indicate there were no errors
return 0
# Registers all plugin classes to the factory
# "myproc" is the same name you will provide in your Jobs file entries as
# the process name, using any capitalization you want (i.e. mYpRoC, MyProc, myProc etc.)
def register(factory) -> None:
factory.register("myproc", MyProcClass)
# You may provide multiple names for your proc, if you like
#factory.register(["myproc", "also_myproc"], MyProcClass)
When your
execute()
is complete, the processor automatically updates the status_file and/or imputed_file with the contents of the corresponding outstatus/outdata datasets, if you have set one.This operation is unable to add or remove data from your imputed_file, it can only update existing records. If you have a need to add or remove data from imputed_file you must do so in your plugin and set
processor_data.indata
to point to your updated data. This will log a warning in your log file, but it can be ignored if this was intended.
Additionally, the processor automatically appends any other datasets you output from a custom plugin to a single “cumulative” version if the process_output_type is set to “All” or “Custom” and the dataset name is specified in a ProcessOutput metadata entry for that custom plugin
The
execute()
method is marked as a classmethod, which means it is first argumentcls
is a reference toMyProcClass
. It also has a second argumentprocessor_data
which is an object of typeProcessorData
(the definition of which can be found insrc\banffprocessor\processor_data.py
). This object includes input files, output files (from previously ran procedures and the current procedure), metadata and parameters from your input JSON file.Your
execute()
method should also return anint
representing the return code of the plugin. Any non-0 number indicates that the plugin did not complete successfully and that the processor should stop processing subsequent steps in the imputation stategy, alternately an exception can be raised.
Finally your plugin’s module must implement a
register()
function, outside of any class definitions in your plugin file. This function has one parameterfactory
. The function must call theregister()
function of the factory object, providing the name of your procedure as it will appear in your metadata and the name of the class that implements it. Though one plugin per file is recommended, if you have multiple classes in the same file that implement the Banff Procedure Interface, you can register all of them using the same register function. Just include afactory.register(...)
call for each plugin procedure you would like to register.NOTE: The name registered to the factory is the same name that you will provide in your Jobs entries as the name of the
process
. Process names are not case sensitive.
For an example of a job that includes a user-defined procedure see banffprocessor\banff-processor\tests\integration_tests\udp_test
with the plugin located in \plugins\my_plugin.py
.
Process Controls#
Note: Process Controls are a new feature introduced with version 2.0.0 of the Python processor.
Process controls are processes that run before an imputation step. An example is a filter applied to the input data or status file. This allows users to define process controls that enable the processor to be more generic, reduce the number of steps in an imputation strategy and improve the information provided to subsequent processes (SEVANI).
This feature requires the use of a new field in the Jobs metadata file, controlid
. This field references an entry (or entries) in a new metadata file, processcontrols.xml (produced from the PROCESSCONTROLS worksheet in the excel template):
controlid |
targetfile |
parameter |
value |
---|---|---|---|
Identifies the control or set of controls to apply to the Jobs steps with the same controlid |
The dataset name to apply the control to (names should be written in the same case as they appear in the table) |
The desired control type |
Determined by the control type |
control1 |
indata |
row_filter |
strat > 15 and (rec_id not in (SELECT * FROM instatus WHERE status != ‘FTI’)) |
control1 |
instatus |
column_filter |
IDENT, AREA, V1 |
control1 |
indata |
exclude_rejected |
True |
control1 |
N/A |
edit_group_filter |
N/A |
All process controls with the specified controlid
are applied to their respective targetfile
s for the single job step on which they are declared. Upon completion of the job step the affected targetfile
s are returned to their original state and the job continues. However, if the job step begins the execution of a new process block, the targetfile
will remain in the state created by the process control(s) applied for the duration of the process block (and any sub-blocks within).
One controlid may be used as many times as needed per-targetfile
. If a controlid is repeated for the same targetfile
AND parameter
then the value
for those controls is combined into one. This is intended to allow more modularity in control sets as individual parts of multi-part conditions can be interchanged as desired without affecting the other parts.
Control Types#
ROW_FILTER
Filters
targetfile
using an SQL WHERE clausevalue
- The SQL condition which can include column names and/or table names (exactly as shown in available table names)If
controlid
,targetfile
andparameter
are repeated for more than one entry, the conditions in theirvalue
fields are joined byAND
COLUMN_FILTER (can apply multiple for one ID, column name lists are combined into one)
Filters
targetfile
’s to remove columns that don’t appear in the list in thevalue
fieldvalue
- A comma-separated list of column names to KEEP intargetfile
If
controlid
,targetfile
andparameter
are repeated for more than one entry, the column lists in theirvalue
fields are combined
EXCLUDE_REJECTED
Filters
targetfile
by removing any entries with aunit_id
that appears in theoutreject
tablevalue
- The text ‘True’ or ‘False’, indicating if the control should be applied or notFor one
controlid
, only one EXCLUDE_REJECTED control may be used per-targetfile
NOTE Errorloc and Prorate each produce slightly
outreject
files.Errorloc:
outreject
, produced by the current errorloc call, overwrites any existingoutreject
file. The contents ofoutreject
are also appended tooutreject_all
.Prorate: The contents of the
outreject
dataset produced by the current prorate call are appended tooutreject_all
and also appended to the existingoutreject
dataset (or just set as theoutreject
table if one does not yet exist).
EDIT_GROUP_FILTER
Filters
instatus
by removing any entries with aneditgroupid
matching the current job step OR any entries that were produced by an Outlier step with a status value ofFTI
orFTE
value
andtargetfile
fields should not be given for this control typeReplaces existing SAS functionality where this filter was automatically applied prior to executing a
DonorImputation
orDeterministic
proc
NOTE: Column names from your original input files should be referenced in their original case, those that are created or added by the Processor should be in ALL-CAPS.
Available Table Names#
Table Name |
Notes |
---|---|
status_log |
Contains all produced |
indata |
The input data to the current job step. Alias: imputed_file |
indata_aux |
Auxiliary input data. |
indata_hist |
Historic input data. |
instatus |
The input status data to the current job step. Alias: status_file |
instatus_hist |
Historic status data. |
time_store |
Information regarding the runtime and exection of each step in a job |
outreject and outreject_all |
Produced by Errorloc and Prorate |
outedit_applic |
Can be produced by Editstats |
outedit_status |
Can be produced by Editstats |
outedits_reduced |
Can be produced by Editstats |
outglobal_status |
Can be produced by Editstats |
outk_edits_status |
Can be produced by Editstats |
outvars_role |
Can be produced by Editstats |
outacceptable |
Can be produced by Estimator |
outest_ef |
Can be produced by Estimator |
outest_lr |
Can be produced by Estimator |
outest_parm |
Can be produced by Estimator |
outrand_err |
Can be produced by Estimator |
outmatching_fields |
Produced by DonorImputation |
outdonormap |
Produced by DonorImputation and MassImputation |
outlier_status |
Can be produced by Outlier |
Optional datasets are only available during execution and saved to disk if the input parameter process_output_type is set to “All” (2) or the dataset’s name is specified in a ProcessOutput metadata table entry for the process producing it and process_output_type is set to “Custom” (3)
Either the table name or its alias may be used, both refer to the same table and data
The indata and instatus files are always available
If instatus is the first step in a job that doesn’t provide an instatus file to start with, it is not available to be used in a filter for the first step, though it is available in subsequent steps
Any procedure-specific file is not available to reference until a job step for that procedure has run in a preceding step, and only if the file referenced is produced according to the
process_output_type
All files will have the columns SEQNO and JOBID added. These can be filtered to obtain data from a specific job step.
The
value
field supports references to tables located on-disk as well as the in-memory table names found aboveThis does not apply to
targetfile
, which must be an in-memory table referenced by nameThe filepath must either be absolute or relative to the input folder for the current job
The filepath should be single-quoted, e.g.:
ident in (SELECT ident FROM '.\subfolder\of\input_folder\idents.csv')
See DuckDB docs for information on supported filetypes (i.e. only ‘.parquet’ is supported NOT ‘.parq’)
Process Blocks#
Note: Process Blocks are a new feature introduced in version 2.0.0 of the Python processor.
A job in the Banff Processor is a collection of Jobs metadata table entries, all joined by a common job identifier (jobid) and processed sequentially according to the sequence number (seqno). Only a single job is specified when executing the processor, which is done by specifying the job_id
input parameter. A Process Block is essentially a job called from within a job. Process Blocks organize jobs into sub-jobs with the following goals:
to allow a process control to be associated with multiple job steps.
to allow the reuse of a sequence of steps that are repeated with different inputs.
to allow users to design and implement imputation strategies using a modular approach. This means that smaller jobs can be developed and tested in isolation rather than has one large job.
Process blocks are used by setting the process field of a Jobs metadata table entry to job
(rather than a traditional Banff procedure such as prorate
or donorimputation
) and setting the specid field to be the jobid
of the process block that is to be run.
Process blocks can call other process blocks, providing further flexiblity. When preparing to execute the Banff Processor
, the Jobs metadata is validated using the job_id
input parameter as the root of the overall job structure. This validation ensures that no cycles (infinite loops) exist when the job has nested process blocks. If one is found, an error will be printed to the console and/or log file and issue will need to be corrected in order to succesfully execute the job.
jobid |
seqno |
controlid |
process |
specid |
editgroupid |
byid |
acceptnegative |
---|---|---|---|---|---|---|---|
main_job |
1 |
n/a |
job |
sub_job |
n/a |
n/a |
n/a |
main_job |
2 |
n/a |
outlier |
outlier_spec1 |
n/a |
n/a |
n/a |
main_job |
3 |
n/a |
job |
sub_job |
n/a |
n/a |
n/a |
sub_job |
1 |
n/a |
prorate |
prorate_spec1 |
n/a |
n/a |
n/a |
sub_job |
2 |
n/a |
donorimp |
donorimp_spec1 |
n/a |
n/a |
n/a |
For example, the Jobs table above would result in the execution of:
prorate (sub_job, 1)
donorimp (sub_job, 2)
outlier (main_job, 2)
prorate (sub_job, 1)
donorimp (sub_job, 2)
A working example of a job with a Process Block can be found in the project directory under ‘examples/example4’.
Output#
Saving Ouputs#
If running from within a python script, output datasets can be saved to disk by calling the save_outputs()
function of the banffprocessor
object containing the results from a call to execute()
. If running from command line save_outputs()
will be called automatically once the processor has finished a job.
During operation, if no output_folder
parameter is provided, the processor will create an out
folder in the same location as your input JSON parameter file to save the Banff log as well as the output status and data files that are created during and after the execution of each Banff proc. The status and data files will be saved in the format determined by:
The
saveFormat
parameter in your input JSON fileIf no value provided for 1., uses the same format as the file in
indata_filename
If neither 1. or 2. are provided, defaults to
.parq
format
Output files/datasets#
The output files from each procedure can be retained and saved. The Processor will automatically add the columns JOBID
and SEQNO
to the outputs. When an output with the same name is generated and retained, the processor will append these output datasets together and the datasets will need to be filtered by JOBID
and SEQNO
to limit the data to a specified processing step.
Minimal Outputs#
Data File |
Description |
---|---|
imputed_file |
This data file contains the final imputed current data. |
status_file |
This data file contains the final imputed data statuses. |
status_log |
This data file contains the history of how the statuses changed during the imputation strategy. |
outreject |
This data file is generated by the ErrorLoc and Prorate procedures. It contains the identification of respondents that could not be processed and the reason why. |
time_store |
This data file stores the start time, end time and duration of each processing step along with the cumulative execution time. |
Optional Outputs#
Data File |
Related Procedure |
Description |
---|---|---|
outlier_status |
Outlier |
It contains the final status file including the additional variables from the |
outmatching_fields |
Donor Imputation |
It contains the status of the matching fields from the |
outdonormap |
DonorImputation |
It contains the identifiers of recipients that have been imputed along with their donor identifier and the number of donors tried before the recipient passed the post-imputation edits. |
outedits_reduced |
EditStats |
This data file contains the minimal set of edits. |
outedit_status |
EditStats |
This data file contains the counts of records that passed, missed and failed for each edit. |
outk_edits_status |
EditStats |
This data file contains the distribution of records that passed, missed and failed K edits. |
outglobal_status |
EditStats |
This data file contains the overall counts of records that passed, missed and failed. |
outedit_applic |
EditStats |
This data file contains the counts of edit applications of status pass, miss or fail that involve each field. |
outvars_role |
EditStats |
This data file contains the counts of records of status pass, miss or fail for which field j contributed to the overall record status. |
outrand_err |
Estimator |
This dataset contains the random error report if at least one of the estimator specifications has the |
outest_ef |
Estimator |
This dataset contains the report on the calculation of averages for estimator functions if at least one of the estimator specifications uses an estimator function (type EF). |
outest_parm |
Estimator |
This dataset contains the report on imputation statistics by estimator. |
outest_lr |
Estimator |
This dataset contains the report on the calculation of « beta » coefficients for linear regression estimators if at least one of the estimator specifications uses a linear regression (type LR). |
outacceptable |
Estimator |
This data file contains the report on acceptable observations retained to calculate the parameters for each estimator given in the specifications. This file can be large and can slow down execution. |
Notes:
Refer to the Banff Procedure User Guide for a full description of a file generated by a Banff Procedure.
Optional output files will be retained if process_output_type =
all
or if process_output_type =custom
and the dataset name is specified in the ProcessOutputs metadata for the given process.Plugins may output additional optional output files.
The Log#
The Python processor can generate an execution log which provides valuable information about the imputation process which is useful for debugging and analytical purposes. The level of information logged can be configured via the log_level
parameter of your input JSON file.
If 0, no log file is produced at all, only warnings, errors and a summarization of each procedure is written to the console after it is performed. This summary is always printed, even at levels 1 and 2.
If 1 (the default value if
log_level
is not set), the log file contains INFO-level messages, which is primarily the output from the execution of each proc from the Banff package, as well as warnings and errors.Finally, if 2, the log file contains all messages from 1 as well as any DEBUG-level messages, such as more granular information about produced and processed datasets.
The processor keeps a maximum of 6 log files at once. The most recent job is always logged to banffprocessor.log
and when a new job is run, a number is appended to the old log file and a new log is created for the new job. The numbering goes from newest to oldest (i.e. banffprocessor.log
is the log for the most recent job, banffprocessor.log.1
is from the next most recent and banffprocessor.log.5
is from the oldest job).
Process Block Output#
When a new process block is to be run, a special folder is created in the output folder for the calling block. This new output folder is named after the new block’s parameters and upon completion of the block will contain all of the files created by the child block. No new log file is created, however. All log outputs for child blocks can be found in the main log file found in the root input folder.