Error Localization#
Execution: banff.errorloc()
SDE function type: Review, Selection
Input status flags: FTI (optional)
Output status flags: FTI
Description#
For each record, selects the minimum number of variables to impute such that each observation can be made to pass all edits.
Consistency edits specify relationships between variables that a record must satisfy. When a record fails to satisfy these relationships, users must choose which variables to change, a process known as error localization. The Banff error localization procedure follows the Fellegi-Holt minimum-change principle, and uses an algorithm to select which variables to change. This process is performed independently on each record. Selected values are saved in the outstatus file, with a status flag of FTI (Field to impute).
This procedure requires a set of edits, consisting of linear equalities and inequalities, that must be internally consistent. The procedure will only perform error localization on the variables included in the list of edits. Any missing values from amongst the listed variables will automatically be selected for imputation.
By default, the procedure will minimize the number of variables to change. Users may also specify variable weights, in which case the procedure will minimize the weighted count of variables to change. For some records, the error localization problem may have multiple solutions (i.e., choices of variables) that satisfy the minimum-change principle; in this case one of the solutions is selected at random.
For a full mathematical description of the procedure methods, with examples, please see the Functional Description.
Input and output tables#
Descriptions of input and output tables are given below. Banff supports a number of input and output formats; please see the Banff User Guide for more information.
Input Tables |
Description |
---|---|
indata |
Input statistical data. Mandatory. |
instatus |
Input status file containing FTI status flags. |
Output Table |
Description |
---|---|
outstatus |
Output status file identifying selected fields with FTI status flags, and their values. |
outreject |
Output table containing records that failed error localization. |
For details on the content of output tables, please see the Output Tables document.
Parameters#
Parameter |
Python type |
Description |
---|---|---|
unit_id |
str |
Identify key variable (unit identifier) on indata. Mandatory. |
edits |
str |
List of consistency edits. Mandatory. |
weights |
str |
Specify the error localization weights. |
accept_negative |
bool |
Treat negative values as valid. Default=False. |
cardinality |
float |
Specify the maximum cardinality. |
time_per_obs |
float |
Specify the maximum processing time allowed per observation. |
seed |
float |
Specify the root for the random number generator. |
rand_num_var |
str |
Specify a random number variable to be used when having to make a choice during error localization. |
by |
str |
Variable(s) used to partition indata into by-groups for independent processing. |
prefill_by_vars |
bool |
Add by-group variable(s) to input status file(s) to improve performance. Default=True. |
presort |
bool |
Sort input tables before processing, according to procedure requirements. Default=True. |
no_by_stats |
bool |
Reduce log output by suppressing by-group specific messages. Default=False. |
display_level |
int |
Value (0 or 1) to request detail output to the log in relation to the random number variable. Default=0. |
Notes#
Multiple equivalent solutions#
In some cases, there may be multiple solutions that solve the error localization problem. For example, for a record failing the edit "Profit = Revenue - Expenses;"
, changing any one variable would be a valid solution. When this occurs, the procedure selects one of these solutions at random.
For development or testing purposes, users may wish to produce consistent results over multiple runs of the procedure, and may do so using the seed
or rand_num_var
parameters. Both ensure that the same solutions will be selected from one run to the next, if executed on the same set of inputs. Note that if both seed
and rand_num_var
are specified, seed
is ignored. If neither is specified, the system generates a default seed.