Error Localization#

Execution: banff.errorloc()
SDE function type: Review, Selection
Input status flags: FTI (optional)
Output status flags: FTI

Description#

For each record, selects the minimum number of variables to impute such that each observation can be made to pass all edits.

Consistency edits specify relationships between variables that a record must satisfy. When a record fails to satisfy these relationships, users must choose which variables to change, a process known as error localization. The Banff error localization procedure follows the Fellegi-Holt minimum-change principle, and uses an algorithm to select which variables to change. This process is performed independently on each record. Selected values are saved in the outstatus file, with a status flag of FTI (Field to impute).

This procedure requires a set of edits, consisting of linear equalities and inequalities, that must be internally consistent. The procedure will only perform error localization on the variables included in the list of edits. Any missing values from amongst the listed variables will automatically be selected for imputation.

By default, the procedure will minimize the number of variables to change. Users may also specify variable weights, in which case the procedure will minimize the weighted count of variables to change. For some records, the error localization problem may have multiple solutions (i.e., choices of variables) that satisfy the minimum-change principle; in this case one of the solutions is selected at random.

For a full mathematical description of the procedure methods, with examples, please see the Functional Description.

Input and output tables#

Descriptions of input and output tables are given below. Banff supports a number of input and output formats; please see the Banff User Guide for more information.

Input Tables	Description
indata	Input statistical data. Mandatory.
instatus	Input status file containing FTI status flags. Values previously flagged as FTI will be prioritized for selection. This ensures the procedure doesn’t select a field for imputation when a previously flagged value already solves the error localization problem, thereby minimizing the overall number of variables requiring imputation.

Output Table	Description
outstatus	Output status file identifying selected fields with FTI status flags, and their values.
outreject	Output table containing records that failed error localization. The outreject table contains records for which error localization could not be performed, either because they exceeded maximum allowable cardinality (error = “CARDINALITY EXCEEDED”) or time per observation (error = “TIME EXCEEDED”).

For details on the content of output tables, please see the Output Tables document.

Parameters#

Parameter	Python type	Description
unit_id	str	Identify key variable (unit identifier) on indata. Mandatory. Must be unique for each record. Records with a missing value are dropped before processing.
edits	str	List of consistency edits. Mandatory. Example: `"Revenue - Expenses = Profit; Revenue >= 0; Expenses >= 0;"`
weights	str	Specify the error localization weights. Weights can be used to influence which variables are selected for subsequent treatment. When specified, the error procedures minimizes the weighted count of variables to change, such that the record in question may be altered to satisfy the consistency edits. By default, error localization assigns a weight of one to each variable. To assign a different weight, specify `variable = value` where variable is one of the variable specified in the `edits`, and value is a number greater than zero. Multiple weights can be assigned in this way, separated by a semi-colon. Example: `weights = "revenue = 1.5; expenses = 0.8"`
accept_negative	bool	Treat negative values as valid. Default=False. By default, a positivity edit is added for every variable in the list of edits; this parameter permits users to remove this restriction. If required, users may directly add positivity edits for individual variables.
cardinality	float	Specify the maximum cardinality. Cardinality refers to the weighted number of variables requiring imputation, which can vary by record. For records with high cardinality, the error localization problem can take a long time to solve. Specifying a maximum cardinality can improve processing time but may result in the procedure failing to solve the error localization problem for all records; these will be identified in the outreject table.
time_per_obs	float	Specify the maximum processing time allowed per observation.
seed	float	Specify the root for the random number generator. The seed is used to ensure consistent results from one run to the next. If not specified or specified as a non-positive value, a random number is generated by the procedure.
rand_num_var	str	Specify a random number variable to be used when having to make a choice during error localization. The random number variable must exist on indata; it must be numeric and values must be non-negative real numbers less than or equal to 1. The values should be unique for each record.
by	str	Variable(s) used to partition indata into by-groups for independent processing. Because error localization is performed on each record independently, this parameter has no effect. Future versions may use this parameter to improve the processing efficiency of the procedure. Example: `by = "province industry"`
prefill_by_vars	bool	Add by-group variable(s) to input status file(s) to improve performance. Default=True.
presort	bool	Sort input tables before processing, according to procedure requirements. Default=True.
no_by_stats	bool	Reduce log output by suppressing by-group specific messages. Default=False.
display_level	int	Value (0 or 1) to request detail output to the log in relation to the random number variable. Default=0.

Notes#

Multiple equivalent solutions#

In some cases, there may be multiple solutions that solve the error localization problem. For example, for a record failing the edit "Profit = Revenue - Expenses;", changing any one variable would be a valid solution. When this occurs, the procedure selects one of these solutions at random.

For development or testing purposes, users may wish to produce consistent results over multiple runs of the procedure, and may do so using the seed or rand_num_var parameters. Both ensure that the same solutions will be selected from one run to the next, if executed on the same set of inputs. Note that if both seed and rand_num_var are specified, seed is ignored. If neither is specified, the system generates a default seed.