banff.proc package#

Submodules#

banff.proc.banff_proc module#

class banff.proc.banff_proc.BanffProcedure(trace, capture, logger, input_datasets, output_datasets, presort=None, prefill_by_vars=None, exclude_where_indata=None, exclude_where_indata_hist=None, keyword_args=None)[source]#: Bases: GeneralizedProcedure

banff.proc.proc_determin module#

Bases: BanffProcedure

Performs imputation when only one combination of values permits the record to pass the set of edits.

The deterministic imputation procedure analyzes each field previously identified as requiring imputation to determine if there is only one possible value which would satisfy the original edits. If such a value is found, it is imputed during execution of this procedure. This method can also be referred to as deductive imputation, since a missing or inconsistent value can be deduced with certainty based upon other fields of the same record.

get_sort_list(include_by=True, include_unit_id=True)[source]#: Call superclass implementation using custom default values.

property indata#

property instatus#

property outdata#

property outstatus#

banff.proc.proc_donorimp module#

Bases: BanffProcedure

Performs nearest neighbour donor imputation such that each imputed record satisfies the specified post-imputation edits.

The donorimp procedure splits records into recipients (records requiring imputation) and donors (records not requiring imputation that pass the edits). For each recipient, the procedure performs the following steps:

From the fields in the edits, a subset are chosen as matching fields for the purpose of distance calculations. This selection can vary depending on which fields require imputation. Users can also specify must-match fields (must_match), which will automatically be included in distance calculations alongside the system-selected ones.
Matching fields are transformed into normalized ranks to remove the effect of scale and clustering from the data. Without this transformation, original data with wide ranges, such as dollar values, would always dominate the distance calculation.
Distances between the recipient and donors are calculated using a L-infinity norm on the transformed matching fields. This is sometimes referred to as the minimax distance because the closest donor is the one with the smallest maximum absolute difference between the transformed values of its matching fields and those of the recipient.
From the donors, a search algorithm is used to efficiently find the closest donor whose values allow the recipient record to pass the user-specified post-imputation edits (post_edits). These are typically a more relaxed form of the edits to ensure a donor can be found.

Note: The Banff distance metric will usually select different donors than a typical Euclidean distance metric. This is by design. When using Euclidean distance metrics, scale differences and skewed distributions in economic data typically result in a distance metric that is dominated by a single field such as revenue. The Banff distance metric ensures that all matching fields are given the same weight in the distance calculation.

Recipients are defined as any record with at least one field within the edits requiring imputation, as indicated by an FTI (Field to Impute) flag on the input status (instatus) file. Donors are defined as any record satisfying all the edits that is not a recipient. The donorimp procedure requires a set of edits; for a version of donor imputation that does not, please see the massimp procedure.

There are a number of ways to exclude records or values from the donor pool. Records can be excluded using the exclude_where_indata or data_excl_var parameters. This does not exclude them from the procedure completely; they may still be included as recipients if they require imputation. Records that have previously been imputed can also be excluded from the donor pool using the eligdon (eligible donor) parameter. The parameter n_limit will limit the number of times a single donor is used for imputation. Users may sometimes identify values that do not require imputation, but are sufficiently unusual that they should not be donated to other records; these should be flagged as FTE (Field to Exclude) on the instatus file.

The Banff distance metric does not accommodate categorical variables. Instead, users may create by-groups by specifying by variables. These by-groups act as imputation classes. Use the min_donors and percent_donors parameters to ensure an appropriate number or ratio of recipients and donors exist in each imputation class before performing imputation.

property indata#

property instatus#

property outdata#

property outdonormap#

property outmatching_fields#

property outstatus#

banff.proc.proc_editstat module#

Bases: BanffProcedure

Produces edit summary statistics tables on records that pass, miss or fail each consistency edit.

This procedure applies a group of edits to statistical data and determines if each record passes, misses (due to missing values) or fails each edit. Resulting diagnostics are saved to five output tables, and can be used to fine-tune the group of edits, estimate the resources required for later procedures, or to evaluate the effects of imputation. Note that this procedure only reviews the data, producing summary statistics; use errorloc (with the same set of edits) to select records and fields for further treatment.

property indata#

property outedit_applic#

property outedit_status#

property outedits_reduced#

property outglobal_status#

property outk_edits_status#

property outvars_role#

banff.proc.proc_errorloc module#

Bases: BanffProcedure

For each record, selects the minimum number of variables to impute such that each observation can be made to pass all edits.

Consistency edits specify relationships between variables that a record must satisfy. When a record fails to satisfy these relationships, users must choose which variables to change, a process known as error localization. The Banff error localization procedure follows the Fellegi-Holt minimum-change principle, and uses an algorithm to select which variables to change. This process is performed independently on each record. Selected values are saved in the outstatus file, with a status flag of FTI (Field to impute).

This procedure requires a set of edits, consisting of linear equalities and inequalities, that must be internally consistent. The procedure will only perform error localization on the variables included in the list of edits. Any missing values from amongst the listed variables will automatically be selected for imputation.

By default, the procedure will minimize the number of variables to change. Users may also specify variable weights, in which case the procedure will minimize the weighted count of variables to change. For some records, the error localization problem may have multiple solutions (i.e., choices of variables) that satisfy the minimum-change principle; in this case one of the solutions is selected at random.

get_sort_list(include_by=True, include_unit_id=True)[source]#: Call superclass implementation using custom default values.

property indata#

property instatus#

property outreject#

property outstatus#

banff.proc.proc_estimato module#

Bases: BanffProcedure

Performs imputation using estimation functions and/or linear regression estimators.

The estimator procedure offers imputation methods such as mean, ratio and regression imputation using current (indata) and/or historical data (indata_hist) for the variable to impute and potentially auxiliary variables. Users may choose from twenty (20) pre-defined imputation estimator algorithms that are included in the procedure, or define their own custom algorithms.

Only fields with an FTI (Field to Impute) from the instatus file are imputed. Fields with FTE (Field to Excluded) or I– (Imputed Field) flags are excluded from the imputation model. (Note that this does not include the flag IDE, which indicates deterministic imputation.)

Estimator or linear regression parameters (e.g. means or regression coefficients) can be calculated on all records or on a particular subset of acceptable records. The restriction of the acceptable records can be applied using an exclusion parameter or by specifying by-groups imputation.

property inalgorithm#

property indata#

property indata_hist#

property inestimator#

property instatus#

property instatus_hist#

property outacceptable#

property outdata#

property outest_ef#

property outest_lr#

property outest_parm#

property outrand_err#

property outstatus#

banff.proc.proc_massimpu module#

Bases: BanffProcedure

Performs donor imputation for a block of variables using a nearest neighbour approach or random selection.

The massimp procedure is intended for use when a large block of variables is missing for a set of respondents, typically when detailed information is collected only for a subsample (or second phase sample) of units. While the donorimp procedure uses both system and user matching fields, mass imputation only considers user matching fields to find a valid record (donor) that is most similar to the one which needs imputation (recipient).

Mass imputation considers a recipient any record for which all the variables to impute (must_impute) are missing on indata, and considers a donors any record for which none of the listed variables are missing. If matching fields (must_match) are provided by the user, the massimp procedure uses them to find the nearest donor using the same distance function as donorimp. If matching fields are not provided, a donor is selected at random.

Unlike donorimp, the massimp procedure does not use edits. Before running the procedure, users should ensure that the pool of potential donors do not include any errors, including outliers or consistency errors.

Users may create by-groups by specifying by variables. These by-groups act as imputation classes. Use the min_donors and percent_donors parameters to ensure an appropriate number or ratio of recipients and donors exist in each imputation class before performing imputation.

property indata#

property outdata#

property outdonormap#

property outstatus#

banff.proc.proc_outlier module#

Bases: BanffProcedure

Identifies outlying observations using Hidiroglou-Berthelot or Sigma-Gap methods.

This procedure offers two methods of univariate outlier detection. The Hidiroglou-Berthelot (HB) method selects outliers based on their distance from the median, relative to the interquartile distance. The Sigma-Gap (SG) method sorts the data in ascending order and searches for significant gaps (relative to the standard deviation) between consecutive values, selecting all subsequent values as outliers. Both methods can detect two types of outliers, which are flagged on the outstatus file:

Values that are extreme enough to be considered errors. These values are flagged as fields to impute (FTI) so they can be imputed in a subsequent step.
Values that are not extreme enough to be considered errors, but are sufficiently unusual to be deemed fields to exclude (FTE) by subsequent imputation procedures such as donorimp and estimator. (This flag can also be useful during weighting and robust estimation.)

For both procedures, users must specify either an imputation or exclusion threshold; no default value is provided.

Additional features of the procedure:

Users can run outlier detection on multiple variables (var) in one call.
Users can also run outlier detection on ratios of variables. In this case, only the numerators (var) are flagged on outstatus. For the denominator, users may select auxiliary variables (with_var) from the current period (indata) or from historical data (indata_hist).
Outlier detection can be performed to the right, left, or on both sides (side).
Outlier detection can be performed within by-groups (by), with a user-specified minimum number of observations (min_obs) required to perform outlier detection.

property indata#

property indata_hist#

property outstatus#

property outstatus_detailed#

property outsummary#

banff.proc.proc_prorate module#

Bases: BanffProcedure

Prorates and rounds records to satisfy user-specified edits.

Unlike other Banff procedures, the edits for this procedure follow specific criteria: only equalities are permitted, and the set of edits must form a hierarchical structure that sums to a grand-total. For example:

`plaintext subtotal1 + subtotal2 = grandtotal a + b + c = subtotal1 d + e + f = subtotal2 `

Each individual edit must consist of a set of components x(i) that sum to a total y, i.e., of the form x(1) + … x(n) = y. Inequalities and constants are not permitted. For each individual edit equation that is not satisfied, one of the two prorating algorithms (basic or scaling) is applied in order to rake the components to match the total. The procedure takes a top-down approach, beginning with the grand-total (which is never changed) and adjusting components as necessary, until the full set of edits is satisfied. Missing values are not prorated; they are set to zero during the procedure and reset to missing afterwards. Values of zero are never altered.

Additional features:

Automatic rounding to the desired number of decimal places.
Optional bounds to constrain the relative change of values during prorating.
Control over which variables are eligible for prorating.
Option to limit prorating to original or previously imputed values, either globally or for individual variables.
Weights to adjust the relative change of individual variables.

get_sort_list(include_by=True, include_unit_id=True)[source]#: Call superclass implementation using custom default values.

property indata#

property instatus#

property outdata#

property outreject#

property outstatus#

banff.proc.proc_verifyed module#

Bases: BanffProcedure

Checks the edits for consistency and redundancy.

The verifyedits procedure does not analyze statistical data or perform any SDE functions (review, selection, treatment). Instead, it is used to review a set of user-specified edits to verify consistency and identify any redundant edits, deterministic variables, or hidden qualities. Once these features are identified, the minimal set of edits is determined. Users are encouraged to review any set of proposed edits using verifyedits before calling the edit-based procedures errorloc, deterministic,`donorimp`, or prorate. Functions performed:

Consistency: the set of edits is checked for consistency, i.e., that the constraints define a non-empty feasible region.
Redundancy: produces a list of edits that are redundant, i.e., that can be removed without affecting the feasible region.
Bounds: produces implied upper and lower bounds for each variable. This also reveals any deterministic variables, i.e., variables that can only take on a single value.
Extremal points: generates the set of extremal points, or vertices, of the feasible region.
Hidden equalities: produces a list of hidden equalities not specified in the original list of edits.
Implied edits: generates a set of implied edits not specified in the original list of edits.
Minimal edits: generates a set of minimal edits required to define the feasible region generated by the original edits.

Together, the outputs of verifyedits may give the user a better sense of the feasible region defined by the original edits, before using them in other procedures. Even if the original edits are consistent, the outputs may reveal unexpected or unintended constraints that can be addressed by adding, removing, or altering the edits. Using the minimal set of edits can also increase performance in other procedures.

Module contents#

class banff.proc.BanffProcedure(trace, capture, logger, input_datasets, output_datasets, presort=None, prefill_by_vars=None, exclude_where_indata=None, exclude_where_indata_hist=None, keyword_args=None)[source]#: Bases: GeneralizedProcedure

banff.proc.determin#: alias of ProcDetermin

banff.proc.donorimp#: alias of ProcDonorimp

banff.proc.editstat#: alias of ProcEditstat

banff.proc.errorloc#: alias of ProcErrorloc

banff.proc.estimato#: alias of ProcEstimato

banff.proc.get_default(key=None)#

Get default value for Procedure option.

Returns dictionary of available default arguments, or a specific default-value when key specified.

Raises KeyError if key does not exist.

banff.proc.massimpu#: alias of ProcMassimpu

banff.proc.outlier#: alias of ProcOutlier

banff.proc.prorate#: alias of ProcProrate

banff.proc.set_default(key, value)#

Set default value for some Procedure options.

Sets the value of key in Procedure _default_args dictionary to value.

Raises KeyError if key does not exist.

banff.proc.verifyed#: alias of ProcVerifyed

banff.proc package#

Submodules#

banff.proc.banff_proc module#

banff.proc.proc_determin module#

banff.proc.proc_donorimp module#

banff.proc.proc_editstat module#

banff.proc.proc_errorloc module#

banff.proc.proc_estimato module#

banff.proc.proc_massimpu module#

banff.proc.proc_outlier module#

banff.proc.proc_prorate module#

banff.proc.proc_verifyed module#

Module contents#

This Page