Estimator Imputation#
Execution: banff.estimator()
SDE function type: Treatment
Input status flags: FTI (required), FTE (optional), I–(optional)
Output status flags: I– (exact code depends on specified algorithm)
Description#
Performs imputation using estimation functions and/or linear regression estimators.
The estimator procedure offers imputation methods such as mean, ratio and regression imputation using current (indata
) and/or historical data (indata_hist
) for the variable to impute and potentially auxiliary variables. Users may choose from twenty (20) pre-defined imputation estimator algorithms that are included in the procedure, or define their own custom algorithms.
Only fields with an FTI (Field to Impute) from the instatus
file are imputed. Fields with FTE (Field to Excluded) or I– (Imputed Field) flags are excluded from the imputation model. (Note that this does not include the flag IDE, which indicates deterministic imputation.)
Estimator or linear regression parameters (e.g. means or regression coefficients) can be calculated on all records or on a particular subset of acceptable records. The restriction of the acceptable records can be applied using an exclusion parameter or by specifying by-groups imputation.
For a full mathematical description of the procedure methods, with examples, please see the Functional Description.
Input and output tables#
Descriptions of input and output tables are given below. Banff supports a number of input and output formats; please see the Banff User Guide for more information.
Input Table |
Description |
---|---|
indata |
Input statistical data. Mandatory. |
instatus |
Input status file containing FTI, FTE and I– status flags. Mandatory. |
inestimator |
Estimator specifications table. Mandatory. |
inalgorithm |
User defined algorithms table. |
indata_hist |
Input historical data. |
instatus_hist |
Input historical status file containing FTI, FTE and I– status flags. |
Output Table |
Description |
---|---|
outdata |
Output statistical table containing imputed data. |
outstatus |
Output status file identifying imputed fields with I– status flags, and their values after imputation. |
outacceptable |
Report on acceptable observations retained to calculate parameters for each estimator. |
outest_ef |
Report on calculation of averages for estimator functions. |
outest_lr |
Report on calculation of « beta » coefficients for linear regression estimators (type LR). |
outest_parm |
Report on imputation statistics by estimator. |
outrand_err |
Random error report when a random error is added to the imputed variable. |
For details on the content of output tables, please see the Output Tables document.
Parameters#
Parameter |
Python type |
Description |
---|---|---|
unit_id |
str |
Identify key variable (unit identifier) on indata and indata_hist. Mandatory. |
data_excl_var |
str |
Variable of the input table used to exclude observations from the set of acceptable observations. |
hist_excl_var |
str |
Variable of the historical input table used to exclude historical observations from the set of acceptable observations. |
exclude_where_indata |
str |
Exclusion expression using SQL syntax to specify which observations to exclude from the set of acceptable observations. |
exclude_where_indata_hist |
str |
Exclusion expression using SQL syntax to specify which historical observations to exclude from the set of acceptable observations. |
seed |
flo |
Specify the root for the random number generator. |
verify_specs |
bool |
Estimator specifications verified without running the imputation. |
accept_negative |
bool |
Treat negative values as valid. Default=False. |
by |
str |
Variable(s) used to partition indata into by-groups for independent processing. |
prefill_by_vars |
bool |
Add by-group variable(s) to input status file(s) to improve performance. Default=True. |
presort |
bool |
Sort input tables before processing, according to procedure requirements. Default=True. |
no_by_stats |
bool |
Reduce log output by suppressing by-group specific messages. Default=False. |
Notes#
The inalgorithm
and inestimator
tables are used to define the models (mean imputation, linear regression, etc.) used for imputation and to specify certain parameters and options. While the inestimator table is mandatory, the inalgorithm table is only required when using a custom algorithm instead of one of the 20 pre-defined algorithms available in the procedure.
Inestimator#
The inestimator
table needs te be prepared before running the procedure. It is used to specify the algorithm (i.e., model) to use for imputation, the variables to impute, and other parameters. The specified algorithm can either pre-defined or user-defined in the algorithm table. Multiple algorithms can be specified in this table, and will be processed in the order they appear. Note that the same variable to impute can be specified for multiple algorithms; in this case, the first algorithm will be applied, but if it fails to impute a value requiring imputation, the next algorithm will be applied, and so on.
The following describes the variables that must appear in the inestimator table. All columns are mandatory.
Column |
Description |
---|---|
fieldid |
Name of the variable to be imputed. |
algorithmname |
Name of the algorithm used to impute the variable. |
auxvariables |
Comma separated list of auxiliary variable names on indata or indata_hist. |
weightvariable |
Name of the weight variable. |
countcriteria |
A positive integer indicating the minimum number of acceptable observations needed in the current by-group. |
percentcriteria |
Minimum percentage of acceptable observations needed in the current by-group (between 0 and 100). |
variancevariable |
Name of the variance variable. |
varianceperiod |
Period of the variance (‘C’ for current or ‘H’ for historical). |
varianceexponent |
A number indicating the power of the variance. |
excludeimputed |
‘Y’ (yes) or ‘N’ (no) to indicate whether previously imputed values should be excluded from the set of acceptable observations. |
excludeoutliers |
‘Y’ (yes) or ‘N’ (no) to indicate whether observations with an FTE status should be excluded from the set of acceptable observations. |
randomerror |
‘Y’ (yes) or ‘N’ (no) to indicate whether a random error is added to the imputed variable. |
Inalgorithm#
In addition to the 20 pre-defined algorithms in the procedure, custom defined algorithms can be defined by the user in the inalgorithm
table. These user-defined algorithms consist of two type:
Estimator function (EF): mathematical expression involving constants, current and/or historical values of some variables of the record, and current and/or historical averages of some variables, those averages being calculated from acceptable records. The mathematical expressions may include parentheses and the arithmetic operators addition (+), subtraction (-), multiplication (*), division (/), and exponentiation (^).
Linear regression (LR): Regression imputation consists of imputing a variable $y$$i$ by a linear regression estimation like
$$
\hat{y_i} = \hat{\beta_0} + \hat{\beta_1} x_{i_1 T_1}^{p_1} + \hat{\beta_2} x_{i_2 T_2}^{p_2} + … + \hat{\beta_m} x_{i_m T_m}^{p_m} + \hat{\epsilon_i}
$$
where : $T$$j$ refer to current or historical periods, and $p$$j$ are exponents. The variable $y$$j$ being imputed is the dependent variable in the model, and the auxiliary variables $x$$ij$ are the independent variables, or regressors. $\hat\beta$$j$ are the regression coefficients, the values of which are solved for by using the method of least squares. $\hat\epsilon$$i$ is a random error term, which can be added to the model to introduce some variability into the fitted values of the $y$$i$. Note that $\beta$$0$, which is the intercept in the regression line, is optional and can be omitted from the model.
The following table describes the variables that must appear in the inalgorithm table. All columns but description
are mandatory.
Column |
Description |
---|---|
algorithmname |
Name of the algorithm. |
type |
Type of the algorithm: ‘EF’ for estimator function and ‘LR’ for linear regression. |
status |
1 to 3 character string that will be inserted in the outstatus table (after adding the prefix “I”) when a variable is estimated by this algorithm. |
formula |
The algorithm formula. |
description |
Text to describe the algorithm. |