Outlier Detection#
Execution: banff.outlier()
SDE function type: Review, Selection
Input status flags: None
Output status flags: FTI, FTE
Description#
Identifies outlying observations using Hidiroglou-Berthelot or Sigma-Gap methods.
This procedure offers two methods of univariate outlier detection. The Hidiroglou-Berthelot (HB) method selects outliers based on their distance from the median, relative to the interquartile distance. The Sigma-Gap (SG) method sorts the data in ascending order and searches for significant gaps (relative to the standard deviation) between consecutive values, selecting all subsequent values as outliers. Both methods can detect two types of outliers, which are flagged on the outstatus
file:
Values that are extreme enough to be considered errors. These values are flagged as fields to impute (FTI) so they can be imputed in a subsequent step.
Values that are not extreme enough to be considered errors, but are sufficiently unusual to be deemed fields to exclude (FTE) by subsequent imputation procedures such as
donorimp
andestimator
. (This flag can also be useful during weighting and robust estimation.)
For both procedures, users must specify either an imputation or exclusion threshold; no default value is provided.
Additional features of the procedure:
Users can run outlier detection on multiple variables (
var
) in one call.Users can also run outlier detection on ratios of variables. In this case, only the numerators (
var
) are flagged onoutstatus
. For the denominator, users may select auxiliary variables (with_var
) from the current period (indata
) or from historical data (indata_hist
).Outlier detection can be performed to the right, left, or on both sides (
side
).Outlier detection can be performed within by-groups (
by
), with a user-specified minimum number of observations (min_obs
) required to perform outlier detection.
For a full mathematical description of the procedure methods, with examples, please see the Functional Description.
Input and output tables#
Descriptions of input and output tables are given below. Banff supports a number of input and output formats; please see the Banff User Guide for more information.
Input Table |
Description |
---|---|
indata |
Input statistical data. Mandatory. |
indata_hist |
Input historical data. |
Output Table |
Description |
---|---|
outstatus |
Contains the status of the fields (FTE/FTI) identified as outliers and their values. |
outstatus_detailed |
Detailed status for the outliers (ODER/ODEL/ODIR/ODIL). |
outsummary |
Outlier summary information such as observation counts and acceptance interval bounds. |
For details on the content of output tables, please see the Output Tables document.
Parameters#
Parameter |
Python type |
Description |
---|---|---|
unit_id |
str |
Identify key variable (unit identifier) on indata and indata_hist. Mandatory. |
method |
str |
Method to be used to detect outlying observations (‘CURRENT’, ‘RATIO’, ‘HISTORIC’ or ‘SIGMAGAP’). Mandatory. |
var |
str |
Variables(s) for which to find outliers. |
with_var |
str |
Historical or auxiliary variables. |
weight |
str |
Variable to be used for weighting. |
exclude_where_indata |
str |
Expression in SQL syntax to exclude observations from the outlier detection. |
mii |
float |
HB multiplier for imputation interval (positive). |
mei |
float |
HB Multiplier for exclusion interval (positive). |
mdm |
float |
HB minimum distance multiplier (positive). Default=0.05. |
exponent |
float |
HB exponent for a ratio or historical trend (between 0 and 1). Default=0. |
min_obs |
integer |
Minimum number of observations that must exist in the input table or in a by-group (positive). Default=3 for HB, 5 for SG. |
side |
str |
Side (‘LEFT’, ‘RIGHT’, or ‘BOTH’) of the ordered data to be used for detecting outliers. Default=’BOTH’. |
start_centile |
float |
SG centile to be used to determine the starting point (between 0 and 100). Default=75 for ‘side=”BOTH”’, 0 otherwise. |
beta_i |
float |
SG multiplier for imputation interval (non-negative). |
beta_e |
float |
SG multiplier for exclusion interval (non-negative). |
sigma |
str |
SG type of deviation (‘MAD’ or ‘STD’) to be calculated. Default=’MAD’. |
outlier_stats |
bool |
Add more information to outstatus_detailed output table, including imputation and exclusion interval bounds. Default=False. |
accept_zero |
bool |
Treat zero values as valid. Default=False in the presence of historical or auxiliary variables, True otherwise. |
acceptnegative |
bool |
Treat negative values as valid. Default=False. |
by |
str |
Variable(s) used to partition indata into by-groups for independent processing. |
presort |
bool |
Sorts input tables before processing, according to procedure requirements. Default=True. |
no_by_stats |
bool |
Reduces log output by suppressing by-group specific messages. Default=False. |
Notes#
Imputation and exclusion thresholds#
The identification of outliers, either for imputation or exclusion, requires user-specified thresholds. There are no default or suggested values; these depend entirely on the user’s criteria for what is considered extreme. At least one threshold must be specified for each method: mii
or mei
for the HB method, and beta_i
or beta_e
for the SG method.
Specifying outlier detection methods#
The outlier
method offers a variety of ways to specify outlier detection for both single variables and ratios, with and without historical data. These depend on the combination of the method
, var
, and with_var
parameters, and whether or not indata_hist
is provided:
To do this: |
|
|
|
---|---|---|---|
Apply HB method to current data |
|
No |
No |
Apply HB method to ratio of current data |
|
Yes |
No |
Apply HB method to historical trend |
|
Yes |
Yes |
Apply SG method to current data |
|
No |
No |
Apply SG method to ratio of current data |
|
Yes |
No |
Apply SG method to historical trend |
|
Yes |
Yes |
The outlier detection method is always applied to the variables listed in var
. (If not specified, outlier detection will be applied to all numerical variables on indata
except those that are listed in the by
parameter.) To apply outlier detection to a ratio of variables, specify the list of numerators by var
and denominators by with_var
; the procedure will run through the ordered pairs one-by-one. (Note that it is possible to list the same variables multiple times in both the var
and with_var
lists.)
In indata_hist
is provided, the procedure will use current data (from indata
) in the numerator and historical data (from indata_hist
) in the denominator. If with_var
is not specified, the procedure will use the same variable in both the numerator and denominator.