Outlier Detection#

  • Execution: banff.outlier()

  • SDE function type: Review, Selection

  • Input status flags: None

  • Output status flags: FTI, FTE

Description#

Identifies outlying observations using Hidiroglou-Berthelot or Sigma-Gap methods.

This procedure offers two methods of univariate outlier detection. The Hidiroglou-Berthelot (HB) method selects outliers based on their distance from the median, relative to the interquartile distance. The Sigma-Gap (SG) method sorts the data in ascending order and searches for significant gaps (relative to the standard deviation) between consecutive values, selecting all subsequent values as outliers. Both methods can detect two types of outliers, which are flagged on the outstatus file:

  • Values that are extreme enough to be considered errors. These values are flagged as fields to impute (FTI) so they can be imputed in a subsequent step.

  • Values that are not extreme enough to be considered errors, but are sufficiently unusual to be deemed fields to exclude (FTE) by subsequent imputation procedures such as donorimp and estimator. (This flag can also be useful during weighting and robust estimation.)

For both procedures, users must specify either an imputation or exclusion threshold; no default value is provided.

Additional features of the procedure:

  • Users can run outlier detection on multiple variables (var) in one call.

  • Users can also run outlier detection on ratios of variables. In this case, only the numerators (var) are flagged on outstatus. For the denominator, users may select auxiliary variables (with_var) from the current period (indata) or from historical data (indata_hist).

  • Outlier detection can be performed to the right, left, or on both sides (side).

  • Outlier detection can be performed within by-groups (by), with a user-specified minimum number of observations (min_obs) required to perform outlier detection.

For a full mathematical description of the procedure methods, with examples, please see the Functional Description.

Input and output tables#

Descriptions of input and output tables are given below. Banff supports a number of input and output formats; please see the Banff User Guide for more information.

Input Table

Description

indata

Input statistical data. Mandatory.

indata_hist

Input historical data.

Records on indata are linked with records on indata_hist by the unit_id variable. Records that appear on indata_hist but not on indata are dropped before processing, as are records with missing values for unit_id.

Output Table

Description

outstatus

Contains the status of the fields (FTE/FTI) identified as outliers and their values.

outstatus_detailed

Detailed status for the outliers (ODER/ODEL/ODIR/ODIL).

Detailed status indicates whether the outlier falls outside the exclusion interval on the right (ODER) or on the left (ODEL). If it concerns an FTI outlier, the detailed status distinguishes between an outlier falling outside the imputation interval on the right (ODIR) and on the left (ODIL).

It will contain more information (imputation and exclusions bounds, current and auxiliary values, ..) with the parameter outlier_stats=True.

outsummary

Outlier summary information such as observation counts and acceptance interval bounds.

For details on the content of output tables, please see the Output Tables document.

Parameters#

Parameter

Python type

Description

unit_id

str

Identify key variable (unit identifier) on indata and indata_hist. Mandatory.

Must be unique for each record. Records with a missing value are dropped before processing.

method

str

Method to be used to detect outlying observations (‘CURRENT’, ‘RATIO’, ‘HISTORIC’ or ‘SIGMAGAP’). Mandatory.

SG method is applied when method='SIGMAGAP', otherwise HB method is used; please see the notes below for details.

var

str

Variables(s) for which to find outliers.

var becomes mandatory when historical or auxiliary variables are used. If they are not used, var can be omitted; in which case all numeric variables in indata (except by-group variables) will be processed.

Example: var = "Revenue Expenses"

with_var

str

Historical or auxiliary variables.

The number of variables in the with_var list must be the same as the var list. If the table indata_hist is used, then with_var variables are read from it, otherwise they are read from indata. If var and with_var are identical (i.e. each variable in var has a historical variable with the same name on indata_hist), then with_var can be omitted.

weight

str

Variable to be used for weighting.

The weight is at the record level and must contain a value for each record. weight will be multiplied by the values of the variables for which one wants to detect outliers. Any record(s) in the input table with a missing weight will be dropped from the outlier detection.

exclude_where_indata

str

Expression in SQL syntax to exclude observations from the outlier detection.

mii

float

HB multiplier for imputation interval (positive).

mii controls the width of the imputation interval. A higher multiplier value for the imputation interval will lead to a lower number of detected outliers to impute. mii becomes mandatory for HB if mei is not specified.

mei

float

HB Multiplier for exclusion interval (positive).

mei controls the width of the exclusion interval. A higher multiplier value for the exclusion interval will lead to a lower number of detected outliers to exclude. mei becomes mandatory for HB if mii is not specified.

mdm

float

HB minimum distance multiplier (positive). Default=0.05.

mdm refers to the minimum interquartile distance required to calculate intervals.

exponent

float

HB exponent for a ratio or historical trend (between 0 and 1). Default=0.

min_obs

integer

Minimum number of observations that must exist in the input table or in a by-group (positive). Default=3 for HB, 5 for SG.

min_obs >= 3 for HB; min_obs >= 5 for SG.

A minimum of 10 observations per by-group is recommended; outlier detection results for by-groups less then 10 observations should be used with caution.

side

str

Side (‘LEFT’, ‘RIGHT’, or ‘BOTH’) of the ordered data to be used for detecting outliers. Default=’BOTH’.

start_centile

float

SG centile to be used to determine the starting point (between 0 and 100). Default=75 for ‘side=”BOTH”’, 0 otherwise.

The centile must be greater than or equal to 0 and less than 100 when side='LEFT' or side='RIGHT'. The centile must be greater than or equal to 50 and less than 100 when side='BOTH'.

beta_i

float

SG multiplier for imputation interval (non-negative).

0<beta_e<beta_i. beta_i becomes mandatory for SG if beta_e is not specified.

beta_e

float

SG multiplier for exclusion interval (non-negative).

0<beta_e<beta_i. beta_e becomes mandatory for SG if beta_i is not specified.

sigma

str

SG type of deviation (‘MAD’ or ‘STD’) to be calculated. Default=’MAD’.

MAD: median absolute value; STD: classical standard deviation.

outlier_stats

bool

Add more information to outstatus_detailed output table, including imputation and exclusion interval bounds. Default=False.

accept_zero

bool

Treat zero values as valid. Default=False in the presence of historical or auxiliary variables, True otherwise.

acceptnegative

bool

Treat negative values as valid. Default=False.

By default, a positivity edit is added for every variable in the list of edits; this parameter permits users to remove this restriction. If required, users may directly add positivity edits for individual variables.

by

str

Variable(s) used to partition indata into by-groups for independent processing.

Outlier detection is performed on each by-group separately.

Example: by = "province industry"

presort

bool

Sorts input tables before processing, according to procedure requirements. Default=True.

no_by_stats

bool

Reduces log output by suppressing by-group specific messages. Default=False.

Notes#

Imputation and exclusion thresholds#

The identification of outliers, either for imputation or exclusion, requires user-specified thresholds. There are no default or suggested values; these depend entirely on the user’s criteria for what is considered extreme. At least one threshold must be specified for each method: mii or mei for the HB method, and beta_i or beta_e for the SG method.

Specifying outlier detection methods#

The outlier method offers a variety of ways to specify outlier detection for both single variables and ratios, with and without historical data. These depend on the combination of the method, var, and with_var parameters, and whether or not indata_hist is provided:

To do this:

method parameter

with_var parameter

indata_hist provided

Apply HB method to current data

"CURRENT"

No

No

Apply HB method to ratio of current data

"RATIO"

Yes

No

Apply HB method to historical trend

"HISTORICAL"

Yes

Yes

Apply SG method to current data

"SIGMAGAP"

No

No

Apply SG method to ratio of current data

"SIGMAGAP"

Yes

No

Apply SG method to historical trend

"SIGMAGAP"

Yes

Yes

The outlier detection method is always applied to the variables listed in var. (If not specified, outlier detection will be applied to all numerical variables on indata except those that are listed in the by parameter.) To apply outlier detection to a ratio of variables, specify the list of numerators by var and denominators by with_var; the procedure will run through the ordered pairs one-by-one. (Note that it is possible to list the same variables multiple times in both the var and with_var lists.)

In indata_hist is provided, the procedure will use current data (from indata) in the numerator and historical data (from indata_hist) in the denominator. If with_var is not specified, the procedure will use the same variable in both the numerator and denominator.