Metadata Files#

Overall, the Banff Processor uses 18 metadata tables, which can be classified as follows:

Tables describing the overall process flow:

  • JOBS (required): Defines the overall process flow, including the process steps to execute and sequencing

  • PROCESSCONTROLS: Optional Process Controls give more control over individual process steps

Tables defining process step parameters for built-in Banff procedures:

  • VERIFYEDITSPECS: Specifications for the verifyedits procedure

  • OUTLIERSPECS: Specifications for the outlier procedure

  • ERRORLOCSPECS: Specifications for the errorloc procedure

  • DONORSPECS: Specifications for the donorimp procedure

  • ESTIMATORSPECS: Specifications for the estimator procedure

    • ESTIMATORS: Additional parameters for inestimator table

    • ALGORITHMS: Additional parameters for inalgorithm table

  • PRORATESPECS: Specifications for the prorate procedure

  • MASSIMPUTATIONSPECS: Specifications for the massimp procedure

Tables used to define edits, used by the verifyedits,editstats,errorloc,deterministic,donorimp, and prorate procedures:

  • EDITS: List of individual edits

  • EDITGROUPS: Combinations of edits referenced by individual process steps

Parameters used by multiple procedures:

  • VARLISTS: Lists of variables and ordering

  • WEIGHTS: Weights, used by some procedures

  • EXPRESSIONS: Exclusion expressions used by some procedures

Process step parameters for User Defined Procedures:

  • USERVARS: Specifications for user-defined procedures (plugins)

Data management:

  • PROCESSOUTPUTS: Specify which outputs should be saved from each procedure

General notes#

  • ID metadata fields such as jobid or controlid can be a maximum of 100 characters

  • Variable names on your input datasets can be to up 64 characters long.

  • Case sensitivity: Consistency must be maintained across metadata with respect to casing. This includes matching fields such as specID and controlID as well as variable names.

  • Variable names need to respect the casing of the column names on the applicable input. In certain contexts ‘income’ and ‘INCOME’ may be considered different variables on a table but in others they are not. For example, the expressions metadata is case insensitive. For this reason it is recommended to not re-use column names simply with different casing and to respect casing across data and metadata files.

Jobs#

Define your statistical data editing process flows.

Column

Type

Primary Key

Foreign Key

Required

Description

jobid

string

The value of the jobid column is used to extract from the table all the rows that belong to the same E&I strategy or Process Block.

seqno

float

The order of execution is defined by the seqno column which is read from the lowest value to the largest. The Banff processor will run even if there are gaps in the seqno values. For example, a value of SEQNO=10 can be followed in the next row with a value of 20.

controlid

string

Points to an ID defined in the ProcessControls metadata table.

process

string

The value of the process column identifies the name of the Banff procedure or plug-in to be executed.

specid

string

The specid column points to an ID in another metadata tables depending on the value of the PROCESS column.

editgroupid

string

The editgroupid column points to an entry in the EditGroups metadata table

byid

string

The byid column points to the VarLists metadata table. It can be used by all procedures except VerifyEdits and the Job process. The variable list indicated will be used for by-group processing

acceptnegative

string

The acceptnegative column is used to indicate whether negative values are considered valid values or not by the process. If not specified (left blank) or if the value is N, the accept negative option will not be specified, otherwise (if the value of the column is Y), the accept negative option will be set to True.

Process Controls#

Define specifications for Process Controls.

Column

Type

Primary Key

Foreign Key

Required

Description

controlid

string

ID used to identify the control.

targetfile

string

The name of the dataset file to apply the control to. This field is NOT required if parameter=EDIT_GROUP_FILTER.

parameter

string

The type of process control to apply.

value

string

The query (in SQL-lite syntax) to apply for this control. This field is not required and must be empty if parameter=EDIT_GROUP_FILTER.

Edits#

Define linear consistency edits to be used by the Banff procedures.

Column

Type

Primary Key

Foreign Key

Required

Description

editid

string

ID used to identify the edit.

leftside

string

The left-hand side of the equation.

operator

string

The operator, valid values include >, >=, <, <=, =, and !=.

rightside

string

The right-hand side of the equation.

modifier

string

The modifier value. The default value of PASS indicates that the equation must be True, FAIL can be specified to indicate that the equation must be false.

ACCEPTE and REJET are also valid values and can be used instead of PASS and FAIL respectively.

Edit groups#

Define sets of edits to be referenced by individual process steps.

Column

Type

Primary Key

Foreign Key

Required

Description

editgroupid

string

ID used to identify the edit group.

editid

string

ID used to identify the edit belonging to the group.

Expressions#

SQL expressions that are used to exclude donors from being processed in the DonorImp procedure or records to be used in calculations in the Estimator procedure.

Column

Type

Primary Key

Foreign Key

Required

Description

expressionid

string

The ID used to identify the expression.

expressions

string

The SQL expression.

Uservars#

Define names and values of substitution variables in user-defined processes (plugins).

Column

Type

Primary Key

Foreign Key

Required

Description

process

string

The name of the User Defined Process that this uservar applies to. This should be the same name (or one of the names) that is used when calling factory.register() in the register() function required for a UDP module.

specid

string

The specid of the specific job step that this uservar applies to.

var

string

The name of this variable.

value

string

The value of the variable. For numeric values, convert the given string value to the desired format within the UDP using this variable.

Varlists#

Lists names of variables that are used in procedure statements that require a list of variable names. All variable names in the fieldid column that have the same value of varlistid will be collected and used with the statement for which the list is created. The variable seqno defines the order in which the variables will appear in the list. seqno is mandatory and is important especially for the BY statement.

Column

Type

Primary Key

Foreign Key

Required

Description

varlistid

string

ID used to identify the varlist.

seqno

float

Defines the ordering of this fieldid within the overall varlist from the lowest value to the largest. Gaps in the sequence are allowed, the numbers must simply form some logical ordering.

fieldid

string

ID used to identify this specific field within the varlist.

Weights#

Contains names of fields grouped by weightid and the value of the weight for each field.

Column

Type

Primary Key

Foreign Key

Required

Description

weightid

string

ID used to identify this weighting.

fieldid

string

Identifies the field to which this weighting applies.

weight

float

The weight value for the field.

Process Outputs#

Define which outputs are to be kept from each process type. The metadata is used when process_output_type is set to custom.

Column

Type

Primary Key

Foreign Key

Required

Description

process

string

The name of the process for which to save this output file.

output_name

string

The name of the dataset to create and retain upon completion of the specified process and then to save to disk upon completion of the overall job.

Built-in Banff Procedure Specific Metadata#

The following metadata tables are used to store input parameters for built-in Banff Procedures. Each metadata table includes the column specid, which links the procedure-specific parameters listed in each row to the associated process step in the jobs metadata table.

  • VERIFYEDITSPECS: Specifications for the verifyedits procedure

  • OUTLIERSPECS: Specifications for the outlier procedure

  • ERRORLOCSPECS: Specifications for the errorloc procedure

  • DONORSPECS: Specifications for the donorimp procedure

  • ESTIMATORSPECS: Specifications for the estimator procedure

    • ESTIMATORS: Additional parameters for inestimator table

    • ALGORITHMS: Additional parameters for inalgorithm table

  • PRORATESPECS: Specifications for the prorate procedure

  • MASSIMPUTATIONSPECS: Specifications for the massimp procedure

For information on parameters for built-in Banff Procedures, please consult the Banff User guide in the Banff procedures repository.

Tips for programmers#

XML Schemas are available for each metadata file in the sub-package banffprocessor.metadata.models. These schemas can be useful for debugging purposes or for application developers who need to generate these files without using the Excel metadata template. For example, the schema for the Algorithms.xml can be printed with the following 2 lines for code:

from banffprocessor.metadata.models.algorithms import Algorithms

print(Algorithms.get_schema())

A duckdb database is used to store the metadata during processing and can be exported at the end of the process for debugging or informational purposes. The database connection can be accessed through the dbconn attribute of your Processor object as illustrated in the code snippet below:

my_bp = Processor.from_file(input_file)
my_bp.dbconn.execute("EXPORT DATABASE '<path to folder to save exported files>'")
---
title: Banff Processor Metadata Entity-Relationship Diagram
---
erDiagram
    ALGORITHMS {
        string algorithmname PK
        string type
        string status
        string formula
        string description
    }
    DONORSPECS {
        string specid PK
        int mindonors
        real pcentdonors
        int n
        string eligdon
        boolean random
        int nlimit
        real mrl
        string dataexclvar FK
        string mustmatchid FK
        string posteditgroupid FK
    }
    EDITGROUPS {
        string editgroupid PK
        string editid PK
    }
    EDITS {
        string editid PK
        string leftside
        string operator
        string rightside
        string modifier
        string edit
    }
    ERRORLOCSPECS {
        string specid PK
        real cardinality
        real timeperobs
        string weightid
    }
    ESTIMATORS {
        string estimatorid PK
        int seqno PK
        string fieldid
        string auxvariables
        string weightvariable
        string variancevariable
        real varianceexponent
        string varianceperiod
        boolean excludeimputed
        boolean excludeoutliers
        int countcriteria
        real percentcriteria
        boolean randomerror
        string algorithmname FK    
    }
    ESTIMATORSPECS {
        string specid  PK
        string dataexclvar FK
        string histexclvar FK
        string Estimatorid FK
    }
    EXPRESSIONS {
        string expressionid PK
        string expressions
    }
    MASSIMPUTATIONSPECS {
        string specid
        int mindonors
        real pcentdonors
        boolean random
        int nlimit
        real mrl
        string mustimputeid
        string mustmatchid   
    }
    OUTLIERSPECS {
        string specid PK
        string method
        real mei
        real mii
        real mdm
        real exponent
        int minobs
        string varid
        string withid
        string dataexclvar FK
        real betae
        real betai
        string weight FK
        string sigma
        string side
        real startcentile
        boolean acceptzero
    }
    PRORATESPECS {
        string specid PK
        int decimal
        real lowerbound
        real upperbound
        string modifier
        string method
    }
    USERVARS {
        string process PK
        string specid PK
        string var PK
        string value
    }
    VARLISTS {
        string varlistid PK
        real seqno PK
        string fieldid
    }
    VERIFYEDITSPECS {
        string specid PK
        int imply
        int extremal
    }
    WEIGHTS {
        string weightid PK
        string fieldid  PK
        real weight        
    } 
    PROCESSCONTROLS {
        string controlid PK
        string targetfile PK
        string parameter PK
        string value
    }  
    PROCESSOUTPUTS {
        string process PK
        string output_name PK
    }
    JOBS {
        string jobid PK
        real seqno PK
        string controlid FK
        string process FK
        string specid FK
        string editgroupid FK
        string byid FK
        boolean acceptnegative
    }
    
    DONORSPECS ||--}o JOBS : "Defines process specs"
    ERRORLOCSPECS ||--}o JOBS : "Defines specs"
    ESTIMATORSPECS ||--}o JOBS : "Defines specs"
    OUTLIERSPECS ||--}o JOBS : "Defines specs"
    PRORATESPECS ||--}o JOBS : "Defines specs"
    MASSIMPUTATIONSPECS||--}o JOBS : "Defines specs"
    VERIFYEDITSPECS ||--}o JOBS : "Defines specs"
    USERVARS ||--}o JOBS : "Defines custom parameters for plugins"

    EDITGROUPS ||--}o JOBS : "Defines edit group"
    EDITGROUPS ||--}o DONORSPECS : "Defines post edit group"
    EDITS ||--}o EDITGROUPS : "Defines edit groups"
    
    ALGORITHMS ||--}o ESTIMATORS : "Defines custom algorithm"
    ESTIMATORS ||--}o ESTIMATORSPECS : "Defines estimator"
    EXPRESSIONS ||--}o ESTIMATORSPECS : "Defines dataexclvar and histexclvar"
    EXPRESSIONS ||--}o DONORSPECS : "defines dataexclvar"

    VARLISTS ||--}o JOBS : "Defines by variables"
    VARLISTS ||--}o DONORSPECS : "defines mustmatchid"
    VARLISTS ||--}o OUTLIERSPECS : "defines varid and withid"
    VARLISTS ||--}o MASSIMPUTATIONSPECS : "Defines mustimputeid"
    
    WEIGHTS ||--}o ERRORLOCSPECS : "Defines variable weight"
    PROCESSCONTROLS ||--}o JOBS : "Defines Process Controls"
    PROCESSOUTPUTS ||--}o JOBS : "Defines datasets to retain"