banff.io_util package#

Module contents#

banff.io_util.DF_from_sas_file(file_path)[source]#

Load SAS dataset file into Pandas DataFrame.

class banff.io_util.GensysInputDataset(name, input_dataset, mandatory=True)[source]#

Bases: StcInputTable

Wrapper for procedure input datasets.

Please also read super-class (StcTable) documentation.

free_c_args()[source]#
init_c_args()[source]#

Low-level preparation of C input dataset arguments.

Given the packed input dataset, cast it to a datatype suitable for C to consume.

For Apache Arrow “C Array Stream”, we’re creating a pointer to the nanoarrow created structure and passing that as a typeless “void pointer”, instead of implementing some complex logic for passing the structure directly. If the dataset is None, send a “null pointer”.

validate_user_spec()[source]#

Validate the user specification.

class banff.io_util.GensysOutputDataset(name, output_specification, mandatory=True, requested_by_default=True)[source]#

Bases: StcOutputTable

Wrapper for procedure output datasets.

INTENDED USE:
  • user_output will be populated based on user_spec with either
    • the requested object format of ds_intermediate

    • the path which ds_intermediate was written to

MEMBERS:

c_output - C code generated output dataset requested_by_default - whether or not to produce, when unspecified by user user_output - output dataset in user requested format (or user_spec when written to file)

Please also read super-class (StcTable) documentation.

extract_c_output()[source]#
free_c_args()[source]#

Free C output dataset arguments (NOT THEIR CONTENTS).

init_c_args()[source]#

Low-level preparation of C output dataset arguments.

Some output datasets are optional. If a dataset is not requested, C-code must receive NULL (C). Accomplish this by passing None (Python). Otherwise, Python passes a string pointer (ctypes.c_char_p()) to

C code “by reference” (ctypes.byref()).

banff.io_util.PAT_from_sas_file(file_path)[source]#

Load SAS dataset file into pyarrow Table.

banff.io_util.SAS_file_to_feather_file(file_path, destination)[source]#

Read SAS file and write to feather file.

banff.io_util.SAS_file_to_parquet_file(file_path, destination)[source]#

Read SAS file and write to parquet file.

banff.io_util.add_status_by_vars(ds_stat, ds_data, unit_id=None, by=None)[source]#

Add BY variables to input status dataset.

Filters ds_stat down to unit_id, ‘FIELDID’, and ‘STATUS’, then add by variables from ds_data, matching on unit_id in a safe case insensitive manner.

ds_stat - instatus or instatus_hist dataset, pandas.DataFrame ds_data - indata or indata_hist dataset, pandas.DataFrame unit_id - name of unit_id, str by - list of 0+ variable names, str

pandas support via pandas.DataFrame.join()

Research indicates this as possibly more efficient than pandas.merge(). Join always happens on the “index”, so .set_index() is used to specify which column to join the dataframes on.

  • calls .reset_index() on merged dataset, to restore that column

pyarrow support via pyarrow.Table.join()

Nothing to really note here.

Case Insensitivity (safely)

Merging/joining pandas dataframes is a case sensitive operation. Procedures are case insensitive with respect to variable (column) names provided in:

  • datasets

  • edit strings

  • parameters

In this function, case insensitivity is implemented without modifying the case of either dataset, instead generating dataset-case-specific metadata. - see match_case() for details

banff.io_util.arrow_to_pandas(pa_t)#

Convert PyArrow Table to Pandas DataFrame.

pyarrow option split_blocks=True recommended for minimizing memory usage.

see https://arrow.apache.org/docs/python/pandas.html#reducing-memory-use-in-table-to-pandas

Uses types_mapper to ensure pandas uses string, not object for string data.

banff.io_util.c_argtype_input_dataset()[source]#

Return the type of argument C code uses for input datasets.

banff.io_util.c_argtype_output_dataset()[source]#

Return the type of argument C code uses for output datasets.

banff.io_util.c_argtype_parameters()[source]#

Return the type of argument C code uses for parameters.

banff.io_util.c_return_type()[source]#

Return the type of value C code returns.

banff.io_util.dest_is_file(destination)[source]#

Whether or not the requested destination appears to be a file path.

banff.io_util.dest_is_object(destination)[source]#

Whether or not the requested destination is an object.

banff.io_util.flag_rows_where(ds_in, where_stmt, new_col_name='_exclude', flag_value='E')[source]#

Add a new ‘flag’ string column with a value flagging certain records.

Add a new string column, new_col_name, to a intermediate dataset. For records matching where_stmt, set the value to flag_value. Unflagged records will have a null (missing) value.

Returns the modified dataset and new column name (may differ from new_col_name).

If new_col_name exists

A case insensitive search for new_col_name is performed If any matches are found, a random numeric suffix will be added to the new column name. The new random name isn’t validated.

Use Case: add ‘exclude’ flag to indata or indata_hist datasets

banff.io_util.get_default_output_spec()[source]#

Return current default output dataset specification.

banff.io_util.handle_arrow_string_data(ds, dest_type=None)[source]#

Return new Table with all string data converted to one datatype.

Casts pyarrow.Table ds string columns (string and large_string) to dest_type. By default dest_type is pa.large_string().

banff.io_util.handle_pandas_string_data(df)[source]#

Force string columns to use nullable string datatype.

Pandas 2.x often has issues loading “missing” string values. In some cases the column’s type will become lost and assumed to be numeric. To avoid this, code below forceably converts the column to a string type such that it will be identified as an arrow string when received by C code.

Optimize: The memory efficiency of this method has not yet been assessed Furthermore, SAS itself provided a single space character (’ ‘) to C when a character value was missing. If this causes issues, we could convert all missing character values to a single space, or have C emulate that behaviour for missing char values.

banff.io_util.interm_to_DF(dataset)[source]#

Convert intermediate dataset to pandas dataframe.

banff.io_util.interm_to_PAT(dataset)[source]#
banff.io_util.load_input_dataset(dataset, log)[source]#

Inspect and (if applicable) load user specified input dataset into intermediate format.

This function takes a user-provided input dataset object or file path and returns that dataset in an “intermediate” data format.

The format of the user-provided input dataset determines how it is loaded. When a file path is provided, the data is loaded with interm_from_input_file(). This function will immediately return a dataset which is already in an intermediate format.

Exceptions:

TypeError - input dataset’s type is not supported: TypeError raised FileNotFoundError - input dataset appears to be a non-existent file path ValueError - input dataset is an empty string

banff.io_util.pack_dataset(dataset, log)[source]#

Format dataset for consumption by C code.

banff.io_util.pack_parms(parms)[source]#

Get parameters in JSON form, encode as utf-8, return it.

banff.io_util.pandas_to_arrow(pd_dataframe)#

Convert Pandas DataFrame to PyArrow Table.

preserve_index=False
  • do not include the pandas “index” as an additional column in the pyarrow table

banff.io_util.remove_rows_where(ds_in, where_stmt)[source]#

Remove (delete/drop) certain records.

Remove rows that match where_stmt

banff.io_util.set_default_output_spec(new_spec)[source]#

Set default output output dataset specification to a supported object type identifier.

banff.io_util.sort_dataset(dataset, by=None, case_sensitive=False, inplace=False)[source]#

Sort dataset in ascending order in order of by variable list.

When case_sensitive=False (default), the dataset’s column names are normalized prior to sorting and restored after.

If no variables are passed, the original dataset is returned. If dataset format is not supported, a TypeError is raised.

banff.io_util.unpack_output_dataset(dataset)[source]#

Load C generated output dataset to intermediate format.

banff.io_util.write_output_dataset(dataset, destination, log, default_format=None)[source]#

Output dataset (dataset) according to user specification (destination).

Handle converstion from intermediate-format to user specified output format. This function looks at the type of arguent destination and calls the appropriate conversion function to facilitate further destination inspection or file conversion. Returns the dataset object in the output format, or the path to the written file.

dataset
  • must be a valid dataset in one of the “intermediate” formats

  • validation of dataset only occurs in the called conversion functions

destination is a user provided value which determines various output settings.
  • If None a Pandas DataFrame is returned

  • If valid path to a supported file type is provided, dataset written to file, path returned

default_format allows the caller to set a custom default output format.

If certain custom exceptions occur during output, an attempt will be made to return a DataFrame of the output dataset.

  • None -> Return pandas DataFrame

  • file path to supported file type - write output to file - return path file was written to