banff.io_util package#
Module contents#
- class banff.io_util.GensysInputDataset(name, input_dataset, mandatory=True)[source]#
Bases:
StcInputTable
Wrapper for procedure input datasets.
Please also read super-class (StcTable) documentation.
- init_c_args()[source]#
Low-level preparation of C input dataset arguments.
Given the packed input dataset, cast it to a datatype suitable for C to consume.
For Apache Arrow “C Array Stream”, we’re creating a pointer to the nanoarrow created structure and passing that as a typeless “void pointer”, instead of implementing some complex logic for passing the structure directly. If the dataset is None, send a “null pointer”.
- class banff.io_util.GensysOutputDataset(name, output_specification, mandatory=True, requested_by_default=True)[source]#
Bases:
StcOutputTable
Wrapper for procedure output datasets.
- INTENDED USE:
- user_output will be populated based on user_spec with either
the requested object format of ds_intermediate
the path which ds_intermediate was written to
- MEMBERS:
c_output - C code generated output dataset requested_by_default - whether or not to produce, when unspecified by user user_output - output dataset in user requested format (or user_spec when written to file)
Please also read super-class (StcTable) documentation.
- init_c_args()[source]#
Low-level preparation of C output dataset arguments.
Some output datasets are optional. If a dataset is not requested, C-code must receive NULL (C). Accomplish this by passing None (Python). Otherwise, Python passes a string pointer (ctypes.c_char_p()) to
C code “by reference” (ctypes.byref()).
- banff.io_util.SAS_file_to_feather_file(file_path, destination)[source]#
Read SAS file and write to feather file.
- banff.io_util.SAS_file_to_parquet_file(file_path, destination)[source]#
Read SAS file and write to parquet file.
- banff.io_util.add_status_by_vars(ds_stat, ds_data, unit_id=None, by=None)[source]#
Add BY variables to input status dataset.
Filters ds_stat down to unit_id, ‘FIELDID’, and ‘STATUS’, then add by variables from ds_data, matching on unit_id in a safe case insensitive manner.
ds_stat - instatus or instatus_hist dataset, pandas.DataFrame ds_data - indata or indata_hist dataset, pandas.DataFrame unit_id - name of unit_id, str by - list of 0+ variable names, str
- pandas support via pandas.DataFrame.join()
Research indicates this as possibly more efficient than pandas.merge(). Join always happens on the “index”, so .set_index() is used to specify which column to join the dataframes on.
calls .reset_index() on merged dataset, to restore that column
- pyarrow support via pyarrow.Table.join()
Nothing to really note here.
- Case Insensitivity (safely)
Merging/joining pandas dataframes is a case sensitive operation. Procedures are case insensitive with respect to variable (column) names provided in:
datasets
edit strings
parameters
In this function, case insensitivity is implemented without modifying the case of either dataset, instead generating dataset-case-specific metadata. - see match_case() for details
- banff.io_util.arrow_to_pandas(pa_t)#
Convert PyArrow Table to Pandas DataFrame.
- pyarrow option split_blocks=True recommended for minimizing memory usage.
see https://arrow.apache.org/docs/python/pandas.html#reducing-memory-use-in-table-to-pandas
Uses types_mapper to ensure pandas uses string, not object for string data.
- banff.io_util.c_argtype_input_dataset()[source]#
Return the type of argument C code uses for input datasets.
- banff.io_util.c_argtype_output_dataset()[source]#
Return the type of argument C code uses for output datasets.
- banff.io_util.c_argtype_parameters()[source]#
Return the type of argument C code uses for parameters.
- banff.io_util.dest_is_file(destination)[source]#
Whether or not the requested destination appears to be a file path.
- banff.io_util.dest_is_object(destination)[source]#
Whether or not the requested destination is an object.
- banff.io_util.flag_rows_where(ds_in, where_stmt, new_col_name='_exclude', flag_value='E')[source]#
Add a new ‘flag’ string column with a value flagging certain records.
Add a new string column, new_col_name, to a intermediate dataset. For records matching where_stmt, set the value to flag_value. Unflagged records will have a null (missing) value.
Returns the modified dataset and new column name (may differ from new_col_name).
- If new_col_name exists
A case insensitive search for new_col_name is performed If any matches are found, a random numeric suffix will be added to the new column name. The new random name isn’t validated.
Use Case: add ‘exclude’ flag to indata or indata_hist datasets
- banff.io_util.get_default_output_spec()[source]#
Return current default output dataset specification.
- banff.io_util.handle_arrow_string_data(ds, dest_type=None)[source]#
Return new Table with all string data converted to one datatype.
Casts pyarrow.Table ds string columns (string and large_string) to dest_type. By default dest_type is pa.large_string().
- banff.io_util.handle_pandas_string_data(df)[source]#
Force string columns to use nullable string datatype.
Pandas 2.x often has issues loading “missing” string values. In some cases the column’s type will become lost and assumed to be numeric. To avoid this, code below forceably converts the column to a string type such that it will be identified as an arrow string when received by C code.
Optimize: The memory efficiency of this method has not yet been assessed Furthermore, SAS itself provided a single space character (’ ‘) to C when a character value was missing. If this causes issues, we could convert all missing character values to a single space, or have C emulate that behaviour for missing char values.
- banff.io_util.load_input_dataset(dataset, log)[source]#
Inspect and (if applicable) load user specified input dataset into intermediate format.
This function takes a user-provided input dataset object or file path and returns that dataset in an “intermediate” data format.
The format of the user-provided input dataset determines how it is loaded. When a file path is provided, the data is loaded with interm_from_input_file(). This function will immediately return a dataset which is already in an intermediate format.
- Exceptions:
TypeError - input dataset’s type is not supported: TypeError raised FileNotFoundError - input dataset appears to be a non-existent file path ValueError - input dataset is an empty string
- banff.io_util.pandas_to_arrow(pd_dataframe)#
Convert Pandas DataFrame to PyArrow Table.
- preserve_index=False
do not include the pandas “index” as an additional column in the pyarrow table
- banff.io_util.remove_rows_where(ds_in, where_stmt)[source]#
Remove (delete/drop) certain records.
Remove rows that match where_stmt
- banff.io_util.set_default_output_spec(new_spec)[source]#
Set default output output dataset specification to a supported object type identifier.
- banff.io_util.sort_dataset(dataset, by=None, case_sensitive=False, inplace=False)[source]#
Sort dataset in ascending order in order of by variable list.
When case_sensitive=False (default), the dataset’s column names are normalized prior to sorting and restored after.
If no variables are passed, the original dataset is returned. If dataset format is not supported, a TypeError is raised.
- banff.io_util.unpack_output_dataset(dataset)[source]#
Load C generated output dataset to intermediate format.
- banff.io_util.write_output_dataset(dataset, destination, log, default_format=None)[source]#
Output dataset (dataset) according to user specification (destination).
Handle converstion from intermediate-format to user specified output format. This function looks at the type of arguent destination and calls the appropriate conversion function to facilitate further destination inspection or file conversion. Returns the dataset object in the output format, or the path to the written file.
- dataset
must be a valid dataset in one of the “intermediate” formats
validation of dataset only occurs in the called conversion functions
- destination is a user provided value which determines various output settings.
If None a Pandas DataFrame is returned
If valid path to a supported file type is provided, dataset written to file, path returned
default_format allows the caller to set a custom default output format.
If certain custom exceptions occur during output, an attempt will be made to return a DataFrame of the output dataset.
None -> Return pandas DataFrame
file path to supported file type - write output to file - return path file was written to