vardb.metadata_wrangling.oasis package

The oasis package is a collection of scripts used for cleaning data from the Oasis database at BCCA, and creating tables for loading to vardb.


vardb.metadata_wrangling.oasis.demographics module


Add the biopsy_number column to the Demographics dataframe sorted by the biopsy_date in ascending order

Parameters:dataframe – Demographics dataframe
Returns:Demographics dataframe with the biopsy_number column added to it

Extract the Biopsy columns from the Clinical dataframe columns which stores the Biopsy information

Parameters:dataframe_columns – Clinical dataframe
Returns:List of columns which stores the Biopsy information

Extract demographic data from the clinical dataframe

Parameters:dataframe – Clinical dataframe
Returns:Demographics dataframe

Work with Demographics data

Parameters:clinical_dataframe – The original clinical dataframe
Returns:Validated Demographics dataframe

Iterate over each row of the dataframe to validate biopsy_date <= pog_report_date

Parameters:row – Each row of the demographics dataframe
Returns:The error code string for that row

Iterate over each row of the dataframe to validate blood_collection_date <= pog_report_date

Parameters:row – Each row of the demographics dataframe
Returns:The error code string for that row

Iterate over each row of the dataframe to validate consent_date <= pog_report_date

Parameters:row – Each row of the demographics dataframe
Returns:The error code string for that row

Validate Demographics data

Parameters:dataframe – Demographics dataframe
Returns:Validated Demographics dataframe
Iterate over each row of the dataframe to validate the mandatory columns in the demographic data
patient_id sex consent_date consent_age
Parameters:row – Each row of the demographics dataframe
Returns:The error code string for that row

Iterate over each row of the dataframe to validate the mandatory columns if pog_report_date exists in the demographic data

blood_collection_date biopsy_date bx_loc_radiated prior_primary_tumour biopsy_site post_pog_activities diag_changed re_bx_prog1
Parameters:row – Each row of the demographics dataframe
Returns:The error code string for that row
Iterate over each row of the dataframe to validate when post_pog_activies is not “POG informed treatment not given”,
then all of “post_pog_treatment_*” should be null
patient_id post_pog_treatment_deceased post_pog_treatment_sick post_pog_treatment_decision_pt post_pog_treatment_decision_phys post_pog_treatment_na post_pog_treatment_cost post_pog_treatment_travel post_pog_treatment_unknown
Parameters:row – Each row of the demographics dataframe
Returns:The error code string for that row

Iterate over each row of the dataframe to validate when post_pog_activies is “POG informed treatment not given”, then exactly one of the “post_pog_treatment_*” should have a “Y” and the rest should be null

post_pog_treatment_deceased post_pog_treatment_sick post_pog_treatment_decision_pt post_pog_treatment_decision_phys post_pog_treatment_na post_pog_treatment_cost post_pog_treatment_travel post_pog_treatment_unknown
Parameters:row – Each row of the demographics dataframe
Returns:The error code string for that row

Iterate over each row of the dataframe to validate when re_bx_prog1, re_bx_prog2, etc is ‘Y’, then re_bx_date1, re_bx_date2, etc should not be null

Parameters:row – Each row of the demographics dataframe
Returns:The error code string for that row

Iterate over each row of the dataframe to validate re_bx_date1, re_bx_date2, etc. >= biopsy_date

Parameters:row – Each row of the demographics dataframe
Returns:The error code string for that row

vardb.metadata_wrangling.oasis.diagnosis module


Extract diagnosis data from the clinical dataframe

Parameters:dataframe – Clinical dataframe
Returns:Diagnosis dataframe

Work with Diagnosis data

Parameters:clinical_dataframe – The original clinical dataframe
Returns:Validated Diagnosis dataframe

Reshapes the Diagnosis dataframe by applying pandas Wide to Long method

Parameters:dataframe – Diagnosis dataframe
Returns:Reshaped Diagnosis dataframe

Validate Diagnosis data

Parameters:dataframe – Diagnosis dataframe
Returns:Validated Diagnosis dataframe
Iterate over each row of the dataframe to validate the mandatory columns in the diagnosis data
site_desc tumour_group diagnosis_date age_at_diagnosis
Parameters:row – Each row of the diagnosis dataframe
Returns:The error code string for that row

vardb.metadata_wrangling.oasis.drug_map module


Drop the comma_separated_drugs column from the Drug Treatment dataframe

Parameters:drug_treatment_dataframe – Drug Treatment dataframe

Eliminate duplicate drug names Sort the drug names alphabetically

Parameters:drug_list – The list of drug names
Returns:The sorted drug list

Get the token with the longest length that matched

Parameters:matched_tokens – List of matched drug tokens
Returns:Matching token

Get the list of tokens that match to the drug map YAML file

Parameters:drug_tokens – Drug name tokens
Returns:List of matching tokens

Copy the ‘drug_list’ column and rename it to ‘original_drug_string’ Insert the ‘original_drug_string’ column before the ‘error’ column

Parameters:drug_treatment_dataframe – Drug Treatment dataframe
Returns:Drug Treatment dataframe with the ‘original_drug_string’ column

Map Oasis drug names to Ontology

Parameters:row – Each row of the Drug Treatment Dataframe
Returns:List of Ontology-mapped drugs

Create the Drug Map and update the Drug Treatment table

Parameters:drug_treatment_dataframe – Drug Treatment dataframe

Split the comma separated drug names into a list Strip out the whitespaces in the beginning and end of the drug name

Parameters:drug_names – List of drug names from Oasis
Returns:Processed column cell data filtering out empty strings ‘a’,,’ b’ –> [‘a’,’b’] or Nan for empty cells

Split the drug names on special characters (E.g. ‘ ‘, ‘(‘, ‘)’) E.g. a (b) –> [‘a’, ‘b’]

Parameters:oasis_drug_name – Oasis drug name
Returns:List of split drug names

Tokenize the Drug names E.g. Input: ‘a b’ Output: [‘a’, ‘b’, ‘ab’]

Parameters:oasis_drug_name – Drug name from Oasis
Returns:Drug name tokens

vardb.metadata_wrangling.oasis.error_code module

vardb.metadata_wrangling.oasis.error_code.append_error_codes(old_error_code, new_error_code)
Append the error code strings together
  • old_error_code – Old error code
  • new_error_code – New error code

Appended error code strings with or without comma(s)

vardb.metadata_wrangling.oasis.error_code.collect_error_codes(error_reporting_dataframe, demographics_dataframe, diagnosis_dataframe, drug_treatment_dataframe, radiation_dataframe, diagnosis_error_dataframe, drug_treatment_error_dataframe, radiation_error_dataframe)

Perform operations to generate and process the error code column in the error and individual tables

  • error_reporting_dataframe – Clinical dataframe where the errors are reported
  • demographics_dataframe – Demographics dataframe
  • diagnosis_dataframe – Diagnosis dataframe
  • drug_treatment_dataframe – Drug Treatment dataframe
  • radiation_dataframe – Radiation dataframe
  • diagnosis_error_dataframe – Diagnosis Error dataframe
  • drug_treatment_error_dataframe – Drug Treatment Error dataframe
  • radiation_error_dataframe – Radiation Error dataframe

Clinical dataframe where the error are reported


Iterate over every row of the error dataframe to concatenate the individual error codes

Parameters:row – Each row of the error code dataframe
Returns:Row information
vardb.metadata_wrangling.oasis.error_code.generate_error_dataframe(dataframe, demographics_dataframe, diagnosis_error_dataframe, drug_treatment_error_dataframe, radiation_error_dataframe)

Generate the error code reporting dataframe

  • dataframe – Copy of the Clinical dataframe
  • demographics_dataframe – Demographics error code dataframe
  • diagnosis_error_dataframe – Diagnosis error code dataframe
  • drug_treatment_error_dataframe – Drug Treatment error dataframe
  • radiation_error_dataframe – Radiation error dataframe

Generated error code dataframe

vardb.metadata_wrangling.oasis.error_code.group_and_aggregate_error_codes(dataframe, aggregate_column)

Group by and aggregate the error codes from individual dataframes

  • dataframe – Input dataframe
  • aggregate_column – The column on which to perform the aggregate operation

Dataframe grouped by the error codes for each patient id and reset index

vardb.metadata_wrangling.oasis.error_code.rename_error_code_column_to_errors(demographics_dataframe, diagnosis_dataframe, drug_treatment_dataframe, radiation_dataframe)

Rename the error code columns in the individual tables

  • demographics_dataframe – Demographics dataframe
  • diagnosis_dataframe – Diagnosis dataframe
  • drug_treatment_dataframe – Drug treatment dataframe
  • radiation_dataframe – Radiation dataframe

Replace string Nan’s with empty string ‘’

Parameters:dataframe – Error dataframe
Returns:Error dataframe with string Nan’s replaced with ‘’

vardb.metadata_wrangling.oasis.helpers module

vardb.metadata_wrangling.oasis.helpers.extract_column_names_from_base_names(dataframe_columns, base_names, pattern_match)

Extracts column names from base names

  • dataframe_columns – Columns of the dataframe
  • base_names – Base names for that dataframe
  • pattern_match – Matching pattern for that dataframe column name

List of column names with patient_id


Extract the columns from the Clinical dataframe which stores dates

Parameters:dataframe – Clinical dataframe
Returns:List of columns which stores dates
vardb.metadata_wrangling.oasis.helpers.reshape_dataframe(dataframe, stub_names, id_variable, sub_observation, separator='', suffix='\\d+')

Reshapes a dataframe from wide to long, drops NaN rows and resets the indices

  • dataframe – The dataframe to be reshaped
  • stub_names – Column names in the reshaped dataframe
  • id_variable – Column to use as id variable
  • sub_observation – Column name that you wish to name your suffix in the long format.
  • separator – A character indicating the separation of the variable names in the wide format, to be stripped from the names in the long format.
  • suffix – A regular expression capturing the wanted suffixes.

Reshaped dataframe

vardb.metadata_wrangling.oasis.oasis module

vardb.metadata_wrangling.oasis.oasis.parse_oasis_data(oasis_file_path, output_path)
  • oasis_file_path – Input OASIS file path
  • output_path – The folder path to store the output

vardb.metadata_wrangling.oasis.output module

vardb.metadata_wrangling.oasis.output.dataframe_to_tsv(dataframe, file_path, date_stamp)

Write the dataframe to a TSV file

  • dataframe – Input dataframe
  • file_path – File path where to write it
  • date_stamp – YYYYMMDD date format of the file

Filter out non pediatric ids from the error reporting dataframe

Parameters:error_dataframe – Error reporting dataframe
Returns:Filtered dataframe without pediatric ids
vardb.metadata_wrangling.oasis.output.output_to_tsv(demographics_dataframe, diagnosis_dataframe, drug_treatment_dataframe, radiation_dataframe, error_dataframe, output_path)

Write the output tables to TSV files

  • demographics_dataframe – Demographics dataframe
  • diagnosis_dataframe – Diagnosis dataframe
  • drug_treatment_dataframe – Drug Treatment dataframe
  • radiation_dataframe – Radiation dataframe
  • error_dataframe – Clinical dataframe where the errors are reported
  • output_path – The folder path to store the output
vardb.metadata_wrangling.oasis.output.write_output_to_tsv(demographics_dataframe, diagnosis_dataframe, drug_treatment_dataframe, radiation_dataframe, error_dataframe, output_path)

Convert the error code lists to strings in the individual tables

  • demographics_dataframe – Demographics dataframe
  • diagnosis_dataframe – Diagnosis dataframe
  • drug_treatment_dataframe – Drug treatment dataframe
  • radiation_dataframe – Radiation dataframe
  • error_dataframe – Error dataframe
  • output_path – The folder path to store the output

vardb.metadata_wrangling.oasis.preprocess module


Append a column named ‘patient_id’ which saves the gsc_pog_id value POG 001-GIC as POG001

Parameters:dataframe – Input dataframe
Returns:Dataframe with the appended column

Clean the OASIS data

Parameters:clinical_dataframe – Clinical dataframe
Returns:(Clinical dataframe, Error reporting dataframe)
vardb.metadata_wrangling.oasis.preprocess.compare_death_dates_to_all_dates(row, date_columns)

Compare death date to the rest of the dates to validate death_date > all other dates

  • row – Each row of the Error Reporting dataframe
  • date_columns – List of date columns from the Error Reporting Dataframe

Error code string for that row

vardb.metadata_wrangling.oasis.preprocess.compare_tumour_group_to_pog_tumour_groups(row, pog_tumour_group_columns)

Compare the tumour group in the gsc_pog_id column to pog_tumour_groups from the treatment data

  • row – Each row of the Error reporting dataframe
  • pog_tumour_group_columns – List of pog_tumour_group columns from the treatment data

Error reporting dataframe with the error reported


Drop multiple POG biopsies with the same biopsy date. Drop all rows.


Drop duplicate rows

Parameters:dataframe – Input dataframe
Returns:Dataframe without duplicate rows

Drop empty rows with only GSC POG ID

Parameters:dataframe – Input dataframe
Returns:Dataframe with the dropped rows

Drop the rows with missing GSC POG IDs

Parameters:dataframe – Input dataframe
Returns:New dataframe excluding the erroneous rows

Iterate over each row of the dataframe to identify empty rows with only GSC POG IDs to label them as ‘Empty’

Parameters:row – Each row of the Clinical dataframe
Returns:Error code string for that row

Iterate over each row of the Clinical dataframe to identify missing gsc_pog_ids

Parameters:row – Each row of the Clinical dataframe
Returns:Error code string for that row

Read the input OASIS Excel file

Parameters:oasis_file – Input OASIS file
Returns:Clinical Dataframe

Split the gsc_pog_id column into tumour_group, patient_id and pediatric_id. Drop the gsc_pog_id after that. Move the new columns as the first three columns

Parameters:dataframe – Input dataframe
Returns:Dataframe with the new columns and without gsc_pog_id column

Split the comma separated strings Strip out the whitespace Join them together as a comma-separated string

Parameters:column_cell_data – Data from the column cell
Returns:Processed column cell data as a comma separated string filtering out empty strings ‘a’,,’ b’ –> ‘a’,’b’ or Nan for empty cells

Strip of a trailing ‘ 1’ from the dates

Parameters:dataframe – Clinical dataframe
Returns:Date formatted Clinical dataframe

Strips whitespaces and consecutive commas from the column data i.e. ‘a’,,’ b’ –> ‘a’,’b’

Parameters:dataframe – Clinical Dataframe
Returns:Clinical Dataframe with no whitespace in the data and eliminate consecutive commas (,,)

Strips the whitespace from the data for all the columns

Parameters:dataframe – Input dataframe
Returns:Dataframe sans whitespaces from the beginning and end
vardb.metadata_wrangling.oasis.preprocess.uniform_date_format(dataframe, date_columns)

Format all the dates to a uniform pattern YYYY-MM-DD

  • dataframe – Clinical dataframe
  • date_columns – List of dataframe columns that store dates

Date formatted Clinical dataframe

vardb.metadata_wrangling.oasis.preprocess.validate_data(clinical_dataframe, error_reporting_dataframe)

Validate the Clinical dataframe and report errors on the Error Reporting dataframe

  • clinical_dataframe – Clinical dataframe
  • error_reporting_dataframe – Error Reporting dataframe

Updated Clinical dataframe and errors reported on the Clinical dataframe


Validate death_date > all other date columns (except pog_report_date as that can be reported anytime)

Parameters:dataframe – Error Reporting dataframe
Returns:Updated Error Reporting dataframe with the appropriate error code

Identify multiple POG biopsies with the same biopsy date. Flag all rows with the error code.

Parameters:dataframe – Error dataframe
Returns:Error Reporting Dataframe updated with multiple POG biopsies with the same biopsy dates identified

Identify and iterate over each row of the duplicate dataframe and label them as ‘Duplicate’

Parameters:dataframe – Error Reporting Dataframe
Returns:Error Reporting Dataframe updated with duplicate records identified

Identify the empty POG ID rows in the Clinical dataframe Report them in the Error Reporting dataframe Drop them from the Clinical dataframe

Parameters:error_reporting_dataframe – Error Reporting dataframe
Returns:Updated Error Reporting dataframe

Identify the missing POG ID rows in the Clinical dataframe Report them in the Error Reporting dataframe Drop them from the Clinical dataframe

Parameters:error_reporting_dataframe – Error Reporting dataframe
Returns:Updated Error Reporting dataframe

Validate same tumour groups in the gsc_pog_id column and treatment columns

Parameters:dataframe – Error reporting Dataframe
Returns:Error reporting dataframe with the error code reported

vardb.metadata_wrangling.oasis.radiation module


Extract radiation data from the clinical dataframe

Parameters:dataframe – Clinical dataframe
Returns:Radiation dataframe

Work with Radiation data

Parameters:clinical_dataframe – The original clinical dataframe
Returns:Validated Radiation dataframe

Reshapes the Radiation dataframe by applying pandas Wide to Long method

Parameters:dataframe – Radiation Dataframe
Returns:Reshaped Radiation Dataframe

vardb.metadata_wrangling.oasis.treatment module


For each treatment group (based on patient_id) count the no of non NULL treatment_type entries

Parameters:pog_informed_group – pog_informed group for Treatment groups (based on patient_id)
Returns:Count of non NULL treatment_type entries

For each treatment group (based on patient_id) count the no of pog_informed entries

Parameters:pog_informed_group – pog_informed group for Treatment groups (based on patient_id)
Returns:Count of pog_informed entries

Extract drug treatment data from the clinical dataframe

Parameters:dataframe – Clinical dataframe
Returns:Drug Treatment dataframe

Work with Drug Treatment data

Parameters:clinical_dataframe – The original clinical dataframe
Returns:Validated Drug Treatment dataframe

Reshapes the Drug Treatment dataframe by applying pandas Wide to Long method

Parameters:dataframe – Drug Treatment Dataframe
Returns:Reshaped Drug Treatment Dataframe

Validate best_response should not be null for pog_informed (Y) entries

Parameters:row – Each row of the Drug Treatment dataframe
Returns:The error code string for that row
vardb.metadata_wrangling.oasis.treatment.validate_data(drug_treatment_dataframe, demographics_dataframe)

All validations pertaining ot the Drug Treatment dataframe

  • drug_treatment_dataframe – Drug Treatment dataframe
  • demographics_dataframe – Demographics dataframe supplied for cross validation with treatment table

Validated Drug Treatment dataframe

vardb.metadata_wrangling.oasis.treatment.validate_demographics_post_pog_activity_categories(pog_informed_y_dataframe, demographics_dataframe)

Validate when pog_informed = ‘Y’ for at least one treatment, Demographics data Post POG activities to be either ‘POG informed out of province’ ‘ST/CT therapy at BCCA’ ‘POG informed compassionate access therapy’ ‘POG informed private pay’

  • pog_informed_y_dataframe – Filtered Drug treatment dataframe with pog_informed = ‘Y’
  • demographics_dataframe – Demographics dataframe

Demographics dataframe with the error code reported

vardb.metadata_wrangling.oasis.treatment.validate_demographics_with_treatment(demographics_dataframe, drug_treatment_dataframe)

Validation performed on a dataframe obtained by merging Demographics with Drug Treatment dataframe and reporting error_codes on the Demographics dataframe

  • demographics_dataframe – Demographics dataframe
  • drug_treatment_dataframe – Drug Treatment dataframe

Demographics dataframe with the error codes reported

vardb.metadata_wrangling.oasis.treatment.validate_drug_treatment_for_bcca_treatment_type(pog_informed_y_dataframe, demographics_dataframe)

Validate Drug Treatment data for bcca treatment type

  • pog_informed_y_dataframe – Filtered Drug treatment dataframe with pog_informed = ‘Y’
  • demographics_dataframe – Demographics dataframe

Demographics dataframe with the error code reported


Iterate over each row of the merged dataframe to validate when demographics.post_pog_activities is ‘ST/CT therapy at BCCA’ then for at least one entry where drug_treatment.pog_informed = ‘Y’, drug_treatment.treatment_type should not be null

Parameters:row – Each row of the merged Demographics dataframe
Returns:The error code string for that row

Iterate over each row of the dataframe to validate whether Post POG activities is either ‘POG informed out of province’ or ‘ST/CT therapy at BCCA’

Parameters:row – Each row of the merged Demographics dataframe
Returns:The error string for that row
Iterate over each row of the dataframe to validate the mandatory columns in the drug treatment data
tumour_group pog_tumour_group course_begin_on course_end_on drug_list intent treatment_time pog_informed
Parameters:row – Each row of the demographics dataframe
Returns:The error code string for that row
Iterate over each row of the dataframe to validate when progression_on is present then,
progression_documentation must also be present
Parameters:row – Each row of the drug treatment dataframe
Returns:The error code string for that row

Validate Drug Treatment data

Parameters:drug_treatment_dataframe – Drug Treatment dataframe
Returns:Validated Drug Treatment dataframe

Iterate over each row of the dataframe to validate when course_begin_on date is <= demographics.pog_report_date then treatment_time should be ‘Pre POG Report’; otherwise, treatment_time should be ‘Post POG Report’

Parameters:row – Each row of the drug treatment dataframe
Returns:The error code string for that row
vardb.metadata_wrangling.oasis.treatment.validate_treatment_time_for_pre_or_post_pog_report(drug_treatment_dataframe, demographics_dataframe)

Validate if course_begin_on date is <= demographics.pog_report_date, then treatment_time should be ‘Pre POG Report’ otherwise, treatment_time should be ‘Post POG Report’.

  • drug_treatment_dataframe – Drug treatment dataframe
  • demographics_dataframe – Demographics dataframe

Drug treatment dataframe with the error code reported


Iterate over each row of the dataframe to validate when pog_informed = ‘Y’ then treatment_time= Post POG Report

Parameters:row – Each row of the drug treatment dataframe
Returns:The error code string for that row
vardb.metadata_wrangling.oasis.treatment.validate_treatment_with_demographics(drug_treatment_dataframe, demographics_dataframe)

Validation performed on a dataframe obtained by merging Drug Treatment with Demographics dataframe and reporting error_codes on the Drug Treatment dataframe

  • drug_treatment_dataframe – Drug Treatment dataframe
  • demographics_dataframe – Demographics dataframe

Drug Treatment dataframe with the error codes reported