vardb.variant_data_files package

There is one variant data file class per file type that is loaded to vardb. They contain all of the information required to load each type of data to the database.

  • column names and data types for the files
  • information on the headers
  • which table the data is loaded to in vardb
  • PSQL commands for parsing and loading the data to tables on vardb
  • routines to compute required metadata from the file, such as the creation date and md5sum
  • the pipelines that the class belongs to

When a new data type is added to vardb, a new variant data file class must be created with this information.

Submodules

vardb.variant_data_files.cnv module

cnv contains classes for germline (controlfree) and somatic (somatic_cnv) pipelines. The somatic_cnv pipeline actually creates several file types, which are all represented here.

class vardb.variant_data_files.cnv.ControlFreeC(**kwargs)

Bases: vardb.variant_data_files.variant_data_file.VariantDataFile

ControlFreeC class gets metadata for cnvs produced by the ControlFreeC pipeline

class vardb.variant_data_files.cnv.HomozygousDeletion(**kwargs)

Bases: vardb.variant_data_files.variant_data_file.VariantDataFile

class vardb.variant_data_files.cnv.HomozygousDeletion_v1(**kwargs)

Bases: vardb.variant_data_files.cnv.HomozygousDeletion

HomozygousDeletion class gets metadata for homozygous deletions that have been selected during review from the somatic cnv pipeline

class vardb.variant_data_files.cnv.HomozygousDeletion_v2(**kwargs)

Bases: vardb.variant_data_files.cnv.HomozygousDeletion

HomozygousDeletion class gets metadata for homozygous deletions that have been selected during review from the somatic cnv pipeline

class vardb.variant_data_files.cnv.HomozygousDeletion_v3(**kwargs)

Bases: vardb.variant_data_files.cnv.HomozygousDeletion

HomozygousDeletion class gets metadata for homozygous deletions that have been selected during review from the somatic cnv pipeline

class vardb.variant_data_files.cnv.SomaticCna(**kwargs)

Bases: vardb.variant_data_files.variant_data_file.VariantDataFile

SomaticCna class gets metadata for raw cna data produced by the somatic cnv pipeline.

class vardb.variant_data_files.cnv.SomaticCnv(**kwargs)

Bases: vardb.variant_data_files.variant_data_file.VariantDataFile

SomaticCna class gets metadata for cnv segment data produced by the somatic cnv pipeline.

class vardb.variant_data_files.cnv.SomaticLOH(**kwargs)

Bases: vardb.variant_data_files.variant_data_file.VariantDataFile

SomaticLOH class gets metadata for loss of heterozygosity states (LOH) produced by the APOLLOH

Zygosity states are:
DLOH=deletion-LOH (state 1) NLOH=copy-neutral-LOH (states 2,4) ALOH=amplified-LOH (states 5,8,9,13,14,19) HET=heterozygous (states 3,6,7) ASCNA=allele-specific-amplification (states 10,12,15,18) BCNA=balanced-amplification (states 11,16,17)
class vardb.variant_data_files.cnv.SomaticVAF(**kwargs)

Bases: vardb.variant_data_files.variant_data_file.VariantDataFile

SomaticVAF class gets metadata and allele frequencies from APOLLOH.

Tab-delimited output file for position-level results.
9-columns:
  1. chr (‘X’ and ‘Y’ will be output as 23 and 24)
  2. position
  3. reference count
  4. non-reference count
  5. total depth
  6. allelic ratio
  7. copy number (from input)
  8. APOLLOH genotype state
  9. Zygosity state.
N additional columns:
posterior marginal probabilities (responsibilities) for each APOLLOH genotype state.
Zygosity states are:
DLOH=deletion-LOH (state 1) NLOH=copy-neutral-LOH (states 2,4) ALOH=amplified-LOH (states 5,8,9,13,14,19) HET=heterozygous (states 3,6,7) ASCNA=allele-specific-amplification (states 10,12,15,18) BCNA=balanced-amplification (states 11,16,17)
class vardb.variant_data_files.cnv.TcgaCnv(**kwargs)

Bases: vardb.variant_data_files.variant_data_file.VariantDataFile

class vardb.variant_data_files.cnv.TcgaGermlineMaskedCnv(**kwargs)

Bases: vardb.variant_data_files.variant_data_file.VariantDataFile

vardb.variant_data_files.data_classes module

vardb.variant_data_files.data_classes.DataClass(**kwargs)

This is a factory for choosing the correct VariantDataFile subclass based on the pipeline information

Parameters:kwargs – metadata arguments
Returns:correct class

vardb.variant_data_files.expression module

class vardb.variant_data_files.expression.RSEM(**kwargs)

Bases: vardb.variant_data_files.variant_data_file.VariantDataFile

RSEM class gets metadata for .rsem files

class vardb.variant_data_files.expression.TranscriptNormalized(**kwargs)

Bases: vardb.variant_data_files.variant_data_file.VariantDataFile

TranscriptNormalized class gets metadata for transcript.normalized files

vardb.variant_data_files.maf module

class vardb.variant_data_files.maf.TCGASimpleSomatic(**kwargs)

Bases: vardb.variant_data_files.variant_data_file.VariantDataFile

TranscriptNormalized class gets metadata for transcript.normalized files

vardb.variant_data_files.variant_data_file module

class vardb.variant_data_files.variant_data_file.Columns(cols)

Bases: object

Immutable object containing a list of tuples with column name and type for each column of the data file

valid_types = ('INT', 'FLOAT', 'DATE', 'TIMESTAMP', 'TEXT', 'BIGINT', 'INTEGER', 'BOOLEAN')
class vardb.variant_data_files.variant_data_file.VariantDataFile(**kwargs)

Bases: object

close()

Closes the file, resets the file pointer to None

get_columns_from_pandas(filename, **kwargs)

Reads a file into a pandas dataframe and extracts the column names and column types of the data file. This is useful in cases where the data file has variable numbers of columns.

Parameters:
  • filename – path to data
  • kwargs – any optional arguments for the pandas read_csv function
Returns:

the Columns object corresponding to the columns in the data file

get_data()

Sets the member variables for header and data. The header is a list of strings, and the file data is a pandas dataframe. Returns the data.

Returns:the data
get_data_ptr()

Sets the file pointer to the first line of data. If the _get_header function has been properly defined in the subclasses, this should always work.

Returns:file pointer at beginning of data
get_header()

Returns the file header, and closes the file

Returns:file header
get_md5sum()

Gets md5sum and adds it to the metadata

line_count()

Just calculates the line count of a file

Returns:line count of file with filename
open()

Opens vcf file for reading

Raises:DataFileException if file couldn’t be opened
exception vardb.variant_data_files.variant_data_file.VariantDataFileException

Bases: exceptions.Exception

vardb.variant_data_files.vcf module

class vardb.variant_data_files.vcf.MutSeq_v1(**kwargs)

Bases: vardb.variant_data_files.vcf.VCF, vardb.variant_data_files.variant_data_file.VariantDataFile

class for somatic vcf files created by mutation seq version 1.0.2

class vardb.variant_data_files.vcf.MutSeq_v2(**kwargs)

Bases: vardb.variant_data_files.vcf.VCF, vardb.variant_data_files.variant_data_file.VariantDataFile

class for somatic vcf files created by mutation seq version 4.3.5

class vardb.variant_data_files.vcf.StrelkaIndels(**kwargs)

Bases: vardb.variant_data_files.vcf.VCF, vardb.variant_data_files.variant_data_file.VariantDataFile

Class for strelka indel files

class vardb.variant_data_files.vcf.StrelkaSnps(**kwargs)

Bases: vardb.variant_data_files.vcf.VCF, vardb.variant_data_files.variant_data_file.VariantDataFile

Class for strelka snp files

class vardb.variant_data_files.vcf.VCF

Bases: object

VCF class has functionality applicable to all vcf data classes

class vardb.variant_data_files.vcf.VCall(**kwargs)

Bases: vardb.variant_data_files.vcf.VCF, vardb.variant_data_files.variant_data_file.VariantDataFile

class for vcf files created by vcall pipeline (mpileup)

get_md5sum()

Gets md5sum and adds it to the metadata

normalize_indels()

normalizes self.unnormalized file to self.path IF self.path does not exist (the file is not already normalized)

class vardb.variant_data_files.vcf.VcfAnnotations(path)

Bases: vardb.variant_data_files.vcf.VCF, vardb.variant_data_files.variant_data_file.VariantDataFile

This class is for annotation VCFs. This VCF does not belong to a library and is only used for importing annotations.

vardb.variant_data_files.vcf_tools module

exception vardb.variant_data_files.vcf_tools.VcfToolsException

Bases: exceptions.Exception

vardb.variant_data_files.vcf_tools.annotate(log_path, vcf_path)

Wrapper function for running bioapps annotator on gphost

Parameters:
  • log_path – path to place log files
  • vcf_path – path to vcf file to annotate
Returns:

vardb.variant_data_files.vcf_tools.check_vt(normalized_vcf_file, log_file)

Checks to make sure that the normalized vcf file was correctly created

Parameters:
  • normalized_vcf_file
  • log_file
Returns:

vardb.variant_data_files.vcf_tools.normalize(unnormalized_file, normalized_file, log_path)

Wrapper function for running vt to normalize indels on gphost

Parameters:
  • unnormalized_file
  • normalized_file
  • log_path
Returns: