biofx.parser package¶
Submodules¶
biofx.parser.GSCDirectoryTreeParser module¶
@cchng
This module is for parsing POG-like directory structure.
Todo
fix all that hard coded paths
-
class
biofx.parser.GSCDirectoryTreeParser.
GSCDirectoryTemplate
(config='/home/cchoo/git/indel_merge/venv/lib/python3.6/site-packages/biofx-0.18.3-py3.6.egg/biofx/configs/directories.conf')[source]¶ Bases:
object
-
TEMPLATE_CONFIG
= '/home/cchoo/git/indel_merge/venv/lib/python3.6/site-packages/biofx-0.18.3-py3.6.egg/biofx/configs/directories.conf'¶
-
get_path_template
(analysis_type, directory_type)[source]¶ get template to path for a specific analysis type :param analysis_type: analysis type :type analysis_type: string :param directory_type: directory type :type directory_type: string
Returns: template path that can be formatted with with format()
Return type: p (string)
-
-
class
biofx.parser.GSCDirectoryTreeParser.
GSCDirectoryTree
(_id=None, project='POG', root=None, config='/home/cchoo/git/indel_merge/venv/lib/python3.6/site-packages/biofx-0.18.3-py3.6.egg/biofx/configs/tumour_characterization_project.cfg')[source]¶ Bases:
object
GSCDirectoryTree class contains methods for retrieving items in the tree.
-
_id
¶ string – Top level identifier. Defaults to None.
-
project
¶ string – Project name. Defaults to POG.
-
root
¶ string – Base path to standard directory. Defaults to None.
-
config
¶ string – Path to project configs. Defaults to configs/tumour_characterization_project.cfg.
-
DEFAULT_CONFIG
= '/home/cchoo/git/indel_merge/venv/lib/python3.6/site-packages/biofx-0.18.3-py3.6.egg/biofx/configs/tumour_characterization_project.cfg'¶
-
PROJECT_CONFIG
= '/home/cchoo/git/indel_merge/venv/lib/python3.6/site-packages/biofx-0.18.3-py3.6.egg/biofx/configs/tumour_characterization_project.cfg'¶
-
add_expression_analysis
(stranded=True)[source]¶ Add expression analysis.
Returns: number of expression analyses. Return type: (int)
-
add_paired_analysis
()[source]¶ Add paired analysis.
Returns: number of paired analyses. Return type: (int)
-
get_file_with_id
(template, library='*', prefix='*')[source]¶ Parameters: - template (string) – a string template
- library (string) – library
- prefix (string) – file prefix
Returns: a list of file paths matching template and other args provided
Return type: files (list)
-
get_sample_info
(library)[source]¶ Get sample info, for example biop1 for library.
Parameters: library (string) – library ID
-
get_tumour_content
(library)[source]¶ Get tumour content for library.
Parameters: library (string) – library ID Returns: tumour content. None if paired analysis not available and ‘not available’ of it hasn’t been reviewed. Return type: (string)
-
get_wgs_libraries
(biotype=None)[source]¶ Get wgs libraries
Parameters: biotype (string) – NOT implemented. Returns: Return type: (list)
-
set_id
(_id)[source]¶ Set identifier. :param _id: identifier :type _id: string
Raises: ValueError
– invalid identifier
-
biofx.parser.LRGparser module¶
@cchng
This module is for processing LRG transcript resource files, for example ftp://ftp.ebi.ac.uk/pub/databases/lrgex/list_LRGs_transcripts_GRCh37.txt Descriptions of the file is available at http://www.lrg-sequence.org/downloads.
biofx.parser.SnpEffparser module¶
@cchng
This module is for processing SnpEff files and EFF strings.
-
biofx.parser.SnpEffparser.
eff_has_transcript
(transcript_id, eff_maps, partial=False)[source]¶ Parameters: - transcript_id (string) – ensembl transcript id
- eff_maps (list) – output of
parse_effect()
- partial (bool) – check based on partial transcript match (in the case of refseq)
Returns: True if transcript in list of effects
Return type: (bool)
-
biofx.parser.SnpEffparser.
filter_snpeff_impact
(eff_maps, impact_filter=['HIGH', 'MODERATE'], hierarchical=False)[source]¶ Filter snpeff annotations by impact.
Parameters: - eff_maps (list) – output of
parse_effect()
- impact_filter (list) – list of impact to be included in output. Defaults to [“HIGH”,MODERATE”].
- hierarchical (boolean) – return highest impact only
Returns: eff_maps format, filtered by impact
Return type: filtered_by_impact (list)
Raises: AssertionError
– Only HIGH, MODERATE, LOW, MODIFIER impacts are accepted,- as defined in the Snpeff manual (Section – EFF field).
- eff_maps (list) – output of
-
biofx.parser.SnpEffparser.
has_chromosome_error
(eff, chromosome, valid_pattern)[source]¶ Check for ERROR_CHROMOSOME_NOT_FOUND
Parameters: - eff (string) – snpeff EFF string
- chromosome (string) – chromosome to be checked
- valid_pattern (string) – regex for valid chromosome patterns.
:param see
biofx.parser.GSCDefinitions
for more info:Returns: True if error Return type: bool
-
biofx.parser.SnpEffparser.
merge_eff_maps
(hgvs, classic, ordered=True, exclude_hgvs_effect_type=['chromosome_number_variation'], check_genotype=True)[source]¶ Merge effect maps
Parameters: - hgvs (list) – list of dictionaries (
parse_effect()
output with hgvs) - classic (list) – list of dictionaries (
parse_effect()
output) - ordered (bool) – are both hgvs and classic lists in order
- exclude_hgvs_effect_type (list) – hgvs effect types to exclude
- check_genotype (bool) – check genotype if True - used when there are multiple alleles
Returns: merged eff maps
Return type: merged_eff (list)
Raises: ValueError
– if annotation is not in exclude_hgvs_effect type but is of high/moderate importance when- there is no transcript assigned.
ValueError
– no classic annotation on transcript for variant, can’t merge
- hgvs (list) – list of dictionaries (
-
biofx.parser.SnpEffparser.
parse_effect
(effect, hgvs=False)[source]¶ Parses snpeff effect with the following format:
>>> EFF= Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_Change| Amino_Acid_Length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon_Rank | Genotype_Number [ | ERRORS | WARNINGS ] )
Parameters: - effect (tuple) – output of
parse_snpeff()
. - tuple with effect type as the first element followed by effect descriptions. (A) –
- hgvs (bool) – True if hgvs effects
Returns: a dictionary mapping effect key and value.
Return type: eff (dict)
Example:
>>> effect = ("INTRON","(MODIFIER|||||DDX12P|unprocessed_pseudogene|NON_CODING|ENST00000290818|19|1)") >>> eff_map = parse_effect(effect) >>> eff_map {'classic_protein_sequence_change': '', 'codon_change': '', 'transcript': 'ENST00000290818', 'functional_class': '', 'coding': 'NON_CODING', 'gene_symbol': 'DDX12P', 'gene_biotype': 'unprocessed_pseudogene', 'classic_effect_type': 'INTRON', 'exon': '19', 'amino_acid_length': '', 'effect_impact': 'MODIFIER'}
- effect (tuple) – output of
-
biofx.parser.SnpEffparser.
parse_snpeff
(raw_text)[source]¶ Parses snpeff string
Parameters: - raw_text (string) – a string in the format of
Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_change| Amino_Acid_length | Gene_Name | Gene_BioType | Coding | Transcript | Exon | GenotypeNum [ | ERRORS | WARNINGS ] )
- by snpeff 3.3/4.1. Multiple effects are comma-separated. (generated) –
Returns: a list of tuples of effects
Return type: m (list)
Raises: AssertionError
– This parser takes EFF formatted annotations onlyNotes
In SnpEff 4.* LOF and NMD predictions are added by default. They are separated by semi-colons. See “Loss of function (LOF) and nonsense-mediated decay (NMD) predictions” for more info. Ignoring the LOH and NMD tags for now
Also multiple, ‘+’ separated effects are allowed.
- raw_text (string) – a string in the format of
-
biofx.parser.SnpEffparser.
select_eff_by_gene_symbol
(gene_symbol, eff_maps, multiple=False)[source]¶ Parameters: - gene_symbol (string) – gene symbol (or any other id used for snpeff annotations
- eff_maps (list) – list of eff maps (see output of
parse_effect()
) - multiple (bool) – True if return multiple eff. Otherwise select first one seen when multiple selections found. False by default.
Returns: tuple containing:
- selected_eff (dict/list): if multiple True, returns a list of selected eff with matching transcript
- alternative_eff (dict)
Return type: (tuple)
Raises: ValueError
– Gene symbol provided should be annotated in eff_maps. nothing to select otherwise.RuntimeError
– genes before and after selection should be the same
-
biofx.parser.SnpEffparser.
select_eff_by_transcript
(transcript_id, eff_maps, multiple=False, partial=False, alt=None, random=False, sort_order=True)[source]¶ Parameters: - transcript_id (string) – ensembl transcript id (or any other id used for snpeff annotations
- eff_maps (list) – list of eff maps (see output of
parse_effect()
) - multiple (bool) – True if return multiple eff. Otherwise select first one seen when multiple selections found. False by default.
- partial (bool) – True if match transcript ID partially. Not recommended.
- alt (string) – alt allele
- random (bool) – select random if True when multiple is False. Typically when multiple alt alleles seen. supercedes alt rule.
Returns: tuple containing:
- selected_eff (dict/list): if multiple True, returns a list of selected eff with matching transcript
- alternative_eff (dict)
Return type: (tuple)
Raises: AssertionError
– transcripts before and after selection should add up
-
biofx.parser.SnpEffparser.
verify_effects
(eff_a, eff_b)[source]¶ Compare snpeff effects, generally between Sequence Ontology effects and classic effects type as documented in the Snpeff manual under section ‘Effect prediction details’.
Parameters: - eff_a (dict) – output of
parse_effect()
- eff_b (dict) – ouput of
parse_effect()
Returns: True if both effects are equivalent.
Return type: (bool)
Raises: AssertionError
– more than one equivalent effectReferences
Snpeff http://snpeff.sourceforge.net/SnpEff_manual.html (Retrieved on June 10 2015)
Notes
Effect types in v4.* are sometimes concatenated. i.e. there can be multiple effects on a single transcript. This happens at splice sites. For example:
>>> EFF=missense_variant(MODERATE|MISSENSE|Cgc/Tgc|p.Arg323Cys/c.967C>T|1013|GARNL3|protein_coding|CODING|ENST00000373387|11|T),missense_variant(MODERATE|MISSENSE|Cgc/Tgc|p.Arg301Cys/c.901C>T|991|GARNL3|protein_coding|CODING|ENST00000435213|12|T),missense_variant(MODERATE|MISSENSE|Cgc/Tgc|p.Arg323Cys/c.967C>T|820|GARNL3|protein_coding|CODING|ENST00000314904|11|T),splice_region_variant+non_coding_exon_variant(LOW|||n.736C>T||GARNL3|retained_intron|CODING|ENST00000495172|8|T),downstream_gene_variant(MODIFIER||3135|c.*905C>T|266|GARNL3|protein_coding|CODING|ENST00000439286||T|WARNING_TRANSCRIPT_INCOMPLETE),intron_variant(MODIFIER|||n.425-3424C>T||GARNL3|processed_transcript|CODING|ENST00000464616|5|T),non_coding_exon_variant(MODIFIER|||n.1140C>T||GARNL3|retained_intron|CODING|ENST00000485331|9|T),non_coding_exon_variant(MODIFIER|||n.913C>T||GARNL3|nonsense_mediated_decay|CODING|ENST00000373386|11|T)
Currently the output will be the concatenated effects; code will generate warnings when impact of the different effect types differ (as suggested in the mapping table provided in the Effect prediction details section in the manual). In the example above, ENST00000495172 has a splice_region_variant + non_coding_exon_variant. Splice regions have “low” impact and exon variants are “modifiers”. But we are ignoring the differences and taking the higher snpeff impact (LOW, in this case) as the bona fide impact.
- eff_a (dict) – output of