use RNA; $seq = "CGCAGGGAUACCCGCG"; ($struct, $mfe) = RNA::fold($seq); #predict mfe structure of $seq RNA::PS_rna_plot($seq, $struct, "rna.ps"); # write PS plot to rna.ps $F = RNA::pf_fold($seq); # compute partition function and pair pobabilities RNA::PS_dot_plot($seq, "dot.ps"); # write dot plot to dot.ps ...
The RNA.pm package gives access to almost all functions in the libRNA.a library of the Vienna RNA PACKAGE. The Perl wrapper is generated using SWIG http://www.swig.org/ with relatively little manual intervention. For each C function in the library the perl package provides a function of the same name and calling convention (with few exceptions). For detailed information you should therefore also consult the documentation of the library (info RNAlib).
Note that in general C arrays are wrapped into opaque objects that can only be accessed via helper functions. SWIG provides a couple of general purpose helper functions, see the section at the end of this file. C structures are wrapped into Perl objects using SWIG's shadow class mechanism, resulting in a tied hash with keys named after the structure members.
For the interrested reader we list for each scalar type of the corepsonding C variable in brackets, and point out the header files containing the C declaration.
Minimum free Energy Folding (from fold.h)
computes the minimum free energy structure of the string SEQUENCE and returns the predicted structure and energy, e.g.
($structure, $mfe) = RNA::fold("UGUGUCGAUGUGCUAU");
If a second argument is supplied and $fold_constrained==1 the CONSTRAINTS string is used to specify constraints on the predicted structure. The characters '|', 'x', '<', '>' mark bases that are paired, unpaired, paired upstream, or downstream, respectively; matching brackets "( )" denote base pairs, dots '.' are used for unconstrained bases.
In the two argument version the CONSTRAINTS string is modified and holds the predicted structure upon return. This is done for backwards compatibility only, and might change in future versions.
returns the energy of SEQUENCE on STRUCTURE (in kcal/mol). The string structure must hold a valid secondary structure in bracket notation.
recalculate the pair matrix and energy parameters after a change in folding
parameters. In many cases (such as changes to
$temperature) the fold()
routine will call
update_fold_params automatically when necessary.
frees memory allocated internally when calling fold.
works as fold, but SEQUENCE may be the concatenation of two RNAs in order compute their hybridization structure. E.g.:
$seq1 ="CGCAGGGAUACCCGCG"; $seq2 ="GCGCCCAUAGGGACGC"; $RNA::cut_point = length($seq1)+1; ($costruct, $comfe) = RNA::cofold($seq1 . $seq2);
compute the structure upon hybridization of SEQ1 and SEQ2. In contrast to cofold only intra-molecular pairs are allowed. Thus, the algorithm runs in O(n1*n2) time where n1 and n2 are the lengths of the sequences. The result is returned in a C struct containing the innermost base pair (i,j) the structure and energy. E.g:
$seq1 ="CGCAGGGAUACCCGCG"; $seq2 ="GCGCCCAUAGGGACGC"; $dup = RNA::duplexfold($seq1, $seq2); print "Region ", $dup->{i}+1-length($seq1), " to ", $dup->{i}, " of seq1 ", "pairs up with ", $dup->{j}, " to ", $dup->{j}+length($dup->{structure})-length($seq1)-2, " of seq2\n";
Partition function Folding (from part_func.h)
calculates the partition function over all possible secondary structures and the matrix of pair probabilities for SEQUENCE and returns a two element list consisting of a string summarizing possible structures. See below on how to access the pair probability matrix. As with fold the second argument can be used to specify folding constraints. Constraints are implemented by excluding base pairings that contradict the constraint, but without bonus energies. Constraints of type '|' (paired base) are ignored. In the two argument version CONSTRAINTS is modified to contain the structure string on return (obsolete feature, for backwards compatibility only)
After calling pf_fold
the global C variable pr
points to the
computed pair probabilities. Perl access to the C is facilitated by
the get_pr
helper function that looks up and returns the
probability of the pair (I,J).
frees memory allocated for pf_fold
recalculate energy parameters for pf_fold. In most cases (such as
simple changes to $temperature) pf_fold
will take appropriate action automatically.
return a random structure chosen according to it's Boltzmann probability. Use to produce samples representing the thermodynamic ensemble of structures.
RNA::pf_fold($sequence); for (1..1000) { push @sample, RNA::pbacktrack($sequence); }
calculates the partition function over all possible secondary structures and the matrix of pair probabilities for SEQUENCE. SEQUENCE is a concatenation of two sequences (see cofold). Returns a five element list consisting of a string summarizing possible structures as first element. The second element is the Gibbs free energy of Sequence 1 (as computed also with pf_fold), the third element the Gibbs free energy of Sequence 2. The fourth element is the Gibbs free energy of all structures that have INTERmolecular base pairs, and finally the fifth element is the Gibbs free energy of the whole ensemble (dimers as well as monomers). See above on how to access the pair probability matrix. As with fold the second argument can be used to specify folding constraints. Constraints are implemented by excluding base pairings that contradict the constraint, but without bonus energies. Constraints of type '|' (paired base) are ignored. In the two argument version CONSTRAINTS is modified to contain the structure string on return (obsolete feature, for backwards compatibility only)
frees memory allocated for co_pf_fold
recalculate energy parameters for co_pf_fold. In most cases (such as
simple changes to $temperature) co_pf_fold
will take appropriate action automatically.
calculates equilibrium concentrations of the three dimers AB, AA, and BB, as well as the two monomers A and B out of the free energies of the duplexes (FdAB, FdAA, FdBB, these are the fourth elements returned by co_pf_fold), the monomers (FA, FB (e.g. the second and third elements returned by co_pf_fold with sequences AB) and the start concentrations of A and B. It returns as first element the concentration of AB dimer, than AA and BB dimer, as fourth element the A monomer concentration, and as fifth and last element the B monomer concentration. So, to compute concentrations, you first have to run 3 co_pf_folds (with sequences AB, AA and BB).
Suboptimal Folding (from subopt.h)
compute all structures of SEQUENCE within DELTA*0.01 kcal/mol of the optimum. If specified, results are written to FILEHANDLE and nothing is returned. Else, the C function returnes a list of C structs of type SOLUTION. The list is wrapped by SWIG as a perl object that can be accesses as follows:
$solution = subopt($seq, undef, 500); for (0..$solution->size()-1) { printf "%s %6.2f\n", $solution->get($_)->{structure}, $solution->get($_)->{energy}; }
Alignment Folding (from alifold.h)
similar to fold()
but compute the consensus structure for a set of aligned
sequences. E.g.:
@align = ("GCCAUCCGAGGGAAAGGUU", "GAUCGACAGCGUCU-AUCG", "CCGUCUUUAUGAGUCCGGC"); ($consens_struct, $consens_en) = RNA::alifold(\@align);
compute a simple consensus sequence or "most informative sequence" form an alignment. The simple consensus returns the most frequent character for each column, the MIS uses the IUPAC symbol that contains all characters that are overrepresented in the column.
$mis = consensus_mis(\@align);
Inverse Folding (from inverse.h)
find a sequence that folds into structure TARGET, by optimizing the sequence until its mfe structure (as returned by fold) is TARGET. Startpoint of the optimization is the sequence START. Returns a list containing the sequence found and the final value of the cost function, i.e. 0 if the search was successful. A random start sequence can be generated using random_string.
optimizes a sequence (beginning with START) by maximising the
frequency of the structure TARGET in the thermodynamic ensemble
of structures. Returns a list containing the optimized sequence and
the final value of the cost function. The cost function is given by
energy_of_struct(seq, TARGET) - pf_fold(seq)
, i.e.-RT*log(p(TARGET))
holds the value of the cost function where the optimization in
inverse_pf_fold
should stop. For values <=0 the optimization will
only terminate at a local optimimum (which might take very long to reach).
the string symbolset holds the allowed characters to be used by
inverse_fold
and inverse_pf_fold
, the default alphabet is "AUGC"
If non-zero stop optimization when its clear that no exact solution can be found. Else continue and eventually return an approximate solution. Default 0.
Cofolding of two RNA molecules (from cofold.h)
Global Variables to Modify Folding (from fold_vars.h)
Do not allow GU pairs to form, default 0.
allow GU only inside stacks, default 0.
Fold with specially stable 4-loops, default 1.
0 = BP; 1=any mit GC; 2=any mit AU-parameter, default 0.
How to compute dangling ends. 0: no dangling end energies, 1: "normal" dangling ends (default), 2: simplified dangling ends, 3: "normal" + co-axial stacking. Note that pf_fold treats cases 1 and 3 as 2. The same holds for the main computation in subopt, however subopt will re-evalute energies using energy_of_struct for cases 1 and 3. See the more detailed discussion in RNAlib.texinfo.
contains allowed non standard bases, default empty string ""
temperature in degrees Celsius for rescaling parameters, default 37C.
use logarithmic multiloop energy function in energy_of_struct, default 0.
consider only structures without isolated base pairs (helices of length 1). For pf_fold only eliminates pairs that can only occur as isolated pairs. Default 0.
list of base pairs from last call to fold. Better use the structure string returned by fold.
scaling factor used by pf_fold to avoid overflows. Should be set to exp(-F/(RT*length)) where F is a guess for the ensmble free energy (e.g. use the mfe).
apply constraints in the folding algorithms, default 0.
If 0 do not compute the pair probabilities in pf_fold (only the partition function). Default 1.
usually 'F'; 'C' require (1,N) to be bonded; 'M' backtrack as if the sequence was part of a multi loop. Used by inverse_fold
the base pairing prob. matrix computed by pf_fold.
Array of indices for moving withing the pr
array. Better use
get_pr.
from RNAstruct.h: these functions convert between strings representating secondary structures with various levels of coarse graining. See the documentation of the C library for details
Full -> HIT [incl. root]
Full -> Coarse [incl. root]
Full -> weighted Shapiro [i.r.]
{Tree} -> ({Tree}R)
add S for stacks to coarse struct
Full -> FFull
FFull -> Full
remove weights from coarse struct
computes structure statistics, and fills the following global variables:
$loops [int] number of loops (and stacks) $unpaired [int] number of unpaired positions $pairs [int] number of paired positions $loop_size[int *] holds all loop sizes $loop_degree[int *] holds all loop degrees $helix_size[int *] holds all helix lengths
from treedist.h: routines for computing tree-edit distances between structures
convert a structure string as produced by the expand_... functions to a Tree, useable as input to tree_edit_distance.
compare to structures using tree editing. T1
, T2
must have been
created using tree_edit_distance
mainly for debugging
free space allocated by make_tree
from stringdist.h routines to compute structure distances via string-editing
[ returns swString * ] make input for string_edit_distance
[ returns float ]
compare to structures using string alignment. S1
, S2
should be
created using Make_swString
from profiledist
[ returns (float *) ]
condense pair probability matrix pr
into a vector containing
probabilities for upstream paired, downstream paired and
unpaired. This resulting probability profile is used as input for
profile_edit_distance
[ returns float ]
align two probability profiles produced by Make_bp_profile
[ returns void ] print string representation of probability profile
[ returns void ] free space allocated in Make_bp_profile
Global variables for computing structure distances
set to 1 if you want backtracking
containes alignmed structures after computing structure distance with
edit_backtrack==1
0 usual costs (default), 1 Shapiro's costs
allocate memory from C. Usually not needed in Perl
die with error message. Better use Perl's die
libRNA uses the rand48 48bit random number generator if available, the current random number is always stored in $xsubi.
initialize the $xsubi random number from current time
returns a random number between 0 and 1 using the random number generator from the RNA library.
returns random integer in the range [FROM..TO]
current date in a string. In perl you might as well use locatime
returns a string of length LENGTH using characters from the string SYMBOLS
calculate hamming distance of the strings S1
and S2
.
pack secondary structure, using a 5:1 compression via 3 encoding. Returns the packed string.
unpacks a secondary structure packed with pack_structure
returns a pair table as a newly allocated (short *) C array, such that: table[i]=j if (i.j) pair or 0 if i is unpaired, table[0] contains the length of the structure.
returns the base pair distance of the two STRUCTURES. dist = {number of base pairs in one structure but not in the other} same as edit distance with open-pair close-pair as move-set
from PS_plot.h
write PostScript drawing of structure to FILENAME. Returns 1 on sucess, 0 else.
write PostScript drawing of structure to FILENAME. The strings PRE and POST contain PostScript code that is included verbatim in the plot just before (after) the data. Returns 1 on sucess, 0 else.
write structure drawing in gml (Graph Meta Language) to FILENAME. OPTION should be a single character. If uppercase the gml output will include the SEQUENCE as node labels. IF OPTION equal 'x' or 'X' write graph with coordinates (else only connectivity information). Returns 1 on sucess, 0 else.
write structure drfawing as coord file for SStructView Returns 1 on sucess, 0 else.
write structure drawing as ".ss" file for further editing in XRNA. Returns 1 on sucess, 0 else.
write a PostScript dot plot of the pair probability matix to FILENAME. Returns 1 on sucess, 0 else.
Select layout algorithm for structure drawings. Currently available 0= simple coordinates, 1= naview, default 1.
from read_epars.c
read energy parameters from FILENAME
write energy parameters to FILENAME
The package includes generic helper functions to access C arrays
of type int
, float
and double
, such as:
return the element INDEX from the array
set element INDEX to VALUE
allocate a new C array of integers with NELEM elements and return the pointer
deletes the C array by calling free()
substituting intP
with floatP
, doubleP
, ushortP
,
shortP
, gives the corresponding functions for arrays of float or
double, unsigned short, and short. You need to know the correct C
type however, and the functions work only for arrays of simple types.
Note, that the shortP... functions were used for unsigned short in previous
versions, while starting with v1.8.3 it can only access signed short arrays.
On the lowest level the cdata
function gives direct access to any data
in the form of a Perl string.
copies SIZE bytes at POINTER to a Perl string (with binary data)
copies the (binary) string STRING to the memory location pointed to by POINTER. Note: memmove is broken in current swig versions (e.g. 1.3.31)
In combination with Perl's unpack
this provides a generic way to convert
C data structures to Perl. E.g.
RNA::parse_structure($structure); # fills the $RNA::loop_degree array @ldegrees = unpack "I*", RNA::cdata($RNA::loop_degree, ($RNA::loops+1)*4);
Warning: using these functions with wrong arguments will corrupt your memory and lead to a segmentation fault.
Ivo L. Hofacker <ivo@tbi.univie.ac.at>