Logo PSP format

This page describes the format of a file that describes the Position-Specific Prior (PSP) scores over a sequence of DNA positions in an input file.
We comment on all required and optional fields in case you need to supply such a file as input to MotifSuite.

In MotifSuite, a PSP file is (optionally) supplied as input for MotifSampler, PhyloMotifSampler, NOrthoMotifSampler.
Page contents :
    File format
    Conversion requirements
    Example



PSP file format

- PSP (position-specific prior) format is a text-based file format describing a sequence of numerical values over DNA positions that reflect the a priori belief for each of these positions to be involved in active regulation. The file follows the same format-rules as the ones used to describe the input DNA sequences (Fasta format), except that each single letter code [A,C,G,T or N] is now replaced by a numerical value, and the numerical values are separated by white spaces. The numerical value is a non-negative integer or decimal (use dots, no komma!) score (resulting from an in vivo, in vitro or computational predictive experiment).
- For input processing reasons, the identifiers (string following the '>' or '>>' symbols) in the PSP file must be identical to those used in the input FASTA file. The number of numerical values for a given sequence in the PSP file (on the lines following the identifier lines with symbol '>') must equal the number of single-letter codes of that sequence in the input FASTA file.
- Mark that it is not requested that each sequence in the FASTA file is also described in the PSP file. In such case, our software internally sets all PSP scores in those FASTA sequences to zero and only the uniform prior applies during motif search (each position in a sequence is equally likely involved as it would not be involved in active regulation). Secondly, a PSP file may also contain PSP information on sequences that are not described in the FASTA file. This PSP information is simply skipped during input processing.



Conversion requirements

- There is no standard file extension for a text file containing PSP scores. Some examples of common file extensions are '.psp', '.prior' or simply '.txt'.
- No space is allowed between the '>' or '>>' symbol and the first letter of the descriptive identifier. The symbol '>' and descriptive identifier is requested input. A tab-spaced description following the identifier is not required (you can use it for clarification of your input).
- When a PSP file is loaded by our software, lines starting with the symbol '#' are skipped. This can be useful if you want to (re)run our software using the PSP for only a subset of the input FASTA sequences. !Note that when excluding PSP information for a particular target gene or species using '#', you need to mask both identifier and PSP lines describing this target gene respectively species with the '#' symbol. Failure to do this for each involved line may result in PSP data being wrongly assigned to the preceding (target gene or species) identifier.
- Numerical entries in the PSP file may equal the value of zero and are not limited to a scaled range of [0,1]. In other words, quantitative values (even with a treshold signal cutoff) as a result from a test may be straightforwardly provided as PSP input to our software. Please read more Impact and Use of a PSP.
- The file should end with a blank line return to assure that also the last sequence in the dataset is being loaded by the program.



Example

    For coregulated DNA sequences from one species (input for MotifSampler):
Fig fail : EvgAc_psp.png

    For orthologous DNA sequences from multiple species (input for PHMS and NOMS):
Fig fail : Urs1h_psp.png



Feedback

Contact us if you have comments, questions or suggestions or simply want to react on the contents of this guideline. Thank you.