Logo PWM format

This page describes the format of a file that describes the PWM representation of one (or more) motif(s).
We comment on all optional and required fields in case you need to supply such a file as input to MotifSuite.

In MotifSuite, a file in PWM format is reported as output by MotifSampler, MotifRanking and FuzzyClustering.
In MotifSuite, file in PWM format is supplied as input for MotifRanking, MotifComparison and MotifLocator.
Page contents :
    File format
    Conversion requirements
    Example



PWM file format

PWM stands for Position Weight Matrix and describes the probability to find the respective nucleotides A,C,G,T on each position of a motif.

The file starts with a comment line (#INCLUSive Motif Model) that refers to our program and serves as a file recognition for our applications that load a PWM file as input.

Next follows the PWM description of a first motif, starting with some comment lines.
The first comment line describes a unique motif identifier (#ID). The second comment line shows a motif score (#Score) which can be a score that is computed from the PWM or any other score that reflects the importance of the motif being described. The following two lines give the PWM length (#W) and a consensus description (#Consensus) of the motif. A consensus description is derived from the information available in the PWM ; it is a string-based sequence representation of the motif in IUPAC code symbols (A,C,G,T,n,s,w) that describes the most likely nucleotide(s) on each position in the motif (n = any of A,C,G,T, s = C or G, w = A or T).

The comment lines are immediately followed by the values that make up the PWM (matrix) : each line describes the tab-separated probabilities (Pr) for nucleotide A, C, G and T on a given position in the motif. The number of lines must equal the length of the motif (#W).
The probabilities described in a PWM can be frequencies (normalized values between 0 and 1 and the sum of a row equals 1), or they can be represented as counts (values can be higher than 1 and zeros are also common). Our applications in MotifSuite have been designed to work with (and report) PWMs that describe frequencies.
! MARK : decimal numbers in a PWM must be described using a DOT (not a komma) e.g. 0.54 (not 0,54).

        Pr(A,1)     Pr(C,1)     Pr(G,1)     Pr(T,1)
        Pr(A,2)     Pr(C,2)     Pr(G,3)     Pr(T,4)
        ...
        Pr(A,W)     Pr(C,W)     Pr(G,W)     Pr(T,W)

The motif description ends with a blank line return. The second and following motifs are described in exactly the same way, each time separated from each other by a blank line. The end of the file is recognized by the last blank line return. Note that there is no explicit numbering of the motifs in the file.

Some comment lines may display typical information when reported by a specific application in MotifSuite (this section assumes you have read and understood the guidelines of the respective applications in MotifSuite) :

- #ID in MotifSampler : the motif identifier typically starts with 'box', followed by underscore-separated information consisting of the number of MotifSampler run, the number of detected motif in this run and a consensus representation of the motif.
Example : #ID = box_3_1_ATTCCTACnnnTGTArGA.

- #ID in FuzzyClustering : the motif identifier consists of underscore('_')-separated fields starting with '#id: box', the number of the reported motif in the file (simple sequential numbering), the symbol 'CS' or 'W' indicating if the reported PWM is the PWM representation with the highest consensus score (CS) respectively the longest motif length (W) and in [brackets] the length, consensus description and consensus score of the motif being described. The end may describe additional information such as (%InstanceCut) the fractional threshold that was used to remove unreliable instance predictions of this motif, (%dataCut) the fractional threshold that was used to remove data-motifs that do not represent this motif, (nbrSeq) the number of sequences with at least one instance of the motif, (nbrInst) the number of instances of the motif and (nbrData) the number of data-motifs that also detected this motif (Mark: data-motifs are the motifs described in the input file supplied to FuzzyClustering, each data-motif represents a motif detected by one run of a motif finder).
Example : #id: box_1_CS[18,ATTCCTACnnnTGTArGA,1.44974]_%InstanceCut=0.34_%DataCut=0.41_nbrSeq=9_nbrInst=9_nbrData=21.

- #Score in MotifSampler : the reported value is the log-likelihood score of the motif being described. The log-likelihood score is not a score that is computed from the PWM. It is a score that is maximized during the motif detection process to find the motif in the dataset that optimally balances motif conservation with the number of motif instances.
- #Score in MotifRanking : the reported score is the score that was used for sorting the candidate motifs in descending motif score order. The type of score that was used is specified with parameter -m in MotifRanking.
- #Score in FuzzyClustering : the reported score is the consensus score (CS) of the PWM, which is a measure for the conservation of the motif.



Conversion requirements

- When a PWM file is used as input in MotifSuite, '#INCLUSive Motif Model' and #ID and #W are always required. #Consensus is never required, it is only reported as informational output by MotifSuite. #Score is only required for MotifRanking.

- Always make sure your file ends with a blank line. If not the last line will not be loaded by the program and loading errors will occur !

- Public databases may provide PWM descriptions as counts instead of frequencies. When a PWM (with frequencies or with counts) is loaded by a MotifSuite application, the entries on each row are always (re-)normalized into frequencies after first adding a small pseudocount to each entry of the PWM. Adding pseudocounts is done to avoid the occurence of (almost) zero values in the PWM. Such zero or extreme low probabilities are unlikely in the framework of a PWM (there is never 100% certainty about the (non-)occurence of a given nucleotide) and they may confound the results in our MotifSuite applications.
The pseudocount in all MotifSuite applications that work with a PWM has been empirically set to 0.0001 assuming that a PWM is described by frequencies. When your PWM is however described by counts and it has many zero values, this default pseudocount may not be sufficient to avoid extreme low frequencies in the internally normalized matrix. We therefore in general recommend to change the PWM entries into frequencies BEFORE loading the PWM file into MotifSuite. This can simply be done by summing the nucleotide counts for A,C,G,T on one motif position and normalizing each entry on this position with the summed count. You repeat this for every position in the PWM. Zero counts will stay zero frequencies, but this is fine as that will be handled for by the empirically set pseudocount in our application when loading the PWM file.
Mark that the same pseudocount 0.0001 is applied when applications in MotifSuite write a PWM to an output file. If absolute zeros are displayed in a PWM reported by our applications, then this is only because of truncated numbers in the reported precision format (in that case the total sum of a row may not perfectly equal 1).

- Public databases may provide the PWM in a transposed format, consisting of 4 lines where each line describes the probability for respectively A,C,G,T on each motif position. You need to change the orientation of the PWM to the dimension as described above. A script (Perl) could be used to automatically transpose the matrix rows into matrix columns.



Example

! MARK : decimal numbers in a PWM must be described using a DOT (not a komma) e.g. 0.54 (not 0,54).

Fig fail : pwmfile.png



Feedback

Contact us if you have comments, questions or suggestions or simply want to react on the contents of this guideline. Thank you.