Logo MotifLocator Guidelines

Go to evaluation of the output if you need a quick help to optimally run MotifLocator.
If you have not used MotifComparison before, go through the following steps :
    Introduction
    MotifLocator Algorithm
    Output to the user
    Evaluation of the output



Introduction

MotifLocator performs a genome-wide screening of DNA sequences for instances of a given motif.
In what follows, we give a comprehensive description of MotifLocator and we provide guidelines on selecting program parameter settings and evaluating the output. If you want to use this application, go back to the applications webpage or setup the MotifLocator commandline in case of standalone use. Note : all program parameters described below have default settings in MotifLocator unless stated otherwise.



MotifLocator Algorithm

The input for MotifLocator is:
- a file (parameter -f) in Fasta format, describing the DNA sequences in which you want to search for motif instances,
- a file (parameter -m) in PWM format, containing a list of given motifs (further called prior motifs) for which you want to search instances,
- a file (parameter -b) in background model format, describing the background model for the genome in which you want to search for motif instances (see CreateBackgroundModel for the selection or creation of a suitable background model).

MotifLocator starts with screening the first described sequence in the fasta file for instances of the first prior motif described in the matrixfile. Each segment in this sequence that has the same length as the prior motif (shortly called segment) is scored by its probability to be a motif divided by its probability to be non-functional background (Fig.1).

Fig fail : segmentscore2.png

Fig.1 : Computation of segment score W in sequence S.

The nominator is computed by multiplying for each position in the segment the frequency with which the nucleotide at this position occurs according to the frequencies in the PWM of the prior motif. Similarly, the denominator multiplies for each position in the segment the frequency with which the nucleotide at this position occurs according to the background model.
In default strand mode (parameter -s = both strands), MotifLocator also scores all possible segments on the reverse complement of the sequence.

The higher the segment score, the more likely this segment represents a true motif instance. The segment score for a false motif instance that differs significantly from its background may however be higher than the segment score of a true motif instance that is less different to its respective background, explaning why the retrieval of true motif instances cannot be done by evaluating the segment score against a default threshold on the segment score. Instead, all segment scores are rescaled such that they have values between 0 and 1. To do this, the segment scores (Wmax and Wmin) are computed for the most respectively least likely instance that can be derived from the prior PWM. For each segment in the sequence, the rescaled score is now :

Fig fail : rescaledscore.png

Fig.2 : rescaled segment score Ws.

Finally, the segments that have a rescaled score Ws higher than the predefined threshold (parameter -t) are reported as instances for the prior motif in the selected sequence. The threshold has been empirically set to 0.85 in default mode. The output file reports the computed score of the segments that were selected as being motif instances of the prior motif. With parameter -a you can choose to display the computed value W of the segment score or the rescaled value Ws (default) that was used to compare against the score threshold.

The search for instances in the sequence at hand is repeated for every prior motif listed in the matrix file (parameter -m). If you only want this process to be repeated for a subset of prior motifs listed in the matrix file, you can list the identifier of the motifs of interest in a separate file (parameter -l). Mark that the motif identifiers in the -l file should be written in exactly the same way as in the '#ID ='-field of the -m file.

The whole procedure of computing scores, selecting and reporting instances for every prior motif is repeated for every sequence listed in the Fasta file.



Output to the user

MotifLocator reports one text output file -o. For each sequence, the selected segments for each prior motif are listed in order of positional occurrence in the sequence. The format of the file is described in instances format. In short, each line describes
- the sequence identifier
followed by
- features of the selected motif instance (start and end position, score (W or Ws) and strand (i.e. orientation in the sequence)
to end with
- the identifier of the prior motif and a nucleotide description of the selected motif instance.

For example : sequenceXYZ   MotifLocator   misc_feature   274   283   0.888901   + . id "MA0001.1";   site "CATTAATTAG";



Evaluation of the output

- The number of reported instances grows (decreases) exponentially with a decreasing (increasing) threshold. By consequence, the number of reported instances is sensitive to minor changes in the setting for the threshold (relative to the default setting). The threshold setting must be between 0 and 1, and the closer to 1 the more stringent the screening (the less instances will be reported).

MotifLocators default threshold (0.85) has been set empirically based on tests with de novo detected motifs that in general do not have extreme low or extreme high PWM values. Extreme PWM values tend to give overall higher rescaled segment scores resulting in an increased number of reported instances. Therefore, a higher threshold is advised when performing a genome-wide scanning for instances of well described (database) motifs (for wich the probability to observe a specific nucleotide is very high at many positions). If your PWM is described by counts (values can be higher than 1 and absolute zeros are also common) instead of frequencies (normalized values between 0 and 1), we recommend to normalize these counts into frequencies before supplying the file to MotifLocator (conversion guidelines described in pwm format).

Note that the threshold setting only determines the number of instances viewed to the user, it does not interfere with the ranking of the reported instances by likelihood to be true motif instances.

The number of reported motif instances also depends on the length of the sequence that is being scanned. Each segment in the sequence is scored individually, so a longer sequence is expected to have more potential motif instances than a shorter one. On average, the correlation between sequence length and number of reported instances is linear.

- The choice of a correct background model (-b) not only influences the number of motif instances reported, but also the reported segment scores (which indicate the likelihood of the reported instances to be true instances of the given motif). Motifs differing largely in nucleotide composition from the background will be assigned a higher score. The choice of a background model that correctly represents the non-functional, non-coding sequences of the organism at hand is therefore crucial. Otherwise motifs that differ largely from the incorrect background will be falsely prioritized or true motifs that look similar to the incorrect background might be falsely downweighted. If one is not sure about the background model, it is safer to use a neutral zero-order background model (all single nucleotide frequencies are 0.25) as this better preserves the ranking of motif instances based on the scoring with the PWM (the background contribution in the segment score is then a constant factor for all segments).

- If the reported rescaled scores Ws are of the same order of magnitude (e.g. all between 0.85 and 0.90), we recommend to rerun MotifLocator in mode -a = 1. The ranking of the reported motif instances will be exactly the same, only this time the reported scores are the absolute segment scores (W) which are numerically more suitable to judge on the relative importance of the reported motif instances. In general, we only consider motif instances with an absolute segment score W in the order of thousand as reliable.



Feedback

Contact us if you have comments, questions or suggestions or simply want to react on the contents of this guideline. Thank you.