Logo De-novo step-by-step approachGuidelines

This step-by-step approach offers an integrated use of the different MotifSuite applications to optimally detect an overrepresented motif signal in a given sequence set (Fig. 1).

Step 1) run MotifSampler at default program settings (search for mainly one instance per sequence). Optionally (not recommended unless you have firm evidence that the motif is well represented in each sequence of the dataset) you can use a higher biasing prior (search for 1-to-2 or 2-to-3 instances per sequence).
=> if the number of reported solutions (reported as MotifSamplers convergence rate, Conv) is lower than 50 (or 70 on the safe side), repeat MotifSampler to search for an exact number of instances per sequence (set parameter -p to search for exact 1 instance per sequence)

The list of multiple solutions reported by MotifSampler is now sufficiently long to be suitable for assessment of the significance of the solutions:
Step 2) run MotifRanking at default program settings on the multiple solutions (in PWM format) reported by MotifSampler.

You now have a list of different motifs with their significance (measured by return ratio RR).
On this list :
Step 3) remove motifs with a return ratio lower than 10% (or 20% on the safe side) as unreliable predictions
Step 4) only retain truly different motifs by running MotifComparsion with p-BliC metric to identify motifs that describe the same underlying motif signal (supply the PWM outputfile of MotifRanking as both query and database file to MotifComparison). If similar motifs are identified, only retain the motif with the highest LL score.

At this point, you have found a (set of) overrepresented motif(s) for the given sequence set that are well described by their PWM representation reported by MotifRanking. The higher the LL score and the return ratio of a given detected motif, the more likely it describes a biologically true signal in the dataset.

Step 5) the most accurate description of true allocations of a detected motif in the sequence set (also called instances) are obtained by: (*)
- run FuzzyClustering at default program settings on the multiple solutions (in instances format) reported by MotifSampler.
(Optionally:) on the list of motifs reported by FuzzyClustering:
- run MotifComparison with p-BLiC metric to cross-validate the motifs that describe the same significant motifs retained by MotifRanking (obtained at the end of step 4)(supply the PWM outputfile of MotifRanking anf FuzzyClustering as query-and database file to MotifComparison).
MARK: The more degenerate the detected motif (i.e. the lower its consensus score), the more we recommend to use the motif model (PWM) reported by FuzzyClustering in further motif assessment applications such as MotifComparison (to search for similar motifs in curated databases) and MotifLocator (to scan for motif instances in genome-wide DNA sequences).

Step 6) evaluate if another motif search attempt is of interest :
- If no instances were detected in a significant (say 30%) number of sequences in the sequence set (you can verify this in the text-outputfile of FuzzyClustering), you should remove the sequences from the sequence set and repeat motif detection (start again step 1) on the skimmed dataset where the motif signal may now be better detectable (the motif to noise ratio is higher).
- The following indicators sign that the true number of motif instances may be underestimated : the detected motif model (reported by MotifRanking) has a high return ratio (RR>40%) and/or all motif instance scores (scores reported by MotifSampler or memberships reported by FuzzyClustering) are of the same order.

Step 7) optimize the final description of a detected motif :
- if you want to avoid reporting false instances (maybe at the cost of missing some true ones) : repeat MotifSampler with a uniform prior (-p no bias) followed by above steps 1-5. Optionally, you may first want to remove some sequences from the sequence set where no instances were detected at this stage. If the same motif is found as in a first search-attempt (you can check this using MotifComparison), the motif model and instances description reported by FuzzyClustering are in general the best representation that can be obtained for this motif by MotifSampler.
- if you want the flexibility to find more weakly conserved instances (maybe at the cost of including some false instances) : run MotifLocator with the earlier detected motif model to screen for more instances in each sequence. The setting for the threshold score cutoff in MotifLocator allows you to find only strong (high threshold) or also weaker conserved instances (default or lower threshold). This method does not update the earlier obtained motif model.


(*) Instead of step 5, you can also retrieve the instances description reported by MotifSampler for each significant motif reported by MotifRanking (the motif identifier is the same in the output files).
In that case (and especially if the motif is obtained by enforced allocation in each sequence i.e. MotifSampler parameter -p = exactly 1), false positives must be removed by evaluating the instance score against a self-defined threshold (instances with a lower score compared to other instances of this motif are less reliable). We however recommend to use FuzzyClustering because it removes false positives by a threshold-free method and can also identify true instances that were not reported in the by MotifRanking selected solution in MotifSampler.



Feedback

Contact us if you have comments, questions or suggestions or simply want to react on the contents of this guideline. Thank you.