Logo Read More MotifSuite

Page contents :
Step 1 : Fundamental input: promotor sequences and genome-specific background model
Step 2 : Probabilistic de novo motif detection: MotifSampler
Step 3 : Summarizing motif detection results: MotifRanking and FuzzyClustering
Step 4 : Comparing motif models: MotifComparison
Step 5 : Genome-wide screening with a motif model: MotifLocator

Fig fail : motifsuite.png



Step 1 : Fundamental input: promotor sequences and genome-specific background model

As coregulated genes share some similarities in their regulatory mechanism, their promoter regions might contain some common motif instances that are binding sites for transcription factors (TF). A sensible approach to detect these regulatory elements is to search for overrepresented instances in the promoter regions of such a set of coregulated genes. Succesfull motif detection requires that the overrepresentation signal of the set of motif instances in the sequence set is sufficiently strong compared to the non-functional surrounding nucleotides, also called the background. A strong signal-to-noise ratio means that you optimally supply only those sequences where you have sufficient coregulatory evidence coming from e.g. cDNA microarrays , ChIP-Chip or ChIP-seq.

Fig fail : motifinbg.png
Fig.1 : (left) a regulatory motif is overrepresented in the surrounding background DNA
of promotor regions from co-expressed genes identified by e.g. cDNA micorarrays (right)

Probabilistic motif detection tools describe the sequence set using a motif model and a background model. The motif model is a position weight matrix (PWM) reflecting the probability to observe either one of the nucleotides A,C,G and T on each position of the overrepresented motif. The background model represents the probability of the nucleotides to belong to the remainder of the sequence set, assumed to be non-functional background data. It is the task of the motif detection tool to find the motif model that differs most significantly from the background model, therefore being a putative functional motif in the input sequence set.
In MotifSampler, the background model is kept fixed during the motif finding procedure and can be computed beforehand.
We supply precompiled background models (Marchal et al. 2003) for a list of prokaryotes and eukaryotes. You can also compile your own background model by running CreateBackgroundModel on your own set of carefully selected intergenic sequences.
Read more on CreateBackgroundModel Guidelines.



Step 2 : Probabilistic de novo motif detection: MotifSampler

MotifSampler is a probabilistic motif detection tool that searches in the space of all possible overrepresented short sequences in the input sequence set (the search space) for the most overrepresented (most frequently conserved) motif. The search space may however contain several local optima, only some of which represent a true motif. Reducing the search space by setting the program parameters in a biologically relevant way will decrease the program runtime and will help guiding the algorithm towards those local optima that most likely correspond to true motifs. Important program parameters to set are for example the motif width, the number of different motifs and information (e.g. prior distribution and maximum) on the number of instances of each motif to search for per sequence.
Because the motif detection algorithm in MotifSampler is relying on a stochastic optimization scheme (a Gibbs sampling approach), it may report a different result (converge to a slightly different local optimum) each time one run of the algorithm is repeated on the same input sequences set, even if you use the same program parameter settings. Running a Gibbs sampler properly means you should intentionally run it several times on the same input sequence set with the same (or with slightly different) parameter settings, following the idea that the most pronounced motif in the set will be detected repeatedly. Making a summary (see step 3) of the detection frequency information contained in the list of multiple solutions will enable to approach the global optimum that hopefully corresponds to a true motif in the set.
Each solution (candidate motif) reported by MotifSampler is represented by a position weight matrix (PWM) displaying the probability of the nucleotides A,C,G,T at each position and by a list of instances annotated in the input sequence set that contributed to the PWM.
Read more on MotifSampler Guidelines.

Fig fail : MSsearchspace.png             Fig fail : motifformats.png

Fig.2 (left) : MotifSampler applies program parameters (1) and multiple runs (2)
shaping a solution space where the most likely true motifs have to be extracted from in a next step (3).
Fig.3 (right) : PWM (1) and instances (2) representation of a solution reported by MotifSampler.



Step 3 : Summarizing motif detection results: MotifRanking and FuzzyClustering

MotifRanking and FuzzyClustering are two complementary methods to summarize a list of multiple motif detection solutions reported for a given sequence set. Both methods follow the principle that only motifs that are found consistently in several motif detection runs are statistically pronounced and will most likely approximate a true motif.

MOTIFRANKING employs the PWM format of the solutions reported by MotifSampler. The PWM outputfile from MotifSampler can directly be used as inputfile for MotifRanking.

Fig fail : MotifRanking.png

Fig.4 : MotifRanking prioritizes motifs (PWMs) based on motif score and motif detection frequency.

MotifRanking sorts the solutions (motifs) reported by MotifSampler in descending order of their motif score (default the LogLikelihood score computed by MotifSampler) and solutions that represent the same motif are grouped based on a matrix (PWM) comparison strategy. Grouping allows identifying the number of different motifs and counting how many times each of these respective motifs was detected by MotifSampler (thus assessing their significance). The motif with the highest motif score has the highest probability to be a functional regulatory motif provided it was detected with a minimal frequency amongst the multiple solutions reported by MotifSampler.
Read more on MotifRanking Guidelines.

FUZZYCLUSTERING is an alternative approach that evaluates the multiple solutions reported MotifSampler at their instance level. The instances outputfile from MotifSampler can directly be used as inputfile for FuzzyClustering.

Fig fail : FuzzyClustering.png

Fig.5 : FuzzyClustering extracts ensemble motifs by evaluating detected motifs at their instance level.
Membership scores allow to 1) prioritize the instances and motifs that correspond to the same ensemble motif,
2) calculate a PWM representation focused on the easily detectable (more reliable) instances of the ensemble motif
and 3) compute the ensemble motif detection frequency in the sequence set.

The multiple detected motifs and their instances listed in the input file are represented in a matrix (the matrix entries describe which instances belong to which detected motifs). FuzzyClustering uses a (spectral graph based) clustering technique to iteratively extract cohesive clusters (submatrices consisting of mainly non-zero values) from this matrix. Each extracted cluster stands for a set of instances and a set of detected motifs that all correspond to the same motif (ensemble motif). Each instance and each motif in the cluster get assigned a membership score that reflects how well the instance (motif) represents the ensemble motif. This method allows filtering instances (motifs) from the solutions reported by MotifSampler for which there is not enough evidence that they correspond to an overrepresented motif.
The instance memberships are used to weigh the contribution of each instance when calculating the PWM representation of the ensemble motif. The total number of detected motifs that corresponds to the ensemble motif determines the motif detection frequency of the ensemble motif in the sequence set. Based on user-defined (or default) thresholds, FuzzyClustering only reports relevant ensemble motifs i.e. that have instances in a sufficiently high fraction of the given sequence set, a minimal motif detection frequency and a minimal PWM-consensus score.
Read more on FuzzyClustering Guidelines.

MOTIFRANKING VERSUS FUZZYCLUSTERING:
- MotifRanking generally has short running times, while running FuzzyClustering takes time in the order of minutes.
- Both tools are equally easy to use (having the same number of parameters) but FuzzyClustering which is based on a complex algorithm might be less biologically intuitive than MotifRanking.
- The number of instances will in general be lower for the motif reported by FuzzyClustering compared to the same motif reported by MotifRanking because less reliable instances (with low instance membership) are being removed from a cluster.
- The easier to detect, more conserved instances of a motif will in general have a higher membership compared to less conserved instances of the same motif. As a result, FuzzyClusterings PWM representation will most likely show a higher degree of conservation (in the form of a higher PWM consensus score) compared to the PWM for the same motif reported by MotifRanking. In such case, the PWM by FuzzyClustering better represents the underlying conserved signal of a true motif in the sequence set which makes this PWM particularly interesting to be used in MotifComparison and MotifLocator (see further).
- As a directive for de novo motif detection, use MotifRanking to answer the question if the sequence set at hand has any motif at all, and use FuzzyClustering in a second step to retrieve the more reliable instances of a detected motif.
- Because the output format of both methods is the same (a PWM and a list of instances), their results can be easily compared to test to what extent both softwares confirm each others results e.g. by using MotifComparison to compare the PWMs of the detected motifs or by comparing the instances of the reported motif in the input sequence set.



Step 4 : Comparing motif models: MotifComparison

You have now detected an overrepresented motif in a given sequence set, represented by a PWM (the instances format of the motif is not used in this step). MotifComparison will answer if your motif (also called the query motif) corresponds to any of previously described motifs reported in curated databases or to a motif detected by yourself in previous analyses.

Fig fail : MotifComparison

Fig.6 : MotifComparison compares a similarity metric against a threshold to judge on similarity of two PWMs.
The similairty metric can be the KL distance (up) or the p-value of the BLiC score (down).

To do this, MotifComparison computes and judges on a similarity metric between the PWMs of the motifs to be compared. A first metric uses the well known and widely used Kullback-Leiber distance (KL, based on the mutual information between the PWMs that are being compared) and sets a threshold on KL to decide upon similarity. A second metric calculates a more advanced similarity score BLiC (Bayesian Likelihood 2-Component, Habib et al. 2008) that will not only compare the PWMs of the two motifs with a hypothetical common motif, but also with the background distribution of the genome where the motif applies to. In the BLiC score, motif positions where the nucleotide distribution (as described in the PWM) is similar to the background distribution are considered less relevant for motif similarity as they do not contribute to the sequence-specific binding of the motif. In addition, there must be significant evidence that a high BLiC score indeed means that the two motifs are similar. This is done by computing and setting a threshold on the p-value of the respective BLiC score, computed against a distribution of BLiC scores of randomly obtained non-similar motifs.

KL VERSUS BLiC:
The p-BLiC metric is particularly interesting to compare the conserved motif signal of PWMs where multiple positions are degenerate (such motifs may be judged similar by KL only because of many meaningless similar degenerate positions) or when only one of both PWMs being compared is very well described on multiple positions (when using KL, the non-similarity on such positions may mask the more meaningfull similarity on other positions). The KL metric however requires less program parameters and at the default MotifComparison threshold parameters, it will in general only report similarity on highly similar PWMs. Both computations have a short runtime and can easily be used in synergy to confirm or question each others results.

The motifs to be compared by MotifComparison are listed in two files : an inputfile with query motifs and a database file with curated motifs. The motifs in both files must be in the PWM format as reported by our MotifSampler (ask us for a correct conversion of your own database file if needed). We supply the following precompiled databases of curated motifs: Jasper, RegulonDB and Transfac. Feel free to ask us to add more databases of interest.
Read more on MotifComparison Guidelines.



Step 5 : Genome-wide screening with a motif model: MotifLocator

In order to avoid too many false positive motif predictions, it is best to tune the parameters for de novo motif detection tools towards a regime that favors a high PPV (i.e. minimizing the prediction of false positive motif instances) at the expense of a lower sensitivity (i.e. missing true motif instances). To compensate for this low motif prediction sensitivity, you can use MotifLocator in a final step to screen your sequence set for missed instances.

Fig fail : MotifLocator

Fig.7 : MotifLocator computes a segment score (W) for each possible instance in the genome.

MotifLocator calculates for each possible instance in the sequence set how well it fits a given motif model (PWM) versus the genome-specific background model. A threshold on the background corrected PWM based score computed for each instance determines whether an instance in the sequence set is a motif or not.
Read more on MotifLocator Guidelines.



Feedback

Contact us if you have comments, questions or suggestions or simply want to react on the contents of this guideline. Thank you.