MotifComparison Guidelines

Go to evaluation of the output if you need a quick help to optimally run MotifComparison.
If you have not used MotifComparison before, go through the following steps :
Introduction
MotifComparison Algorithm
Output to the user
Evaluation of the output

Introduction

MotifComparison is designed to compute the similarity between two motifs coming from two input files (a query file and a database file). MotifComparison can be used on any two lists of motifs as long as the motifs to be compared are in the PWM format used throughout MotifSuite.

In what follows, we give a comprehensive description of MotifComparison and we provide guidelines on selecting program parameter settings and evaluating the output. If you want to use this application, go back to the applications webpage or setup the MotifComparison commandline in case of standalone use. Note : all program parameters described below have default settings in MotifComparison unless stated otherwise.

MotifComparison Algorithm

The inputs for MotifComparison are two files each containing a (list of) motif(s) represented by a Position Weight Matrix (PWM). A PWM is a matrix describing the probability -preferably described by a frequency (value between 0 and 1) but also counts (numbers higher than 1 and also zeros are common) are allowed- to find any of the nucleotides A,C,G,T on each position of the motif (PWM format describes conversion requirements in case your input file was generated outside MotifSuite). In what follows we call the 'unknown' motifs the query motifs which are listed in the query file (parameter -m). The database file (parameter -d) lists the motifs coming from a public or personal database, further called database motifs. When you automatically use precompiled database files, you must verify that both files use the same type of PWM entries (or frequencies, or counts).

MotifComparison starts with comparing the first motif from the query file with the first motif from the database file by computing a similarity metric between their respective PWMs. Two types of similarity metric are implemented in MotifComparison (Parameter -l): the Kullback-Leiber distance (KL, default mode) and the p-value of the Bayesian Likelihood 2-Component score (p-BLiC). If you want to use both the KL and p-BLiC metric, you have to rerun MotifComparison with each of the metrics independently.

The KL metric computes the mutual information between the query and database motif at each aligned position. The mutual information on a given aligned position compares the nucleotide distributions (as described in the PWM) of the query and database motif directly with each other at this position (Fig.1).

Fig.1 : Computation of the KL distance at 1 aligned position of two PWMs being compared.

Each aligned position individually contributes to the KL distance between query and database motif. To compensate for differences in length of the query and database motif, the sum of the KL distances computed at each aligned position is averaged over the length of the overlapping part between the two aligned motifs. As different alignments between two motifs may give a different KL distance, all possible alignments that respect the minimum required overlap (parameter -x) and the maximal allowed shift (parameter -s) are evaluated (Fig.2).

Fig.2 : Multiple possible alignments when comparing the similarity between two PWMs. The rectangles represent the query motif (green) and database motif (blue), both of length w, visualized in 3 shifted alignments : maximal shift(s) to the left, zero shift, and maximal shift to the right. The grey area visualizes the overlap (x) between the query and database motif in a particular alignment.

In default program mode, -s allows an alignment shift of only 1 nucleotide and -x requires an alignment overlap of at least 6 nucleotides. The reason why two parameters -s and -x are introduced is mainly to handle motif comparisons of different motif length : allowing for sufficient shifts while conserving a minimal overlap between aligned regions accomodates the comparison of longer motifs (for which many shifts might be needed to find the optimal alignment) while the minimal overlap is needed to constrain the allowed shifts for shorter motifs. When all input motifs have the same length, only one parameter is needed to restrict the admitted alignments: the minimum overlap constraint (-w) is typically set to half of the (fixed) motif length and the maximal shift (-s) to an unrestricting high value (e.g. equal to the motif length). Mark that multiple alignments also include alignments of the query motif with the reverse complement of the database motif, as the query motif may also be transcribed in the opposite direction as described in the input file.
Eventually, the best alignment (i.e. with the lowest KL distance) is retained to judge on the similarity of the two motifs at hand. Because the KL is non symmetrical, the same KL computation is repeated now using the database motif as 'query' and the query motif as 'database motif'. The result of both computations is averaged into the final KL distance for the two motifs at hand (Fig.3).

Fig.3 : KL distance between two motifs being compared, w is the overlapping part of the two aligned motifs.

The more the PWMs of the query and database motif are similar, the lower will be the KL distance between these two motifs. The KL distance will be zero if the two motifs being compared have perfectly equal PWMs. Finally, two motifs are judged similar if the KL distance between their PWMs is below a given similarity threshold (parameter -t). In MotifComparison, the default KL threshold has been empirically set to 0.4.

The BLiC score in the p-BliC metric compares the PWMs of the query and database motif with a PWM that represents a 'common' motif to which both motifs would correspond if they were similar. Secondly, the BLiC score assigns less importance to similar motif positions that do not differ significantly from the genome-specific background distribution, as such positions are assumed not to contribute to the sequence-specific binding of the motif. These two aspects are translated into two terms in the BLiC score (Fig.4).

Fig.4 : Computation of the BLiC score at 1 aligned position of two PWMs being compared.

The first term of the BLiC score computes the ratio of the probability that the query and database motif each are described by a common PWM versus the probability that the two motifs are described by their proper PWM. The common PWM is unknown and therefore estimated based on the nucleotide counts from the query and database motifs and a Dirichlet prior pseudocount for each nucleotide A,C,G,T (these counts are described in a file (parameter -p) starting with the symbol '>' followed by underscore separated values for prior counts on A,C,G,T i.e. >countA_countC_countG_countT). The second term in the BLiC score computes the ratio of the probability that the query and database motif both correspond to the earlier described common motif versus the probability that the two motifs being compared correspond to the background. The background is described by the single nucleotide distribution (#snf) in the genome-specific background model (parameter -b).

Each aligned position individually contributes to the BLiC score between query and database motif. The BLiC score is computed for all possible alignments (as admitted by parameter -x and -s) between the query and database motif and the alignment with the highest BLiC score is retained for further evaluation. The BLiC score is symmetric by design so swapped computation is not needed.
The higher the BLiC score, the more two PWMs represent a same motif that differs from the background. The BLiC score of non-similar motifs that differ significantly from the background may however be higher than the BLiC score of similar motifs less different to their background, explaining why the similarity of two motifs cannot be judged by evaluating the BLiC score against a default BLiC threshold (as was done with KL). Instead, the p-value of the BLiC score (p-BLiC) allows to assess if the BLiC score is significantly higher than a set of BLiC scores computed for a negative control set (i.e. a set of motif pairs that are non-similar and differ equally to the background as the query and database motif). The computation of the p-value is illustrated in Fig.5 and reflects the probability that two motifs are falsely classified as similar motifs given that the motif pairs in a negative control set are non-similar motifs.

Fig.5 : Theoretical illustration of computing the p-value of BLiC (p-BLiC). BLiC(1,2) is the BLiC score computed for a given query and database motif. The dots represent how frequent each value for BLiC was obtained amongst the BLiC scores computed for the query-database motif and the motif pairs of the negative control set. The grey surface area represents how likely it is to obtain a BLiC value higher than BLiC(1,2) in the computed set of BLIC scores given all motifs in the negative control set are non-similar.

The negative control set for a given query and database motif is obtained by repetitively (parameter -n times) shuffling the positions in the query and database motif in a random way and computing the BLiC score for the shuffled motifs after each shuffling. The query and database motif are classified as similar motifs if the p-value of their BLiC score is below the p-value threshold (parameter -t, default set to 0.001).
[Note: the p-BLiC metric is based on -but not identical to- the BLiC design described in Habib et al. (2008). Read here details on the differences between the two methods.]

Fig.6 summarizes the differences between the two similarity metrics that are implemented in MotifComparison.

Fig.6 : Parameter -l sets the type of similarity metric use in MotifComparison, and consequently how the judgment on similarity is done and the number of program parameters used MotifComparison.

In both cases, whether you choose the KL or the BLiC score, the comparison procedure is repeated for every possible set of query and database motifs present in the two files supplied to MotifComparison. Mark that the admitted set of alignments (determined by parameter -s and -x) between two motifs being compared depends on the length of both motifs and is recomputed for every combination of two motifs that are being compared from the two input files. So at all times, make sure you have reasonable entries for both the shift (-s) and overlap (-x) parameter as this influences the set of evaluated alignments significantly (as demonstrated in Fig.7).

Fig.7 : Example of increasing the minimal required overlap (-x) without adjusting the maximal allowed shift (-s).

Output to the user

The outputfile (parameter -o) of MotifComparison prints the result of comparing each query motif with each database motif on separate lines, or (choose 'do not print ## lines') only prints the results for query and database motifs that were found similar.

The similarity of a query and database motif is visualized by the first symbol on the line: only lines that do NOT start with the symbol '##' describe similar motifs.
Each line has the following tab- or ';'-separated contents (further discussed below in 'Evaluation of the output'):
- query:query_consensus : the identifier and consensus description of the query motif (corresponding to the motif description in the input file -m)
- match:match_consensus : the identifier and consensus description of the database motif (corresponding to the motif description in the input file -d)
- score : the computed similarity score (i.e. the KL distance or BLiC score, depending on your choice for parameter -l)
- shift : the shift of the database motif in the most-similar alignment to the query motif. For KL (which is a non-symmetrical metric) the second value reports the shift of the query motif in the most-similar alignment to the database motif.

The following fields are only reported for the p-BliC metric :
- overlap : length of the overlapping region of the most-similar alignment of the query and database motif
- strand : strand of the database motif in the most-similar alignment to the query motif, '+' is the orientation of the database motif as it was present in the database file, '-' means the reverse complement of the database motif is most similar to the query motif
- p-value : p-value of the BLiC score of the query and database motif, this value is compared with the threshold set by parameter -t to judge on similarity
- common : the number of motif pairs in the negative control set that have the same most-similar shift/overlap/strand as the query and database motif
-(n:avg,stdev) : n=number of shuffles, avg,stdev = average and standard deviation of the BLiC scores of the negative control set.

Evaluation of the output

- In the KL metric, the same score is computed for equally similar aligned positions regardless if they are informative or not whereas the BLiC score assigns less importance to non-informative compared to informative aligned positions (Fig.8-a). To compensate for overestimating the similarity of two motifs that both have non-informative aligned positions, the default threshold for KL has been set to a stringent value (default -t = 0.4) in MotifComparison. Using a stringent threshold makes the KL metric sensitive to detect true similarities, maybe at the cost of missing some less prominent similarities. The default p-BLiC threshold (-t = 0.001) has a statistical meaning and from our experience, is less stringent than the KL threshold. Repeating MotifComparison with p-BLiC metric thus allows to pick up candidate similarities that were missed by the KL metric, maybe at the expense of recovering some false similarities (i.e. if the dissimilarity of non-informative positions would be underestimated, Fig.8-b). When multiple query motifs were found similar to a given database motif, the magnitude of the BLiC score is most suitable to prioritize the query motifs by their similarity on informative positions to the database motif.

Fig.8 (based on Habib et al. (2008)) : Distinguishing between informative and non-informative positions: a) two pairs of aligned motifs are demonstrated, both of which having three identical positions and two different ones. The identical positions in the left pair are non-informative, the identical positions in the right pair are informative. The BLiC score can distinguish between these two types of similarities and assign a higher score to the right pair. b) in a third alignment, the database motif is informative and the query motif is not-informative on the centered aligned positions. The BLiC score may assign a higher similarity to the centered aligned positions than the KL score would do. The graph visualizes the KL respectively BLiC similairty scores (normalized to scale [-1,+1]) computed for the three alignments.

Only for p-BLiC:
- Because the shuffling process to create the negative control set is performed randomly, the BLiC score distribution of the negative control set (characterized by its average ('avg') and standard deviation ('stdev')), and thus also the reported p-value, might slightly deviate if you repeat MotifComparison on the same query and database motif with the same program parameter settings. However, the overall result should remain the same (the query and database motif being classified as either similar versus non-similar). If this is not the case, we advise to increase the number of shuffles (parameter -n) to decrease the effect of the random sampling on the negative control set.
The statistical assessment based on a p-value assumes that the negative control set to calculate the null distribution represents a true distribution of BLiC scores of non-similar motifs. The random shuffling process that is used in MotifComparison to create the negative control set could however by chance not introduce enough dissimilarity in the given query and database motifs. Therefore, the number of shuffles (parameter -n) must be sufficiently higher than the maximal motif length in the input files or at least equal the default value (20) so there are enough BLiC scores available to build a distribution from. As a rough trigger to assess the suitability of a negative control set, the number of motif pairs in the negative control set that has the same shift/overlap/strand as the given query and database motif (described by 'common') should be approximately zero. Mark that increasing the number of shuffles does not bias the classification towards similarity, it only seeks for a stronger evidence that the reported similarity classification is correct. For palindromic motifs (that have the same consensus in either direction) it might be more difficult to create a reliable negative control set by random shuffling. In those cases, the KL metric might be more suitable.

- In default mode, the Dirichlet counts (parameter -p) are set to a uniform distribution with a low count (i.e. 0.01) for A,C,G,T. The Dirichlet pseudocount allows introducing some prior knowledge on the consensus of the common PWM where the query and database motif are compared with. The higher these Dirichlet counts, the more the query and database motif should be similar to the prior nucleotide distribution you provide in order to be considered mutually similar. An example of its use is to enter higher Dirichlet counts for G and T compared to A and T when you expect to have a common GC-rich motif. It is advised to keep the absolute count lower than 1, if not the query and database motif information is suppressed by the Dirichlet counts in the construction of the common PWM.

- For the background model (parameter -b), the uniform nucleotide distribution (#snf = 0.25 0.25 0.25 0.25) is in general a good choice as the user is in most cases not interested in the similarity of such undefined positions in a motif. Unless for example if you are evaluating motifs with a variable spacer, here the KL metric might be more suitable.

Feedback

Contact us if you have comments, questions or suggestions or simply want to react on the contents of this guideline. Thank you.