Logo Case study conservationPSP

    Setup of the test dataset
    Performance of MotifSampler with a conservation-based PSP



Setup of the test dataset

The use of a PSP and different types of conservation-based PSPs has been extensively studied in combination with PRIORITY (Gordan et al., 2009). PRIORITY is conceptually similar to MotifSampler in that both are Gibbs sampling overrepresentation based motif finders. In the next case study, we demonstrate that the integration of a PSP in MotifSampler results in a same performance increase as demonstrated for PRIORITY in 62 yeast datasets. At the same time, we demonstrate that the use of a conservation-based PSP is a powerful method to integrate phylogenetic motif conservation into our Gibbs sampling algorithm.

We use yeast datasets derived from ChIP-chip experiments each measuring the binding of a single TF in S. cerevisiae (Harbison et al., 2004). These data have been extensively used in case studies by many motif discovery algorithms allowing for a straightforward benchmark. We retain 62 datasets for which Gordan (Gordan et al., 2009) created PSPs based on an alignment-free conservation score (PSP-C) respectively a discriminative conservation-based score (PSP-DC). The known motif in each of these datasets is 6 to 8 positions long, is not too degenerative and has at least 10 TFBS in forward or reverse orientation in the dataset. The conservation score is computed by counting how many times a DNA site is conserved in an orthologous dataset of closely related species, irrespective of location or orientation in these sequences. For the discriminative score, the same conservation score is computed on a dataset from the ChIP-chip experiments in which the selected TF did not bind. The discriminative score at each position in the sequence is then computed as the fraction of the conservation score in the bound dataset to the conservation score in both the bound and unbound dataset. Intuitively, the discriminative score discriminates between highly conserved sites in the bound dataset from highly conserved sites throughout the genome, which makes the score more selective for the specific TF.



Performance of MotifSampler with a conservation-based PSP

We run MotifSampler (parameter settings in appendix-1) in each of the 62 datasets in 3 different trials. The first trial does not use any PSP, the second trial uses PSP-C describing positional alignment-free conservation-based prior scores, and the last trial operates with the discriminative conservation-based PSP-DC.

Fig.1 shows the number of datasets (out of 62) in which MotifSampler successfully found the literature motif in any of the trials (last column) and in each of the 3 trials (MS, MS-C, MS-DC). The table also includes results for PRIORITY and 7 other motif finders (extracted from the Gordan paper). Details on the successes in the individual datasets are available in appendix-2.

Fig fail : Results_PSP_62datasetsGordan.png

Fig.1 : Number of success obtained by MotifSampler (MS) with different PSPs on 62 datasets.

The results show a low number of successes for a non-PSP based detection with MotifSampler (MS). We add two marks that this low performance does not bring down the value of MotifSampler as such. First, only the highest scoring significant motif was retained for further validation on success. We cannot exclude that the target motif for a given dataset was not reported as a lower scoring motif, neither did we verify if the retained motif matched another true motif (other than the target literature motif). Secondly, two other motif finders (PhyloCon and PhyME) obtain a performance of only the same order although they use an extra level of information (conservation in orthologs) in their motif finding algorithms. The low number of successes of MotifSampler is therefore attributed to the abundant presence of local optimal solutions that confound the detection of the target motif.
Using conservation-based prior (MS-C) shows to be a good remediation against convergence to solutions that do not have many occurrences in related species (the target motif is now found in more than 10 additional datasets). Using a discriminative criterion (MS-DC) adds another 10 successes, demonstrating the TF selective power of the discriminative PSP. Overall, the target motif is found by MotifSampler (with or without prior) in 45 datasets confirming our detector is at competitive level with other motif finders.

All of the 7 other motif finders are alignment-based and have incorporated orthology into their motif detection algorithms with a varying degree of complexity that does not necessarily correlate with the number of successes in the results table. For example Converge and PhyME are both EM based and use evolutionary distances in their model yet their overall performance in the 62 datasets differs significantly. PhyloGibbs and CompareProspector both use Gibbs sampling and bias the search in conserved windows again with different overall success. Our other phylogenetic motif finders (PHMS and NOMS) also use rather advanced approaches to incorporate orthology and it is intriguing to add their performance to the above table. However, NOMS and PHMS can only be run in the combined coregulation-orthology space whereas PRIORITY and MotifSampler are run in the coregulation space only (mark that we did not learn from the Gordan paper in which space some or all of the 7 other motif finders were run). In the coregulation space, the number of sequences is significantly lower and this likely implies a stronger motif to background ratio signal giving an intrinsic advantage to trials that only search in the coregulation space. The performance of particular motif finders may also highly depend on the characteristics of the datasets such as the number of orthologs involved, the dependency (phylogenetic proximity) between the orthologs or the quality of the pre-alignment. We therefore almost anticipate that a quantitative assessment of the number of successes of NOMS and PHMS in the 62 yeast datasets will not show a superior neither inferior performance in the results. Nevertheless, this study does confirm that using a conservation-based (discriminative) PSP is a valuable approach to incorporate orthology in motif detection.


Feedback

Contact us if you have comments, questions or suggestions or simply want to react on the contents of this guideline. Thank you.

References

- Gordan, R., Narlikar, L., & Hartemink, A. (2009). Finding regulatory DNA motifs using alignment -free evolutionary conservation information. Nucleic Acids Research, 38(6):e90.
- Harbison, C., Gordan, D., Lee, T., Rinaldi, N., Macisaac, K., Danford, T., et al. (2004). Transcriptional regulatory code of an eukaryotic genome. Nature, 431:99-104.