PhD achievements

Fig.1 : The basic pipeline (MotifSampler-MotifRanking) is extended with a set of complementary functionalities
that allow to optimize the input for motif detection, the quality of the output of motif detection
and open up de search space for less well explored types of motifs.

This page summarizes the MotifSuite pipelines developed in a recent PhD (Claeys, 2014).
Justification
(2012) Search for more weakly overrepresented motifs
(2012) Refine significance assessment at the motif instance level
(NEW 2014) Search for motifs phylogenetically conserved in related species
(NEW 2014) Integrate any type of position-specific regulatory prior information (PSP)

Justification:

MotifSampler was developed in a previous PhD (Thijs, 2003) and belongs to the class of Gibbs sampling overrepresentation based tools intended for use in the coregulation space. The tool was included in benchmark studies (Tompa et al., 2005, Hu et al., 2005, Singh et al., 2008), forms part of ensemble platforms (Tmod (Sun et al., 2010), MotifLab (Klepper and Drablos, 2013)) and can be valued as a competitive player in the field of computational motif detection. MotifSampler is however prone to reporting local optima (overrepresented solutions that are not involved in regulation): from in-house usage, motifs are typically retained at a significance of order 10 to 20% (assessed by MotifRanking), which does not represent a very convincing strongly overrepresented motif signal.

Next-Generation Sequencing daily generate new sets of unexplored sequences for over more than 260000 genomes and high-throughput in vivo and in vitro protein-DNA binding experiments produce large amounts of regulatory evidence (in general context-specific and at low nucleotide resolution level). The goal of our recent work (Claeys, 2014) is employing this new information sources to improve the resistance of MotifSampler against reporting local optimal solutions, at the same time extending the application domain of MotifSampler in line with new regulatory insights, keeping it at competitive level for the prediction of motifs that can further increase our understanding of transcriptional regulatory pathways. The complementary functionalities of the pipelines developed are summarized in Figure 1.

Search for more weakly overrepresented motifs:

The basic MotifSampler (Thijs, 2003) is equipped with a probabilistic estimation of number of instances to be annotated per sequence during Gibbs sampling. It works with a prior parameter that biases the motif search to mainly one instance in each sequence in the dataset. Although a higher number is possible, this ‘mainly one’ setting proved to be (and still is) the best to find the core of a motif. Other occurrences are typically retrieved with a motif scanning tool using the detected core motif (MotifLocator).
We made two modifications to the probabilistic framework of estimating the number of instances per sequence. First, we introduced a stochastic element that gives the Gibbs sampler the chance to pick up more or less than the estimated (~prioritized) number of instances of a detected motif during consecutive iteration steps. Secondly, we offer a wider range of settings for the prior parameter allowing the user to bias the number to a setting different to ‘mainly one’. The ‘uniform’ prior does not bias towards any particular number of instances in any sequence, whereas the 'fixed' prior overwrites the probabilistic estimation and demands an exact same number of annotations in each sequence. The impact on the motif detection performance was evaluated on a benchmark E. coli dataset.

The stochastic element and other prior settings clearly enable that MotifSampler can de novo pickup multiple occurrences of a detected motif, avoiding the absolute necessity of an additional scanning step. The uniform prior setting enables MotifSampler to annotate different numbers of instances in the individual sequences of the dataset, provided the motif overrepresentation signal is sufficiently strong. The combination of the stochastic element and the uniform prior may even enhance the sampler to switch to earlier not explored motifs (other optimal solutions). The fixed prior shows to be beneficial to speed up MotifSampler’s runtime and to find motif instances that resemble the core motif less (the sampler is forced to pick up the yet most similar sites in every sequence).
Conclusively, the applied modifications definitely enable MotifSampler (release 3.1.5 and higher) to better explore the motif solution space for less conserved motif instances or other less conserved hopefully true motifs.

Refine significance assessment at the motif instance level:

MotifSampler reports a list of motif predictions obtained from multiple repeats of the motif search initiated at random seeds. A significance assessment of this output is typically done at the motif matrix level using MotifRanking: if the PWM of a high scoring motif is similar to a minimum number of other motif matrices in the list, the motif is found a significant prediction. Other than its moatrix representation, a predicted solution is also reported by the list of sites from which the PWM is constructed. As the input sequences for motif detection in general also contain some noisy sequences (the motif is not present) and different sequences not necessarily contain an equal number of instances, the reported sites list likely still contains some spurious instances that do not belong to the motif. This is especially true if MotifSampler is run with the fixed prior for the number of instances per sequence.
We developed a complementary tool, called FuzzyClustering, to assess the significance of MotifSampler’s output at the instance level. In this tool, subsets of instances that were frequently detected together in a subset of multiple motif detection repeats (‘runs’) are grouped together into clusters by a graph spectral method. A cluster thus represents an ensemble motif from which spurious instance predictions have been removed and in which instances and runs are prioritized according to their membership score. The method is heuristic-free and does not depend on the motif properties (such as the consensus conservation). We evaluated the performance of both FuzzyClustering and MotifRanking in assessing the significance of motifs reported by MotifSampler in a benchmark E. coli dataset.

The study showed the complementary value of FuzzyClustering and MotifRanking. The latter method is suitable for quickly assessing whether a dataset contains any significant motifs, whereas FuzzyClustering performs better in retrieving more reliable instances of a detected motif. Alternatively, FuzzyClustering can be used to summarize results of running MotifSampler at different parameter settings. Employing this scenario is for example most suitable to find motifs having different unknown lengths in different sequences (FuzzyClustering will report instances of different lengths and build a single consensus PWM of most optimal motif length). The clustering and prioritizing technique in FuzzyClustering does intrinsically not rely on the motif definition, so FuzzyClustering can be used in combination with any motif finder (after conversion to our input format).
And even in applications beyond: at date, our tool is used as a summary procedure to distinct significant regulatory modules discovered with ProBic-II in a large scale E. coli compendium. ProBic-II uses known motifs from RegulonDB as seeds to learn regulatory modules (co-expressed genes under certain conditions bound by common TFs) in a query-driven way. These results are rewritten into the FuzzyClustering input format as a listing of co-expressed genes grouped per set of common TFs that regulated the genes. By reading the definition of ‘motif instance’ and ‘run’ as respectively ‘gene’ and ‘TFs’, the output of FuzzyClustering prioritizes the core genes (~instances with high membership score) expressed by significant sets of TFs (~runs with high membership score). This study demonstrates the dual-clustering power of our new method.

Search for motifs phylogenetically conserved in related species:

MotifSampler is designed for motif detection in the coregulation space of one species. If run on coregulated sequences from different species separately, each output will be confounded by local optima that do not represent motifs of the common TF for the related species. Searching in the combined coregulation-orthology space is one remediation to filter out solutions that are not overrepresented in each of the involved species.
We extended the algorithm in MotifSampler in two different ways to take the phylogenetic proximities between orthologs (provided as input in the format of a newick tree) into account during motif detection. The first application, PhyloMotifSampler (PHMS), uses an explicit motif evolution model to quantify the expected sequence similarity of orthologous motif instances. The evolution model expresses that instances in closely related orthologs show a higher sequence similarity compared to those of distant orthologs. PHMS applies to datasets from species related by a star topology and detects a common consensus motif. In a second method, NOrthoMotifSampler (NOMS), it is assumed that motifs have evolved at a slower rate compared to the surrounding background sequences. In sequence terms, this means that motif instances from related species must show a higher similarity to each other than the surrounding background sequences of these species do. NOMS can handle datasets of any rooted tree topology and a detected motif is reported by a common consensus PWM as well as by a set of species-specific motifs. A third approach is the use of a conservation-based position-specific prior (PSP) in MotifSampler. The PSP expresses for each sequence position in a reference species the prior belief this site is conserved in other species. Regions of high PSP probability in a given reference species are favoured during motif detection guiding the Gibbs sampler to detect motifs that also have conserved occurrences in other species. All three methods were assessed for their motif detection performance in synthetic and real datasets.

From this study, we learned that the performance of orthology-based tools depends on the setup of the dataset (the range of phylogenetic proximities and the number of orthologs involved) and we learned how our tools can be tuned to operate best in particular datasets. The extensions applied in the Gibbs sampling scheme are sound en theoretically well underbuilt which adds to a good understanding of the expected performance of our algorithms. Based on the design of the modifications and on the results of the preliminary assessment in our work, we are confident to conclude that our three extensions of MotifSampler are each on their own valuable in complementary application fields.
Conclusively: MotifSampler-PSP is best used for the detection of well conserved sequence motifs (in closely related species), PHMS operates well in datasets with intermediate proximities in which motifs have evolved to some extent, and NOMS has the capability of detecting motifs that have mutated in distantly related species. Especially the latter method is innovative for the computational discovery of motifs in species distantly related to well-studied model species.

Integrate any type of position-specific regulatory prior information (PSP):

While the present-day computational motif detection algorihtms can accurately predict in vitro binding, solutions represent the in vivo reality more accurately when used in concert with additional knowledge. Integrating additional knowledge into the Gibbs sampling scheme can be challenging with computational or analytical complexities and uncontrolled outcome. The use of a PSP on the contrary, to guide the detection towards prioritized DNA region without excluding this region, is fast, easy and efficient.
We have integrated this option in MotifSampler (discussed above in using a conservation-based PSP) as well as PHMS and NOMS. Especially in the latter applications, it is an interesting option to generate a PSP on well studied reference species to create good initialization seeds to find motifs in unexplored related species.

References:

Claeys, M. (2014). PhD: Probabilistic algorithm for finding motifs in sets of orthologous sequences. Leuven: KUL Department Molecular and Microbial Systems.
Hu, J., Li, B., & Kihara, D. (2005). Limitations and potentials of current motif discovery algorithms. Nucleic Acids Research, 33(15):4899-4913.
Klepper, K., & Drablos, F. (2013). MotifLab: a tool and data integration workbench for motif discovery and regulatory sequence analysis. BMC Bioinformatics, 14:9.
Singh, C., Khan, F., Mishra, B., & Chauhan, D. (2008). Performance evaluation of DNA motif discovery programs. Bioinformation, 3(5):205-212.
Sun, H., Yuan, Y., Liu, H., Liu, J., & Xie, H. (2010). Tmod: toolbox for motif discovery. Bioinformatics, 26(3):405-407.
Thijs, G. (2003). PhD: Probabilistic methods to search for regulatory elements in sets of coregulated genes. Leuven: KUL Department Electrotechniek.
Tompa, M., et al. (2005). Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology, 23:137-144.