CreateConservationPSP Guidelines

Go to evaluation of the output if you need a quick help to optimally run CreateConservationPSP.
If you have not used CreateConservationPSP before, go through the following steps :
Introduction
Theoretical background of the algorithm
Input for the algorithm
CreateConservationPSP Algorithm
Output to the user
Evaluation of the output
Optionally, also read a discussion on the results of using a conservation-based PSP in MotifSampler :
Case study : Improved motif detection performance of MotifSampler on real benchmark datasets when using a conservation-based PSP for the detection of motifs well conserved in close orthologs.

Introduction

Related species show similarities in transcriptional regulation, therefore their regulatory sequences are expected to contain the same motif. Adding orthologous sequences to the motif search space is a common pratice to remediate against reporting (suboptimal) solutions with likely no biological functionality (because the predictions are not sufficiently conserved in related species).
More than often however, implementing a cross-species conservation criterion in the underlying model of the algorithm increases the computational complexity and runtime of the motif detector application. An alternative approach is to simply bias the motif search direction towards DNA regions that a priori are more likely to contain functional regulatory sites. This can be done using a 'conservation-based PSP' as prior input to the motif detector, as explained in Use a PSP: special case.

Different conservation-based score definitions can be used to generate a conservation-based PSP file. In what follows, we give a comprehensive description of our application CreateConservationPSP and we provide guidelines on the program parameter settings and evaluation of the output.
If you want to use this application, go back to the applications webpage. Note : all program parameters described below have default settings in CreateConservationPSP unless stated otherwise.

Theoretical background of the algorithm

Three different conservation-based scores were defined by Gordan (Gordan et al., 2009) to compute position specific priors:
- The conservation-based prior score XC: is based on counting how many times a DNA segment is conserved in orthologous sequences, irrespective of location or orientation in these sequences.
- The conservation-based prior score XA: is based on counting how many times a DNA segment is conserved in an aligned set of orthologous sequences.
- The conservation-based prior score XT: is based on the conservation track computed by Siepel (Siepel et al., 2005) from multiple alignments of seven yeast species which is available in the USCS genome browser (Kent, W. et al., 2002).

The XA and XT scores are based on alignments of orthologous sequences. Alignments may however incorrectly insert gaps in orthologous motif occurrences, or non-functional regions that are conserved in closely related organisms may prevent a correct alignment of TF binding sites (Gordan et al., 2009). Also, binding sites sometimes change orientation or their position relative to each other (Ludwig, 2002). Not only may motifs occur outside aligned orthologous regions, promoters of related species may sometimes just not align well, especially when sequences are very divergent, or their alignment may strongly depend on the exact alignment algorithm used (Stark et al., 2000). In scenarios of incorrect alignment, the binding sites locations will not score high in an alignment-based PSP and are almost surely missed out by motif finders that use such alignment-based prior information.

The XC score does not rely on multiple or pairwise alignment of orthologous sequences. The definition for evolutionary conservation in XC is more relaxed in that it allows DNA sites to occur anywhere in an orthologous sequence, irrespective of orientation. We have choosen to use Xc as the basis for the development of CreateConservationPSP.

Input for the algorithm

- a file (parameter -f) containing a set of DNA sequences in FASTA format. CreateConservationPSP is designed to work in combination with MotifSampler. It does so by prioritizing from the MotifSampler DNA input sequences those short DNA segments that have multiple occurrences in other sequences that are expected to be regulated by the same motif. A good candidate for those 'other' sequences are sequences from different species that originated by vertical descent from the same common ancestor as the species of interest for MotifSampler (further called the 'reference species'). Such orthologous sequences in phylogenetically related species can be found in (public) databases.

By design, the CreateConservationPSP algorithm assumes the first species encountered in the inputfile is the reference species. All sequences that follow this reference species are grouped to be orthologous sequences until the end of the file or until an explicit group separator is encountered. You can provide an explicit grouping of sequences by inserting a line starting with the symbol '>>' (optionally followed by a description) between separate groups. The FASTA format of sequences in the dual coregulation-orthologous space that is used in our dual space motif detectors (NOMS and PHMS) is an example of such layout (e.g. Urs1h example). Mark that when using an explicit '>>', the computation of the Xc scores is executed based on counting the similarity of DNA segments in the sequences within one orthologous group only, and not on the total set of all sequences listed in the inputfile.

CreateConservationPSP Algorithm

The formula in fig.2 describes the computation of the Xc score on each position (p) in a reference sequence (Sr), given a set of M unaligned orthologous sequences (Sm). The use of suffix trees in CreateConservationPSP allows a particularly fast implementation of the string based computation of Xc for any large dataset.

Fig.2: Computation of the alignment-free conservation-based score XC in CreateConservationPSP. L = sequence length. δ = kronecker-delta function. The function δ equals 1 if the site of length W starting at position p in Sr (or its reverse complement) equals the site of length W starting at position p’ in Sm. If the sites are not identical, the function is zero.

In default mode, Xc scores are computed only for the positions in the reference species. The suffix tree algorithm is repeated for each group of orthologous sequences independently.

Parameter -r allows to set the name of the reference species for which the Xc scores are computed. At default, this is the first species that is listed in the inputfile, assuming this is the reference species of interest to proceed motif detection in MotifSampler. If you choose another expliciet species name, make sure it is written in the same way as the sequence_id description in the FASTA file ('>sequence_id', you do not need to add the '>' symbol). Optionally, use 'all' if you want to compute the Xc scores in each sequence of the orthologous group.

Parameter -w sets the length of the DNA segments that are compared during counting. It is recommended to set the length the same or slightly higher compared to the setting of the motif width (also called parameter -w) in MotifSampler. The resulting PSP will then help MotifSampler to pickup the best conserved seeds that are present in maybe only a few of MotifSamplers'input sequences. A significantly higher respectively significantly lower length setting will likely result in overall many low respectively high Xc scores within a sequence. This may be of interest when you want to use CreateConservationPSP to prioritize the search in phylogentically more conserved sequences/regions amongst the MotifSampler input sequences compared to using the PSP to provide on potential motif seeds.

Parameter -s sets the strand (direction of transcription) in the sequences that could have a common conserved motif. At default 'both strands', the program will include the counting of conserved segments in the reverse sequence orientation, by this covering the possibility that the motif might be located on the strand opposite to the sequences'strand supplied in the input file.

Parameter -x : Sometimes, not all sequence positions in the (orthologous) sequences are fully identified and nucleotides are described by 'N'. It is not possible to validate the conservation between segments at such 'N' positions. Parameter -x describes the maximum number of unidentified positions in segments being compared. Keep the setting low.

Output to the user

The algorithm reports one file (parameter -o) with Xc scores in PSP format. The sequence identifier lines (starting with the symbol '>' and/or ">>") in the FASTA file are copied in the PSP file, making the PSP file immediately available for use in MotifSampler.

Parameter -r sets the name of the species for which the Xc scores are reported in this outputfile (-o). At default, this is the first species that is listed in the inputfile, assuming this is the reference species of interest to proceed motif detection by MotifSampler. You can choose to report the Xc scores for another expliciet species name or for all of the sequences listed in the inputfile.

Evaluation of the output

Before using the PSP outputfile as input in MotifSampler, it is recommended to first carefully evaluate the Xc scores in this file. If most Xc scores are high or most Xc scores are low, the PSP is not very informative for use in MotifSampler. Also, if Xc scores are mainly all zero for one sequence and not in others, it may be that the orthologous sequences for the low PSP-scoring sequence are more distant or less well defined compared to the orthologous sequences for the other higher PSP-scoring sequences. To avoid an unfair down-priorization of the low PSP-scoring sequence in MotifSampler, it is better to first delete the PSP information from the PSP file before proceeding with Gibbs sampling based motif detection in MotifSampler.

The selection of a proper set of orthologous sequences as input to CreaeConservationPSP is thus important. Too closely related orthologs generate many high Xc scores, while distant orthologs in general do not. Also the number of orthologs added to the dataset influence the Xc score, as the counts of conserved DNA segments is reported in absolute numbers. Therefore, an input FASTA file with dual space input sequences should preferably describe the same number of orthologs with the same or similar phylogentic distance to the reference species in the coregulation space.

Adding distant orthologs to the input sequences dataset and observing if Xc-scores are high in these distant ortholologs (by setting the program parameter -r to 'all') is still a valuable action. High-scoring DNA segments in distant orthologs are less expected compared to high-scoring DNA segments in close orthologs. So if they occur, it is likely that the conserved DNA segment is actively involved in (hopefully) transcriptional regulation and not just a random local optimum.

A different setting for the DNA segment length (-w) in CreateConservationPSP will in general generate a different PSP as longer sites likely have a lower number of exact (or reverse) occurrences compared to shorter sites. Using a higher or lower segment length setting compared to the motif length used for motif detection hence allows to generate different Xc-based priors that will reduce or relax the prioritized motif detection search space. A good strategy is to run the pipeline MotifSampler/MotifRanking with different PSP files and compare the detected motifs using MotifComparison. If the same motif is detected repeatedly, it is more likely a true positive motif.

Although beyond the score of this application, we like to draw your attention to the possibility of converting the Xc scores in the output PSP file into discriminative Xdc scores that are more selective for a specific transcription Factor. You can read more on this subject in this case study.

References

- Bailey, T., Boden, M., Whitington, T., & Machanick, P. (2010). The value of position-specific priors in motif discovery using MEME. BMC bioinformatics, 11:179.
- Cuellar-Partida, G., Buske, F., McLeay, R., Whitington, T., Noble, W., & Bailey, T. (2011). Epigenetic priors for binding active transcription factor binding sites. Bioinformatics, 28(1):56-62.
- Ernst, J., Plasterer, H., Simon, I., & Bar-Joseph, Z. (2010). Integrating multiple evidence sources to predict tanscription factor binding in the human genome. Genome Research, 20:526-536.
- Gordan, R., Narlikar, L., & Hartemink, A. (2009). Finding regulatory DNA motifs using alignment -free evolutionary conservation information. Nucleic Acids Research, 38(6):e90.
- Kent, W., Sugnet, C, Furey, T., Roskin, K., Pringle, T., & Zahler, A. (2002). The humane genome browser at USCS. Science, 996-1006.
- Klepper, K. and Drablos, F. (2013). MotifLab: a tool and data integration workbench for motif discovery and regulatory sequence analysis. BMC Bioinformatics, 14:9.
- Ludwig, M. (2002). Functional evolution of noncoding DNA. Curr. Opin. Genet. Dev., 12(6):34-639.
- Narlikar, L., Gordan, R., & Hartemink, A. (2007). A nucleosome-guided map of transcription factor binding sites in Yeast. PLoS Computational Biology, 3(11):e215.
- Pique-Regi, R., Degner, J., Pai, A., Gaffney, D., Gilad, Y., & Pritchard, J. (2011). Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Research, 21:447-455.
- Siepel, A., et al. (2005). Evolutionary conserved elements in vertebrate, insect, worm and yeast genomes. Genome Research, (15):1034-1050.
- Stark, A., et al. (2000). Discovery of functional elements in 12 drosphila genomes using evolutionary signatures. Nature, (450):184-185.
- Tharakaraman, K., Bodenreider, O., Landsman, D., Spouge, J., & Marino-Ramirez, L. (2008). The biological function of some human transcription factor binding motifs varies with position relative to the transcription start site. Nucleic Acids Research, 36(8):2777-2786.

Feedback

Contact us if you have comments, questions or suggestions or simply want to react on the contents of this guideline. Thank you.

o the possibility of creating a discrimative conservation-based PSP score

CreateConservation PSPGuidelines