parameter settings

Concerning the selection of pairwise conserved sequences (AVID-VISTA) parameters L and C define the minimal length and conservation a region should have to be considered as pairwisely 'conserved'. Lowering parameter resulted instead of detecting one large conserved region, several smaller subpieces matching the same longer region. We tested different parameter settings ranging from to and selected the parameter setting that resulted in the longest collinear regions as this reduces the complexity of the clustering in the subsequent step of the analysis flow.

For the clustering step, we assessed the effect of the parameters I and P on the composition of the resulting clusters. The inflation parameter I, regulates the granularity of the clusters, and determines the size of the clusters i.e. the number of sequences within the resulting clusters. Changing its value between showed not to have a major effect on the resulting clusters. I was fixed to 4. The other cluster parameter, P, is a similarity parameter; it determines the minimal similarity (i.e. the percent identity between two sequences) for clustering. To optimize this parameter, runs using different P settings were performed. We adapted the similarity measure P to the degree of phylogenetic relatedness between the organisms from which the pair wisely compared sequences originated: clustering was performed using either the same similarity threshold as in the previous VISTA selection step, C, or by using less stringent criteria (by subtracting respectively a factor 5 or 10 from the conservation selection criterion C). In all runs, the inflation parameter was fixed to 4. These tests resulted for a given data in three sets of clusters set: one set of clusters for similarity (P-) score C%, one for C-5% and one for C-10% (I = 4). Conclusively, for each benchmark data set, multiple sets of clusters were obtained depending on parameter settings used when performing the clustering. A cluster set that corresponds to one parameter setting consists of different clusters each of which corresponds to a respective conserved region. Dependent on the parameters used, a cluster set consisted of multiple small (more tightly related) or few large clusters. The large clusters contained weaker conserved subsequences obtained by less stringent relations between the mammalian subsequences (e.g. lower percent identity); Because these clusters also contained subsequences that were not significantly homologous, the cluster set resulting in the smallest clusters was selected as input for motif detection.

For running BlockSampler we used its default values, except for the threshold on the consensus score (between 0 and 2). We fixed this threshold to 1.2, to ensure that only initially well-conserved motifs are extended in length. But if any other user is interested in longer (shorter) less (stronger) conserved blocks he can lower (augment) this parameter To select form the output of BlockSampler the most promising hits we designed a score that increases with the degree of conservation of the motifs but at the same time is length independent. The consensus score seems appropriate as it describes the degree of conservation of the motif but as it is length dependent, it has to be normalized (short motifs have a higher chance of resulting in a high consensus score). Normalization was done by recalculating the consensus scoring according to the following formula:

Where L is the length of the conserved block, E is an empirical factor and Cs the consensus score. We tested different empirical factors on different data sets, and 5 appeared to give the best balance between motif length and conservation. Again depending on the interest of the user, the empirical factor can be enlarged to favor larger blocks.

back