Supplementary information
BENCHMARK DATA SETS
The benchmark data sets always consist of orthologs of following species: human, chimp, mouse, rat and Fugu.
The composition of the initial data sets is given in table 1.
Input data files can be downloaded from these links:
cfos_int_SINFRUG18.fasta
cfos_int_SINFRUG19.fasta
cfos_int_SINFRUG87.fasta
hoxb2intergenic.fasta
pax6intergenic.fasta
sclintergenicpaper.fasta
ADDITIONAL DATA SETS
Six additional datasets were included in the analysis: ERG3, GSH1, HIV-EP1, HOXB5, Meis2 and PCDH8. These datasets contain more than one distantly related organism (compared to mammals). The constitute of different combinations of human, chimp, mouse, rat, dog, chicken, Fugu, Tetraodon and zebrafish.
The composition of these data sets is given in table 2 and the input data files can be downloaded from these links:
egr3intergenic.fasta
GSH1intergenic.fasta
HIVEP1intergenic.fasta
HOXB5intergenic.fasta
MEIS2intergenic.fasta
PCDH8intergenic.fasta
Top
DATA REDUCTION
Data reduction makes use of two algorithms, namely AVID (Bray et al., 2003) and TribeMCL (Enright et al., 2002). These can be downloaded here:
AVID
TribeMCL
Concerning the benchmark data sets:
Table 3 lists the clusters that contain at least one subsequence from each mammalian ortholog. When a certain cluster contained more than one subsequence from a single ortholog, this cluster was divided into subclusters [see figure]. These are represented by a profile consisting of the constituting subsequence id's (one per ortholog), as given in table 3. Table 4 gives an overview of the generated subclusters.
BLOCKSAMPLER
The stand-alone version of BlockSampler can be downloaded here.
This HELP file describes in detail the use of BlockSampler by means of an example data set.
Benchmark data sets
Using the two-step procedure we detected respectively 8 significant blocks for hoxb2, 13 for pax6, 1 in scl and none in the cfos data set (see article). To validate these blocks we checked whether they contained transcription factor binding sites: we looked for previously described motifs (Göttgens et al., 2002; Kammandel et al., 1999; Scemama et al., 2002) and we also performed a screening with the Transfac database of vertebrate transcription profiles (Wingender et al., 2001). The results are summarized in the article; a more detailed description of the regulatory motifs recovered in the detected blocks can be found here.
Additional data sets
More information on the additional data sets we analyzed and the conserved blocks we identified can be found here.
PARAMETER SETTINGS
Our analysis flow consists of three major algorithms (AVID, TribeMCL, BlockSampler), each attributed with its own set of parameters. Parameter fine-tuning of the major algorithms used in our analysis flow is based on multiple test runs with several benchmark data sets and different parameter settings. More information can be found here.
Evaluation of developed procedure |
---|
blocks containing previously described motifs
Newly identified conserved blocks
Top
references
Top