- Required arguments
- Optional arguments
- Output definitions
- Background model description
- Example
The BlockSampler is used to find conserved blocks in the upstream region of sets of orthologous genes.
Some basic remarks on the program:
- The program should be started from the command line. A full description of the required and optional arguments can be found below.
- The final results are printed either on STDOUT or in a file in GFF format.
- The block models found can be saved in a separate file using the -m swicth. To further analyze these matrices you can download MotifComparison and MotifRanking from download page
- On the STDERR you can monitor the progress of the program.
Switch |
Argument |
Description |
-f |
file |
Input sequences in fasta format. There should be at least 2 sequences in the file. |
-b |
file |
File containing a list of sequence ids and the related background models file. Format description of this file and related background model files can be found below. |
-i |
value |
Defines the root sequence of the data set. This value should be similar to one of the identifiers of the sequences in the fasta file. |
Switch |
Argument |
Description |
-s |
0|1 |
Select strand. (default plus strand). 0 is only input sequences, 1 include reverse complement. |
-p |
value |
Sets prior probability of finding 1 instance of the block. This value allows the user to define the type of block to search for. If the prior is set close to 0 then more conserved blocks are retrieved, increasing the prior will introduce more degeneracy into the block model. If the prior is set too small, it is possible that no block is found. Default = 0.2. |
-w |
value |
Sets length of the motif; motif = initial seed of the block (default 8). |
-r |
value |
Set number of times the BlockSampler should be repeated (default = 100). When using this option it is best to define a matrix file with the '-m' switch to store the generated block models. This file can later be used to analyze the block models and select the best scoring models |
-t |
value |
Sets threshold to extend block length (a value between 0 and 2, default = 1.0). |
Switch |
Argument |
Description |
-o |
file |
Sets the output file to save the results. The found block instances are written to this file in GFF format. Default the results are written to STDOUT. |
-m |
file |
Sets the file name of the matrix file to store the retrieved block models. If not provided the matrices are not saved. This matrix file can be used with the MotifScanner to screen DNA sequences for instances of the retrieved blocks. If you have done multiple runs (switch '-r'), you should use this matrix file to further analyze the results. |
Background Model Description |
---|
In BlockSampler each orthologous intergenic sequence in the input data set is scored with its appropriate species-specific background model (structure is given below). In order to provide information about which sequence should be scored with which background model, a file containing links to the different background models is required.
ENSG00000173917 /path/to/your/backgrounddir/homo_sapiens_order.bg
ENSPTRG00000009352 /path/to/your/backgrounddir/pan_troglodytes_order.bg
ENSMUSG00000047830 /path/to/your/backgrounddir/mus_musculus_order.bg
ENSRNOG00000008365 /path/to/your/backgrounddir/rattus_norvegicus_order.bg
SINFRUG00000136637 /path/to/your/backgrounddir/fugu_rubripes_order.bg
Such a background model is stored as an ascii text file using a well defined format. To create such a background model file from a set of input sequences you should use the program CreateBackgroundModel from the INCLUSive website.
Below you can find an example of the first-order Homo sapiens background model file. The file should always start with the word #INCLUSive at the first position of the file. Next, there are several lines describing the organism, data set and order of the background model. Finally the data itself are represented.
--
#INCLUSive Background Model v1.0
#
#Order = 1
#Organism = human
#Sequences = d:\sae\projects\sista.sequence\sequenceviewer\bgModels\epd_homo_sapiens_499_chromgenes_non_split.tfa
#Path =
#
#snf
0.2570 0.2534 0.2465 0.2432
#oligo frequency
0.2570
0.2534
0.2465
0.2432
#transition matrix
0.3121 0.1944 0.2751 0.2184
0.2751 0.3014 0.1547 0.2688
0.2400 0.2718 0.2943 0.1939
0.1970 0.2469 0.2637 0.2924
--
You can get some pre-compiled background models at our Background Model download page.
To create your own background model you can use the program CreateBackgroundModel which you can find on the INCLUSive website.
Here is a step-by-step example, on how to use the BlockSampler. Currently only a Linux version is available, other version will be made available in the near future. To make sure that all the file specifications are clear, an example data set is provided as additional data file, together with the background model files needed.
- 1. Software installation
-
The first step is the installation of the program. Download our software here . If you save it, make it executable (chmod 755 BlockSampler) and make sure that the program is included in your path. You can test if it works by just typing BlockSampler at the prompt without any option.
The output should look like this:
ssh|rvanhell>BlockSampler
Seed = 750317702
Usage: BlockSampler
Required Arguments
-f Sequences in FASTA format
-b File containing a list of sequence ids and background model file names.
-i Defines the root sequence of the data set. should be similar to the identifier of the sequence in the fasta file.
Optional Arguments
-s <0|1> Select strand. (default plus strand)
0 is only input sequences, 1 include reverse complement.
-p Sets prior probability of 1 motif copy. (default 0.2).
-w Sets length of the motif (default 8).
-r Set number of times the MotifSampler should be repeated
(default = 1).
-t Sets threshold to extend motif length (default = 1.0).
Output formatting Arguments
-o Output file to write results (default stdout).
-m Output file to write retrieved motif models.
Version 3.1 -- the bug fix release
Questions and Remarks: gert.thijs@esat.kuleuven.ac.be
- 2. Input Sequences
-
Input sequence should be in fasta format. An example can be downloaded here.
- 3. Background Model
-
For this example we use the third-order background model from different vertebrate organisms (these can be found here): The sequence with identifier ENSG00000007372 is derived from Homo sapiens and is thus scored by a Homo sapiens-specific background model, namely homo_sapiens_3.bg. In a similar way, each ortholog is scored with its species-specific background model. How to download or create background models is explained above.
- 4. Do a single run
-
First, we do a simple test in which we test one set of parameters in one single run. We use default parameters except for
- -t 1.2 Here we augment the threshold of the consensus score. This will allow the algorithm to find stronger conserved blocks.
- -r 1 we only perform one BlockSampler run.
Command line: BlockSampler -f example.fasta -b example.bg -i SINFRUG00000121553 -t 1.2 -r 1 >error1.log
Note that in this example the output is written on STDOUT and the STDERR is redirected to 'error1.log'.
#INCLUSive GFF File
#id: block_SINFRUG00000121553_1 consensus: CATTATTGTTGCCAGCACGAAGCATCACAATCAATCATAAG sequences: 5 instances: 5 cs: 1.51 ic: 1.50 ll: 264.17
ENSRNOG00000004410 BlockSampler misc_feature 320 360 2.08702e+23 + . id "block_SINFRUG00000121553_1"; site "CATTATTGTTGCCAGCACGAAGCATCACAATCAATCATAAG";
ENSMUSG00000027168 BlockSampler misc_feature 315 355 3.07531e+23 + . id "block_SINFRUG00000121553_1"; site "CATTATTGTTGCCAGCACGAAGCATCACAATCAATCATAAG";
ENSPTRG00000003474 BlockSampler misc_feature 950 990 1.7212e+23 + . id "block_SINFRUG00000121553_1"; site "CATTATTGTTGCCAGCACGAAGCATCACAATCAATCATAAG";
ENSG00000007372 BlockSampler misc_feature 949 989 2.94992e+23 + . id "block_SINFRUG00000121553_1"; site "CATTATTGTTGCCAGCACGAAGCATCACAATCAATCATAAG";
SINFRUG00000121553 BlockSampler misc_feature 10098 10138 1.64152e+21 + . id "block_SINFRUG00000121553_1"; site "CATTATTGTTGCCAACACGAAGCATCAGAATCAATCACGAG";
- 5. Do a batch run and store motif models
-
Once we have tested a few parameter settings in single runs, it is time to move on and do some more extensive tests. Here we will repeat the same experiment 50 times and save the found matrices to a separate file. You can try the following parameter settings:
- -p 0.3 the prior probability of finding one copy of the block is set to 0.3.
- -t 1.5 Here we augment the threshold of the consensus score even further. This will allow the algorithm to find stronger conserved blocks.
- -r 50 the test is repeated 50 times
- -o example50.gff output is written in gff file
- -m example50.mtrx block models are written to a matrix file
Command line: BlockSampler -f example.fasta -b example.bg -i SINFRUG00000121553 -t 1.5 -r 50 -p 0.3 -o example50.gff -m example50.mtrx >error50.log
Note that in this example the output is written to 2 files, one gff and one matrix. The STDERR is redirected to 'error50.log'.
Take a look at the files 'example50.gff' and 'example50.mtrx'. The results file should look more or less like this.
- 6. Post processing steps
-
When you have done a batch run, it would surely be interesting to look at the results and select the best block.
To order the blocks, you can use the comment lines in the GFF file. Such a comment line starts with a '#' and contains the following information
- identifier
- consensus
- number of sequences that contain the block
- number of instances found of the block in all sequences
- consensus score
- information content
- loglikelihood score
Here are a few examples:
#id: block_SINFRUG00000121553_1 consensus: GAATCCTT sequences: 5 instances: 5 cs: 1.57 ic: 1.56 ll: 52.35
#id: block_SINFRUG00000121553_13 consensus: ATCTTTCCGCTCATTGCCCATTCAAATACAATTGTAGATCGAAGCCGGCCTTGTCAsGTTGAGAAAAAGTGAATTTCTAACATC sequences: 5 instances: 5 cs: 1.47 ic: 1.46 ll: 526.27
#id: block_SINFRUG00000121553_49 consensus: ATCTTTCCGCTCATTGCCCATTCAAATACAATTGTAGATCGAA sequences: 5 instances: 5 cs: 1.49 ic: 1.48 ll: 279.10
To select all the comment lines from the from the GFF file you can use the 'grep' tool. To sort the blocks you can use the 'sort' tool. Here is an example on how to sort all the blocks in our example GFF file to their information content.
grep '^#' example50.gff | sort -bg +11
At the bottom of the sorted GFF file you find the best scoring block. It is possible that the same block is found in many runs, which is an indication that it is a strong block.
Top