Logo Read More background model

Page contents :
Introduction
CreateBackgroundModel Algorithm
Output to the user
Evaluation of the output



Introduction

CreateBackgroundModel is designed to create a genome-specific higher-order background model for a set of non-coding DNA sequences. A genome-specific background model represents the probability of the nucleotides A,C,G,T to be non-functional non-coding data, also called background data, for the genome of interest. Building and using the right background model has a major impact on the performance of MotifSampler and MotifLocator. The CreateBackgroundModel method is independent of the application in which it is used afterwards so CreateBackgroundModel can be used on any list of sequences of which you want to model the nucleotide distribution.

When you want to use this application, go back to the applications webpage or setup the CreateBackgroundModel commandline in case of standalone use.
Note : all program parameters described below have default settings in CreateBackgroundModel unless stated otherwise.



CreateBackgroundModel Algorithm

Supply sequences (parameter -f) :
When you merely want to model the nucleotide distribution of a set of DNA sequences of interest, you simply have to provide your sequence dataset in FASTA format. In the context of MotifSuite (MotifSampler or MotifLocator), such a sequences dataset should optimally reflect the nucleotide distribution of non-functional non-coding data typical for the organism of interest. We have designed such sequence sets for a list of prokaryotes and eukaryotes (Arabidopsis thaliana, Caenorhabditis elegans, Saccharomyces cerevisiae, Drosophila melanogaster, Bos taurus, Gallus gallus, Homo sapiens, Mus musculus, Rattus norvegicus, Xenopus leavis, Arthropodes, Echinoderm, Nematodes, Plants and Vertebrates) for which we have computed background models that are available on our server.

Fig fail : CBgM_interreg.png

Fig.1 : Three possible configurations of the intergenic region between two consecutive genes in the genome.

You can design such a sequences dataset for your organism of interest yourself by carefully selecting the non-coding region between two consecutive genes (intergenic region) available in published databases. The intergenic region can however be long and vary in composition for different organisms so you need to evaluate which subregion is most suitable. The direction of transcription of the two consecutive genes is also important (fig.1). When both genes are transcribed in the same direction, the intergenic region may contain motif instances that control transcription of the second gene in row. When the transcription direction is pointed in opposite directions away from each other, the intergenic region most likely has motif instances controlling the transcription of both genes. If the opposite transription directions point to each other, the motif instances are most likely outside the intergenic region. The latter sequences should be excluded from the analysis as they may not correctly represent the non-coding nucleotide distribution of the background where the motif signal is hidden in. Finally, the total length of all sequences should be significantly higher than the expected number and length of hidden motif instances so the construction of the background model is not biased towards the motif signal (making it more difficult for MotifSampler or MotifLocator to distinguish that motif signal from the background information where it is submerged in).

The order of the background model (parameter -o) :
The order sets the accuracy of the background model description. The simpliest model is the zero-order or single nucleotide frequency (snf) model that merely counts the occurrencies (normalized into frequencies) of each of the nucleotides A,C,G,T in the sequences dataset. A higher-order model assumes that the probability of observing a certain nucleotide in a sequence depends on the composition of the previous nucleotides in the sequence. In our program, a background model of order 'o' is represented by a transition matrix where each entry describes the probability of finding the respective nucleotide given 'o' preceding nucleotides in the dataset. To construct a transition matrix for a background model of order 'o', we count all oligonucleotides of length 'o' + 1 in the sequence dataset. We rearrange the counts in a matrix of dimension power(4,'o') x 4, such that each row has the same first 'o' nucleotides while each column corresponds to the last nucleotide in the oligonucleotides. Next, a small pseudocount is added and each row is normalized into frequencies.

Fig fail : CBgM_transmatrix.png

Fig.2 : Second-order transition matrix. Symbol b represents a nucleotide A,C,G or T.

A background model of order 2 for example (fig.2), will have as first entry on the first row Pr(A|AA), the probability of having nucleotide A given the two preceding nucleotides are AA in the sequence, followed by Pr(C|AA), Pr(G|AA) and Pr(T|AA). The second row describes Pr(A|AC), Pr(C|AC), Pr(G|AC), Pr(T|AC) and so on untill the 16th row describing Pr(A|TT), Pr(C|TT), Pr(G|TT), Pr(T|TT).
Assuming that the input sequences dataset supplied to CreateBackgroundModel is also intended for motif detection (MotifSampler or MotifLocator), the default order of the background model has been set to order 1 and the maximal order is limited to 4. Default order 1 gives a more accurate description than the more commonly available snf does. Order 2 or higher order models will only be beneficial for motif detection if the sequences dataset supplied to CreateBackgroundModel is sufficiently long. The reason is that short sequences may only describe a subset of the oligonucleotides that span the higher order transition matrix resulting in many zero matrix entries. In this case, the higher order transition matrix does not accurately describe a genomes non-functional code or might even be biased towards an overrepresented functional or coding signal if such is present in the short sequences dataset. In abstract numbers, when you have N sequences of length L, there are N x(L-'o') oligonucleotides of length 'o' + 1 possible. To have a transition matrix with reliable values, the number of oligonucleotides should be sufficiently higher than 4 x power('o',4), which is the number of values in the transition matrix that models the background.

The last parameter -n sets the name of the organism where the background model applies to. This name is only for your own information to store the computed model for later use. The name will be copied into the #Organism field of the outputfile.



Output to the user

The background model is written into a file in the following format : read format details.
Regardless the setting of the background model order, the file always describes :
- the zero-order background model (#snf),
- the oligo frequencies (#oligo) of all possible combinations of preceding nucleotides of length 'o' (as listed in 'bb' in fig.2),
- the transition matrix of the background model of requested order.
The toplines of this file describe some textual information that was used to create the background model like the order (#Order), the name of the organism (#Organism) and the path where you uploaded your input sequence dataset from (#Sequences).



Evaluation of the output

At all terms, the values in #oligo should all be sufficiently higher than zero as they represent the counts of preceding oligonucleotides that are necessary to compute the transition matrix. A subset of significantly low values in #oligo means that your input sequences dataset is not sufficiently long to correctly model the varying composition of the genome-specific background. The background model will be over-ordered and biased towards the input sequences dataset. This will result in a decrease of the (motif) signal to (background) noise ratio during motif detection. In those cases we recommend to repeat CreateBackgroundModel with a lower value for the order setting. So the higher the order of the background model, the more important it is to select the right set of input sequences for constructing the background model.
Observing that single nucleotide frequencies (#snf) are of the same order as the frequencies in the columns of the transition matrix indicates that the background composition is not variable along the genome. The less variable the background composition, the lower the added value in using a higher-order model during motif detection. However, if not over-ordered, using a higher order background model does not harm.
If you do not have a good resource to select suitable background sequences for your organism of interest, you can also choose background sequences of a phylogenetically closely related organism as their background models in general show a high degree of correlation. Also here do not set the order of the model too high to avoid that the inherent small but genome-specific differences in sequence composition would bias the background model towards the other organism.



Feedback

Contact us if you have comments, questions or suggestions or simply want to react on the contents of this guideline. Thank you.