The file starts with a comment line (#INCLUSive GFF File) that refers to our program and serves as a file recognition for our applications that load an instances file as input.
For MotifSampler and FuzzyClustering, the motif description starts with a comment line that describes a unique motif identifier (#id:). For MotifLocator, the motif identifier is printed as part of the reported instance description (see further). The comment line typically looks as follows:
- in an outputfile of MotifSampler: the motif identifier typically starts with '#id: box', followed by underscore-separated information consisting of the number of MotifSampler run, the number of detected motif in this run and a consensus representation of the detected motif. The next fields in the motif identifier describe (sequences:) the number of sequences that have at least one instance and (instances:) the total number of motif instances, followed by 3 motif scores internally computed by MotifSampler: (cs:) consensus score, (ic:) information content and (ll:) log-likelihood score.
Example: #id: box_24_2_TCATCrrTAyAATmnATGA   consensus: TCATCrrTAyAATmnATGA   sequences: 5   instances: 5   cs: 0.99   ic: 0.88   ll: 71.35
- in an outputfile of FuzzyClustering: the motif identifier consists of underscore('_')-separated fields starting with '#id: box', the number of the ensemble motif in the file (simple sequential numbering), the symbol 'CS' and in [brackets] the length, consensus description and consensus score of the ensemble motif being described. The end may describe additional information such as (%InstanceCut) the fractional threshold that was used to remove unreliable instance predictions of the ensemble motif, (%MotifCut) the fractional threshold that was used to remove motifs that do not sufficiently well correspond to the ensemble motif, (nbrSeq) the number of sequences with at least one instance of the ensemble motif, (nbrInst) the number of instances of the ensemble motif and (nbrMotifs) the number of motifs (detected by different motif detection runs) that corresponds to the ensemble motif.
Example: #id: box_1_CS[18,ATTCCTACnnnTGTArGA,1.44974]_%InstanceCut=0.34_%MotifCut=0.41_nbrSeq=9_nbrInst=9_nbrMotifs=21.
Next, all instances that belong to the motif are described on separate lines. The minimal information that is described for one instance on one line is:
1) the sequence identifier of the sequence where the instance is annotated in,
2) the start and
3) end position of the instance in the sequence (the annotation is relative to the first nucleotide in the given sequence),
4) the strand (+/- for forwarded/reversed direction of transcription) of the instance in the sequence and
5) the string description of the instance (also called 'site') based on the {a,c,g,t,n,A,C,G,T,N} alphabet.
Mark that the length of the site description must equal {'end' - 'start' + 1}.
Example (minimal format): yneN 122 139 + CTGCCTACAGCTGTAAGA
- in an outputfile of MotifSampler and MotifLocator, the fields are tab- or white space separated and includes some additional text-markers, the instance score computed by the program (following 'misc_feature') and the motif identifier to which the instance belongs (following 'id').
Example: yneN MotifSampler misc_feature 122 139 7.78063e+06 + . id "box_5_1_ATTCCTACnnnTGTArGA"; site "CTGCCTACAGCTGTAAGA";
- in an outputfile of FuzzyClustering, the fields are underscore('_')-separated and end with a (bracketed) description of (shift) the shift of the instance when all instances are aligned to reconstruct a PWM and (occ) the number of times the instance (or a shifted version) occurred in the inputfile supplied to FuzzyClustering. After the last bracket follows (separated by a tab) the membership score of this instance in the cluster as computed by FuzzyClustering.
Example: emrK_59_77_+_TAATCCTACAGGCGTAAGA_(shift=0,occ=55)   0.447
The second and following motifs are described in exactly the same way. For MotifSampler and FuzzyClustering the different motifs are clearly separated by the above described comment line starting with '#id:'. For MotifLocator, when it searches for instances of multiple prior motifs, it is the 'id' field that describes the prior motif identifier to which the reported instance belongs. The end of the file is recognized by the last blank line return.
|