Logo instances format

This page describes the format of a file that describes one (or more) motif(s) by the positions where it occurs in a given sequence dataset.
We comment on all optional and required fields in case you need to supply such a file as input to MotifSuite.

In MotifSuite, a file in instances format is reported as output by MotifSampler, MotifLocator and FuzzyClustering.
In MotifSuite, a file in instances format is supplied as input for FuzzyClustering.

Page contents :
   File format
   Conversion requirements

Instances file format

The file starts with a comment line (#INCLUSive GFF File) that refers to our program and serves as a file recognition for our applications that load an instances file as input.

For MotifSampler and FuzzyClustering, the motif description starts with a comment line that describes a unique motif identifier (#id:). For MotifLocator, the motif identifier is printed as part of the reported instance description (see further). The comment line typically looks as follows:

- in an outputfile of MotifSampler: the motif identifier typically starts with '#id: box', followed by underscore-separated information consisting of the number of MotifSampler run, the number of detected motif in this run and a consensus representation of the detected motif. The next fields in the motif identifier describe (sequences:) the number of sequences that have at least one instance and (instances:) the total number of motif instances, followed by 3 motif scores internally computed by MotifSampler: (cs:) consensus score, (ic:) information content and (ll:) log-likelihood score.
Example: #id: box_24_2_TCATCrrTAyAATmnATGA   consensus: TCATCrrTAyAATmnATGA   sequences: 5   instances: 5   cs: 0.99   ic: 0.88   ll: 71.35

- in an outputfile of FuzzyClustering: the motif identifier consists of underscore('_')-separated fields starting with '#id: box', the number of the ensemble motif in the file (simple sequential numbering), the symbol 'CS' and in [brackets] the length, consensus description and consensus score of the ensemble motif being described. The end may describe additional information such as (%InstanceCut) the fractional threshold that was used to remove unreliable instance predictions of the ensemble motif, (%MotifCut) the fractional threshold that was used to remove motifs that do not sufficiently well correspond to the ensemble motif, (nbrSeq) the number of sequences with at least one instance of the ensemble motif, (nbrInst) the number of instances of the ensemble motif and (nbrMotifs) the number of motifs (detected by different motif detection runs) that corresponds to the ensemble motif.
Example: #id: box_1_CS[18,ATTCCTACnnnTGTArGA,1.44974]_%InstanceCut=0.34_%MotifCut=0.41_nbrSeq=9_nbrInst=9_nbrMotifs=21.

Next, all instances that belong to the motif are described on separate lines. The minimal information that is described for one instance on one line is:
1) the sequence identifier of the sequence where the instance is annotated in,
2) the start and
3) end position of the instance in the sequence (the annotation is relative to the first nucleotide in the given sequence),
4) the strand (+/- for forwarded/reversed direction of transcription) of the instance in the sequence and
5) the string description of the instance (also called 'site') based on the {a,c,g,t,n,A,C,G,T,N} alphabet.
Mark that the length of the site description must equal {'end' - 'start' + 1}.
Example (minimal format): yneN 122 139 + CTGCCTACAGCTGTAAGA

- in an outputfile of MotifSampler and MotifLocator, the fields are tab- or white space separated and includes some additional text-markers, the instance score computed by the program (following 'misc_feature') and the motif identifier to which the instance belongs (following 'id').
Example: yneN MotifSampler misc_feature 122 139 7.78063e+06 + . id "box_5_1_ATTCCTACnnnTGTArGA"; site "CTGCCTACAGCTGTAAGA";

- in an outputfile of FuzzyClustering, the fields are underscore('_')-separated and end with a (bracketed) description of (shift) the shift of the instance when all instances are aligned to reconstruct a PWM and (occ) the number of times the instance (or a shifted version) occurred in the inputfile supplied to FuzzyClustering. After the last bracket follows (separated by a tab) the membership score of this instance in the cluster as computed by FuzzyClustering.
Example: emrK_59_77_+_TAATCCTACAGGCGTAAGA_(shift=0,occ=55)   0.447

The second and following motifs are described in exactly the same way. For MotifSampler and FuzzyClustering the different motifs are clearly separated by the above described comment line starting with '#id:'. For MotifLocator, when it searches for instances of multiple prior motifs, it is the 'id' field that describes the prior motif identifier to which the reported instance belongs. The end of the file is recognized by the last blank line return.

Conversion requirements

- Although being an instances file, the outputfile of MotifLocator cannot be directly used as input for FuzzyClustering. This is because MotifLocator reports the annotated instances of multiple prior motifs per sequence of the dataset and not per prior motif. A script (Perl) could be used to automatically group all instances with the same 'id' field together preceded by a comment line that separates the different motifs from each other.
- FuzzyClustering requires that the start of a motif is indicated by a line starting with '>' or with '#id:'. It is optional for the user to explicitly describe the motif identifier after the marking symbol ('>' or '#id:'). If not described, the program will internally identify the different motifs by a sequential numbering 'motif_x' with x = the order of the motif in the file.
- FuzzyClustering can read both the minimal and extended format of the instances reported by MotifSampler and MotifLocator. The minimal format is the tab-separated format as described above (fields 1->5).
- FuzzyClustering can optionally also load an instance weight that reflects the importance of the instance in the motif. In such case the instance line format should be exactly as described in minimal format, and the weight must follow separated by a tab.
Example: yneN 122 139 + CTGCCTACAGCTGTAAGA     0.95


Fig fail: instancesfile.png

Example output by MotifLocator (top), output by FuzzyClustering (middle), output by MotifSampler (bottom)


Contact us if you have comments, questions or suggestions or simply want to react on the contents of this guideline. Thank you.