Logo FASTA format

This page describes the format of a file that describes a set of DNA sequences in FASTA format.
We comment on all optional and required fields in case you need to supply such a file as input to MotifSuite.

In MotifSuite, a FASTA file is supplied as input for MotifSampler, MotifLocator, CreateBackgroundModel.
Page contents :
    File format
    Conversion requirements

FASTA file format

In bioinformatics, FASTA format is a text-based format for representing DNA sequences, in which base pairs are represented using a single-letter code [A,C,G,T,N] where A=Adenosine, C=Cytosine, G=Guanine, T=Thymidine and N= any of A,C,G,T. The format also allows for sequence names and comments to precede the sequences.
A sequence in FASTA format begins with a single-line identifier description, followed by lines of DNA sequence data. The identifier description line is distinguished from the sequence data by a greater-than ('>') symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is a description (optional) separated form the identifier by a white space or tab. The sequence data starts on the next line following the text line and ends if another line starting with a ">" appears; this indicates the start of another sequence.

        >sequence1-id     description(optional)
        lines of {ACGTNacgtn}
        >sequence2-id     description(optional)
        >sequenceEnd-id     description(optional)
        lines of {ACGTNacgtn}

Conversion requirements

- There is no standard file extension for a text file containing FASTA formatted sequences. Some examples of widely used file extensions are '.fasta', '.fna' or simply '.txt'.
- There should be no space between the '>' and the first letter of the sequence identifier. The symbol '>' and sequence-id is requested, description is not (you can use it for clarification of your input). The sequence identifier (not necessarily numerative) is used to report sequence-related results in output files generated by MotifSuite (e.g. the localization of motif instances in a sequence). It is recommended that the full text line be shorter than 80 characters.
- When a fasta file is loaded by MotifSampler, lines starting with the symbol '#' are skipped. This can be useful if you want to (re)run MotifSampler without some of the DNA sequences. To exclude a particular sequence, you need to mask all lines describing this sequence with '#', failure to do this may result in sequence data being assigned to the preceding sequence-id.
- The sequence data alphabeth is {A,C,G,T,N,a,c,g,t,n}. MotifSuite makes no distinction between capitals and small letters. Gaps or insertions (white space or '-') are treated as <Nn>. The file should end with a blank line return to asure that also the last sequence in the dataset is being loaded by the program.


Fig fail : Fasta.png


Contact us if you have comments, questions or suggestions or simply want to react on the contents of this guideline. Thank you.