FASTA format

This page describes the format of a file that describes a set of DNA sequences in FASTA format.
We comment on all optional and required fields in case you need to supply such a file as input to MotifSuite.

In MotifSuite, a FASTA file is supplied as input for MotifSampler, MotifLocator, CreateBackgroundModel.
Page contents :
File format
Conversion requirements
Example

FASTA file format

In bioinformatics, FASTA format is a text-based format for representing DNA sequences, in which base pairs are represented using a single-letter code [A,C,G,T,N] where A=Adenosine, C=Cytosine, G=Guanine, T=Thymidine and N= any of A,C,G,T. The format also allows for sequence names and comments to precede the sequences.
A sequence in FASTA format begins with a single-line identifier description, followed by lines of DNA sequence data. The identifier description line is distinguished from the sequence data by a greater-than ('>') symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is a description (optional) separated form the identifier by a white space or tab. The sequence data starts on the next line following the text line and ends if another line starting with a ">" appears; this indicates the start of another sequence.

>sequence1-id description(optional)
lines of {ACGTNacgtn}
>sequence2-id description(optional)
....
>sequenceEnd-id description(optional)
lines of {ACGTNacgtn}

Conversion requirements

- There is no standard file extension for a text file containing FASTA formatted sequences. Some examples of widely used file extensions are '.fasta', '.fna' or simply '.txt'.
- There should be no space between the '>' and the first letter of the sequence identifier. The symbol '>' and sequence-id is requested, description is not (you can use it for clarification of your input). The sequence identifier (not necessarily numerative) is used to report sequence-related results in output files generated by MotifSuite (e.g. the localization of motif instances in a sequence). It is recommended that the full text line be shorter than 80 characters.
- When a fasta file is loaded by MotifSampler, lines starting with the symbol '#' are skipped. This can be useful if you want to (re)run MotifSampler without some of the DNA sequences. To exclude a particular sequence, you need to mask all lines describing this sequence with '#', failure to do this may result in sequence data being assigned to the preceding sequence-id.
- The sequence data alphabeth is {A,C,G,T,N,a,c,g,t,n}. MotifSuite makes no distinction between capitals and small letters. Gaps or insertions (white space or '-') are treated as <Nn>. The file should end with a blank line return to asure that also the last sequence in the dataset is being loaded by the program.

Example

Feedback

Contact us if you have comments, questions or suggestions or simply want to react on the contents of this guideline. Thank you.

FASTAfile format

FASTA file format

Conversion requirements

Example

Feedback

FASTA
file format