FASTA format

This page describes the format of a file that describes a set of DNA sequences from one species, or sets of orthologous sequences from multiple species in FASTA format.
We comment on all optional and required fields in case you need to supply such a file as input to MotifSuite.

In MotifSuite, a FASTA file is supplied as input for MotifSampler, PhyloMotifSampler, NOrthoMotifSampler, MotifLocator, CreateBackgroundModel.
Page contents :
File format
Conversion requirements
Example

FASTA file format

In bioinformatics, FASTA format is a text-based format for representing DNA sequences, in which base pairs are represented using a single-letter code [A,C,G,T,N] where A=Adenosine, C=Cytosine, G=Guanine, T=Thymidine and N= any of A,C,G,T. The format also allows for sequence names and comments to precede the sequences.
A sequence in FASTA format begins with a single-line identifier description, followed by lines of DNA sequence data. The identifier description line is distinguished from the sequence data by a greater-than ('>') symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is a description (optional) separated from the identifier by a white space or tab. The sequence data starts on the next line following the text line and ends if another line starting with a ">" appears; this indicates the start of another sequence.

For DNA sequences from one species :
The sequence identifier typically refers to the target gene regulated by the hidden motif in the dataset:

>target_gene1 description(optional)
lines of {ACGTNacgtn}
>target_gene2 description(optional)
....
>target_geneEnd description(optional)
lines of {ACGTNacgtn}

For orthologous DNA sequences from multiple species (input for PHMS and NOMS):
Sets of orthologous sequences (regulating the same target genes in different species) are separated by a line consisting of a double greater-than ('>>') symbol in the first column. The word following the '>>' symbol is further called the group identifier, optionally followed by a tab-spaced more detailed description. The group('>>') identifier typically refers to the target gene regulated by the hidden motif in the dataset. The following sequence('>') identifiers now refer to the species genome of the respective orthologous sequences:

>>target_gene1 description(optional)
>species1 description(optional)
lines of {ACGTNacgtn}
>species2 description(optional)
....
>>target_gene2 description(optional)
>species1 description(optional)
lines of {ACGTNacgtn}
>species2 description(optional)
....
>>target_geneEnd description(optional)
....
>speciesEnd description(optional)
lines of {ACGTNacgtn}

Conversion requirements

- There is no standard file extension for a text file containing FASTA formatted sequences. Some examples of widely used file extensions are '.fasta', '.fna' or simply '.txt'.
- There should be no space between the '>' or the '>>' and the first letter of the identifier. The symbol '>' and the following identifier is requested input, the tab-spaced description is not (you can use it for clarification of your input). The identifier (not necessarily numerative) is used to report sequence-related results in output files generated by MotifSuite (e.g. the localization of motif instances in a sequence for a given species). It is recommended that the full text line be shorter than 80 characters.
- When a fasta file is loaded by our software, lines starting with the symbol '#' are skipped. This can be useful if you want to (re)run our software without some of the DNA sequences. To exclude sequences of a particular target gene or species, you need to mask all lines describing this target gene respectively species with '#', failure to do this may result in sequence data being assigned to the preceding (target gene or species) identifier.
- The sequence data alphabeth is {A,C,G,T,N,a,c,g,t,n}. Our software makes no distinction between capitals and small letters. Gaps or insertions (white space or '-') are treated as <Nn>. The file should end with a blank line return to asure that also the last sequence in the dataset is being loaded by the program.

Example

For DNA sequences from one species :
Fig fail : Fasta.png

For orthologous DNA sequences from multiple species (input for PHMS and NOMS):
Fig fail : NOrthoFasta.png

Feedback

Contact us if you have comments, questions or suggestions or simply want to react on the contents of this guideline. Thank you.

FASTAfile format

FASTA file format

Conversion requirements

Example

Feedback

FASTA
file format