Logo FASTA format

This page describes the format of a file that describes a set of DNA sequences from one species, or sets of orthologous sequences from multiple species in FASTA format.
We comment on all optional and required fields in case you need to supply such a file as input to MotifSuite.

In MotifSuite, a FASTA file is supplied as input for MotifSampler, PhyloMotifSampler, NOrthoMotifSampler, MotifLocator, CreateBackgroundModel.
Page contents :
    File format
    Conversion requirements
    Example



FASTA file format

In bioinformatics, FASTA format is a text-based format for representing DNA sequences, in which base pairs are represented using a single-letter code [A,C,G,T,N] where A=Adenosine, C=Cytosine, G=Guanine, T=Thymidine and N= any of A,C,G,T. The format also allows for sequence names and comments to precede the sequences.
A sequence in FASTA format begins with a single-line identifier description, followed by lines of DNA sequence data. The identifier description line is distinguished from the sequence data by a greater-than ('>') symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is a description (optional) separated from the identifier by a white space or tab. The sequence data starts on the next line following the text line and ends if another line starting with a ">" appears; this indicates the start of another sequence.

For DNA sequences from one species :
The sequence identifier typically refers to the target gene regulated by the hidden motif in the dataset:

        >target_gene1     description(optional)
        lines of {ACGTNacgtn}
        >target_gene2     description(optional)
        ....
        >target_geneEnd     description(optional)
        lines of {ACGTNacgtn}

For orthologous DNA sequences from multiple species (input for PHMS and NOMS):
Sets of orthologous sequences (regulating the same target genes in different species) are separated by a line consisting of a double greater-than ('>>') symbol in the first column. The word following the '>>' symbol is further called the group identifier, optionally followed by a tab-spaced more detailed description. The group('>>') identifier typically refers to the target gene regulated by the hidden motif in the dataset. The following sequence('>') identifiers now refer to the species genome of the respective orthologous sequences:

        >>target_gene1     description(optional)
        >species1     description(optional)
        lines of {ACGTNacgtn}
        >species2     description(optional)
        ....
        >>target_gene2     description(optional)
        >species1     description(optional)
        lines of {ACGTNacgtn}
        >species2     description(optional)
        ....
        >>target_geneEnd     description(optional)
        ....
        >speciesEnd     description(optional)
        lines of {ACGTNacgtn}



Conversion requirements

- There is no standard file extension for a text file containing FASTA formatted sequences. Some examples of widely used file extensions are '.fasta', '.fna' or simply '.txt'.
- There should be no space between the '>' or the '>>' and the first letter of the identifier. The symbol '>' and the following identifier is requested input, the tab-spaced description is not (you can use it for clarification of your input). The identifier (not necessarily numerative) is used to report sequence-related results in output files generated by MotifSuite (e.g. the localization of motif instances in a sequence for a given species). It is recommended that the full text line be shorter than 80 characters.
- When a fasta file is loaded by our software, lines starting with the symbol '#' are skipped. This can be useful if you want to (re)run our software without some of the DNA sequences. To exclude sequences of a particular target gene or species, you need to mask all lines describing this target gene respectively species with '#', failure to do this may result in sequence data being assigned to the preceding (target gene or species) identifier.
- The sequence data alphabeth is {A,C,G,T,N,a,c,g,t,n}. Our software makes no distinction between capitals and small letters. Gaps or insertions (white space or '-') are treated as <Nn>. The file should end with a blank line return to asure that also the last sequence in the dataset is being loaded by the program.



Example

    For DNA sequences from one species :

Fig fail : Fasta.png

    For orthologous DNA sequences from multiple species (input for PHMS and NOMS):

Fig fail : NOrthoFasta.png



Feedback

Contact us if you have comments, questions or suggestions or simply want to react on the contents of this guideline. Thank you.