Logo TREE format

This page describes the format of a file that describes a rooted phylogenetic tree with branchlengts in NEWICK format.
We comment on all optional and required fields in case you need to supply such a file as input to MotifSuite.

In MotifSuite, a tree file is supplied as input forNOrthoMotifSampler.
Page contents :
    File format
    Conversion requirements
    Example



TREE file format

In bioinformatics, NEWICK is a text-based format for representing trees in computer-readable form using (nested) parentheses and kommas. For detailed information how to build such a nested-parentheses tree, we refer to ~/phylip/newicktree.html.

The newick tree occurs on a single line, starting with a greater-than ('>') symbol in the first column and a tree-recognition string, in our case 'Tree' or 'Star', immediateley followed by the (nested) parentheses describing the relations between the organisms involved in your study :
e.g.
        >Star(organism1:branchlength1,organism2:branchlength2,organism3:branchlength3);
or
        >Tree(organism1:branchlength1,(organism2:branchlength2,organism3:branchlength3):node_branchlength1);
        or with internal node identification (optional):
        >Tree(organism1:branchlength1,(organism2:branchlength2,organism3:branchlength3)NODE1:node_branchlength1);

The names of the organisms involved should be exactly the organism-identifiers used in your sequences FASTA file (in this file, the organism names also follow the greater than '>' symbol in the sequence identifier lines, see Fasta format). A branchlength consists of integers and (if decimal numbers) a dot (no komma!). Branchlenghts are always preceded by a colon symbol ':'. Internal nodes may but do not need to be identified by a string (we do not further use such node strings in our software). The description of 'NODE1' above is thus optional. The whole line is loaded as one unit, so no white spaces or tabs are allowed in any of the identifiers or numbers or before or after parentheses, kommas or colons. The use of 'Tree' or 'Star' as a tree-recognition string for our software is required but the choise is not stringent (you can use 'Tree' or 'Star' for both star respectively tree topology, it does not further influence the way of loading the root-nodes-ancestor- and branchlength information).

(to do:)(expert use)
if (pInput[0] == "GeneBias") StoreGeneMutationRates(pInput);
else if (pInput[0] == "CoregBias") StoreCoregulationWeights(pInput);
else if (pInput[0] == "UnitSubstRate") { istringstream istr(pInput[1]);



Conversion requirements

- There is no standard file extension for a text file containing a Newick Tree. We propose the use of '.tree' or simply '.txt'.
- When a file is loaded by our software, lines starting with the symbol '#' are skipped. You can use it to add information lines for your own interest.
- IMPORTANT: In case of a non-star toplogy, organisms that are not involved in your motif detection study (i.e. there are no sequences for these organisms in your FASTA file) cannot be simply removed from the newick tree description (parentheses may not be nesting correctly obstructing the defintion of internal nodes in our software). In such case, our software will load the full tree and internally remove the branches that describe non-involved organisms or internal nodes that only connect to non-involved organisms. Mark that we choose to not remove internal nodes that connect to only a single involved organism (we do not merge the branchlengths). Alternatively, you redesign the tree to only describe involved organisms and provide the according newick format description as input for our software.
- The file should end with a blank line return to asure that all information is being loaded by the program.



Example

Fig fail : Sacc_NewickTree.png

This file describes the newick tree format for 5 Saccharomyces organisms related as shown in:
Fig fail : Sacc_star_tree.png



Feedback

Contact us if you have comments, questions or suggestions or simply want to react on the contents of this guideline. Thank you.