Required Data and File Formats
Interaction Network Data
Key to the use of IAMBEE is an interaction network. The interaction network is a compilation of the available interactomics data of the organism under research.
The interaction network file consists of two parts:
- The definition of the interaction types (Header).
- The list of interactions (Body).
The different types of interactions that will be used are defined in the file header. Each interaction definition line in the header must start with a "%", the name of the interaction type (free text can be used) and an indication as to whether the interaction is regulatory or not using the fixed keywords "regulatory" or "non-regulatory". Below is an example in which different interaction types are defined. Each field must be separated from the next field using a "space".
The main body of the interaction network file consists of the interactions. Each interaction line contains the identifier of the starting gene, the end gene, the type of the interaction (corresponding with the interaction types defined in the header file) and the direction of the interaction with the keywords "directed" or "undirected". All the fields must be separated with a "comma". Below is an example of some interaction lines.
IAMBEE analyses sequencing data from populations that were adapted to the same focal endpoint. It combines the search for recurrently mutated sub-networks with information on the functional impact scores of the occurring mutations and, if available with the relative frequency increase during a selective sweep. The latter information must be provided in the mutation file.
The mutation data should be in "comma" delimited format. The file header is also "comma" delimited by but starts with the "#" symbol. There is only one header file and the fields in the header should be spelled in the exact same way as shown in the example.
This file can be created by the user, or preprocessed from the VCF files provided by the user on the submission page. The specifications of the file are provided below.
The information of each mutation must be defined in a fixed predefined order.
The header describes the information contained in each of the columns of the data file. It contains the following information:
- Position: the position of the mutation (mandatory)
- Gene identifier: (mandatory, this identifier should match the ID used in the network file)
- Reference: The reference allele (optional but header mandatory);
- Alternative: the alternative allele (optional but header mandatory);
- Freq increase: the frequency increase of the mutation in the population during a selection sweep (optional)
- Functional score: the functional impact score of the mutation in the population (optional)
- Condition: identifies the population in which the mutation occurs (mandatory)
- synonymous: annotation of the mutation (optional)
The fields position, gene identifier, reference and alternative are mandatory in the header file. However if the data are not available an ‘NA’ value can be used. Information on the gene identifier is required. Information on the position and reference and alternative allele are just meant for annotation purposes but not used during calculations.
- Source of Genome: NCBI or Ensembl can be used. Keep in mind that the Gene IDs must match the network IDs. Additionally, the addition of functional scores (SIFT scores) for your variants requires the use of SIFT4G, the creation of the prediction database requires the GTF file with starting/end codon which is not provided by the NCBI. Optionally, SIFT4G can create your own database in their website.
- Variant Calling: Evolution experiments require all variants to be take into account, this includes variants with low frequency who can settle during a sweep. LoFreq variant caller was used to analyze the sample data provided and extract the information required for IAMBEE.
- Frequency Increase: As mentioned previously SIFT4G, was used to annotate the variants.
As a normal VCF file each file of the header starts with "##".
Differently, to the file header, there is a second header row starting with a single "#" followed by the columns inside the VCF.
The addition of SIFT scores modifies the "INFO" field inside the VCF file adding extra info regarding each variant.
Gene-Names Mapping File
We provide the possibility to upload a file in which the genes ids used in the network/nutation file (generally numbered values) can be mapped to a more interpretable nomenclature (e.g. common names).
The mapping file must have the specific format delimited by "comma". The file header is also delimited by "comma" but starts with a "#" symbol.
The information of each gene must be defined using a from -> to structure. To identify distinguish the header from other rows "comma" separated rows, it is necessary to add a "#" symbol without spaces to the header file. The header contains the following elements:
The from column must match the identifiers of the network file and mutation file.
Example of mapping file lines. Each line starts with the gene id used in the previous files. Followed by the novel id to be used.