Required Data and File Formats

Interaction Network Data

Key to the use of IAMBEE is an interaction network. The interaction network is a compilation of the available interactomics data of the organism under research.

The interaction network file consists of two parts:

  • The definition of the interaction types (Header).
  • The list of interactions (Body).

File Header

The different types of interactions that will be used are defined in the file header. Each interaction definition line in the header must start with a "%", the name of the interaction type (free text can be used) and an indication as to whether the interaction is regulatory or not using the fixed keywords "regulatory" or "non-regulatory". Below is an example in which different interaction types are defined. Each field must be separated from the next field using a "space".

Important:

For IAMBEE the following interactions can be defined in the network

The header files and the interaction types in your network file should match the exact formulation of the interaction definitions listed above. If not the visualization will not work properly

File Body

The main body of the interaction network file consists of the interactions. Each interaction line contains the identifier of the starting gene, the end gene, the type of the interaction (corresponding with the interaction types defined in the header file) and the direction of the interaction with the keywords "directed" or "undirected". All the fields must be separated with a "comma". Below is an example of some interaction lines.

If you do not have your own preferable network you can also generate a network using the STRING database Click here.

Mutation Data

IAMBEE analyses sequencing data from populations that were adapted to the same focal endpoint. It combines the search for recurrently mutated sub-networks with information on the functional impact scores of the occurring mutations and, if available with the relative frequency increase during a selective sweep. The latter information must be provided in the mutation file.

The mutation data should be in "comma" delimited format. The file header is also "comma" delimited by but starts with the "#" symbol. There is only one header file and the fields in the header should be spelled in the exact same way as shown in the example.

This file can be created by the user, or preprocessed from the VCF files provided by the user on the submission page. The specifications of the file are provided below.

File Header

Information on how to obtain these additional values can be found in the tab Relevance Scores in our documentation. An indication as to whether the mutations are synonymous or not can be provided. If provided synonymous mutations are ignored during analysis. In the absence of this information all mutations are considered.

The information of each mutation must be defined in a fixed predefined order.

The header describes the information contained in each of the columns of the data file. It contains the following information:

  • Position: the position of the mutation (mandatory)
  • Gene identifier: (mandatory, this identifier should match the ID used in the network file)
  • Reference: The reference allele (optional but header mandatory);
  • Alternative: the alternative allele (optional but header mandatory);
  • Freq increase: the frequency increase of the mutation in the population during a selection sweep (optional)
  • Functional score: the functional impact score of the mutation in the population (optional)
  • Condition: identifies the population in which the mutation occurs (mandatory)
  • synonymous: annotation of the mutation (optional)

The fields position, gene identifier, reference and alternative are mandatory in the header file. However if the data are not available an ‘NA’ value can be used. Information on the gene identifier is required. Information on the position and reference and alternative allele are just meant for annotation purposes but not used during calculations.

  1. Freq increase and functional score are important fields but not required. If the information is not available the columns and corresponding field in the header should be omitted from the input file. Information on how to obtain these additional values can be found in the read the Relevance Scores in our documentation. The information is optional but using it can help reducing the search space. It will reduce computational time and steer the solution to a more biologically relevant one (especially if few independent samples are available). Frequency increases should be indicated in % (ranging from 0 to 100).
  2. An indication as to whether the mutations are synonymous or not can be provided. If provided synonymous mutations are ignored during analysis. In the absence of this information all mutations are considered. Omitting the synonymous mutations during analysis will reduce the search space, and decrease computational time and might result in more relevant solutions (especially of the number of independent populations is small).
  3. The mapping of the mutated positions to genes can be obtained from an annotation tool. Only mutations that are mapped to geneIDs (same geneIDs as used in the network file), will be used in the network analysis (e.g. an intergenic region that was not allocated to a gene cannot be mapped to a node in the network and will be automatically discarded from the analysis). Note that if an INDEL encompasses multiple genes, the INDEL should be added multiple times to the input file, each time mapped to a different gene that can be associated with the INDEL. Incorporating large INDELS that cover multiple genes can increase the signal/noise in the dataset.
  4. See an example of how to annotate INDELS:

File Body

The analysis of the raw reads (FastQ) can be performed based on user preferences. However, we suggest to perform the following steps to your files.

  1. Source of Genome: NCBI or Ensembl can be used. Keep in mind that the Gene IDs must match the network IDs. Additionally, the addition of functional scores (SIFT scores) for your variants requires the use of SIFT4G, the creation of the prediction database requires the GTF file with starting/end codon which is not provided by the NCBI. Optionally, SIFT4G can create your own database in their website.
  2. Variant Calling: Evolution experiments require all variants to be take into account, this includes variants with low frequency who can settle during a sweep. LoFreq variant caller was used to analyze the sample data provided and extract the information required for IAMBEE.
  3. Frequency Increase: As mentioned previously SIFT4G, was used to annotate the variants.
    Some technicalities:
    • In order to build the SIFT database the Ensemble genome was used, this because the additional files required (GTF, peptide fasta) have the correct format.
    • To compile the SIFT scripts was required to use GCC v4.9 onwards, this will avoid the following error what(): regex_error.
    • For more information go to the SIFT4G website clicking here.

File Header

As a normal VCF file each file of the header starts with "##".

File Body

Differently, to the file header, there is a second header row starting with a single "#" followed by the columns inside the VCF.

The addition of SIFT scores modifies the "INFO" field inside the VCF file adding extra info regarding each variant.

Gene-Names Mapping File

We provide the possibility to upload a file in which the genes ids used in the network/nutation file (generally numbered values) can be mapped to a more interpretable nomenclature (e.g. common names).

The mapping file must have the specific format delimited by "comma". The file header is also delimited by "comma" but starts with a "#" symbol.

File Header

The information of each gene must be defined using a from -> to structure. To identify distinguish the header from other rows "comma" separated rows, it is necessary to add a "#" symbol without spaces to the header file. The header contains the following elements:

The from column must match the identifiers of the network file and mutation file.

File Body

Example of mapping file lines. Each line starts with the gene id used in the previous files. Followed by the novel id to be used.

This file is optional, if you do not want to change the ids ignore this step.