Required Data and File Formats

Interaction Network Data

Key to the use of IAMBEE is an interaction network. The interaction network is a compilation of the available interactomics data of the organism under research.

The interaction network file consists of two parts:

  • The definition of the interaction types (Header).
  • The list of interactions (Body).

File Header

The different types of interactions that will be used are defined in the file header. Each interaction definition line in the header must start with a "%", the name of the interaction type (free text can be used) and an indication as to whether the interaction is regulatory or not using the fixed keywords "regulatory" or "non-regulatory". Below is an example in which different interaction types are defined. Each field must be separated from the next field using a "space".

File Body

The main body of the interaction network file consists of the interactions. Each interaction line contains the identifier of the starting gene, the end gene, the type of the interaction (corresponding with the interaction types defined in the header file) and the direction of the interaction with the keywords "directed" or "undirected". All the fields must be separated with a "comma". Below is an example of some interaction lines.

Mutation Data

IAMBEE analyses sequencing data from populations that were adapted to the same focal endpoint. It combines the search for recurrently mutated subnetworks with information on the functional impact scores of the occurring mutations and, if available with the relative frequency increase during a selective sweep. The latter information must be provided in the mutation file.

The mutation data should be in "comma" delimited format. The file header is also "comma" delimited by but starts with the "#" symbol. There is only one header file and the fields in the header should be spelled in the exact same way as shown in the example.

This file can be created by the user, or preprocessed from the VCF files provided by the user on the submission page. The specifications of the file are provided below.

File Header

The information of each mutation must be defined in a fixed predefined order.

The header describes the information contained in each of the columns of the data file. It contains the following information:

  • Position: the position of the mutation (mandatory)
  • Gene identifier: (mandatory, this identifier should match the ID used in the network file)
  • Reference: The reference allele (mandatory);
  • Alternative: the alternative allele (mandatory);
  • Freq increase: the frequency increase of the mutation in the population during a selection sweep (optional)
  • Functional score: the functional impact score of the mutation in the population (optional)
  • Condition: identifies the population in which the mutation occurs (mandatory)
  • synonymous: annotation of the mutation (optional)

Freq increase and functional score are important fields but not required. Although information on the functional impact score and frequency increase are optional, the results will improve largely when provided. Default value is set to true, If the information is not available the columns and corresponding field in the header should be omitted from the input file.

The absence of this columns can be specified in the form. Note that the input file cannot have missing values. To knowInformation on how to obtain these additional values can be found in the read the Relevance Scores in our documentation. An indication as to whether the mutations are synonymous or not can be provided. If provided synonymous mutations are ignored during analysis. In the absence of this information all mutations are considered.

File Body

The analysis of the raw reads (FastQ) can be performed based on user preferences. However, we suggest to perform the following steps to your files.

  1. Source of Genome: NCBI or Ensembl can be used. Keep in mind that the Gene IDs must match the network IDs. Additionally, the addition of functional scores (SIFT scores) for your variants requires the use of SIFT4G, the creation of the prediction database requires the GTF file with starting/end codon which is not provided by the NCBI. Optionally, SIFT4G can create your own database in their website.
  2. Variant Calling: Evolution experiments require all variants to be take into account, this includes variants with low frequency who can settle during a sweep. LoFreq variant caller was used to analyze the sample data provided and extract the information required for IAMBEE.
  3. Frequency Increase: As mentioned previously SIFT4G, was used to annotate the variants.
    Some technicalities:
    • In order to build the SIFT database the Ensemble genome was used, this because the additional files required (GTF, peptide fasta) have the correct format.
    • To compile the SIFT scripts was required to use GCC v4.9 onwards, this will avoid the following error what(): regex_error.
    • For more information go to the SIFT4G website clicking here.

File Header

As a normal VCF file each file of the header starts with "##".

File Body

Differently, to the file header, there is a second header row starting with a single "#" followed by the columns inside the VCF.

The addition of SIFT scores modifies the "INFO" field inside the VCF file adding extra info regarding each variant.

Gene-Names Mapping File

We provide the possibility to upload a file in which the genes ids used in the network/nutation file (generally numbered values) can be mapped to a more interpretable nomenclature (e.g. common names).

The mapping file must have the specific format delimited by "comma". The file header is also delimited by "comma" but starts with a "#" symbol.

File Header

The information of each gene must be defined using a from -> to structure. To identify distinguish the header from other rows "comma" separated rows, it is necessary to add a "#" symbol without spaces to the header file. The header contains the following elements:

The from column must match the identifiers of the network file and mutation file.

File Body

Example of mapping file lines. Each line starts with the gene id used in the previous files. Followed by the novel id to be used.

This file is optional, if you do not want to change the ids ignore this step.