Extended Help File

This file contains details on how the input data are internally processed by IAMBEE. Full details can be found in the following paper:

Network-Based Identification of Adaptive Pathways in Evolved Ethanol-Tolerant Bacterial Populations.
Toon Swings, Bram Weytjens, Thomas Schalck, Camille Bonte, Natalie Verstraeten, Jan Michiels, and Kathleen Marchal
Mol Biol Evol. 2017 Nov 1;34(11):2927-2943. doi: 10.1093/molbev/msx228.

Calculating Relevance Scores

Below is described how IAMBEE uses internally the frequency increase and functional impact scores to derive the relevance scores

The relevance score of a mutation consists of two components, a contribution of the frequency increase and of the functional impact of the mutation.

$$relevance(S,n) = (1 - eCDF_{fun}(Functional Score(S, n) * (eCDN_{freq.n}(Frequency Increase(S, n)$$

Frequency Increase of the mutation during selection sweep

The frequency increase of a mutation is equal to the difference of its frequency in the population just after and just before the sweep. The contribution of the frequency increase to the relevance score is a value that expresses the relative importance of a mutation’s frequency in a population. It is derived by first estimating the distribution of the frequency increase for all mutations in the evolved population using a nonparametric cumulative distribution. Hereto only mutations that increase in frequency during a selective sweep are considered. Hence prior to estimating the distribution, all mutations with a decrease in frequency of at least 2% are discarded.

The contribution of the frequency increase to the relevance score of a mutation is derived from this non parametric cumulative distribution of the frequency increase as follows.

$$(eCDN_{freq,n} (Frequency Increase (S, n))$$

With $$eCDF…$$ being the value of the cumulative distribution function in population n of the frequency increases for the mutation in gene S. If the same mutation occurs in multiple populations, the $$eCDF…$$ is derived from the population in which the mutation underwent the largest frequency increase.

Functional-impact Score of the Mutations

Analogously to the frequency increase of a mutation during a sweep also the functional impact score contributes to the mutation’s relevance score. The relevance score contribution of the functional impact score is derived from the cumulative distribution of the functional impact scores of all mutations, irrespective of the evolved population.

$$1 - eCDF_{fun}(Functional Score (S,n)$$

The value of the cumulative distribution function of the functional impact scores for the mutation in gene S in population n with the most deleterious functional impact score (note the $$1-eCDF_{fun}$$ as we used SIFT scores and a low SIFT score corresponds to a high functional impact).

Combining the contribution of the frequency increase and functional impact score results in the of a mutation S in population n $$relevance(S,n)=(1−eCDF_{fun}(Functional Score(S,n)∗(eCDN_{freq.n}(FrequencyIncrease(S,n)$$

The relevance score is a value between 0 (gene S is unlikely to be relevant towards adaptation in population n) and 1 (gene S is very likely to be relevant towards adaptation in population n). Genes without mutations are assigned the mean functional impact score and frequency increase when calculating their relevance.

Correction for mutation rate

Populations with a mutation rate that is significantly higher than the mutation rates of the other populations can bias the search and the results. Therefore IAMBEE allows reducing the impact of the population’s mutation rate on the search. Hereto a correction factor is calculated for each population. The user can choose whether or not to use this correction.

If this option is switched on IAMBEE first identifies populations with significantly higher mutation rates using the modified Z-score for outlier detection (see Swings et al). From this modified Z-score a population specific correction factor n is calculated. Due to the modified Z-score, the correction factor intrinsically assigns a relatively lower value to outlier populations if a larger number of independent populations are available, hereby largely reducing the effects of populations with high mutation rates to reduce noise when a large number of independent populations is present. When only a limited number of independent populations is available, the correction factor will be relatively higher as in that case also the populations with larger mutation rates are needed to exploit parallelism (as so few populations are available).

The relevance score and the correction factor are integrated into a single score for every mutated gene in every population. This is implemented as follows:

$$correctedRelevance(S,n) = relevance(S,n) * correction(n).$$

So mutations in a population with high mutation rate will have a lower impact on the path-finding.

Subnetwork Inference

Topology Weighed Interaction Network

A topology based weighting of the interaction network is used to reduce the effect of hubs in the subsequent analysis steps i.e. to down weight interactions originating from large hubs while avoiding the penalization of nodes with low out-degrees.

Hereto a power law distribution is estimated based on the out-degrees of the nodes in the interaction network. Using this distribution the out-degree that corresponds to the 90th percentile is used as inflection point in a sigmoidal function that determines the edge weight between two adjacent nodes i and j in the interaction network.

$$weight_{i,j} = {1 \over 1 + e^{({outDegree(i)-inflectionPoint \over {inflectionPoint \over 6}})}}$$

Pathfinding Step

IAMBEE proceeds in two steps: a pathfinding step and an optimization step.

All genes with at least one mutation in any independent population are mapped on the topology-weighted interaction network. One network model exists for the complete cohort of populations. Subsequently, all possible paths originating from a mutated gene in a population and ending in any other gene which is mutated in another population, are enumerated. A path is defined as a series of consecutive edges in the interaction network. Paths between mutated genes in the same populations are excluded, reasoning that because of the clonality a single mutation in a pathway will confer most of its fitness advantage.

Each path is assigned a probability which reflects the degree of belief that the path is associated with the adaptive phenotype under study. This probability takes into account the weights of the edges which make up the path. These edge weights are calculated based on the network topology and the relevance scores from both the start gene and the terminal gene of the path (see above). For each path its probability is calculated.

As enumerating all possible paths is computationally expensive, the following heuristics are used:

1. The maximum path length is set to 4 (hard coded).
2. From all possible paths originating from a mutated gene in a specific population, only the N paths with highest probabilities are retained (N best paths parameter). Increasing the number of best paths allows for a more accurate estimation of the probability of a path but is computationally more expensive (takes longer).

Optimization step

The final step of the analysis is the inference of a subnetwork containing the molecular pathways responsible for the adaptive phenotype. This subnetwork consists of a subset of the paths selected in the previous step. This subset of paths is obtained by optimizing the following function:

$$S(K) = \sum_{n \epsilon R} ( \sum_{S \epsilon Q_n} (P(path(muts_{S,n},mut_{all}) | probabilities ))) - | K | * x_e$$

Where $$S(K)$$ is the score of the selected subnetwork and needs to be maximized, $$K_j$$ is the number of edges selected, $$x_e$$ is the imposed cost for each edge, R is the collection of populations used in the experiment, $$Q_n$$ is the collection of mutated genes from population n and $$P(path(mutS,n, mutall)j_{probabilities})$$ is the probability that there exists a path relevant to the observed phenotype between a mutated gene $$S$$ in population $$n$$ and any other mutated gene in any other population, given the degrees of belief (probabilities) of all found paths in the pathfinding step.

The algorithm searches for the collection of high probable paths (subnetwork) in the weighed interaction network that connect as many as possible mutations occurring in the different populations with the least number of edges. The latter is imposed by the cost term e. By imposing this cost term the algorithm is forced to select paths with overlapping edges and hence focuses on regions in the interaction network that are enriched in mutations. The value $$x_e$$ determines the size of the network. If set too high (stringent) the selected subnetwork will be small and multiple causal molecular pathways are likely missed. Conversely if $$x_e$$ is set too low the subnetwork will grow too large and start including many false positives (passengers rather than drivers). The user can choose to run the algorithm for one specific default edge cost or perform a parameter sweep over the xe parameter. In the later case a subnetwork will be calculated for each cost parameter and results will be merged to derive a single network visualization (see below).

To solve the optimization, a greedy hill-climbing heuristic is used in which the previously found paths get sampled pseudo-randomly based on the overlap the paths have with each other: Overlapping paths are more likely to be sampled together. As this procedure is stochastic in nature, the procedure is repeated multiple times (number of repeats parameter) and the best solution with respect to the optimization score $$S(K)$$ is used as the solution. Increasing the number of repeats will contribute to the accuracy of the results (allows the algorithm to better approximate the true optimal) but this comes at the expense of a higher computational cost.

Visualization

If a selection sweep is performed over the cost parameter, a single network will be drawn that is the merge of the results obtained at each cost parameter explored during the sweep. Hereto a single network is drawn that is the union of all recovered subnetworks. In general edges that are recovered at a stringent edges cost will also be found at less stringent edges costs. These edges are most supported by the data and hence most reliable. This property is reflected by assigning a weight to the edges based on the maximum edge cost for which they are still included in an optimal subnetwork. Edges with a higher weight are recovered at the more stringent cost parameters and hence more reliable.