Extraction of over-represented alleles in Bulk Segregant Analysis

How to use this tool?

Bulk segregant analysis (BSA) coupled to high throughput sequencing is a powerful method to map genomic regions related with phenotypes of interest. It relies on crossing two parents, one inferior and one superior for a trait of interest. Segregants displaying the trait of the superior parent are pooled, the DNA extracted and sequenced. Genomic regions linked to the trait of interest are identified by searching the pool for overrepresented alleles that normally originate from the superior parent. BSA data analysis is non-trivial due to sequencing, alignment and screening errors.

Create an account

First, you can create your acount or login using the buttons on the top right corner:

If you do not want to create an account, don't worry, it is not required. But you will need to store the results URL to be able to access them again.

Create a new Experiment

Step 1: Experiment Information

After cliking the "New Experiment" button on the main page, a form will be displayed to add the information of the BSA experiment. A name, a description, the recombination rate of the organism and the number of segregants:

  • Recombination Rate: The known or expected rate of recombination in the chromosomes (e.g. 0.0000035 for Yeast)
  • No. of Segregants: The number of segregants sequenced on the Pool (e.g. 30 for the test data)

After adding the experiment click the "Next" button to continue to the next step.

Step 2: Upload count data

Variant Call Format (vcf):

You can upload a standard .vcf file and enter the name of the sample that want to analysed. If the name is empty, the first sample will be analysed.

A file like this can be generated with NGSEP or GATK. A tutorial for NGSEP can be found in this link.

You can see an example downloading this file.

In the test .vcf there is only one sample (i.e. Pool1), but it could contain information about more samples (e.g. parent genotyping or other pools data). If your have a .vcf file with information from multiple experiments, this fields becomes useful.

Tab Delimited Format:

You can upload a tab delimited file with the following format:

Chromosome    position    totalCount    altCount

You can see an example downloading this file. We normally use alt counts based on the assumption that that the reference used for the DNA mapping is the inferior parent. Therefore, alt counts are the supperior parent counts at the marker sites.

You can easily generate this file using the Gatk VariantsToTable tool

java -jar GenomeAnalysisTK.jar -R reference.fasta -T VariantsToTable -V file.vcf -F CHROM -F POS -GF DP -GF AD -o results.counts

Step 3: Select EXPLoRA parameters

EXPLoRA is a Hidden Markov Model (HMM) that models the effects of linkage disequilibrium to explain the dependencies between neighboring variant frequencies. EXPLoRA models the relationship between a genomic variant and the phenotype of interest as a hidden state and use beta-binomial distributions to calculate emission probabilities.

A: marker sites are modeled to be in a neutral state (N-state, blue circles) or in a state of being linked to the phenotype of interest (P-state, orange circles) based on its relative variant frequency in the pool of segregants. B: emission probabilities for the neutral (blue curve) and the phenotype-linked states (orange line) as a function of the variant frequencies, modeled by a beta-binomial distribution with parameters α and β. C: transition probability as a function of the physical distance between neighboring marker sites.

In this step you need to select different values to the parameters of the beta-distributions that model the emission probabilities. As default parameters we use three different values:
α= 5, β=1 is meant to have high sensitivity.
α=10, β=1 is meant to be a middle ground between sensitivity and specificity.
α=30, β=1 is meant to have high specificty.

The ratio between alfa and beta: You can run the tool for different ratio's. The higher alfa becomes, the more stringent the search i.e. the more the effect of the QTL needs to be (the more overrepresented the number of superior allele counts versus the total number of allele counts at a marker site. Regions that will be detected with the most stringent setting will also be detected with less stringent settings. The more stringent the setting at which a region can still be detected, the more reliably the QTL association is. Note also that increasing the stringency of the setting allows to decrease the region that associates: because of LD, the number of markers that correlate with the truly associating marker site decreases with the distance from the truly associating site. The more stringent the setting the less 'correlating markers' still meet the criteria to be selected.

Additional parameters can be added. The parameters can be easily modified by moving the horizontal sliding bars.

Changing the parameters allow to see how they relate to the background frequency distribution (shown in Black)

You should select parameters so that the mass of the distribution representing the neutral emission probability is leftsided from the mass represented by the distribution describing the emission probabilities in the phenotype linked sites. The smaller the intersection between the neutral and the phenotype linked emission probability distributions, the more stringent the selection criteria. Check these simulations results (here and here) for more information about the performance of the methods with different parameters and scenarios.

Undesrtanding the parameters

EXPLoRA models for each marker site, two possible states: one state (P-state) expresses that the variants in the pool at that marker site originate predominantly (but not always in all segregants) from the superior parent and are thus linked to the phenotype of interest. A second state (N-state) models that the variants in the pool at a given marker site originate to an equal extent from either parent, in which case the marker site is assumed to be located in a neutral region not linked to the phenotype of interest.

The effect of linkage disequilibrium is modeled by the transition probabilities t between two neighboring marker sites. The transition probability t models the chance that a neighboring site remains in the same state as its preceding site state. Its distribution is described by a negative exponential model as a function of the recombination rate r and the physical distance between neighboring marker sites. The probability to change states upon transition from one marker site to its direct neighboring marker site (from a neutral N-state to a phenotype-linked P-state or vice versa) is then described by 1-t and takes into account the true distance between them (i.e. no distance binning is involved). The model captures the fact that marker sites located in each other's physical neighborhood are likely to be in linkage disequilibrium and less likely to change their state (from P to N or from N to P). Optimally the parameter r should approximate the true recombination rate in the organism of interest. When r is overestimated marker phenotype associations are treated indepdently and the effect of exploiting LD to increase the power of the analysis is lost.

Each state in the model emits a random variable nA, corresponding to the number of variant counts at a given marker site originating from the superior parent. nA ranges from 0 to n, with n being equal to the (known) total variant count for the marker site. nA is described by a beta binomial distribution, which allows capturing different emission probabilities in phenotype linked versus neutral states by choosing different α and β parameters for their corresponding distributions. We modeled all neutral states with the parameters αN and βN, and all phenotype-linked states with the parameters αP and βP. While for the neutral states αN should almost equal βN to make values of nA closer to n/2 more likely to be sampled, for the phenotype-linked states αP should be much larger than βP to make values of nA close to n more likely to be sampled. In the application the parameters αN and βN of the emission probabilities in the neutral state can be estimated directly from the observed variant frequencies as described in Pulido et al. The ratio between αP and βP defines the degree to which the relative variant frequency at a marker site needs to differ from the one obtained through random inheritance for it to be called linked to the phenotype (stringency of the method). Changing the ratio affects the probability with which an observed relative variant frequency is interpreted by the model as a phenotype linked region. In the application the ratio between αP and βP can be altered by fixing βP equal to 1 and testing different values of αP.

After the parameters have been selected, submit the experiment for processing.

Results

The results of EXPLoRA is shown as a posterior probability of being a phenotype-linked marked. The results are presented for each set of parameters (α β ratio) selected in Step 3 in a different color.

Notice that for parameter settings too stringent the results will be a posterior probability of zero in all marker sites. If this is happening in your sample, please use the Edit Experiment button where new sets of more relaxed parameters can be selected.

The figure show in the X-axis the position in the chrmosome and in the Y-axis the posterior probability.

The results can be downloaded in two different formats:

  • Marker site information: as a coma separated (.csv) file listing the posterior distributions per marker and parameter setting
  • Detected regions: a BED file showing the regions linked to a phenotype detected. The BED file can easily be used to further analysis by importing it into other tools like the Integrative Genomic Viewer

Running EXPLoRA standalone

Download EXPLoRA file and execute the following comand java -jar EXPLoRA_1.1.0.jar

This will open a GUI that will allow you to analyse a .vcf file with multiple samples and obtain the posterior probabilities for each marker site.