Ensemble biclustering approach

Introduction

Query-based biclustering techniques allow interrogating a gene expression compendium with a given gene or gene list. They do so by searching for genes in the compendium that have a profile close to the average expression profile of the genes in this query-list. As it can often not be guaranteed that the genes in a long query-list will all be mutually coexpressed, it is advisable to use each gene separately as a query. This approach however, leaves the user with a tedious post-processing of partially redundant biclustering results. The fact that for each query-gene multiple parameter settings need to be tested in order to detect the ‘most optimal bicluster size’ adds to the redundancy problem. To aid with this post-processing, we introduce here an ensemble approach to be used in combination with query-based biclustering. The method takes as input the outcome of a query-based biclustering algorithm applied to each gene of the query-list separately. The output consists of non-redundant consensus biclusters that maximally reflect the information contained within the original query-based biclustering results.

Glossary

Query-gene: a gene of interest to a certain researcher, for instance a gene that was identified in a certain experiment.
Query-based biclustering: query-based biclustering algorithms search for genes that are coexpressed in a conditiondependent way with a predefined set of genes (the query-genes). The output consists of the coexpressed genes (including the query) and the selected conditions under which they exhibited coexpression.
Resolution sweep: as it is often not known in advance what level of coexpression results in the most optimal biclustering solutions from a biological viewpoint, several query-based biclustering approaches include a resolution sweep approach. In this approach a resolution parameter controls the bicluster size (or tightness of gene coexpression) and multiple biclustering solution are outputted for a whole range of values for this parameter (i.e. a sweep over parameter values is performed). The most optimal biclustering solution can be selected a posteriori.
Gene/condition scores: gene and condition scores are calculated by query-based biclustering methods and represent the strength of association of a gene/condition to a certain bicluster. These scores can reflect the probability of a gene/condition to belong to the bicluster [1] or represent the correlation amongst the genes within the bicluster [2]. Alternatively a query-based biclustering method might just output a list of genes and conditions [3] and in this case the gene and condition scores are binary (‘1’ if the gene/condition belongs to the bicluster and ‘0’ otherwise).

How to get the software?

The software can be downloaded as a compiled matlab package. This package runs on unix systems and does not require matlab to be installed on your system. Running the sofware however requires the Matlab Compiler Runtime (MCR) to be installed. In case you already have the Matlab Compiler Runtime (preferentially version 7.7) installed on your computer, download and unpack this zip-file (130 KB). Alternatively download and unpack this zip-file (210 MB), which includes a binary for MCR installation.
Then perform the following steps:

Unzip the file in your directory of choice
Install the MCR into a separate directory (for instance mcr_dir) by executing the following command: ./MCRInstaller.bin. This will open a GUI for installation of the MCR. Follow the steps in the MCR Installer wizard. Note that this step is only required if you do not already have the MCR installed on your computer.
Run the ensemble-code: ./ensembleQDB args. Here args designates the input arguments for the software (see below for a description).

How to use the software?

Step1
Compile a list of query-genes. Save this list of query-genes in the ‘QueryFile’. This file contains one column with the locus tags of the query-genes. See here for an example of such a query-file for E. coli.
Step 2
Apply query-based biclustering to each query-gene from the ‘QueryFile’ separately. You can choose to use a resolution sweep approach or not. Examples of query-based biclustering algorithms are: query-driven biclustering (QDB)[1], the Signature Algorithm [2] and ProBic [3].
Step 3
Convert the output of query-based biclustering to a format that can be read in by the ensemble software. In particular as the output of query-based biclustering are biclusters, for each query-gene two different files are constructed: one that contains the gene scores (‘geneScoreFile’) and one that contains the condition scores (‘conditionScoreFile’) (see Glossary).
The ‘geneScoreFile’ contains in the first row all the gene locustags present in the expression compendium used. The remainder of the file consists either of a matrix (in case of resolution sweep) or a rowvector (no resolution sweep)with each row representing the gene scores obtained for a certain value of the resolution parameter. In case your query-based biclustering algorithm just outputs a list of genes and no gene scores this matrix is binary with ‘1’ indicating that the gene belonged to the bicluster and ‘0’ indicating that the gene did not belong to the bicluster. An example of this ‘geneScoreFile’ can be obtained here.
The ‘conditionScoreFile’ is constructed in a similar way. The first row of the ‘conditionScoreFile’ contains all condition labels of the conditions in the compendium. Again this row is followed by a matrix that contains the conditionscores. If no resolutionsweep was used this matrix is replaced by one row vector containing the output of the query-based biclustering run. An example of this ‘conditionScoreFile’ can be obtained here. Both gene and condition scores must be between 0 and 1.
Naming conventions for these files are the following: ‘[query-gene]_geneS.txt’ for the geneScoreFiles and ‘[query-gene]_condS.txt’ for the conditionScoreFiles. Here [query-gene] refers to the respective query-genes saved in the ‘QueryFile’. All gene score and condition score files should be saved into the same folder, further referred to as ‘qdbResultDir’.
Step 4
Run the ensemble software. The software itself runs through the unix commandline in the following manner:
./ensembleQDB queryFile qdbResultDir ‘resolution’ F ‘scoreThr’ 0 ‘qualityCheck’ T ‘filter’ none

Required input arguments:

queryFile: path and name of the ‘queryFile’
qdbResultDir: contains the path towards the folder with the ‘geneScoreFiles’ and ‘conditionScoreFiles’

Optional input arguments:

‘resolution’: has as value of either F (false) (DEFAULT) or T (true). F denotes that the query-based biclustering method used does not uses a resolutionsweep , T denotes that the query-based biclustering method uses a resolutionsweep.
‘scoreThr’: has a value between 0 (DEFAULT) and 1. This variable represents the threshold for the gene- and conditionscores that needs to be exceeded for the genes/conditions to belong to the bicluster. This is especially relevant for non-binary gene and condition scores, in case of binary gene scores default settings for this parameter are recommended.
‘filter’: has as value either 'none' (DEFAULT), 'disparity' or 'TOM'. Refers to a possible transformation of the consensus matrix before graph clustering is applied. Either no transformation is applied and hence the normal consensus matrix is used for graph clustering, alternatively the topological overlap matrix 'TOM' is used or insignificant consensus scores are put to zero using the disparity-filter ('disparity').
‘qualityCheck’: has as value either F (false) or T (true) (DEFAULT). If the value for this parameter is T (true), the qualities of the individual query-based biclustering outcomes are verified before applying the ensemble approach. The quality check entails checking whether biclustering outcomes contain more genes than just the query-gene and whether biclustering outcomes contain a minimal number of conditions (fixed to 10). Biclustering solutions not fulfilling any of these 2 criteria are ignored.

Output: the output of the ensemble approach consists of two files: ‘consensusBicluster_genes.txt’ is a tab-delimited two-column file with the first column containing the number of the consensus bicluster and the second column containing the gene locus tag of the genes belonging to the bicluster. A similar file, ‘consensusBicluster_conds.txt’ is constructed for the condition content of the consensus biclusters. Both files are saved into the ‘qdbResultDir’ directory.

Examples: suppose that the queryFile is saved in the following file: here we suppose that the ‘queryFile’ is saved in the following location /home/usr/ensemble/query/queryFile.txt and that the path to the ‘qdbResultDir’ is /home/usr/ensemble/qdbResultDir/

Running the ensemble software with default settings:
./ensembleQDB /home/usr/ensemble/query/queryFile.txt /home/usr/ensemble/qdbResultDir
Running the ensemble software on query-based biclustering output with a resolution sweep approach:
./ensembleQDB /home/usr/ensemble/query/queryFile.txt /home/usr/ensemble/qdbResultDir ‘resolution’ T
Running the ensemble software on query-based biclustering output with a resolution sweep, only considering gene and condition scores exceeding 0.5:
./ensembleQDB /home/usr/ensemble/query/queryFile.txt /home/usr/ensemble/qdbResultDir ‘resolution’ T ‘scoreThr’ 0.5

References

[1] Dhollander,T. et al. (2007) Query-driven module discovery in microarray data. Bioinformatics, 23, 2573-2580.

[2] Ihmels,J. et al. (2002) Revealing modular organization in the yeast transcriptional network. Nat. Genet., 31, 370-377.

[3] Zhao,H. et al. (2011) Query-based biclustering of gene expression data using Probabilistic Relational Models. BMC Bioinformatics, 12, S37.

Citation

When using this software for your own work please cite:
De Smet, R. and Marchal, K. (2011). An ensemble biclustering approach for querying gene expression compendia with experimental lists. Bioinformatics (revised version submitted).

Contact

Kathleen Marchal