Introduction

Query-based biclustering techniques allow interrogating a gene expression compendium with a given gene or gene list. They do so by searching for genes in the compendium that have a profile close to the average expression profile of the genes in this query-list. As it can often not be guaranteed that the genes in a long query-list will all be mutually coexpressed, it is advisable to use each gene separately as a query. This approach however, leaves the user with a tedious post-processing of partially redundant biclustering results. The fact that for each query-gene multiple parameter settings need to be tested in order to detect the ‘most optimal bicluster size’ adds to the redundancy problem. To aid with this post-processing, we introduce here an ensemble approach to be used in combination with query-based biclustering. The method takes as input the outcome of a query-based biclustering algorithm applied to each gene of the query-list separately. The output consists of non-redundant consensus biclusters that maximally reflect the information contained within the original query-based biclustering results.

Glossary

Query-gene: a gene of interest to a certain researcher, for instance a gene that was identified in a certain experiment.
Query-based biclustering: query-based biclustering algorithms search for genes that are coexpressed in a conditiondependent way with a predefined set of genes (the query-genes). The output consists of the coexpressed genes (including the query) and the selected conditions under which they exhibited coexpression.
Resolution sweep: as it is often not known in advance what level of coexpression results in the most optimal biclustering solutions from a biological viewpoint, several query-based biclustering approaches include a resolution sweep approach. In this approach a resolution parameter controls the bicluster size (or tightness of gene coexpression) and multiple biclustering solution are outputted for a whole range of values for this parameter (i.e. a sweep over parameter values is performed). The most optimal biclustering solution can be selected a posteriori.
Gene/condition scores: gene and condition scores are calculated by query-based biclustering methods and represent the strength of association of a gene/condition to a certain bicluster. These scores can reflect the probability of a gene/condition to belong to the bicluster [1] or represent the correlation amongst the genes within the bicluster [2]. Alternatively a query-based biclustering method might just output a list of genes and conditions [3] and in this case the gene and condition scores are binary (‘1’ if the gene/condition belongs to the bicluster and ‘0’ otherwise).

How to get the software?

The software can be downloaded as a compiled matlab package. This package runs on unix systems and does not require matlab to be installed on your system. Running the sofware however requires the Matlab Compiler Runtime (MCR) to be installed. In case you already have the Matlab Compiler Runtime (preferentially version 7.7) installed on your computer, download and unpack this zip-file (130 KB). Alternatively download and unpack this zip-file (210 MB), which includes a binary for MCR installation.
Then perform the following steps:

How to use the software?

Step1
Compile a list of query-genes. Save this list of query-genes in the ‘QueryFile’. This file contains one column with the locus tags of the query-genes. See here for an example of such a query-file for E. coli.
Step 2
Apply query-based biclustering to each query-gene from the ‘QueryFile’ separately. You can choose to use a resolution sweep approach or not. Examples of query-based biclustering algorithms are: query-driven biclustering (QDB)[1], the Signature Algorithm [2] and ProBic [3].
Step 3
Convert the output of query-based biclustering to a format that can be read in by the ensemble software. In particular as the output of query-based biclustering are biclusters, for each query-gene two different files are constructed: one that contains the gene scores (‘geneScoreFile’) and one that contains the condition scores (‘conditionScoreFile’) (see Glossary).
The ‘geneScoreFile’ contains in the first row all the gene locustags present in the expression compendium used. The remainder of the file consists either of a matrix (in case of resolution sweep) or a rowvector (no resolution sweep)with each row representing the gene scores obtained for a certain value of the resolution parameter. In case your query-based biclustering algorithm just outputs a list of genes and no gene scores this matrix is binary with ‘1’ indicating that the gene belonged to the bicluster and ‘0’ indicating that the gene did not belong to the bicluster. An example of this ‘geneScoreFile’ can be obtained here.
The ‘conditionScoreFile’ is constructed in a similar way. The first row of the ‘conditionScoreFile’ contains all condition labels of the conditions in the compendium. Again this row is followed by a matrix that contains the conditionscores. If no resolutionsweep was used this matrix is replaced by one row vector containing the output of the query-based biclustering run. An example of this ‘conditionScoreFile’ can be obtained here. Both gene and condition scores must be between 0 and 1.
Naming conventions for these files are the following: ‘[query-gene]_geneS.txt’ for the geneScoreFiles and ‘[query-gene]_condS.txt’ for the conditionScoreFiles. Here [query-gene] refers to the respective query-genes saved in the ‘QueryFile’. All gene score and condition score files should be saved into the same folder, further referred to as ‘qdbResultDir’.
Step 4
Run the ensemble software. The software itself runs through the unix commandline in the following manner:
./ensembleQDB queryFile qdbResultDir ‘resolution’ F ‘scoreThr’ 0 ‘qualityCheck’ T ‘filter’ none

Required input arguments:

Optional input arguments:

Output: the output of the ensemble approach consists of two files: ‘consensusBicluster_genes.txt’ is a tab-delimited two-column file with the first column containing the number of the consensus bicluster and the second column containing the gene locus tag of the genes belonging to the bicluster. A similar file, ‘consensusBicluster_conds.txt’ is constructed for the condition content of the consensus biclusters. Both files are saved into the ‘qdbResultDir’ directory.

Examples: suppose that the queryFile is saved in the following file: here we suppose that the ‘queryFile’ is saved in the following location /home/usr/ensemble/query/queryFile.txt and that the path to the ‘qdbResultDir’ is /home/usr/ensemble/qdbResultDir/

References

[1] Dhollander,T. et al. (2007) Query-driven module discovery in microarray data. Bioinformatics, 23, 2573-2580.

[2] Ihmels,J. et al. (2002) Revealing modular organization in the yeast transcriptional network. Nat. Genet., 31, 370-377.

[3] Zhao,H. et al. (2011) Query-based biclustering of gene expression data using Probabilistic Relational Models. BMC Bioinformatics, 12, S37.

Citation

When using this software for your own work please cite:
De Smet, R. and Marchal, K. (2011). An ensemble biclustering approach for querying gene expression compendia with experimental lists. Bioinformatics (revised version submitted).

Contact

Kathleen Marchal