Condition dependency and combinatorial regulation of the Escherichia coli transcriptional network

K. Lemmens, T. De Bie, T. Dhollander, S. De Keersmaecker, I. Thijs, G. Schoofs, A. De Weerdt, B. De Moor, J. Collado-Vides, K. Engelen, K. Marchal

    Microarray compendium


[View full size image]
We developed a method to assess the condition-specific combinatorial regulation of transcriptome networks based on independent data sources. Our data integration framework DISTILLER (Data Integration System To Identify Links in Expression Regulation) searches for transcriptional modules by combining expression data with information on the direct interaction between a regulator and its corresponding target genes. The inferred modules contain a minimal number of genes that must be co-expressed in a sufficiently large number of conditions and share motif instances for the same regulator(s).

The framework (software available upon request) builds upon advanced itemset mining approaches that have been designed to have good scalability, efficient memory use, and a small number of user parameters. It includes a condition selection or bicluster strategy in which co-expression of genes is required in only a significant subset of the complete condition set. By including this condition selection we can apply the algorithm to large expression compendia where interesting genes are not necessarily co-expressed in all measured conditions. Our approach also makes it straightforward to include any number of data sources related to transcriptional interactions such as additional microarrays, ChIP-chip or motif data.

Our methodology consists of three steps: (1) the identification of seed modules; (2) the reduction of the set of all seed modules to a manageable set of non-redundant and significant seed modules; and (3) the extension of the thus obtained seed modules with additional genes.

In this study, we applied DISTILLER to simultaneously analyze two distinct data sources: a cross platform expression compendium consisting of 870 microarrays and a regulatory motif compendium consisting of both predicted and experimentally verified motif instances E.coli microarray experiments were collected from the three major microarray databases (Parkinson et al, 2007; Barrett et al, 2007; Demeter et al, 2007) and combined in a cross platform compendium of 870 arrays. From RegulonDB (Salgado et al, 2006), we collected the binding site models (weight matrices) of 67 regulators. These were used to screen E.coli upstream sequences in order to predict potential regulator-target interactions (Hertzberg et al, 2005). Supplementing these predictions with the known gene - motif interactions from RegulonDB resulted in the total interaction matrix.

In summary, DISTILLER identifies modules that consist of genes that are co-expressed in a subset of conditions, together with the controlling regulators. Modules can overlap in regulator, gene and condition content. This allows for identifying condition dependent combinatorial regulation: in this example gene 5 is regulated by both M1 and M2, but under a different set of conditions.

Microarray compendium

Our cross-platform compendium contains publicly available microarray data from the 3 major microarray databases: Stanford Microarray Database (Demeter et al, 2007), Gene Expression Omnibus (Barrett et al, 2007) and ArrayExpress (Parkinson et al, 2007). Additionally, we included four microarray experiments described in literature. After filtering out the redundant information in the microarray databases, we obtained a total of 870 microarrays (data available upon request). Conditional categories were assigned based on manual curation: descriptions of each experiment as available in the database were combined with those of the corresponding publications. A full description of the microarray compendium can be found in table 1a and 1b.

Table 1a Experiment info: html

Table 1a Experiment info: txt

Table 1b Array info: html

Table 1b Array info: txt

Download the data


Supplementary files

The supplementary file containing additional information on the methodology can be found here.

Identified modules

DISTILLER allows integrating interaction data with expression information to obtain condition-dependent modules. Each module is composed of true and/or predicted interactions that are functionally active under conditions that were selected from the expression compendium. The 150 statistically significant modules recovered by DISTILLER confirmed 454 (62%) of 736 previously described interactions in RegulonDB.

A detailed description of the 150 modules can be found here.

An overview of the number of known interactions per regulator that were identified, as well as the number of new, predicted interactions can be found here .

Flat files containing information about the modules can be found below. For each module (column 1), the gene / regulator / condition content (column2) is indicated.

Novel predictions

In addition to the confirmation of 454 previously described interactions, about 278 novel interactions were predicted. Table 2 shows a complete list of the novel interactions.
Note that what we claim as "novel" corresponds to "not present in RegulonDB" but does not exclude that it was never reported in recent literature

Table 2: html

Table 2: txt

Conditional Dependency

Arrays were grouped into conditional categories depending on the major cue that was changed in the experiments of the compendium (see table 1b). For instance, the category aerobic-anaerobic groups all arrays in which the effect of changing the oxygen level on gene expression was measured. Enrichment of a conditional category thus implies that the target genes of a particular regulator are mainly co-expressed in conditions belonging to the enriched category and this indirectly gives information on the conditions where a particular regulator is active.

A detailed description of the conditional dependency can be found here.


DISTILLER 1.0 and 2.0 can be download here.
DISTILLER 2.0 is also available as a web service.

Features of DISTILLER 2.0:
  • Query-driven function was added
  • Out of heap problem was solved
  • Much faster than before
  • Can handle both "binary p-value" and "binary threshold" parameter

    When the gene-motif/regulatory matrix is not binary, we need a threshold to discretize it to binary matrix
  • Binary p-value is a parameter for choosing a threshold using a randomization procedure
  • Binary threshold is a parameter that can be used directly as the threshold to discretize the matrix

  • Contributors:
    Tijl De Bie, Karen Lemmens, Hong Sun, Qiang Fu, Riet De Smet, Kristof Engelen, Kathleen Marchal