### abstract ###
While combinatorial models of transcriptional regulation can be inferred for metazoan systems from a priori biological knowledge, validation requires extensive and time-consuming experimental work.
Thus, there is a need for computational methods that can evaluate hypothesized cis regulatory codes before the difficult task of experimental verification is undertaken.
We have developed a novel computational framework that integrates transcription factor binding site and gene expression information to evaluate whether a hypothesized transcriptional regulatory model is likely to target a given set of co-expressed genes.
Our basic approach is to simultaneously predict cis regulatory modules associated with a given gene set and quantify the enrichment for combinatorial subsets of transcription factor binding site motifs comprising the hypothesized TRM within these predicted CRMs.
As a model system, we have examined a TRM experimentally demonstrated to drive the expression of two genes in a sub-population of cells in the developing Drosophila mesoderm, the somatic muscle founder cells.
This TRM was previously hypothesized to be a general mode of regulation for genes expressed in this cell population.
In contrast, the present analyses suggest that a modified form of this cis regulatory code applies to only a subset of founder cell genes, those whose gene expression responds to specific genetic perturbations in a similar manner to the gene on which the original model was based.
We have confirmed this hypothesis by experimentally discovering six new CRMs driving expression in the embryonic mesoderm, four of which drive expression in founder cells.
### introduction ###
A central challenge to determining the structure of genetic regulatory networks is the development of systematic methods for assessing whether a set of transcription factors co-regulates a given set of co-expressed genes.
Although classical genetics approaches allow the identification of key regulating TFs and the determination of their approximate ordering within the genetic hierarchy, demonstrating that a collection of TFs forms a combinatorial code acting to directly drive gene expression has required laborious experimental identification and perturbation of numerous individual cis regulatory modules.
To speed this process, several groups have recently demonstrated that computational approaches can rapidly identify CRMs with considerable accuracy CITATION CITATION, especially when performing computational searches with a collection of TFs known a priori to co-regulate.
This is perhaps best exemplified by the dramatic progress made by several groups in discovering CRMs for genes expressed during segmentation of the Drosophila melanogaster embryo CITATION, CITATION, CITATION, CITATION, a system where years of genetic screens have identified the regulating TFs CITATION.
In most biological systems, however, such a set of co-regulating TFs is either merely hypothesized or entirely unknown.
Therefore, in order for these in silico approaches to effectively identify the cis component of regulation in novel biological systems, additional computational methods are needed that can identify the trans component of regulation .
To address this question in metazoan systems, we have developed an initial statistical framework for evaluating hypothesized transcriptional regulatory models.
As a model system, we have examined the regulation of a class of Drosophila myoblast genes for which a regulatory model has been previously hypothesized CITATION, CITATION and for which extensive transcriptional profiling datasets have been generated CITATION.
Muscle founder cells are a sub-population of mononucleate myoblasts that are specified by the Wingless, Decapentaplegic, and Ras signal transduction cascades acting in combination within the somatic mesoderm CITATION, CITATION.
Prior experimental work using the gene even-skipped to mark a single FC in each embryonic hemisegment provided a detailed model for the integration of these three signaling pathways at the transcriptional level: the TFs activated by the Wg, Dpp, and Ras pathways T cell factor, Mothers against dpp, and Pointed, respectively were demonstrated to bind a transcriptional enhancer driving expression of eve within dorsal FCs CITATION, CITATION, CITATION, CITATION.
Additional tissue specificity was shown to be provided by two mesodermal selector TFs, Twist and Tinman.
Thus, from this single enhancer, a combinatorial model of transcriptional regulation for genes expressed in FCs was hypothesized, where exogenous signaling cues and endogenous tissue-specific TFs jointly establish the appropriate expression domain.
Guided by this genetic analysis of eve expression, a series of gene expression profiles has been determined for purified embryonic myoblasts by Estrada et al. CITATION.
In addition to profiling wild-type cells, these investigators performed expression array analyses of myoblasts in which the Wg, Dpp, Ras, and Notch pathways were variably perturbed by 12 informative gain-of-function and loss-of-function genetic manipulations.
Each of these 12 genetic perturbations was predicted, based on the example of eve, to increase or decrease expression of those genes with localized expression in FCs.
These 12 expression arrays were then combined into a single weighted ranking, which was used to predict additional FC genes.
Estrada et al. CITATION performed over 200 in situ hybridizations on predicted FC genes from the top of this composite FC ordering, and their experiments yielded a list of 159 validated FC genes.
In the present work, we utilize the expression data of Estrada et al. CITATION to evaluate the roles of dTCF/Mad/Pnt/Twi/Tin as generalized regulators of FC gene expression.
A previous computational scan for windows of sequence containing these five TFs successfully identified an additional enhancer for the gene heartbroken that drove expression in dorsal FCs and contained matches to these five transcription factor binding site motifs, demonstrating that the example of eve was not unique CITATION.
However, the generality of the model could not be established by those two examples alone, and we therefore developed a method of quantifying enrichment for these five TFBS motifs in localized windows of non-coding sequences flanking or intronic to FC genes.
Importantly, this approach, which we term CodeFinder, quantifies the relevance of not only each TF individually, but also of all combinations of the given set of TFs.
From this analysis, we hypothesized that the eve TRM is unlikely to apply to all FC genes.
Rather, we found that three TFs Pnt, Twi, and Tin are likely to regulate a specific subset of FC genes that share characteristic changes in their gene expression profiles in response to the genetic perturbations used by Estrada et al. CITATION.
Thus, by combining TFBS and gene expression data, our analysis allows a refinement of the initial model such that a subset of the original TFs appears to regulate a subset of FC genes.
As a test of this hypothesis, we have empirically validated four candidate FC enhancers that conform to our modified TRM .
