### abstract ###
An important step in understanding gene regulation is to identify the DNA binding sites recognized by each transcription factor.
Conventional approaches to prediction of TF binding sites involve the definition of consensus sequences or position-specific weight matrices and rely on statistical analysis of DNA sequences of known binding sites.
Here, we present a method called SiteSleuth in which DNA structure prediction, computational chemistry, and machine learning are applied to develop models for TF binding sites.
In this approach, binary classifiers are trained to discriminate between true and false binding sites based on the sequence-specific chemical and structural features of DNA.
These features are determined via molecular dynamics calculations in which we consider each base in different local neighborhoods.
For each of 54 TFs in Escherichia coli, for which at least five DNA binding sites are documented in RegulonDB, the TF binding sites and portions of the non-coding genome sequence are mapped to feature vectors and used in training.
According to cross-validation analysis and a comparison of computational predictions against ChIP-chip data available for the TF Fis, SiteSleuth outperforms three conventional approaches: Match, MATRIX SEARCH, and the method of Berg and von Hippel.
SiteSleuth also outperforms QPMEME, a method similar to SiteSleuth in that it involves a learning algorithm.
The main advantage of SiteSleuth is a lower false positive rate.
### introduction ###
An important step in characterizing the genetic regulatory network of a cell is to identify the DNA binding sites recognized by each transcription factor protein encoded in the genome.
A TF typically activates and/or represses genes by associating with specific DNA sequences.
Although other factors, such as metabolite binding partners and protein-protein interactions, can affect gene expression CITATION, it is important to identify the sequences directly recognized by TFs to the best of our ability to understand which genes are controlled by which TFs.
A better understanding of gene regulation, which plays a central role in cellular responses to environmental changes, is a key to manipulating cellular behavior for a variety of useful purposes, as in metabolic engineering applications CITATION .
A number of computational methods have been developed for predicting TF binding sites given a set of known binding sites CITATION CITATION.
Commonly used methods involve the definition of a consensus sequence or the construction of a position-specific weight matrix, where DNA binding sites are represented as letter sequences from the alphabet {A, T, C, G}.
More sophisticated approaches further constrain the set of potential binding sites for a given TF by considering, in addition to PWMs, the contribution of each nucleotide to the free energy of protein binding CITATION and additional biologically relevant information, such as nucleotide correlation between different positions of a sequence CITATION or sequence-specific binding energies CITATION.
Perhaps not as widely used as sequence analysis, the idea of employing structural data for predicting TF binding sites has been considered CITATION CITATION.
Most of these methods use protein-DNA structures rather than DNA by itself.
Acquiring training sets large enough to be useful is problematic for even well-studied TFs, for which only small sets of known binding sites are typically available CITATION.
New high-throughput technologies have been used to identify large numbers of binding sites for particular TFs CITATION CITATION, but there remains a need for methods that predict TF binding sites given a small number of positive examples.
Such methods can be used, for example, to complement analysis of high-throughput data.
Binding sites detected by high-throughput in vitro methods, such as protein-binding microarrays CITATION, can be compared with predicted binding sites to prioritize studies aimed at confirming the importance of sites in regulating gene expression in vivo.
The fine three-dimensional structure of DNA is sequence dependent and TF-DNA interactions depend on various physicochemical parameters, such as contacts between nucleotides and amino acid residues and base pair geometry CITATION.
These parameters are not accounted for by conventional methods for predicting TF binding sites, which rely on sequence information alone.
Letter representations of DNA sequences do not capture the biophysics underlying TF-DNA interactions.
Given that a TF does not read off letters from a DNA sequence, but interacts with a particular sequence because of its chemical and structural features, we hypothesized that better predictions of TF binding sites might be generated by explicitly accounting for these features in an algorithm for predicting TF binding sites.
The mechanisms by which TFs recognize DNA sequences can be divided into two classes: indirect readout and direct readout CITATION.
For indirect readout, a TF recognizes a DNA sequence via the conformation of the sequence, which is determined by the local geometry of base pair steps, the distortion flexibility of the DNA sequence, and protein-DNA interactions CITATION, CITATION.
For direct readout, a TF recognizes a DNA sequence through direct contacts between specific bases of the sequence and amino acid residues of the TF CITATION, CITATION.
These two classes of recognition mechanisms are not mutually exclusive.
In this study, we introduce a method, SiteSleuth, for predicting TF binding sites on the basis of sequence-dependent structural and chemical features of short DNA sequences.
By using molecular dynamics methods to calculate these features, we can map a set of known or potential binding sites for a given TF to vectors of structural and chemical features.
We use features of positive and negative examples of TF binding sites to train a support vector machine to discriminate between true and false binding sites.
Negative examples are derived from randomly selected non-coding DNA sequences.
Positive examples are taken from RegulonDB CITATION, which collects information about TFs in Escherichia coli.
Classifiers for E. coli TFs developed through the SiteSleuth approach are evaluated by cross validation, and the classifier for Fis is tested against chromatin immunoprecipitation -chip assays of Fis binding sites CITATION.
Combining ChIP with microarray technology, ChIP-chip assays provide information about DNA-protein binding in vivo on a genome-wide scale CITATION.
We also evaluate the performance of SiteSleuth against four other computational methods: the method of Berg and von Hippel CITATION, MATRIX SEARCH CITATION, Match CITATION, and QPMEME CITATION.
The BvH, MATRIX SEARCH, and Match methods rely on the PWM approach to capture TF preferences for binding sites.
The QPMEME method is similar to SiteSleuth in that it employs a learning algorithm.
In the case of Fis, we show that SiteSleuth generates significantly fewer estimated false positives and provides higher prediction accuracy than the other computational approaches.
