### abstract ###
Protein function is mediated by different amino acid residues, both their positions and types, in a protein sequence.
Some amino acids are responsible for the stability or overall shape of the protein, playing an indirect role in protein function.
Others play a functionally important role as part of active or binding sites of the protein.
For a given protein sequence, the residues and their degree of functional importance can be thought of as a signature representing the function of the protein.
We have developed a combination of knowledge- and biophysics-based function prediction approaches to elucidate the relationships between the structural and the functional roles of individual residues and positions.
Such a meta-functional signature, which is a collection of continuous values representing the functional significance of each residue in a protein, may be used to study proteins of known function in greater detail and to aid in experimental characterization of proteins of unknown function.
We demonstrate the superior performance of MFS in predicting protein functional sites and also present four real-world examples to apply MFS in a wide range of settings to elucidate protein sequence structure function relationships.
Our results indicate that the MFS approach, which can combine multiple sources of information and also give biological interpretation to each component, greatly facilitates the understanding and characterization of protein function.
### introduction ###
Vast amounts of sequence and structural data are being generated by high-throughput technologies.
Functional annotations of the uncharacterized sequences and structures are significantly lagging.
The time and cost of experimental techniques required to probe the function of all uncharacterized proteins are prohibitive.
Therefore, computational means have been increasingly useful and popular in predicting and annotating functions for the huge amount of sequence and structure data CITATION, CITATION .
However, protein function prediction is itself a difficult problem to formulate, since it is difficult to define function CITATION, CITATION.
Various functional definition schemes have been developed over the years and have addressed various aspects of protein function.
Instead of adopting an existing functional definition scheme, we proposed to probe the role of individual amino acid residues in protein function, regardless of the functional definition schemes that are used.
In such cases, the protein function can be represented simply as a series of quantitative values, each of which indicates the functional importance of the corresponding amino acid residue in the protein sequence or structure.
To calculate the quantitative values for each residue, we used a combined approach, the meta-functional signature, which takes into account the individual scores from various function prediction algorithms and generates a composite score for each amino acid residue in a given protein.
Currently our signature generation protocol consists of the following four types of scores for four different types of information: sequence conservation, evolutionary conservation, structural stability, and amino acid type.
All these scores are generated via conceptually simple and easily implementable algorithms, and their combined use outperforms sophisticated algorithms that use only one source of information.
Sequence conservation is one of the most utilized methods for measuring the functional importance of individual amino acids.
Amino acid residues with more conservative variation patterns are usually more important for the preservation of protein function.
This concept is often used to identify the functional regions of proteins by building multiple alignments between the target sequence and all its sequence homologues, and then analyzing the degree of sequence conservation among each alignment site.
Various measures of sequence conservation have been proposed over the years, with differing complexity and sophistication CITATION.
The simplest measures of sequence conservation are the entropy score and its variants CITATION CITATION.
More complicated measures CITATION CITATION incorporate other information, such as amino acid pairwise similarity, physicochemical properties, and theoretical sequence profiles, into the scoring schemes.
The AL2CO program package incorporates nine different scoring schemes, but these scores tend to correlate with each other CITATION.
Recently it was also shown that a Jensen-Shannon divergence measure improves predicting functionally important residues, and that considering conservation in sequentially neighboring sites further improves accuracy CITATION.
We previously demonstrated that a relative entropy measure which incorporates amino acid background frequencies, can better predict functional sites than simple entropy measures CITATION.
Furthermore, we found that incorporating the amino acid frequencies as estimated by the hidden Markov Models further improves the performance of the relative entropy measure CITATION.
In the current study, we use a sequence conservation measure derived from HMMs as one component of our meta-functional signature generation protocol.
In addition to sequence conservation, we also incorporate evolutionary conservation information in the meta-functional signature.
Many studies have shown that the use of phylogenetic relationships among a group of evolutionarily related sequences help accurate prediction of functional sites.
The Evolutionary Trace method, one of the first and the most successful of such methods, analyzes residue variation patterns within and between protein subfamilies from multiple alignments, maps important residues to protein structure, and quantitatively ranks residue importance CITATION, CITATION.
A further development of the Evolutionary Trace method allows quantitative ranking of residue importance, by combining the use of evolutionary information and the entropy measures CITATION, CITATION.
Similarly, the ConSurf method constructs phylogenetic relationships from a group of similar sequences, calculates the conservation score by a Bayesian or a maximum likelihood method, and maps the conservation information to the protein surface CITATION, CITATION.
Further, a study by Soyer et al. used site-specific evolutionary models that assumed a different substitution matrix for each site, for detecting protein functional sites CITATION.
La et al. used evolutionary relationships among sequence fragments to infer protein functional sites CITATION.
del Sol Mesa et al. presented several automated methods that divide a given protein family into subfamilies and search for residues that determine specificity CITATION.
The commonality among all these methods is that sequence relationships are analyzed based on the topology of an evolutionary tree, thus providing an additional level of information instead of relying on multiple sequence alignments alone.
Here, we propose a novel method, called the state to step ratio score, for measuring evolutionary conservation.
Based on given multiple alignments, we construct a maximum parsimony tree, and analyze the variation patterns from the root of the tree to the leaf of the tree to create a score for each amino acid residue.
The SSR score is a simple yet effective way of measuring evolutionary conservation.
Functional signature scores can also be derived from biophysics-based methods, using experimentally determined or computationally predicted protein structures.
For example, a recent study demonstrated that destabilizing regions in protein structures can often be used to provide valuable information for functional inference and functional site identification CITATION.
For a given structure and a given position, we propose that we can mutate the wild-type residue to 19 other amino acids and calculate their structural stability scores, which can in turn be used to assign a score to each residue in a protein.
Hence, these scores can also serve as a component of protein function prediction.
We previously developed a residue-specific all-atom probability discriminatory function CITATION that compiles statistics from a database of experimental structures to score and pick decoy structures that are more likely to be similar to experimentally derived structures.
The RAPDF has been optimized and enhanced in recent years for protein structure prediction CITATION CITATION.
Here, we further expanded the RAPDF to score residue mutations on a per-residue basis.
Each residue in a given protein was mutated to one of the 19 alternative amino acids, producing new structures that were further optimized for topology and maximized for stability.
In our current MFS generation protocol, we used two RAPDF based scoring functions, to measure how all mutated structures deviate from each other and how the experimentally determined structure differs from mutated structures, which represent the potential impact on stability for the position and for the naturally occurring residue, respectively.
These scores separate residues conserved for structure versus function.
An additional component of the meta-functional signature is information on the type of amino acids, such as histidine and cysteine, which are more likely to be located in functional sites than other amino acids.
However, such prior probability for a functional site is not explicitly modeled and incorporated by most current functional site prediction algorithms.
In our MFS generation protocol, we used 19 binary variables to represent the amino acid identity for each position in a given protein.
We also examined whether the explicit use of amino acid information, as opposed to the implicit use, could provide additional information and better performance.
Given the complexity of defining and identifying protein functional sites, clearly no single method will always work to capture all protein functional site information.
Therefore, several groups have begun to incorporate information from various sources, especially structure-derived information, to give more accurate predictions.
Work by Chelliah et al. has shown that distinguishing the structural and functional constraints for amino acid residues leads to better prediction of protein interaction sites CITATION.
We have shown that by considering both structural and functional constraints on protein evolution, we can better identify functional sites and signatures CITATION, CITATION.
Recently, Petrova et al. showed that integration of seven selected sequence and structure features into a support vector machine framework can improve identification of catalytic sites CITATION.
Furthermore, Fischer et al. integrated sequence conservation, amino acid distribution, predicted secondary structure and relative solvent accessibility into a probability density framework, and showed that at 20 percent sensitivity the integrated method leads to a 10 percent increase in precision over non-integrated methods for predicting catalytic residues from the Catalytic Site Atlas and PDB SITE records CITATION.
Youn et al. investigated the various features for discriminating catalytic from noncatalytic residues in novel structural folds, and showed that a measure of sequence conservation, a measure of structural conservation, a degree of uniqueness of a residue's structural environment, solvent accessibility, and residue hydrophobicity are the best predictors of catalytic sites CITATION.
Other similar studies also incorporated dozens to hundreds of features into a machine-learning framework for catalytic site identification CITATION, CITATION.
Altogether, the previous work suggests great value in using several complementary sequence and structure components for scoring catalytic sites.
Unlike these approaches that were largely based on machine-learning algorithms, in the current study, we aim to combine several sources of information regarding the sequence, structure, evolution, and type of amino acids together via a simple logistic regression model for function prediction, including both catalytic sites and binding sites.
The major advantage of the regression model is that each component can be associated with a biologically meaningful interpretation, and that individual scores for a protein can be manually studied to gain additional insights into different aspects of protein function, which are not available when many components are thrown into a sophisticated machine-learning framework.
We compare the MFS approach with several other functional site prediction algorithms, propose enhancements to our approach, exemplify the wide definition of function assessed by MFS, and discuss how different components of MFS can be used to understand biological function via four real-world examples.
