Modeling Variant Gal4 Transcriptional Activation: A Structure-Based Approach

Masso, Majid

Background: Gal4 is a transcription factor that promotes expression of genes which code for enzymes that are used to convert galactose to glucose. The Gal4 protein from Saccharomyces cerevisiae (baker’s yeast) has become a model for studying eukaryotic transcriptional activation in general because its regulatory properties mirror those of several eukaryotic organisms. Methods: The structure of Gal4 was modeled by applying Delaunay tessellation, a computational geometry technique that identifies quadruplets of nearest neighbor residues in the folded protein. An in silico mutagenesis methodology that combines this approach with a four-body knowledge-based energy function was used to empirically characterize the effects of every possible single residue substitution on the Gal4 structure. In this way, each Gal4 variant was endowed with both an overall scalar value quantifying the structural change, as well as a more comprehensive feature vector of structure-based attributes. A recent study also investigated functional impacts to a diverse set of single residue Gal4 variants by measuring changes in their transcriptional activation relative to wild type. Results: As testament to the validity of the variant Gal4 computational and experimental data, and given that protein structure determines function, a significant correlation was observed between the computed (scalar) structural data and the measured activity values obtained for the collection of single residue Gal4 variants. Consequently, by instead representing the Gal4 variants in terms of their more informative structure-based attribute vectors, supervised classification and regression statistical machine learning algorithms were implemented to train predictive models of variant Gal4 activity. All of the models performed well under cross-validation testing, with balanced accuracy reaching 86% among the classification models, and with actual and predicted activity values displaying a correlation as high as r = 0.67 for the regression models. Conclusions: These models can be used to predict transcriptional activation levels for all Gal4 variants that have yet to be studied, by simply submitting their respective structure-based feature vectors to the trained models as a separate test set. By doing so, researchers may be able to narrow the focus and reduce the cost of performing mutagenesis experiments designed to achieve desired Gal4 transcriptional activation levels through single residue modifications.