Genome-wide Prediction of Splice Sites using Maximal Dependence Decomposition
Eukaryotic mRNA splicing is a key process involved in gene expression. It plays a vital role in producing functional proteins. It is carried out by a complex called spliceosome that ensures introns are removed and exons are joined correctly prior to translation. Spliceosome spots specific sequence motifs called splice sites within an mRNA sequence, two of which are the 5’ and 3’ sites that flank all introns. Mutations within splice sites can disrupt normal splicing process leading to malignant diseases. The aim of this work is to predict splice sites in a human genome using higher order position weight matrix (PWM). PWM is a popular computational method used to represent splice sites or any sequence motif. It is constructed by capturing the occurrences of nucleotides at each position of aligned sequences. Simple PWMs performed fairly well in classifying known 5' and 3' splice sites and predicting cryptic splice sites in human genes. However, they make a strong assumption of independency between adjacent and non-adjacent nucleotide positions. Therefore, we developed a method that will identify statistically significant splice sites and incorporated maximal dependence decomposition algorithm (MDD) to achieve it. Methods: MDD operates by constructing a classification tree on aligned sequences. It takes into account the interdependency between nucleotide positions unlike simple PWMs and degree of dependencies is evaluated by chi-squared tests. The leaves of the classification tree represent distinct groups with their own PWM and are used to cluster unknown test sequences and classify unknown splice sites. Results: We performed 10-fold cross validation of the MDD algorithm for 5' and 3' authentic human splice sites from the HS3D databaseand observed area under the Receiver Operating Characteristic curve (ROC) to be 0.97 and 0.93, respectively. Similarly, we performed classification of putative 5' and 3' cryptic splice sites in the beta-globin and breast cancer type 1 susceptibility protein (BRCA1) genes and observed area under ROC to be .76 and .70. Our algorithm is efficient; scoring each human chromosome takes on average 5 minutes. Conclusion: We are currently working on implementing a graphical-user interface for our method to facilitate data input and output as well as visualize the predicted splice sites in whole genome sequences. The tool can easily be extended to rapidly predict any DNA sequence motif.