Abstract: A Novel Single Linkage Clustering Algorithm used to Identify Relationships within Gene Families (2012 AAAS Annual Meeting (16-20 February 2012))

Sunday, February 19, 2012

Exhibit Hall A-B1 (VCC West Building)

Ben Busby , NCBI, NLM, NIH, Bethesda, MD

Jeffrey Robinson , Department of Biological Sciences, Dartmouth College, Hanover, NH

Colleen Bollin , NCBI, NLM, NIH, Bethesda, MD

DEAD-box helicases are ubiquitously involved in cellular processing of coding and noncoding RNA. These paralogous families are involved in transcriptional, post-transcriptional, and translational regulation of the cell cycle, chromosome segregation, cellular differentiation, viral replication, RNA interference, and microRNA processing. While there are several published phylogenies for these gene families, they are generally biased toward Vertebrate and Metazoan taxa, and fail to account for a full and nonbiased catalog of distribution through the eukaryotic and prokaryotic domains of life. After an exhaustive blast search of DEAD-box helicase subfamilies, sequences are clustered to reduce intersequence informational bias, and extraordinarily long or short sequences are removed. Following iterative alignment of whole proteins, sequences containing the fewest blocks of highly conserved sequence are removed to eliminate intrasequence informational bias. A hierarchical single-linkage clustering algorithm is implemented, in an iterative manner and Maximum likelihood trees are constructed using FastTree. Here we show a tree of all canonical DEAD-box protein families, and their interrelationships. The algorithm naively self-segregates DEAD-box protein clusters from similar protein superfamilies such as DEAH helicases. The proteins that belong to one DEAD-box family of clusters, P68-DDX5, are involved in many canonical DEAD-box processes, and have been shown to perform as a cofactor in p53 mediated co- and post-transcriptional processing of miRNAs. Within the p68/DDX5 family, we demonstrate that differences between this gene family tree and established species trees can help delineate the positions of amino acids which may be important in these processes. This tree also shows the phyletic distribution of DEAD-box families within and across kingdoms. The deep conservation and position specific variance of thep68/DDX5 family of DEAD-box helicases indicate that they perform essential and potentially conserved roles in RNA processing related to the Eukaryotic cell cycle. Analysis of these specific positions may yield mechanistic insights into the relative functional roles of different amino acids within and between DEAD-box families. Applying the original algorithm to other sets of gene families, such as all genes in multiply sequenced bacterial genomes may show a further utility of the self-segregation qualities and functional insights that stem from application of this algorithm.

See more of: AAAS General Poster Session
See more of: Poster Sessions

<< Previous Abstract | Next Abstract >>

7685 A Novel Single Linkage Clustering Algorithm used to Identify Relationships within Gene Families