Our lab develops novel computational methods to study cellular biological systems from a global and data-driven perspective. We seek to exploit diverse high-throughput functional and genomic data to understand the molecular networks underlying fundamental cellular processes, including regulation of transcription, pre-mRNA processing, signaling, and post-transcriptional gene silencing. Our algorithmic methods draw on machine learning, a computational field concerned with learning accurate, predictive models from noisy and high-dimensional data.
Transcriptional regulatory networks
We are interested in learning gene regulatory programs that accurately predict genome-wide differential mRNA expression under different cellular conditions and extracting testable hypotheses about transcriptional regulatory networks. Our algorithmic approach integrates promoter sequence, mRNA expression data, and ChIP on chip binding data to learn gene regulatory programs and discover transcription factor binding motifs. We are using this approach, called the MEDUSA algorithm, to model the oxygen and heme regulatory network in yeast, and through our experimental collaborator, we are performing biochemical validation of predicted oxygen regulators.
Gene silencing by microRNAs
Most current computational methods for predicting microRNA targets rely on scoring possible hybridizations between microRNA and target site sequences and on cross-species comparisons. As a complement these sequence-based efforts, we are developing integrative models of mammalian microRNA regulation that incorporate tissue-specific microRNA expression data and different sources of gene expression data. We are using these models to evaluate statistical evidence for competing hypotheses about microRNA silencing mechanisms.
Remote protein homology detection
An Interview with Christina Leslie
"Coming from pure mathematics into this very interdisciplinary field has been a revelation"
Recognizing a protein's fold from its primary sequence of amino acids is a long-standing problem in computational biology. Traditional approaches use pairwise sequence comparison or protein family models based on multiple sequence alignments to infer structural relationships from sequence similarity. However, these methods may not perform well in the remote homology detection setting, where the protein sequence to be classified is only remotely homologous to known protein families. Our lab introduced the use of biologically-motivated k-mer based "string kernels" for support vector machine (SVM) classification of protein sequences into structural categories, achieving state-of-the-art performance for remote homology detection.