We developed a Bayes Hierarchical model, DisHet, for dissection of heterogeneous bulk tumors, to evaluate the tumor microenvironment in renal cell carcinoma (RCC). DisHet was used to separate the normal, tumor, and immune/stromal components from RNA-sequencing (RNA-seq) data. DisHet analyses uncovered 610 genes not previously linked to the RCC tumor microenvironment and showed that half of the previously designated immune signature genes are not expressed in the RCC tumor microenvironment. These RCC-specific immune signature genes defined by DisHet analyses were termed eTME. Together with data from The Cancer Genome Atlas, the DisHet and eTME analyses characterized a highly-inflamed RCC subtype (termed IS) that exhibited enrichment of regulatory T cells, natural killer cells, Th1 cells, neutrophils, macrophages, B cells, and CD8+ T cells. The IS subtype was associated with aggressive disease, including BAP1-deficient clear-cell RCC and type 2 papillary tumors, and predicted poor survival in patients with RCC. These findings provide a missing link between tumor cells, the tumor microenvironment, and systemic factors.
We developed SCINA, a semi-supervised cell type assignment tool for single cell RNA-Seq and CyTOF/FACs data. One feature that distinguishes SCINA from previously used approaches is the consideration of prior knowledge as a form of supervision. The prior knowledge is denoted by a list of signature genes for each type of cell. SCINA searches for a segregation of the pool of profiled cells such that each type of assigned cells highly expresses the signature genes specified by the researcher. The subset of cells that do not highly express any of the signature genes will be designated as cells of unknown type. SCINA is also general and can be applied on other data of similar format, such as patient bulk RNA-Seq data. In our validation datasets, SCINA demonstrated superior performance to unsupervised approaches such as t-SNE and K-means clustering. Overall, SCINA, representing a “signature-to-category” approach, addresses a critical research need that has been previously neglected. Nevertheless, it is also synergistic with traditional unsupervised “category-to-signature” approaches.
I created the Linkage Analyzsis software for statistical mapping of phenotype-genotype in mouse forward genetic screening data. Linkage Analyzsis is being run daily for the Mutagenetix consortium led by Dr. Bruce Beutler, which is a large-scale screening project focused on identifying immune-response genes from ENU-mutated mice. From the huge amount of data (more than 39,000 mutations X more than 40 phenotypes) collected in this project, I am furthering my study to characterize the differential damaging effects of various types of missense and loss of function mutations. I successfully developed many machine learning models to solve real-life biomedical questions. For example, I co-led a team to win the highly competitive NIEHS-NCATSUNC DREAM Toxicogenetics Challenge, an international competition for the estimation of drug treatment effects using genomic and chemical data (published in Nat. Biotech). I also participated in and won several other DREAM challenges. Recently, I co-organized the Prostate Cancer DREAM Challenge that aimed at predicting the prognosis of prostate cancer patients using commonly available clinical variables (published in Lancet Oncology).