Statistics, Big Data Analysis, and Applications

Biology, medicine, energy, and the environment (mathematics of planet Earth).

Machine learning.

Research topics

  • Bayesian methods
  • Big data analytics
  • Biostatistics
  • Causal inference
  • Empirical likelihood
  • Extreme-value analysis
  • Infectious disease modeling
  • Lifetime data analysis
  • Methods for high-dimensional data
  • Pseudo and composite likelihood methods
  • Semiparametric and nonparametric models
  • Statistics
  • Statistical learning and data mining
  • Survival analysis

Research Spotlight: Dr. Qingrun Zhang

Qingrun Zhang

I am developing Machine Learning algorithms focusing on high-dimensional data, with applications to the integration of genomics and gene expression data to predict risk and occurrence of diseases, as well as drug responses. Collaborating with colleagues in the Cumming School of Medicine, we develop computational tools tailored to different diseases including cancers and neurodevelopment disorders. 

Our research addresses the following questions: 


(1) How to learn sensible representations of high dimensional data. Representation learning is an emerging technique largely based on unsupervised learning to learn the internal structure of the data. The outcome will be new (usually lower dimensional) data that enable more effective downstream machine learning tasks. 


(2) How to identify association and causality using data mining techniques. Identifying association and causality is a long-standing field in statistics and has broad applications in genomics and precision medicine. A specific focus of our research is stably identifying them in the presence of noise and complicated correlation structures.


(3) How to carry out statistical inference in multi-scale omics data (e.g., single-cell sequencing). Integrating multiple layers of data may represent the current trend of statistical inference. We focus on how to integrate -omics data including genomics, transcriptomics, epigenomics and other -omics data to form predictors. In particular, we are interested in analyzing single-cell -omics data to infer within-tissue and within-individual dynamics at the single-cell resolution. 


For more information check my research page here, and my lab page here.

Research Spotlight: Dr. Hua Shen

Image of Hua Shen

My research interests are on the methodology development and statistical analysis of complex and imperfect data arising from public health and medical research. Key areas of focus include the analysis of survival data, recurrent events data and longitudinal data that often arise in both clinical trials and observational studies. The types of complications that I deal with include incomplete data, misclassification and measurement error, latent heterogeneity, joint outcomes, high dimension, and hierarchical structures with the aim to identify significant factors, quantify the association and make valid inferences.

My current research focuses on developing statistical methods to

  1. analyze lifetime data involving latent processes where the underlying disease may resolve while some covariates are incompletely observed or subject to misclassification to avoid ignorance of patient heterogeneity, biased estimates and invalid inference, 
  2. develop joint models for classification and prediction based on mixed measurements involving surrogate classifiers or observations subject to measurement error to produce higher accuracy and precision in subgroup attribution or diagnostic test, 
  3. conduct causal inference using advanced statistical learning methods to address the complications of having missing and/or misclassified confounders to produce unbiased estimates of treatment effect, 
  4. propose advanced and adaptive methods for variable selection and group-variable selection in recurrent event analysis and survival analysis and investigate their oracle properties, and
  5. model the longitudinal data and survival data or multivariate lifetime time jointly and propose computationally efficient methods for algorithm implementation and statistical inference.

I am keen in supporting medical research through transdisciplinary partnership. My collaborators include epidemiologists, oncologists, radiologists, medical physicists, gastroenterologists, cardiovascularists, and rheumatologists. Researchers in other areas are also welcome to contact me for prospective collaboration. 

I am interested in working with students at both graduate and undergraduate levels. Students with good work ethic, strong interests and solid background in statistics, biostatistics, applied mathematics, computer science and other related areas are welcome to make inquiries about graduate studies or post-doctoral positions.

Research Spotlight: Dr. Thierry Chekouo

Thierry Chekouo

My research interests are in developing new statistical frameworks for analyzing datasets characterized by high dimensionality and complex structures. I am particularly interested in the development of novel Bayesian methodologies motivated by real problems in integromics, imaging genetics and genomics. Many of these methods have incorporated biological and external knowledge through prior distributions.

A. Development of innovative Bayesian statistical methods for biclustering.

I have developed innovative statistical methods for biclustering that aims to cluster simultaneously rows and columns of a data matrix. In (A1), we proposed a Bayesian biclustering model that incorporates gene-gene relationship (using gene ontologies) through prior distributions when applied to gene expression data. We developed a hybrid MCMC (Markov chain Monte Carlo) procedure that mixes the Metropolis–Hastings sampler with a variant of the Wang–Landau algorithm. A theoretical proof of the convergence of this algorithm was provided. In (A2), we shed light on associated statistical models behind the biclustering algorithms. It turns out that most of the known techniques have a hidden Bayesian flavor. We then proposed a Bayesian biclustering model that controls the degree of overlapping between biclusters. We applied our methods to gene expression databases, with the aim of confirming known or finding novel subnetworks of proteins/genes associated with disease.

(A1) Chekouo T, Murua A and Raffelsberger W. The Gibbs-plaid biclustering model. The Annals of Applied statistics. 2016; 9:1643–1670.

(A2) Chekouo T and Murua A. The Penalized Biclustering model and Related Algorithms. Journal of Applied Statistics. 2015; 42(6):12551277.

B. Development of innovative statistical methods for integration of multiplatform -omic data.

I have developed innovative statistical methods for jointly analyzing multiple type of genomic data such as mRNA expression, DNA methylation data and miRNA expression. The methods are capable of identifying a small set of prognostic markers that are associated with clinical outcomes (e.g. survival data). In (B1), our approach is built in a Bayesian framework and incorporate the complex dependence between data types through prior distributions. In (B2), the novel integrative Bayesian approach fully exploits the amount of available information across platforms and does not exclude any of the subjects from the analysis. By applying our methods to Kidney cancer data, we were able to identify and validate some biomarkers that are predictive for the disease.

(B1) Chekouo T, Stingo FC, Doecke JD and Do KA. miRNA-target gene regulatory networks: A Bayesian integrative approach to biomarker selection with application to kidney cancer. Biometrics. 2015 Jun;71(2):42838. PubMed PMID: 25639276; PubMed Central PMCID: PMC4499566.

(B2) Chekouo T, Stingo FC, Doecke JD and Do KA. A Bayesian integrative approach for multiplatform genomic data: A kidney cancer case study. Biometrics. 2017 Jun;73(2):615624. PubMed PMID: 27669160.

C. Development of innovative statistical methods for imaging-genetics.

I have developed an innovative statistical method for jointly analyzing imaging and genetic data. In (C1), we propose an integrative Bayesian risk predictive model that combines both single nucleotide polymorphism (SNP) arrays and functional magnetic resonance imaging (fMRI). By incorporating the dependence between imaging and genetic data, the method allows us to discriminate between individuals with schizophrenia and healthy controls, based on a sparse set of discriminatory regions of interests and SNPs. In terms of prediction and feature selection, we found our approach to outperform competing methods that do not use the dependence fMRI-SNP to the selection of discriminatory markers.

(C1) Chekouo T, Stingo F, Guindani M and Do K. A Bayesian predictive model for imaging genetics with application to schizophrenia. The Annals of Applied Statistics. 2016; 10(3):15471571.

D. Development of innovative statistical methods for simultaneous clustering and variable selection.

In (D1), inspired by the plaid biclustering model, we proposed a model that performs simultaneously clustering and variable selection. Unlike conventional clustering, within this model an observation may be explained by several clusters. This characteristic makes it especially suitable for gene expression, where genes may participate in multiple biological pathways. Parameter estimation is performed with the Monte Carlo expectation maximization algorithm and importance sampling. An application of our approach to the gene expression data of the kidney recall cell carcinoma validates some previously identified cancer biomarkers.

(D1) Chekouo T and Murua A. High-dimensional variable selection with the plaid mixture model for clustering, Computational Statistics. 2018; 33(3):14751496


E. Bayesian approaches for detecting patterns of markers for extremely small sample sizes

In (E1), we have developed an innovative Bayesian approach that efficiently identifies patterns of markers with similar patterns of biological relevance. Motivated by the availability of ion mobility mass spectrometry data on cell line experiments in myelodysplastic syndromes and acute myeloid leukemia from the "Moon Shots" Program at MD Anderson Cancer Center, our methodology can identify protein markers that follow biologically meaningful trends. Extensive simulation studies demonstrate the good performance of the proposed method even in the presence of relatively small treatment effects and sample sizes.

(E1) Chekouo T, Stingo F., Class C., Yan Y., Bohannan Z., Wei Y., Garcia Manero, G., Hanash S. and Do K. Investigating Protein Patterns in Human Leukemia Cell Line Experiments: A Bayesian Approach for Extremely Small Sample Sizes. 2019, Statistical methods in medical research, DOI: 10.1177/0962280219852721.