High Dimensional Data and Machine Learning

High-Dimensional Data and Machine Learning

Motivated by the data explosion in multiple scientific areas, such as genomics and proteomics, high-dimensional statistics deals with data where number of variables p is large compared with the sample size n. Instead of low dimensional regime, where the number of variables p is fixed, in high dimensional statistics, it is often assumed that the number of variables p grows with the growth of sample size n, which brings new challenges for estimation and inference problems. In the past two decades, rapid progress has been made in computation, methodology and theory for high-dimensional statistics, which yields fast growing areas of selective inference, post selection inference and multiple testing.

Machine learning (ML) is an emerging area in statistics and computer science aiming at algorithm development for data mining tasks, such as classification, prediction, and clustering. Statisticians play important roles in ML not only in developing novel algorithms and applying them to real data challenges, but also providing the theoretical guarantees on the statistical and computational properties of the algorithms. Machine learning has several subareas:

Our department faculty (Drs. Dai, R., Dai, H., Dong, Zhang, and Zheng) have done some innovative methodology research in this area.

  1. Zhang, H., Zheng, Y., Hou, L., Zheng, C., and Liu, L. (2021) Mediation analysis for survival data with high-dimensional mediators. doi: 10.1093/bioinformatics/btab564.
  2. Dai, R., Song, H., Raskutti, G., and Barber, RF. (2020) The bias of isotonic regression. Electronic Journal of Statistics. 14: 801-874  
  3. Song, H., Dai, R., Barber, RF., and Raskutti, G. (2020) Convex and non-convex approaches for statistical inference with noisy labels. Journal of Machine Learning Research. 21: 1-58.
  4. Wu, L., Jin, Q., Chen, J., He, J., Dong, J. (2020). Diagnostic Accuracy of Chest Computed Tomography Scans for Suspected Patients With COVID-19: Receiver Operating Characteristic Curve Analysis, JMIR Public Health and Surveillance. Oct; 6(4): e19424. DOI: 10.2196/19424
  5. Liu, Y. and Zheng, C. (2019). Deep latent variable models for generating knockoffs. STAT. 8: e260.
  6. Dong, J., Wang, L., Gill, J., and Cao J. (2017) Functional Principal Component Analysis of GFR Curves after Kidney Transplant. Statistical Methods in Medical Research. 27(12):3785--3796
  7. Dai, , Wu, G., Wu. M., and Zhi D. (2016) An Optimal Bahadur-efficient Method in Detection of Sparse Signals with Applications to Pathway Analysis in Sequencing Association Studies. PloS One. doi.org/10.1371/journal.pone.0152667.
  8. Su, X., Wijayasinghe, CS., Fan, J., and Zhang, Y. (2016) Sparse estimation of proportional hazards models via approximated information criteria. Biometrics. 72: 751-759.
  9. Jiang, DF., Huang, J., and Zhang, Y. (2013) The cross-validated AUC for MCP-logistic regression with high-dimensional data. Statistical Methods for Medical Research. 22(5): 505-518, 2013.