Workshop in Semi/Nonparametrically Statistical Learning

Workshop in semi/nonparametrically statistical learning

Conference Schedule







Yuhong Yang
University of Minnesota

Treatment Allocations Based on Multi-Armed Bandit Strategies


Runze Li
The Pennsylvania State University

Testing of Significance for High-Dimensional Longitudinal Data


Have a Break(Taking a photo)


Jianguo (Tony) Sun
University of Missouri

Joint Analysis of Interval-censored failure Time Data and Panel Count Data


Lunch( 学校柳园食府二楼大厅)


Heping Zhang
Yale University

Modeling Hybrid Dependent Responses


Wei Pan
University of Minnesota

An  empirical comparison of deep learning and other methods for prediction  of protein subcellular localization with microscopy images


Have a Break


Fang Liu
The University of Notre Dame

Noise Injection Regularization in Large Models with Applications to Neural Networks and Graphical Models


Gang Li
University of California at Los Angeles

Prediction Accuracy Measures for a Nonlinear Model and for Right-Censored Time-to-Event Data



Location: Conference Room 408, Hongyuan Building of SWUFE(Southwestern University of Finance and Economics)




时  间:2018年7月6日(星期五)9:00-15:20

地  点:弘远楼408会议室

主题一: Treatment Allocations Based on Multi-Armed Bandit Strategies




Yuhong  Yang received his Ph.D from Yale in statistics in 1996. He then joined  the Department of Statistics at Iowa State University and moved to the  University of Minnesota in 2004. He has been a full professor there  since 2007. His research interests include model selection, multi-armed  bandit problems, forecasting, high-dimensional data analysis, and  machine learning. He has published in journals in several fields,  including Annals of Statistics, IEEE Transaction on Information Theory,  Journal of Econometrics, Journal of Approximation Theory, Journal of  Machine Learning Research, and International Journal of Forecasting. He  is a fellow of Institute of Mathematical Statistics.



In  practice of medicine, multiple treatments are often available to treat  individual patients. The task of identifying the best treatment for a  specific patient is very challenging due to patient inhomogeneity.  Multi-armed bandit with covariates provides a framework for designing  effective treatment allocation rules in a way that integrates the  learning from experimentation with maximizing the benefits to the  patients along the process.

In  this talk, we present new strategies to achieve asymptotically  efficient or minimax optimal treatment allocations. Since many  nonparametric and parametric methods in supervised learning may be  applied to estimating the mean treatment outcome functions (in terms of  the covariates) but guidance on how to choose among them is generally  unavailable, we propose a model combining allocation strategy for  adaptive performance and show its strong consistency. When the mean  treatment outcome functions are smooth, rates of convergence can be  studied to quantify the effectiveness of a treatment allocation rule in  terms of the overall benefits the patients have received.  A multi-stage  randomized allocation with arm elimination algorithm is proposed to  combine the flexibility in treatment outcome function modeling and a  theoretical guarantee of the overall treatment benefits. Numerical  results are given to demonstrate the performance of the new strategies.

The talk is based on joint work with Wei Qian.

主题二: Testing of Significance for High-Dimensional Longitudinal Data




李润泽是宾州州立大学统计系冠名讲座教授。他的研究领域包括高维数据的variable  selection and feature  screening以及非参数模型和半参数模型的建模和统计推断。他在统计学应用方面也做了一系列的研究工作。他曾担任Annals of  Statistics的副主编和主编。目前他担任JASA的副主编。他是IMS , ASA and AAAS 的fellows.



This  paper concerns statistical inference for longitudinal data with  ultrahigh dimensional covariates. We first study the problem of  constructing confidence intervals and hypothesis tests for a low  dimensional parameter of interest. The major challenge is how to  construct an optimal test statistic in the presence of high dimensional  nuisance parameters and the sophisticated dependence among measurements.  To deal with the challenge, we propose a novel quadratic decorrelated  inference function approach, which simultaneously removes the impact of  nuisance parameters and incorporates the correlation to enhance the  efficiency of the estimation procedure. We prove that the proposed  estimator is asymptotically normal and attains the semiparametric  information bound, based on which we can construct an optimal test  statistic for the parameter of interest. We then study how to control  the false discovery rate (FDR) when a vector of high-dimensional  regression parameters is of interest. We prove that applying the Storey  (2002)'s procedure to the proposed test statistics for each regression  parameter controls FDR asymptotically in longitudinal data. We conduct  simulation studies to assess the finite sample performance of the  proposed procedures. Our simulation results imply that the newly  proposed procedure can control both Type I error for testing a low  dimensional parameter of interest and the FDR in the multiple testing  problem. Finally, we apply the proposed procedure to a real data  example.

主题三:Joint Analysis of Interval-censored failure Time Data and Panel Count Data




孙建国,现为密苏里大学统计系教授,1992年毕业于滑铁卢大学,并取得博士学位。其研究兴趣包括:生物统计学,  生存分析, 纵向数据分析, 化学计量学。他是数理统计研究所的fellow,  国际统计研究所成员,ASA的fellow,2018年获得MU研究生导师奖。



Interval-censored  failure time data and panel count data are two types of incomplete data  that commonly occur in event history studies and many methods have been  developed for their analysis separately (Sun, 2006; Sun and Zhao,  2013). Sometimes one may be interested in or need to conduct their joint  analysis such as in the clinical trials with composite endpoints, for  which it does not seem to exist an established approach in the  literature.  This talk will discuss this problem and present a sieve   maximum likelihood approach.  Some simulation results and an  application will also be provided.

主题四:Modeling Hybrid Dependent Responses




张和平博士,耶鲁大学Susan  Dwight  Bliss生物统计学教授,统计与数据科学教授,儿童研究中心教授。他创建并主持耶鲁大学科学与统计协作中心。同时他也是香港大学荣誉教授,国家千人计划学者和长江讲座教授,泛华统计协会候任主席。他于1982年获得江西师范大学数学学士学位,1991年获得斯坦福大学统计学博士学位并兼修计算机科学。  

他是期刊Statistics and Its Interface的创始主编。他目前担任美国统计协会杂志(JASA), 遗传流行病学和生殖与不育专题研究的编委。2019担任JASA (ACS)主编。

张教授曾入选哈佛大学公共卫生学院2008年度Myrto  Lefkopoulou杰出学者并作2011年IMS  Medallion报告,2011年Royan国际生殖健康研究奖的获得者,2013年获得美国生殖医学学会颁发的科学论文奖,2014年March  of Dimes 早产最佳研究奖,2017年美国妇产科杂志优秀论文奖。

研究兴趣包括非参数方法,纵向数据,统计遗传学和生物信息学,临床试验,流行病学数据统计建模,脑成像分析,统计计算和行为科学的统计方法。他是Springer出版的“递归分区及其应用(Recursive  Partitioning and Its  Applications)”一书的作者,并在高影响力的统计、遗传、流行病学和精神病学期刊上发表了280多篇学术论文,其中包括Annals of  Statistics, Annals of Applied Statistics, Biometrika, JASA, JRSSB,  American Journal of Human Genetics, American Journal of Psychiatry,  PNAS, Science, JAMA, 以及 the New England Journal of Medicine.



I  will present a novel multivariate model for analyzing hybrid traits and  identifying genetic factors for comorbid conditions. Comorbidity is a  common phenomenon in mental health in which an individual suffers from  multiple disorders simultaneously. For example, in the Study of  Addiction: Genetics and Environment (SAGE), alcohol and nicotine  addiction were recorded through multiple assessments that we refer to as  hybrid traits. Statistical inference for studying the genetic basis of  hybrid traits has not been well-developed. Recent rank-based methods  have been utilized for conducting association analyses of hybrid traits  but do not inform the strength or direction of effects. To overcome this  limitation, a parametric modeling framework is imperative. Although  such parametric frameworks have been proposed in theory, they are  neither well-developed nor extensively used in practice due to their  reliance on complicated likelihood functions that have high  computational complexity. Many existing parametric frameworks tend to  instead use pseudo-likelihoods to reduce computational burdens. Here, we  develop a model fitting algorithm for the full likelihood. Our  extensive simulation studies demonstrate that inference based on the  full likelihood can control the type-I error rate, and gains power and  improves the effect size estimation when compared with several existing  methods for hybrid models. These advantages remain even if the  distribution of the latent variables is misspecified. After analyzing  the SAGE data, we identify three genetic variants (rs7672861, rs958331,  rs879330) that are significantly associated with the comorbidity of  alcohol and nicotine addiction at the chromosome-wide level. Moreover,  our approach has greater power in this analysis than several existing  methods for hybrid traits. Although the analysis of the SAGE data  motivated us to develop the model, it can be broadly applied to analyze  any hybrid responses.

主题五:An  empirical comparison of deep learning and other methods for prediction  of protein subcellular localization with microscopy images







We  compare the performance of deep-learning method and more traditional  machine learning methods to predict protein subcellular localization  based on a large dataset of single cell microscopy images. Specifically,  we show better performance of various VGG-type Convolutional Neural  Networks (CNNs) and residual CNNs (ResNets) over random forests and  gradient boosting. We also demonstrate the use of CNNs for transfer  learning and feature extraction.

主题六:Noise Injection Regularization in Large Models with Applications to Neural Networks and Graphical Models

主讲人六:University of Notre Dame刘芳副教授



Prof  Fang Liu is currently an Associate Professor and the Director of  Graduate Studies in the Department of Applied and Computational  Mathematics and Statistics at the University of Notre Dame. She obtained  her Ph.D. degree in Biostatistics from University of Michigan, Ann  Arbor in 2003 and worked as a Biostatistician at Merck Research Labs  from 2003 to 2011. Prof Liu’s research interests include development of  statistical methods for protecting data privacy, missing data analysis,  Bayesian methods and modelling, statistical learning and regularization  of complex models, and application of statistics to biological and  social science data. Prof Liu has published 40+ peer-reviewed journal  articles, and is the sole PI on two NSF grants on data privacy. She is  also the lead biostatistician on several large multinational studies on  malaria prevention.



The  noise injection regularization technique (NIRT) is an approach to  mitigate over-fitting in large models. In this talk, I will demonstrate  the applications of the NIRT in two scenarios of learning large models:  Neural Networks (NN) and Graphical Models (GM). For NNs, we develop a  NIRT called whiteout that injects adaptive Gaussian noises during the  training of NNs. We show that the optimization objective function  associated with whiteout in generalized linear models has a closed-form  penalty term that has connections with a wide range of regularizations  and includes the bridge, lasso, ridge, and elastic net penalization as  special cases; it can also be extended to offer regularizations similar  to the adaptive lasso and group lasso.  For GMs, we develop an AdaPtive  Noisy Data Augmentation regularization (PANDA) approach to promote  sparsity in estimating individual graphical models and similarity among  multiple graphs through training of generalized linear models. On the  algorithmic level, PANDA can be implemented in a straightforward manner  by iteratively solving for MLEs without constrained optimizations. For  both the NN and PANDA approaches, we use simulated and real-life data to  demonstrate their applications and show their superiority or  comparability with existing methods.

主题七:A New Joint Screening Method for Right-Censored Time-to-Event Data with Ultrahigh Dimensional Covariates




Dr.  Gang Li obtained his Ph.D degree in Statistics from Florida State  University in 1992. He is Professor of Biostatistics and Biomathematics  at University of California at Los Angeles (UCLA) and Director of UCLA’s  Jonsson Comprehensive Cancer Center Biostatistics Shared Resource. He  has published extensively with over 110 peer-reviewed articles in  statistical research and applied work in the areas of survival analysis,  longitudinal data analysis, high dimensional data analysis, clinical  trials, and evaluation of biomarkers. He has co-authored/co-edited three  statistical research monographs including a recent CRC Chapman &  Hall book entitled "Joint Modeling of Longitudinal and Time-to-Event  Data".  Dr. Li is Elected Fellow of the Institute of Mathematics,  Elected Fellow of the American Statistical Association, Elected Member  of the International Statistics Institute, and Elected Fellow of the  Royal Statistical Society. He has served on the editorial board for  multiple statistics journals. Dr. Li has been active in collaborating  with researchers in basic science, translational research, and clinical  trials, and has been a statistics principal investigator for multiple  large cancer studies.



In  an ultrahigh dimensional setting with a huge number of covariates,  variable screening is useful for dimension reduction before a more  refined variable selection and parameter estimation method is applied.  This paper proposes a new sure joint screening procedure for  right-censored time-to-event data based on a sparsity-restricted  semiparametric accelerated failure time model.  Our method, referred to  as Buckley-James assisted sure screening (BJASS),   consists of an  initial screening step using a sparsity-restricted least-squares  estimate based on a synthetic time variable and a refinement screening  step using a sparsity-restricted least-squares estimate with the  Buckley-James imputed event times. The refinement step may be repeated  several times to obtain more stable results. We show that with any fixed  number of refinement steps, the BJASS procedure retains all important  variables with probability tending to 1. Simulation results are  presented to illustrate its performance in comparison with some marginal  screening methods.  A real data example is provided using a diffuse  large-B-cell lymphoma (DLBCL) data. We have implemented the BJASS method  using Matlab and R, which are available to readers upon request.

(This talk is based on joint work with Yi Liu and Xiaolin Chen)