文：西南财经大学统计学院 发布时间：2016-05-09 浏览次数：2201
主 题：Automated feature extraction and selection for high-throughput phenotyping
地 点：通博楼TB404 教室
主办单位：统计研究中心 统计学院 科研处
Dr. Yu Sheng is an assistant professor of statistics in the Center for Statistical Science of Tsinghua University. Dr. Yu received his BS and MA degrees in statistics from Nankai University and the University of Michigan, and he received his PhD degree in system engineering (operations research) from the George Washington University. He started his research in medical informatics since his research work at Harvard University, and his current research interests include deep understanding of the medical language with machine learning methods, internet and data-driven knowledge extraction, and supervised and unsupervised EMR analysis.
With the rapid adoption of electronic medical records (EMR), medicine and healthcare has become one of the most important field for big data applications. One of the important applications in medical research is the EMR-based phenotyping, which is to identify patients with certain phenotypes with machine learning algorithms. The conventional procedure for designing a phenotyping algorithm requires the participation of medical experts to discuss with statisticians and medical informaticians about the variables to use and the medical terms to search for, and the designing of one algorithm typically takes months to finalize. We propose a data-driven method to automate the algorithm designing process that can achieve higher accuracy even than expert designed algorithms. We utilize publicly available knowledge sources, such as the Wikipedia, to collect an initial set of candidate features. Billing codes and the natural language variable of the target phenotype are used to created surrogates of the gold-standard labels, and penalized logistic regression models are trained repeatedly with bootstrap to predict the surrogates in order to evaluate the informativeness of the candidate features. Only a succinct set of highly informative features will pass the data-driven screening and enter the final model to predict the true gold-standard labels. This method has been implemented in the development of large scale biobanks in top ranked hospitals in the U.S.