Colloquium - Simon Du, "Pre-Training Data Selection for Representation Learning"
From cs-speakerseries
views
comments
From cs-speakerseries
Abstract:
Pre-training datasets are a critical component in recent breakthroughs in artificial intelligence. However, their design has not received the same level of research attention as model architectures or training algorithms. In this presentation, I will discuss our recent work on pre-training data selection for representation learning in the contexts of multi-modal contrastive learning and multi-task representation learning. For multi-modal contrastive learning, we propose a new notion, the Variance Alignment Score (VAS). We demonstrate that by maximizing the VAS as a data selection strategy, we can achieve superior performance on dataset selection benchmarks. For multi-task representation learning, we explore how to select the most relevant pre-training tasks for a target downstream task. We introduce a metric to characterize task relevance and design a new method for actively selecting the most pertinent tasks.
Bio:
Simon S. Du is an assistant professor in the Paul G. Allen School of Computer Science & Engineering at the University of Washington. His research interests are broadly in machine learning, such as deep learning, representation learning, and reinforcement learning. Prior to starting as faculty, he was a postdoc at the Institute for Advanced Study. He completed his Ph.D. in Machine Learning at Carnegie Mellon University. Simon's research has been recognized by a Samsung AI Researcher of the Year Award, an NSF CAREER award, an Intel Rising Star Faculty Award, an Nvidia Pioneer Award, a AAAI New Faculty Highlights, a Distinguished Dissertation Award honorable mention from CMU, among others.