April is AI Month at Penn! Check out the calendar for a full list of events across campus.

FOLDS Seminar featuring Weijie Su (University of Pennsylvania)

04-09-26
12:00 pm
Penn AI Foundations Icon

Title: Surrogate-Model Approaches to Optimizers for LLM Training

Abstract: The recent empirical success of the Muon optimizer in training large language models has outpaced the theoretical understanding of its matrix-gradient orthogonalization design. To bridge this gap, this talk introduces surrogate-model approaches that analyze and systematically improve deep learning optimization over a single iteration. We first present the isotropic curvature model, a convex program assuming curvature isotropy across perturbation directions, which reveals that optimal update matrices achieve a more homogeneous spectrum. This approach demonstrates that while Muon's gradient orthogonalization is directionally correct, it is only strictly optimal under specific curvature phase transitions. Building upon this theoretical foundation, we introduce a second quadratic surrogate model that approximates the loss using the gradient, an output-space curvature matrix, and the input data matrix. By minimizing this surrogate under an isotropic weight assumption, we derive Newton-Muon. This finding implies that standard Muon is an implicit Newton-type method that neglects the right preconditioning induced by the input second moment. Empirically, Newton-Muon accelerates GPT-2 pretraining, reaching target validation loss in 6% fewer iteration steps and reducing wall-clock training time by roughly 4%, illustrating the efficacy of principled surrogate models in designing LLM optimizers.

Speaker

Weijie Su headshot Weijie Su Associate Professor, Wharton Statistics and Data Science; Departments of Computer and Information Science, Biostatistics, Epidemiology and Informatics, and Mathematics, University of Pennsylvania; Co-Director, Penn Research in Machine Learning

Weijie Su is an Associate Professor in the Wharton Statistics and Data Science Department and holds appointments in the Departments of Computer and Information Science, Biostatistics, Epidemiology and Informatics, and Mathematics at the University of Pennsylvania. He is Co-Director of Penn Research in Machine Learning (PRiML). His research focuses on mathematical and compute-efficient approaches to understanding deep learning and AI. His current interests include the statistical foundations of large language models, privacy-preserving machine learning, high-dimensional statistics, mathematical optimization, and phenomenological deep learning theory.