Title: Surrogate-Model Approaches to Optimizers for LLM Training
Abstract: The recent empirical success of the Muon optimizer in training large language models has outpaced the theoretical understanding of its matrix-gradient orthogonalization design. To bridge this gap, this talk introduces surrogate-model approaches that analyze and systematically improve deep learning optimization over a single iteration. We first present the isotropic curvature model, a convex program assuming curvature isotropy across perturbation directions, which reveals that optimal update matrices achieve a more homogeneous spectrum. This approach demonstrates that while Muon's gradient orthogonalization is directionally correct, it is only strictly optimal under specific curvature phase transitions. Building upon this theoretical foundation, we introduce a second quadratic surrogate model that approximates the loss using the gradient, an output-space curvature matrix, and the input data matrix. By minimizing this surrogate under an isotropic weight assumption, we derive Newton-Muon. This finding implies that standard Muon is an implicit Newton-type method that neglects the right preconditioning induced by the input second moment. Empirically, Newton-Muon accelerates GPT-2 pretraining, reaching target validation loss in 6% fewer iteration steps and reducing wall-clock training time by roughly 4%, illustrating the efficacy of principled surrogate models in designing LLM optimizers.
Weijie Su
Weijie Su is an Associate Professor in the Wharton Statistics and Data Science Department and holds appointments in the Departments of Computer and Information Science, Biostatistics, Epidemiology and Informatics, and Mathematics at the University of Pennsylvania. He is Co-Director of Penn Research in Machine Learning (PRiML). His research focuses on mathematical and compute-efficient approaches to understanding deep learning and AI. His current interests include the statistical foundations of large language models, privacy-preserving machine learning, high-dimensional statistics, mathematical optimization, and phenomenological deep learning theory.