分享

SWAN: Preprocessing SGD Enables Adam-Level Performance On LLM Training With Significant Memory Reduction

热度