分享

A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training

热度