分享

Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling

热度