社区供稿 | 封神榜团队揭秘大模型训练秘密：以数据为中心

近一年来，各种各样的开源和闭源的大语言模型，不断在多个中文英文的测试基准中刷新着记录。然而，大语言模型的开发仍然面临诸多挑战，比如从头开始训练大语言模型的高昂成本，以及继续预训练导致的灾难性遗忘等等。尽管许多研究致力于解决这些问题，但一个重要而且实际的限制是，许多研究过于追求扩大模型规模，没有全面分析和优化预训练数据在训练大语言模型过程中的使用。在这项联合中科大的工作中，我们提出了Ziya2，一个拥有130亿（13B）参数的模型。它使用LLaMA2作为基座模型，并进一步在7000亿（700B）个tokens上进行继续预训练。我们重点关注继续预训练的相关技术，并使用Data-centric的相关方法来增强Ziya2在不同预训练阶段的学习过程。实验结果表明，Ziya2在多个基准测试中的表现显著优于其他规模相似的开源预训练模型。

论文题目：Ziya2: Data-centric Learning is All LLMs Need

论文链接：

https://arxiv.org/abs/2311.03301

https://huggingface.co/papers/2311.03301

模型下载链接：

https://huggingface.co/IDEA-CCNL/Ziya2-13B-Base

近年来，大语言模型（Large Language Models）在人工智能领域取得了巨大的成功。大语言模型在大量无监督的文本数据上进行预训练，并在没有大量监督数据的情况下，在多个下游任务展现出了出色的性能。代表性的大语言模型包括ChatGPT、GPT-4和PaLM-2等等。

然而，ChatGPT等模型是闭源的，源代码和参数的缺乏对于研究大语言模型来说是一种障碍。因此，许多研究机构训练了开源的大语言模型如LLaMA作为ChatGPT等模型的替代品。尽管开源的大语言模型带来了诸多好处，它们的发展目前仍面临三个重要的问题。第一个问题是从零开始预训练大模型需要花费高昂的成本，相比之下从一个训练好的大语言模型进行继续预训练的效率更高。第二个问题是开源的大语言模型通常不附带开源的数据。这导致继续预训练的数据分布与原始数据的训练分布是有差异的，最终会导致灾难性遗忘。第三个问题是许多研究优先考虑增加模型大小以便最大化地提升性能，却忽视了训练数据的质量对模型性能的影响。因此，上述问题催生了对数据质量及其对大语言模型的性能影响，以及在大语言模型训练的不同阶段如何更好地利用数据的研究。

在这项工作中，我们专注于继续预训练的技术，并探究了数据与模型性能之间的复杂关系。我们深入探索了高质量的预训练数据如何提高大语言模型的性能，同时保持模型大小和结构基本不变。为此，我们使用LLaMA2-13B作为基础模型，在大约7000亿（700B）个中英文tokens上进行预训练，从而得到了Ziya2。简要的说，我们采用了三阶段的训练过程，利用通用语料和特定领域的语料来提高Ziya2的双语生成能力。其中第一阶段使用中英文的大量无监督数据训练Ziya2；第二阶段使用有监督数据优化Ziya2；第三阶段主要集中在用数学数据训练Ziya2。最终，我们得到了具有130亿（13B）参数的Ziya2模型。

Figure 1: The overall data-centric process for learning Ziya2, where the pipeline to obtain high-quality data, the training strategies, and the three-stage training process are presented in the bottom, middle, and top part of the figure, respectively, Notably, the training strategies for different stages are illustrated by comparisons of data distributions.

我们在六个代表性的测试基准上进行了实验，包括中文通用任务CMMLU和C-Eval，英文通用任务CMMLU，数学能力GSM8K和MATH，以及编程能力HumanEval。实验结果如下图所示。

Table1

相比于基座模型LLaMA2，Ziya2的结果在MMLU上提升了10%，在CMMLU上提升了61%，在MMLU上提升了68%，在GSM8K上提升了138%，在MATH上提升了120%，在HumanEval上提升了89%。与其他开源的规模相似的大语言模型相比，Ziya2在中文通用和英文通用任务上的表现达到了领先的水平，在数学和编程任务上的表现则显著强于其他大语言模型。值得注意的是，Ziya2在中文任务上的表现超过了GPT-3.5-turbo。这些结果表明，在不显著改变模型大小的情况下，高质量的数据和恰当的继续预训练方法能够显著提高大语言模型的表现。

除了介绍Ziya2的结果，本文还将介绍训练Ziya2用到的数据和训练的细节。特别的，我们还分析了训练的不同的阶段用到的数据对结果的影响。

The Data Factory

数据是训练大语言模型的关键，其规模和质量直接影响模型的性能。因此，我们构建了完整的数据生产体系，包括预处理数据、自动评分、基于规则的过滤、消除重复内容和评估数据等子任务。

Figure 2: An overview of the proposed data-centric learning approach of Ziya2, which includes five components: data preprocessing (DP), automatic scoring (AS), rule-based filtering (RF), content de-duplication (CD), and data evaluation (DE). The gray and colored blocksindicate the proportion of data that is filtered out andretained relative to the original dataset, respectively.

预处理数据（DP） 我们搜集了700TB common crawl的语料和其他13TB高质量开源语料，如Wudao、Yuan1.0、Pile、Big Code等，并自动对文本进行语种分类和标准化。

自动评分（AS） 随后，我们用模型对预处理之后的数据打分得出PPL值。PPL得分前30%的数据被标记为高质量，30%到60%之间的为中等质量。用于打分的模型是基于KenLM框架在中文和英文维基百科语料上自主训练的。

基于规则的过滤（RF） 网络上的数据存在大量的色情、暴力、政治偏见等倾向性严重的文本，所以我们尽可能对这类文本进行了过滤。我们从文档级别、段落级别、句子级别三个粒度设计了30多种文本过滤的规则，并按照这个顺序进行逐级过滤。我们用人工抽检的方式确保了过滤规则的准确性并进行多次规则的迭代优化。

消除重复内容（CD） 由于开源的数据里存在大量重复的网页，我们利用bloomfilter技术对网址进行精准去重。其次，我们发现很多通过转载、拷贝等方式产生的新网页内容，在处理后只有标点符号或者表情的差异。因此，我们对这类文本去掉符号和表情等符号后再次进行精准去重。对于剩下的海量数据，采用simhash进行文本内容的模糊去重。

评估数据（DE） 完成上述数据处理步骤后，我们对处理后的数据进行了评估，随机抽取了1%的处理后的数据，并用一系列指标对其进行自动评估。同时我们也对这部分抽取的语料使用人工进行逐条评估，评估的结果帮助优化数据处理流程。经过多轮迭代，当不合格的样本比例少于0.1%时，认为数据合格。

最终，数据工厂得到4.5TB的高质量数据，我们从中抽样了约2TB数据以训练Ziya2，训练数据的具体分布如下：

Table 2: The information of high-quality pre-training datasets used to train Ziya2. The "pre-training stage" illustrates the stages that the datasets are used: "en", “zh”, “multi" mean the language of the datasets are English, Chinese, and multi-lingual, respectively: “Sampling” refers to the ratio of data sampled from the original dataset.

下图是我们用于继续预训练的数据的样例:

左右滑动查看Demo

Continual Pre-training Details

Initialization 我们对LLaMA2的词表新增了7400个中文字符，新增的中文字符作为一个新的token。相比于LLaMA2将单个中文字符拆成2-4个token，对于每个新增的中文字符，将该字符原来LLaMA2对应的embedding进行加权平均，初始化Ziya2中文字符对应的embedding。这种做法不仅能够提高编解码的效率，同时能利用LLaMA2已经学到的中文信息，具有更低的初始loss。

Training Strategy灾难性遗忘是继续预训练的关键问题之一。为了在不丢失模型的英文能力的同时，增强其中文能力和其它能力，我们采用了三阶段的训练方式进行继续预训练。

第一阶段，我们采样了接近LLaMA2原始分布的英文数据，同时采样了清洗后的中文数据和格式化后的代码数据，混合构成了650B无监督的数据进行继续预训练。我们完全打乱这些数据，并将不同的数据片段拼接成长度为4096的样本，通过attention mask避免不同数据片段互相影响，最大程度提高训练效率。

第二阶段，我们通过增加中文和英文的有监督数据，如Wanjuan-Idea来增强Ziya2在下游任务的效果。不同于第一阶段随机组合数据，我们把相同类型的有监督数据拼接成一个样本，并且确保每个样本里拼接的数据都是完整的。Figure 3介绍了上面提到的两种样本拼接方式。

第三阶段，我们增加了数学相关的有监督数据MetaMath，数据的拼接方式和第二阶段一致。经过第三阶段的预训练，Ziya2不仅在数学推理能力上提升显著，同时在编程能力上也有明显的提升，这可能说明数学推理的数据对编程这类逻辑性很强的任务非常重要。

同时为了防止Ziya2遗忘已经学到的知识，第二阶段和第三阶段我们也采样了和有监督数据同比例的中英文无监督数据进行继续预训练。

Figure 3: The process of constructing unsupervised andsupervised pre-training data in the three stages.

Training Efficiency and Stability 我们使用Megatron+DeepSpeed作为训练框架。Megatron使用了数据并行、张量并行和流水线并行来实现大模型分布式训练。DeepSpeed的ZeRO技术用于节省显存。我们在此基础上实现了LLaMA2模型结构适配，同时还增加了flash-attention、fused-softmax等技术来提升训练效率，最终能够使每个GPU达到163.0TFLOPS的业界领先效率。同时我们通过使用BF16混合精度训练解决了训练不稳定的问题。如Figure 5 所示，在训练Ziya1时，我们使用FP16混合精度训练，在训练后期经常会遇到loss spike的情况。这是因为FP16的取值范围不够，在计算时出现了数值溢出的情况。在采用BF16替换FP16并进行底层代码适配后，我们在Ziya2的继续预训练中解决了该问题。

Figure 4: The pre-training loss of Ziya2-13B and Ziya-13B with respect to the number of training steps.

Data Efficiency

为了展示数据对于训练效果的影响，我们在下图中展示了三个阶段，Ziya2在六个数据集上的表现。其中基座模型LLaMA2的结果由水平虚线表示以供比较。

左右滑动查看测评结果

在第一阶段的训练中，由于训练数据包含了大量与LLaMA2的训练数据不同的中文语料，Ziya2在MMLU上的表现在初期出现了下降。而随着训练步数的增加，Ziya2从更多的数据中学习到了更广泛的知识，这增强了其在中英文文本处理方面的能力。特别是在LLaMA2未优化的中文任务中，新数据显著提高了Ziya2在CMMLU和C-Eval上的表现。同时，Ziya2在数学和编程能力上也有适度的提升。在继续预训练的第二阶段，Ziya2在六个基准测试上相对于第一阶段表现出更大的提升，尤其是在C-Eval、GSM8K和MATH上有显著的进步。这些结果强调了有监督数据对于继续预训练的贡献。

在继续预训练的第三阶段，使用特定领域的数据集进行训练显著提高了Ziya2在GSM8K和MATH上的表现，同时保持了Ziya2在通用任务上的表现。实验结果证明，针对特定数据集的数据增强能够显著提升模型在该数据集上的表现。当然，这种用了极少的训练数据获得的模型性能提升是否对生成效果有益仍需要进一步分析。

模型生成效果

下面我们给出Ziya2模型的三个生成的例子，包括生成中文，英文，以及Python的代码。

左右滑动查看Demo

结语

本文所提出的开源LLM Ziya2，基于开源的 LLaMA2 模型，通过提出的以数据为中心的学习方法进行继续预训练。Ziya2 在中英文下游任务中的表现不仅优于 LLaMA2，而且优于同期规模相近的开源 LLM，证明了以数据为中心的学习方法对 LLM 预训练的有效性。

我们的研究为模型开发者们提供了优秀的大模型继续预训练方案，为开发者提供数据、模型、训练全流程优化的经验借鉴，帮助提高大语言模型的性能。

未来，我们计划继续Ziya2系列的开发，探索具有 70B 参数的更大模型，同时基于Ziya2的对话, 写作，代码，多模态、特别是同样与中科大联合研发的医疗领域的专业模型也将在近期发布，敬请期待。我们将持续为社区提供先进的大模型技术和经验，推进大模型生态发展。

参考文献

[1] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. PaLM 2 Technical Report. arXiv preprint arXiv:2305.10403.

[2] Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. GPT-NeoX-20B: An OpenSource Autoregressive Language Model. arXiv preprint arXiv:2204.06745.

[3] Burton H Bloom. 1970. Space/time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM, 13(7):422–426.

[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language Models are Few-shot Learners. Advances in neural information processing systems, 33:1877–1901.

[5] Moses S Charikar. 2002. Similarity Estimation Techniques from Rounding Algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388.

[6] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374.

[7] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An Open-source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://vicuna.lmsys.org (accessed 14 April 2023).

[8] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168.

[9] Together Computer. 2023. RedPajama: An Open Dataset for Training Large Language Models. https://github.com/togethercomputer/RedPajama-Data.

[10] OpenCompass Contributors. 2023. Opencompass: A Universal Evaluation Platform for Foundation Models.

[11] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-efficient Exact Attention with IO-Awareness. Advances in Neural Information Processing Systems, 35:16344–16359.

[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1), pages 4171–4186. Association for Computational Linguistics.

[13] Shizhe Diao, Jiaxin Bai, Yan Song, Tong Zhang, and Yonggang Wang. 2020. ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4729–4740.

[14] Zhenyi Fan, Chenghao Lu, and Jie Tian. 2023. ChineseVicuna: A Chinese Instruction-following LLaMA based Model.

[15] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. 2020. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv preprint arXiv:2101.00027.

[16] Conghui He, Zhenjiang Jin, Chao Xu, Jiantao Qiu, Bin Wang, Wei Li, Hang Yan, JiaQi Wang, and Dahua Lin. 2023. Wanjuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models. arXiv preprint arXiv:2308.10755.

[17] Kenneth Heafield. 2011. KenLM: Faster and Smaller Language Model Queries. In Proceedings of the sixth workshop on statistical machine translation, pages 187–197.

[18] DanHendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring Massive Multitask Language Understanding. arXiv preprint arXiv:2009.03300.

[19] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring Mathematical Problem Solving with the MATH Dataset. arXiv preprint arXiv:2103.03874.

[20] Yongfeng Huang, Yanyang Li, Yichong Xu, Lin Zhang, Ruyi Gan, Jiaxing Zhang, and Liwei Wang. 2023a. MVP-Tuning: Multi-View KnowledgeRetrieval with Prompt Tuning for Commonsense Reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13417–13432, Toronto, Canada.

[21] Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, et al. 2023b. uation Suite for Foundation Models. arXiv preprint arXiv:2305.08322.

[22] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2020. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of the Association for Computational Linguistics, 8:64–77.

[23] Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2021. Deduplicating Training Data Makes Language Models Better. arXiv preprint arXiv:2107.06499.

[24] Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. 2023. CMMLU: Measuring Massive Multitask Language Understanding in Chinese. arXiv preprint arXiv:2306.09212.

[25] Miaofeng Liu, Yan Song, Hongbin Zou, and Tong Zhang. 2019a. Reinforced Training Data Selection for Domain Adaptation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1957–1968, Florence, Italy.

[26] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Roberta: A robustly Optimized Bert Pretraining Approach. arXiv preprint arXiv:1907.11692.

[27] Ilya Loshchilov and Frank Hutter. 2017. Decoupled Weight Decay Regularization. arXiv preprint arXiv:1711.05101.

[28] Junyu Lu, Ping Yang, Ruyi Gan, Jing Yang, and Jiaxing Zhang. 2022. Unified BERT for Few-shot Natural Language Understanding. arXiv preprint arXiv:2206.12094.

[29] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. 2017. Mixed Precision Training. arXiv preprint arXiv:1710.03740.

[30] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781.

[31] OpenAI. 2022. Introducing ChatGPT.

[32] OpenAI. 2023. GPT-4 Technical Report.

[33] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training Language Models to Follow Instructions with Human Feedback. Advances in Neural Information Processing Systems, 35:27730–27744.

[34] Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Yan Song, Shuming Shi, and Jing Li. 2018. Joint LearnRuxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. The Refinedweb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. arXiv preprint arXiv:2306.01116.

[35] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar.

[36] Yang Ping, JunYu Lu, Ruyi Gan, Junjie Wang, Yuxiang Zhang, Pingjian Zhang, and Jiaxing Zhang. 2023. UniEX: An Effective and Efficient Framework for Unified Information Extraction via a Span-extractive Perspective. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16424–16440,

[37] Toronto, Canada. Han Qin, Yuanhe Tian, and Yan Song. 2021. Relation Extraction with Word Graphs from N-grams. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2860–2868, Online and Punta Cana, Dominican Republic.

[38] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language Models are Unsupervised Multitask Learners. OpenAI blog, 1(8):9.

[39] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-text Transformer. The Journal of Machine Learning Research, 21(1):5485–5551.

[40] SamyamRajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory Optimizations toward Training Trillion Parameter Models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 116. IEEE.

[41] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural Machine Translation of Rare Words with Subword Units. arXiv preprint arXiv:1508.07909.

[42] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training Multi-billion Parameter Language Models using Model Parallelism. arXiv preprint arXiv:1909.08053.

[43] Yan Song, Chia-Jung Lee, and Fei Xia. 2017. Learning Word Representations with Regularization from Prior Knowledge. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 143–152.

[44] Yan Song, Shuming Shi, and Jing Li. 2018. Joint Learning Embeddings for Chinese Words and Their Components via Ladder Structured Networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages, 4375–4381.

[45] Yan Song, Tong Zhang, Yonggang Wang, and Kai-Fu Lee. 2021. ZEN 2.0: Continue Training and Adaption for N-gram Enhanced Text Encoders. arXiv preprint arXiv:2105.01279.

[46] Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. Roformer: Enhanced Transformer with Rotary Position Embedding. arXiv preprint arXiv:2104.09864.

[47] Yuanhe Tian, Weidong Chen, Bo Hu, Yan Song, and Fei Xia. 2023. End-to-end Aspect-based Sentiment Analysis with Combinatory Categorial Grammar. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13597–13609, Toronto, Canada.

[48] Yuanhe Tian, Yan Song, and Fei Xia. 2022. Improving Relation Extraction through Syntax-induced Pretraining with Dependency Masking. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics.

[49] Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari S Morcos. 2023. D4: Improving Llm Pretraining Via Document De-duplication and Diversification. arXiv preprint arXiv:2308.12284.

[50] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971.

[51] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. LLaMA 2: Open Foundation and Fine-tuned Chat Models. arXiv preprint arXiv:2307.09288.

[52] Junjie Wang, Yuxiang Zhang, Ping Yang, and Ruyi Gan. 2022. Towards No. 1 in CLUE Semantic Matching Challenge: Pre-trained Language Model Erlangshen with Propensity-Corrected Loss. arXiv preprint arXiv:2208.02959.

[53] Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. 2019. CCNet: Extracting High Quality Monolingual Datasets from WebCrawl Data. arXiv preprint arXiv:1911.00359.

[54] Shaohua Wu, Xudong Zhao, Tong Yu, Rongguo Zhang, Chong Shen, Hongli Liu, Feng Li, Hong Zhu, Jiangang Luo, Liang Xu, et al. 2021. Yuan 1.0: Large-scale Pre-trained Language Model in Zero-shot and Few-shot Learning. arXiv preprint arXiv:2110.04725.

[55] Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan Yang, et al. 2023. Baichuan 2: Open Large-scale Language Models. arXiv preprint arXiv:2309.10305.

[56] Ping Yang, Junjie Wang, Ruyi Gan, Xinyu Zhu, Lin Zhang, Ziwei Wu, Xinyu Gao, Jiaxing Zhang, and Tetsuya Sakai. 2022. Zero-Shot Learners for Natural Language Understanding via a Unified Multiple Choice Perspective. arXiv preprint arXiv:2210.08590.

[57] Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. 2023. Metamath: Bootstrap Your Own Mathematical Questions for Large Language Models. arXiv preprint arXiv:2309.12284.

[58] Sha Yuan, Hanyu Zhao, Zhengxiao Du, Ming Ding, Xiao Liu, Yukuo Cen, Xu Zou, Zhilin Yang, and Jie Tang. 2021. Wudaocorpora: A Super Large-scale Chinese Corpora for Pre-training Language Models. AI Open, 2:65–68.

[59] Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. GLM-130B: An Open Bilingual Pre-trained Model. arXiv preprint arXiv:2210.02414.

[60] Biao Zhang and Rico Sennrich. 2019. Root Mean Square Layer Normalization. Advances in Neural Information Processing Systems, 32.

[61] Jiaxing Zhang, Ruyi Gan, Junjie Wang, Yuxiang Zhang, Lin Zhang, Ping Yang, Xinyu Gao, Ziwei Wu, Xiaoqun Dong, Junqing He, et al. 2022. Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence. arXiv preprint arXiv:2209.02970.

本文由 Hugging Face 中文社区内容共建项目提供，稿件由社区成员投稿，经授权发布于 Hugging Face 公众号。文章内容不代表官方立场，文中介绍的产品和服务等均不构成投资建议。了解更多请关注公众号:

如果你有与开源 AI、Hugging Face 相关的技术和实践分享内容，以及最新的开源 AI 项目发布，希望通过我们分享给更多 AI 从业者和开发者们，请通过下面的链接投稿与我们取得联系:

https://hf.link/tougao

内容中包含的图片若涉及版权问题，请及时与我们联系删除

社区供稿 | 封神榜团队揭秘大模型训练秘密：以数据为中心

评论