PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model

2024年06月19日
  • 简介
    病原体鉴定在诊断、治疗和预防疾病方面至关重要,对于控制感染和维护公共卫生也至关重要。传统的基于比对的方法虽然被广泛使用,但计算密集且依赖于广泛的参考数据库,通常由于其低灵敏度和特异性而无法检测到新的病原体。同样,传统的机器学习技术虽然很有前途,但需要大量的注释数据集和广泛的特征工程,并且容易出现过拟合的问题。为了解决这些挑战,我们介绍了PathoLM,这是一个针对细菌和病毒序列中病原性鉴定进行优化的先进病原体语言模型。利用预训练的DNA模型(如核苷酸变换器)的优势,PathoLM需要最少的数据进行微调,从而增强了病原体检测能力。它有效地捕捉更广泛的基因组上下文,显著提高了新型和不同的病原体的鉴定能力。我们开发了一个包括大约30种病毒和细菌的全面数据集,包括ESKAPEE病原体,七种明显具有抗生素耐药性的细菌菌株。此外,我们策划了一个以ESKAPEE组为中心的物种分类数据集。在比较评估中,PathoLM显著优于现有的模型(如DciPatho),表现出强大的零样本和少样本能力。此外,我们还扩展了PathoLM-Sp用于ESKAPEE物种分类,在任务的复杂性方面表现出比其他先进的深度学习方法更好的性能。
  • 作者讲解
  • 图表
  • 解决问题
    Pathogen identification is a crucial task in diagnosing, treating, and preventing diseases, but traditional methods are computationally intense and often fail to detect novel pathogens. This paper aims to introduce a new pathogen language model optimized for pathogen identification, requiring minimal data for fine-tuning.
  • 关键思路
    The key idea of this paper is to leverage pre-trained DNA models to develop PathoLM, a pathogen language model that captures a broader genomic context and significantly improves the identification of novel and divergent pathogens. PathoLM requires minimal data for fine-tuning and outperforms existing models like DciPatho, demonstrating robust zero-shot and few-shot capabilities.
  • 其它亮点
    The paper introduces PathoLM, a cutting-edge pathogen language model optimized for the identification of pathogenicity in bacterial and viral sequences. The model is trained on a comprehensive dataset comprising approximately 30 species of viruses and bacteria, including ESKAPEE pathogens, and a curated species classification dataset centered specifically on the ESKAPEE group. PathoLM outperforms existing models like DciPatho, demonstrating robust zero-shot and few-shot capabilities. Additionally, the paper expands PathoLM-Sp for ESKAPEE species classification, where it showed superior performance compared to other advanced deep learning methods.
  • 相关研究
    Recent related studies in this field include 'DciPatho: A Deep Learning Framework for Pathogenicity Prediction of Bacterial DNA Sequences' and 'DeepMicrobes: Classifying Microorganisms Using Deep Learning on Single-Cell Sequencing Data'.
许愿开讲
PDF
原文
点赞 收藏
向作者提问
NEW
分享到Link

提问交流

提交问题,平台邀请作者,轻松获得权威解答~

向作者提问