2023年7月16日,福布斯(Forbes)官网发表文章The Next Frontier For Large Language Models Is Biology,探讨了大模型在生物学中的应用前景。

像 GPT-4 这样的大型语言模型凭借对自然语言的惊人驾驭能力风靡全球。然而,对于 LLM 来说,最重要的长期机遇将是一种完全不同的语言:生物学语言。

在过去的一个世纪中,生物化学、分子生物学和遗传学的研究取得了长足的进步,其中一个引人注目的主题是:生物学原来是一个可破译、可编程,在某些方面甚至是数字化的系统。

DNA 只用四个变量--A(腺嘌呤)、C(胞嘧啶)、G(鸟嘌呤)和 T(胸腺嘧啶),就能为地球上的每一个生物体编码完整的遗传指令。与之相比,现代计算机系统使用两个变量--0 和 1 来编码世界上所有的数字电子信息。一个系统是二进制的,另一个是四进制的,但两者在概念上有惊人的重叠,两个系统都可以恰当地视为数字系统。

再举个例子,每种生物体内的每种蛋白质都是由按特定顺序连接在一起的一维氨基酸串构成和定义的。蛋白质的长度从几十个到几千个氨基酸不等,有 20 种不同的氨基酸可供选择。

这也是一个极易计算的系统,语言模型非常适合学习。

正如 DeepMind 首席执行官兼创始人 Demis Hassabis 所说:“在最基本的层面上,我认为生物学可以被看作是一个信息处理系统,尽管它是一个异常复杂和动态的系统。正如数学被证明是物理学的正确描述语言一样,生物学也可能被证明是人工智能应用的完美机制类型。

当大型语言模型能够处理大量信号丰富的数据,推断出远远超出人类吸收能力的潜在模式和深层结构时,它们的功能最为强大。然后,它们可以利用这种对主题的复杂理解,生成新颖、令人叹为观止的复杂输出。

例如,通过摄取互联网上的所有文本,ChatGPT 等工具学会了就任何可以想象到的话题进行深思熟虑、细致入微的对话。通过摄取数十亿幅图像,Midjourney 等文本到图像模型学会了按需生成创造性的原始图像。

将大型语言模型指向生物数据--让它们学习生命语言--将释放出各种可能性,使自然语言和图像相比之下显得微不足道。

在短期内,在生命科学领域应用大型语言模型最有吸引力的机会是设计新型蛋白质。

蛋白质

蛋白质是生命本身的中心。正如著名生物学家Arthur Lesk所说:"在分子尺度的生命剧中,蛋白质是行动的中心"。

蛋白质几乎参与了每种生物体内发生的所有重要活动:消化食物、收缩肌肉、将氧气输送到全身、攻击外来病毒。你的荷尔蒙是由蛋白质组成的,你的头发也是。

蛋白质之所以如此重要,是因为它们用途广泛。它们能够承担大量不同的结构和功能,远远超过任何其他类型的生物大分子。这种令人难以置信的多功能性是蛋白质构建方式的直接结果。

如上所述,每种蛋白质都由一串被称为氨基酸的构件按照特定顺序串联而成。根据氨基酸的一维序列,蛋白质折叠成复杂的三维形状,从而实现其生物功能。

蛋白质的形状与其功能密切相关。举个例子,抗体蛋白折叠成的形状使它们能够精确地识别和锁定异物,就像钥匙插入锁里一样。再比如,酶--加速生化反应的蛋白质--具有特殊形状,能与特定分子结合,从而催化特定反应。因此,了解蛋白质的折叠形状对于理解生物体的功能以及生命本身的运作方式至关重要。

半个多世纪以来,仅根据蛋白质的一维氨基酸序列确定其三维结构一直是生物学领域的一大挑战。这个被称为"蛋白质折叠问题"的难题困扰了几代科学家。2007 年,一位评论家将蛋白质折叠问题描述为"现代科学中最重要但尚未解决的问题之一"。

深度学习和蛋白质:天作之合

2020 年底,在生物学和计算机领域的一个分水岭时刻,一个名为 AlphaFold 的人工智能系统找到了蛋白质折叠问题的解决方案。AlphaFold 由 Alphabet 旗下的 DeepMind 公司打造,它能正确预测蛋白质的三维形状,精确到约一个原子的宽度,远远超过了人类设计出的任何其他方法。

AlphaFold 的意义无论如何强调都不为过。长期从事蛋白质折叠研究的专家John Moult对此作了精辟的总结:"这是人工智能第一次解决了一个严重的科学问题"。

 

然而,说到人工智能和蛋白质,AlphaFold 只是一个开始。

 

AlphaFold 并非使用大型语言模型构建而成。它依赖于一种更古老的生物信息学结构,即多序列比对(MSA),将蛋白质的序列与进化过程中相似的蛋白质进行比较,从而推断其结构。

 

正如 AlphaFold 所表明的那样,多序列比对非常强大,但也有局限性。

 

首先,它的速度很慢,而且计算密集,因为它需要参考许多不同的蛋白质序列才能确定任何一种蛋白质的结构。更重要的是,由于 MSA 需要存在大量在进化和结构上相似的蛋白质才能推理出新的蛋白质序列,因此它对所谓的 "孤儿蛋白质(orphan proteins)"--很少或没有近似相似物的蛋白质--的作用有限。这种孤儿蛋白约占所有已知蛋白质序列的 20%。

 

最近,研究人员开始探索一种有趣的替代方法:使用大型语言模型而不是多序列比对来预测蛋白质结构。

 

"蛋白质语言模型"不是根据英语单词,而是根据蛋白质序列训练出来的,它展现出了惊人的能力,能够直观地发现蛋白质序列、结构和功能之间的复杂模式和相互关系:例如,改变蛋白质序列某些部分的某些氨基酸会如何影响蛋白质的折叠形状。可以说,蛋白质语言模型能够学习蛋白质的语法或语言学。

 

蛋白质语言模型的想法可以追溯到2019年哈佛大学George Church实验室的UniRep工作(不过UniRep使用的是LSTM,而不是如今最先进的变换器模型)。 

 

2022 年底,Meta 首次发布了 ESM-2 和 ESMFold,这是迄今为止发布的规模最大、最复杂的蛋白质语言模型之一,共有 150 亿个参数。(ESM-2是LLM本身;ESMFold是其相关的结构预测工具)。

 

在预测蛋白质三维结构方面,ESM-2/ESMFold 与 AlphaFold 一样准确。但与 AlphaFold 不同的是,它能够根据单个蛋白质序列生成结构,而不需要输入任何结构信息。因此,它比 AlphaFold 快 60 倍。当研究人员希望在蛋白质工程工作流程中同时筛选数百万个蛋白质序列时,这种速度优势就会产生巨大的不同。对于缺乏进化相似类似物的孤儿蛋白,ESMFold 也能比 AlphaFold 做出更准确的结构预测。

 

语言模型能够对蛋白质的“潜在空间”产生普遍的理解,这为蛋白质科学开辟了令人兴奋的可能性。

 

AlphaFold以来的几年里,更强大的概念进步已经发生。

 

简而言之,这些蛋白质模型可以反推:与其根据蛋白质序列预测其结构,不如反推ESM-2等模型,根据所需的特性生成自然界不存在的全新蛋白质序列。

发明新蛋白质

当今世界上存在的所有蛋白质只是理论上可能存在的所有蛋白质的极小一部分。机会就在这里。

举几个粗略的数字:据估计,存在于人体中的全部蛋白质--即所谓的"人类蛋白质组"--的数量大约在 8 万到 40 万之间。与此同时,理论上可能存在的蛋白质数量大约在 10^1,300 左右--这是一个深不可测的庞大数字,比宇宙中原子的数量还要多出许多倍。(要明确的是,并非所有这 10^1,300 种可能的氨基酸组合都会产生具有生物活性的蛋白质。远非如此。但某些子集会)。
在数百万年的时间里,蜿蜒曲折的进化过程偶然发现了数以万计或数十万计的可行组合。但这仅仅是冰山一角。
用领先的蛋白质人工智能初创公司 Generate Biomedicines 的联合创始人Molly Gibson的话来说:“大自然所能提供的序列空间是巨大的,大自然在生命史中采样的序列空间量,几乎只相当于地球上所有海洋中的一滴水。”
我们有机会改进自然。毕竟,自然选择的进化虽然是一种强大的力量,但它并不是无所不知的,它并不预先计划,它并不以自上而下的方式进行推理或优化。它是随机的、机会主义的,传播的是碰巧有效的组合。
利用人工智能,我们可以首次系统而全面地探索蛋白质空间的广阔未知领域,从而设计出不同于自然界中存在过的任何蛋白质,满足我们的医疗和商业需求。
我们将能够设计出新的蛋白质疗法,以解决人类的各种疾病--从癌症到自身免疫性疾病,从糖尿病到神经退行性疾病。展望医学以外的领域,我们将能够创造出新的蛋白质类别,在农业、工业、材料科学、环境修复等领域实现变革性应用。
一些利用深度学习进行全新蛋白质设计的早期尝试并未使用大型语言模型。
华盛顿大学David Baker世界闻名的实验室推出的 ProteinMPNN 就是一个突出的例子。ProteinMPNN 架构没有使用 LLM,而是主要依靠蛋白质结构数据来生成新蛋白质。
Baker实验室最近发布了 RFdiffusion,这是一种更先进、更通用的蛋白质设计模型。顾名思义,RFdiffusion 是利用扩散模型构建的,这种人工智能技术与 Midjourney 和 Stable Diffusion 等文本到图像模型相同。RFdiffusion 可以生成新颖的、可定制的蛋白质 "骨架"--即蛋白质的整体结构支架--然后将序列分层。
ProteinMPNN和RFdiffusion等以结构为重点的模型取得了令人瞩目的成就,推动了基于人工智能的蛋白质设计技术的发展。然而,得益于大型语言模型的变革能力,我们可能正处于该领域新一轮变革的风口浪尖。
为什么与其他蛋白质设计计算方法相比,语言模型是一条前景光明的道路?一个关键原因是:扩展性。

扩展法则

人工智能最近取得的巨大进步背后的关键力量之一就是所谓的 "扩展定律(scaling laws)":随着语言模型参数数量、训练数据和计算量的不断增加,其性能的提高几乎令人难以置信。

每增加一个数量级的规模,语言模型都会展现出非凡的、意想不到的、新出现的能力,这些能力超越了在较小规模下可能实现的能力。
近年来,正是 OpenAI 对扩展原则的承诺,使该组织跃居人工智能领域的前沿。从 GPT-2 到 GPT-3 再到 GPT-4 以及更多,OpenAI 构建了更大的模型,部署了更多的计算,并在更大的数据集上进行了训练,比世界上任何其他组织都更胜一筹,从而释放出令人惊叹、前所未有的人工智能能力。
扩展法则与蛋白质领域有何关联?
在过去二十年里,由于科学上的突破,基因测序的成本大大降低,而且更容易获得,因此可用于训练人工智能模型的 DNA 以及蛋白质序列数据量呈指数级增长,远远超过了蛋白质结构数据。
蛋白质序列数据可进行标记化处理,就所有意图和目的而言,可被视为文本数据;毕竟,它是由按一定顺序排列的氨基酸线性字符串组成的,就像句子中的单词一样。大型语言模型只需在蛋白质序列上进行训练,就能深入理解蛋白质结构和生物学。
因此,利用 LLMs 对这一领域进行大规模扩展的时机已经成熟--这些努力可能会为蛋白质科学带来惊人的新见解和新能力。
第一项使用基于变换器的 LLMs 设计全新蛋白质的研究是 ProGen,由 Salesforce Research 于 2020 年发布。最初的 ProGen 模型有 12 亿个参数。
ProGen 的首席研究员 Ali Madani 后来成立了一家名为 Profluent Bio 的初创公司,致力于推进 LLM 驱动的蛋白质设计技术并将其商业化。
在率先将 LLM 用于蛋白质设计的同时,Madani 也清醒地认识到,以原始蛋白质序列为基础训练的现成语言模型本身并不是应对这一挑战的最有力方法。结合结构和功能数据至关重要。
Madani 说:“蛋白质设计领域的最大进步将体现在对不同来源的数据进行仔细整理,以及能够灵活学习这些数据的多功能建模这两者之间的交叉点上。这就需要利用我们掌握的所有高信号数据--包括蛋白质结构和来自实验室的功能信息。”
另一家应用 LLM 设计新型蛋白质疗法的早期初创公司是 Nabla Bio。Nabla 公司从 George Church 在哈佛大学的实验室中分离出来,由 UniRep 背后的团队领导,专门研究抗体。鉴于目前 60% 的蛋白质疗法都是抗体疗法,而且全球销量最高的两种药物都是抗体疗法,因此选择 Nabla 也就不足为奇了。
Nabla 决定不开发自己的疗法,而是向生物制药合作伙伴提供尖端技术,作为帮助他们开发自己药物的工具。

随着世界逐渐认识到蛋白质设计是一个巨大的、尚未被充分开发的领域,可以应用大型语言模型看似神奇的能力,预计在未来的几个月和几年里,这一领域会有更多的初创企业活动。

前方的路

Frances Arnold在 2018 年诺贝尔化学奖的获奖感言中说道:"今天,我们可以读取、书写和编辑任何 DNA 序列,但我们无法创作它。生命密码是一部交响乐,由无数演奏者和乐器演奏出错综复杂的美妙乐章。也许我们可以从大自然的作品中剪切和粘贴片段,但我们不知道如何为单个酶通道编写小节"。

就在五年前,这还是事实。
人工智能可能会让我们有能力,在生命史上第一次真正从头开始编写全新的蛋白质(及其相关的遗传密码),专为我们的需求而生。这是一种令人惊叹的可能性。
这些新型蛋白质将成为治疗从传染病到癌症等各种人类疾病的药物;它们将帮助基因编辑成为现实;它们将改变材料科学;它们将提高农业产量;它们将中和环境中的污染物;还有更多我们甚至无法想象的东西。
由人工智能(尤其是 LLM)驱动的蛋白质设计领域仍处于起步阶段,尚未得到证实。科学、工程、临床和商业领域仍然存在重大障碍。将这些新疗法和产品推向市场需要数年时间。
然而,从长远来看,人工智能的市场应用大有可为。
在今后的文章中,我们将深入研究用于蛋白质设计的 LLM,包括探索该技术最引人注目的商业应用,以及计算结果与实际湿实验室实验之间的复杂关系。
最后,让我们放大视野。全新蛋白质设计并不是大型语言模型在生命科学领域唯一令人兴奋的机遇。
语言模型还可用于生成其他类别的生物分子,特别是核酸。例如,一家名为 Inceptive 的初创公司正在应用 LLM 生成新型 RNA 疗法。
其他研究小组的目标更为宽广,旨在建立通用的"生物学基础模型",能够融合基因组学、蛋白质序列、细胞结构、表观遗传学状态、细胞图像、质谱分析、空间转录组学等各种数据类型。
最终目标是超越蛋白质等单个分子的建模,进而建立蛋白质与其他分子相互作用的模型,然后建立整个细胞、组织、器官的模型,最终建立整个生物体的模型。
建立一个能够理解和设计复杂生物系统每一个复杂细节的人工智能系统,这个想法令人匪夷所思。假以时日,这一切都将唾手可得。
从爱因斯坦的相对论到量子力学的发现,从核弹到晶体管,物理学的根本性进步定义了二十世纪。正如许多现代观察家所指出的,二十一世纪正在成为生物学的世纪。在未来的几十年里,人工智能和大型语言模型将在揭开生物学的秘密和释放其可能性方面发挥核心作用。

系好安全带。

https://www.forbes.com/sites/robtoews/2023/07/16/the-next-frontier-for-large-language-models-is-biology/

The Next Frontier For Large Language Models Is Biology

Rob Toews Contributor

I write about the big picture of artificial intelligence.

The Next Frontier For Large Language Models Is Biology

Rob Toews

 

Contributor

 

I write about the big picture of artificial intelligence.

Jul 16, 2023,06:00pm EDT

Large language models like GPT-4 have taken the world by storm thanks to their astonishing command of natural language. Yet the most significant long-term opportunity for LLMs will entail an entirely different type of language: the language of biology.

One striking theme has emerged from the long march of research progress across biochemistry, molecular biology and genetics over the past century: it turns out that biology is a decipherable, programmable, in some ways even digital system.

DNA encodes the complete genetic instructions for every living organism on earth using just four variables—A (adenine), C (cytosine), G (guanine) and T (thymine). Compare this to modern computing systems, which use two variables—0 and 1—to encode all the world’s digital electronic information. One system is binary and the other is quaternary, but the two have a surprising amount of conceptual overlap; both systems can properly be thought of as digital.

To take another example, every protein in every living being consists of and is defined by a one-dimensional string of amino acids linked together in a particular order. Proteins range from a few dozen to several thousand amino acids in length, with 20 different amino acids to choose from.

This, too, represents an eminently computable system, one that language models are well-suited to learn.

As DeepMind CEO/cofounder Demis Hassabis put it: “At its most fundamental level, I think biology can be thought of as an information processing system, albeit an extraordinarily complex and dynamic one. Just as mathematics turned out to be the right description language for physics, biology may turn out to be the perfect type of regime for the application of AI.”

Large language models are at their most powerful when they can feast on vast volumes of signal-rich data, inferring latent patterns and deep structure that go well beyond the capacity of any human to absorb. They can then use this intricate understanding of the subject matter to generate novel, breathtakingly sophisticated output.

By ingesting all of the text on the internet, for instance, tools like ChatGPT have learned to converse with thoughtfulness and nuance on any imaginable topic. By ingesting billions of images, text-to-image models like Midjourney have learned to produce creative original imagery on demand.

Pointing large language models at biological data—enabling them to lear the language of life—will unlock possibilities that will make natural language and images seem almost trivial by comparison.
What, concretely, will this look like?
In the near term, the most compelling opportunity to app
ly large language models in the life sciences is to design novel proteins.

Proteins 101

Proteins are at the center of life itself. As prominent biologist Arthur Lesk put it, “In the drama of life at a molecular scale, proteins are where the action is.”

Proteins are involved in virtually every important activity that happens inside every living thing: digesting food, contracting muscles, moving oxygen throughout the body, attacking foreign viruses. Your hormones are made out of proteins; so is your hair.

Proteins are so important because they are so versatile. They are able to undertake a vast array of different structures and functions, far more than any other type of biomolecule. This incredible versatility is a direct consequence of how proteins are built.

As mentioned above, every protein consists of a string of building blocks known as amino acids strung together in a particular order. Based on this one-dimensional amino acid sequence, proteins fold into complex three-dimensional shapes that enable them to carry out their biological functions.

A protein’s shape relates closely to its function. To take one example, antibody proteins fold into shapes that enable them to precisely identify and target foreign bodies, like a key fitting into a lock. As another example, enzymes—proteins that speed up biochemical reactions—are specifically shaped to bind with particular molecules and thus catalyze particular reactions. Understanding the shapes that proteins fold into is thus essential to understanding how organisms function, and ultimately how life itself works.

Determining a protein’s three-dimensional structure based solely on its one-dimensional amino acid sequence has stood as a grand challenge in the field of biology for over half a century. Referred to as the “protein folding problem,” it has stumped generations of scientists. One commentator in 2007 described the protein folding problem as “one of the most important yet unsolved issues of modern science.”

MORE FOR YOU

Turn AI Art Into Amazing Poster Prints

How Can Humans Best Use AI?

Kamala Harris Struggles With AI And So Do Older Workers

Deep Learning And Proteins: A Match Made In Heaven

In late 2020, in a watershed moment in both biology and computing, an AI system called AlphaFold produced a solution to the protein folding problem. Built by Alphabet’s DeepMind, AlphaFold correctly predicted proteins’ three-dimensional shapes to within the width of about one atom, far outperforming any other method that humans had ever devised.

It is hard to overstate AlphaFold’s significance. Long-time protein folding expert John Moult summed it upwell: “This is the first time a serious scientific problem has been solved by AI.”

Yet when it comes to AI and proteins, AlphaFold was just the beginning.

AlphaFold was not built using large language models. It relies on an older bioinformatics construct called multiple sequence alignment (MSA), in which a protein’s sequence is compared to evolutionarily similar proteins in order to deduce its structure.

MSA can be powerful, as AlphaFold made clear, but it has limitations.

For one, it is slow and compute-intensive because it needs to reference many different protein sequences in order to determine any one protein’s structure. More importantly, because MSA requires the existence of numerous evolutionarily and structurally similar proteins in order to reason about a new protein sequence, it is of limited use for so-called “orphan proteins”—proteins with few or no close analogues. Such orphan proteins represent roughly 20% of all known protein sequences.

Recently, researchers have begun to explore an intriguing alternative approach: using large language models, rather than multiple sequence alignment, to predict protein structures.

“Protein language models”—LLMs trained not on English words but rather on protein sequences—have demonstrated an astonishing ability to intuit the complex patterns and interrelationships between protein sequence, structure and function: say, how changing certain amino acids in certain parts of a protein’s sequence will affect the shape that the protein folds into. Protein language models are able to, if you will, learn the grammar or linguistics of proteins.

The idea of a protein language model dates back to the 2019 UniRep work out of George Church’s lab at Harvard (though UniRep used LSTMs rather than today’s state-of-the-art transformer models).

In late 2022, Meta debuted ESM-2 and ESMFold, one of the largest and most sophisticated protein language models published to date, weighing in at 15 billion parameters. (ESM-2 is the LLM itself; ESMFold is its associated structure prediction tool.)

ESM-2/ESMFold is about as accurate as AlphaFold at predicting proteins’ three-dimensional structures. But unlike AlphaFold, it is able to generate a structure based on a single protein sequence, without requiring any structural information as input. As a result, it is up to 60 times faster than AlphaFold. When researchers are looking to screen millions of protein sequences at once in a protein engineering workflow, this speed advantage makes a huge difference. ESMFold can also produce more accurate structure predictions than AlphaFold for orphan proteins that lack evolutionarily similar analogues.

Language models’ ability to develop a generalized understanding of the “latent space” of proteins opens up exciting possibilities in protein science.

But an even more powerful conceptual advance has taken place in the years since AlphaFold.

In short, these protein models can be inverted: rather than predicting a protein’s structure based on its sequence, models like ESM-2 can be reversed and used to generate totally novel protein sequences that do not exist in nature based on desired properties.

Inventing New Proteins

All the proteins that exist in the world today represent but an infinitesimally tiny fraction of all the proteins that could theoretically exist. Herein lies the opportunity.

To give some rough numbers: the total set of proteins that exist in the human body—the so-called “human proteome”—is estimated to number somewhere between 80,000 and 400,000 proteins. Meanwhile, the number of proteins that could theoretically exist is in the neighborhood of 10^1,300—an unfathomably large number, many times greater than the number of atoms in the universe. (To be clear, not all of these 10^1,300 possible amino acid combinations would result in biologically viable proteins. Far from it. But some subset would.)

Over many millions of years, the meandering process of evolution has stumbled upon tens or hundreds of thousands of these viable combinations. But this is merely the tip of the iceberg.

In the words of Molly Gibson, cofounder of leading protein AI startup Generate Biomedicines: “The amount of sequence space that nature has sampled through the history of life would equate to almost just a drop of water in all of Earth’s oceans.”

An opportunity exists for us to improve upon nature. After all, as powerful of a force as it is, evolution by natural selection is not all-seeing; it does not plan ahead; it does not reason or optimize in top-down fashion. It unfolds randomly and opportunistically, propagating combinations that happen to work.

Using AI, we can for the first time systematically and comprehensively explore the vast uncharted realms of protein space in order to design proteins unlike anything that has ever existed in nature, purpose-built for our medical and commercial needs.

We will be able to design new protein therapeutics to address the full gamut of human illness—from cancer to autoimmune diseases, from diabetes to neurodegenerative disorders. Looking beyond medicine, we will be able to create new classes of proteins with transformative applications in agriculture, industrials, materials science, environmental remediation and beyond.

Some early efforts to use deep learning for de novo protein design have not made use of large language models.

One prominent example is ProteinMPNN, which came out of David Baker’s world-renowned lab at the University of Washington. Rather than using LLMs, the ProteinMPNN architecture relies heavily on protein structure data in order to generate novel proteins.

The Baker lab more recently published RFdiffusion, a more advanced and generalized protein design model. As its name suggests, RFdiffusion is built using diffusion models, the same AI technique that powers text-to-image models like Midjourney and Stable Diffusion. RFdiffusion can generate novel, customizable protein “backbones”—that is, proteins’ overall structural scaffoldings—onto which sequences can then be layered.

Structure-focused models like ProteinMPNN and RFdiffusion are impressive achievements that have advanced the state of the art in AI-based protein design. Yet we may be on the cusp of a new step-change in the field, thanks to the transformative capabilities of large language models.

Why are language models such a promising path forward compared to other computational approaches to protein design? One key reason: scaling.

Scaling Laws

One of the key forces behind the dramatic recent progress in artificial intelligence is so-called “scaling laws”: the fact that almost unbelievable improvements in performance result from continued increases in LLM parameter count, training data and compute.

At each order-of-magnitude increase in scale, language models have demonstrated remarkable, unexpected, emergent new capabilities that transcend what was possible at smaller scales.

It is OpenAI’s commitment to the principle of scaling, more than anything else, that has catapulted the organization to the forefront of the field of artificial intelligence in recent years. As they moved from GPT-2 to GPT-3 to GPT-4 and beyond, OpenAI has built larger models, deployed more compute and trained on larger datasets than any other group in the world, unlocking stunning and unprecedented AI capabilities.

How are scaling laws relevant in the realm of proteins?

Thanks to scientific breakthroughs that have made gene sequencing vastly cheaper and more accessible over the past two decades, the amount of DNA and thus protein sequence data available to train AI models is growing exponentially, far outpacing protein structure data.

Protein sequence data can be tokenized and for all intents and purposes treated as textual data; after all, it consists of linear strings of amino acids in a certain order, like words in a sentence. Large language models can be trained solely on protein sequences to develop a nuanced understanding of protein structure and biology.

This domain is thus ripe for massive scaling efforts powered by LLMs—efforts that may result in astonishing emergent insights and capabilities in protein science.

The first work to use transformer-based LLMs to design de novo proteins was ProGen, published by Salesforce Research in 2020. The original ProGen model was 1.2 billion parameters.

Ali Madani, the lead researcher on ProGen, has since founded a startup named Profluent Bio to advance and commercialize the state of the art in LLM-driven protein design.

While he pioneered the use of LLMs for protein design, Madani is also clear-eyed about the fact that, by themselves, off-the-shelf language models trained on raw protein sequences are not the most powerful way to tackle this challenge. Incorporating structural and functional data is essential.

“The greatest advances in protein design will be at the intersection of careful data curation from diverse sources and versatile modeling that can flexibly learn from that data,” Madani said. “This entails making use of all high-signal data at our disposal—including protein structures and functional information derived from the laboratory.”

Another intriguing early-stage startup applying LLMs to design novel protein therapeutics is Nabla Bio. Spun out of George Church’s lab at Harvard and led by the team behind UniRep, Nabla is focused specifically on antibodies. Given that 60% of all protein therapeutics today are antibodies and that the two highest-selling drugs in the world are antibody therapeutics, it is hardly a surprising choice.

Nabla has decided not to develop its own therapeutics but rather to offer its cutting-edge technology to biopharma partners as a tool to help them develop their own drugs.

Expect to see much more startup activity in this area in the months and years ahead as the world wakes up to the fact that protein design represents a massive and still underexplored field to which to apply large language models’ seemingly magical capabilities.

The Road Ahead

In her acceptance speech for the 2018 Nobel Prize in Chemistry, Frances Arnold said: “Today we can for all practical purposes read, write, and edit any sequence of DNA, but we cannot compose it. The code of life is a symphony, guiding intricate and beautiful parts performed by an untold number of players and instruments. Maybe we can cut and paste pieces from nature’s compositions, but we do not know how to write the bars for a single enzymic passage.”

As recently as five years ago, this was true.

But AI may give us the ability, for the first time in the history of life, to actually compose entirely new proteins (and their associated genetic code) from scratch, purpose-built for our needs. It is an awe-inspiring possibility.

These novel proteins will serve as therapeutics for a wide range of human illnesses, from infectious diseases to cancer; they will help make gene editing a reality; they will transform materials science; they will improve agricultural yields; they will neutralize pollutants in the environment; and so much more that we cannot yet even imagine.

The field of AI-powered—and especially LLM-powered—protein design is still nascent and unproven. Meaningful scientific, engineering, clinical and business obstacles remain. Bringing these new therapeutics and products to market will take years.

Yet over the long run, few market applications of AI hold greater promise.

In future articles, we will delve deeper into LLMs for protein design, including exploring the most compelling commercial applications for the technology as well as the complicated relationship between computational outcomes and real-world wet lab experiments.

Let’s end by zooming out. De novo protein design is not the only exciting opportunity for large language models in the life sciences.

Language models can be used to generate other classes of biomolecules, notably nucleic acids. A buzzy startup named Inceptive, for example, is applying LLMs to generate novel RNA therapeutics.

Other groups have even broader aspirations, aiming to build generalized “foundation models for biology” that can fuse diverse data types spanning genomics, protein sequences, cellular structures, epigenetic states, cell images, mass spectrometry, spatial transcriptomics and beyond.

The ultimate goal is to move beyond modeling an individual molecule like a protein to modeling proteins’ interactions with other molecules, then to modeling whole cells, then tissues, then organs—and eventually entire organisms.

The idea of building an artificial intelligence system that can understand and design every intricate detail of a complex biological system is mind-boggling. In time, this will be within our grasp.

The twentieth century was defined by fundamental advances in physics: from Albert Einstein’s theory of relativity to the discovery of quantum mechanics, from the nuclear bomb to the transistor. As many modern observers have noted, the twenty-first century is shaping up to be the century of biology. Artificial intelligence and large language models will play a central role in unlocking biology’s secrets and unleashing its possibilities in the decades ahead.

Buckle up.

 

内容中包含的图片若涉及版权问题,请及时与我们联系删除