亲爱的朋友们

构建机器学习系统的大部分工作是调试而不是开发,内化这一心理框架让我成为了一个更高效的机器学习工程师。

这个想法可能会引起那些多年从事监督学习或强化学习项目的机器学习工程师的共鸣。这个想法同样适用于基于提示的人工智能开发的新兴实践。

在构建传统软件系统时,常见的做法是编写产品规范,然后根据规范编写代码,最后花时间调试代码并解决问题。但当你在构建机器学习系统时,最好的做法是快速构建一个初始原型,并用它来识别和修复问题。对于构建人类可以做得很好的应用程序,例如处理图像、音频或文本等非结构化数据任务,这个做法尤其正确。
● 快速建立一个简单的系统,看看它运行地如何。
● 找出系统不足的地方(通过错误分析或其他技术),并迭代地尝试缩小系统输出与给定人类(例如你自己、开发人员或领域专家)相同数据时的输出之间的差距。

机器学习软件通常需要执行一系列步骤,这样的系统被称为管道 (pipelines) 或级联 (cascades)。比方说,你想构建一个系统,将电子商务网站的客户电子邮件发送给相应部门(服装、电子产品等),然后使用语义搜索检索相关产品信息,最后起草回复供人类员工编辑。这些步骤中的每一步都可以由人类来完成。通过单独检查它们,看看系统在哪些方面不如人类水平的表现,就可以集中注意力对这些部分进行改进。

在调试系统时,我经常会有“嗯,这看起来很奇怪”的时刻,这暗示了下一步应该尝试什么。例如,我曾多次经历过以下每一种情况:
● 学习曲线看起来不太对。
● 系统在你认为比较简单的样本上表现得更差。
● 损失函数输出的值比你想象的要高或低。
● 添加一个你认为有助于提高性能的特征实际上是有害的。
● 系统在测试集上表现出来的性能比“看起来合理”要更好。
● LLMs的输出格式不一致;例如,包含多余文本。

当你注意到这样的情况时,同时处理多个项目的经验是有帮助的。机器学习系统有很多活动部件。当你看到许多学习曲线时,就会开始磨练自己的直觉,判断什么是正常的,什么是异常的;或者当你多次提示LLM输出JSON时,你会开始了解最常见的错误模式。最近,我经常出于兴趣利用周末的时间尝试构建不同的基于LLM的小型应用程序。观察这些程序的运行(并与朋友聊聊他们所做的项目)有助于我磨练自己的直觉,判断此类应用程序何时会出现问题,以及什么是合理的解决方案。

理解算法的工作原理也很有帮助。多亏了像TensorFlow和PyTorch这样的开发工具,我们可以只用几行代码就实现一个神经网络,这太棒了!但是如果(或者当)你发现你的系统不能很好地工作怎么办?学习解释各种算法背后的理论课程是有用的。如果你在技术层面上了解了学习算法是如何工作的,那么就更有可能发现意外情况,并且有更多的选项来调试它。

很多“机器学习开发类似于调试”的概念源于这样一种现象:当我们开始一个新的机器学习项目时,我们不知道会在数据中发现什么奇怪而美妙的东西。在基于提示的开发中,我们也不知道生成模型会产生什么奇怪而美妙的东西。这就是为什么机器学习开发比传统软件开发更具迭代性——我们正在踏上发现这些东西的旅程。快速构建一个系统,然后投入更多时间调试它,这是让此类系统工作的一种实用方法。

请不断学习!
吴恩达

Dear friends,

Internalizing this mental framework has made me a more efficient machine learning engineer: Most of the work of building a machine learning system is debugging rather than development.

This idea will likely resonate with machine learning engineers who have worked on supervised learning or reinforcement learning projects for years. It also applies to the emerging practice of prompt-based AI development.

When you’re building a traditional software system, it’s common practice to write a product spec, then write code to that spec, and finally spend time debugging the code and ironing out the kinks. But when you’re building a machine learning system, it’s frequently better to build an initial prototype quickly and use it to identify and fix issues. This is true particularly for building applications that humans can do well, such as unstructured data tasks like processing images, audio, or text.

Build a simple system quickly to see how well it does.
Figure out where it falls short (viaerror analysisor other techniques), and iteratively try to close the gap between what the system does and what a human (such as you, the developer, or a domain expert) would do given the same data.

Machine learning software often has to carry out a sequence of steps; such systems are called pipelines or cascades. Say, you want to build a system to route an ecommerce site’s customer emails to the appropriate department (is this apparel, electronics, . . . ), then retrieve relevant product information using semantic search, and finally draft a response for a human representative to edit. Each of these steps could have been done by a human. By examining them individually and seeing where the system falls short of human-level performance, you can decide where to focus your attention.

While debugging a system, I frequently have a “hmm, that looks strange” moment that suggests what to try next. For example, I’ve experienced each of the following many times:

The learning curve doesn’t quite look right.
The system performs worse on what you think are the easier examples.
The loss function outputs values that are higher or lower than you think it should.
Adding a feature that you thought would help performance actually hurt.
Performance on the test set is better than seems reasonable.
An LLM’s output is inconsistently formatted; for example, including extraneous text.

When it comes to noticing things like this, experience working with multiple projects is helpful. Machine learning systems have a lot of moving parts. When you have seen many learning curves, you start to hone your instincts about what’s normal and what’s anomalous; or when you have prompted a large language model (LLM) to output JSON many times, you start to get a sense of the most common error modes. These days, I frequently play with building different small LLM-based applications on weekends just for fun. Seeing how they behave (as well as consulting with friends on their projects) is helping me to hone my own instincts about when such applications go wrong, and what are plausible solutions.

Understanding how the algorithms work really helps, too. Thanks to development tools like TensorFlow and PyTorch, you can implement a neural network in just a few lines of code — that’s great! But what if (or when!) you find that your system doesn’t work well? Taking courses that explain the theory that underlies various algorithms is useful. If you understand at a technical level how a learning algorithm works, you’re more likely to spot unexpected behavior, and you’ll have more options for debugging it.

The notion that much of machine learning development is akin to debugging arises from this observation: When we start a new machine learning project, we don’t know what strange and wonderful things we’ll find in the data. With prompt-based development, we also don’t know what strange and wonderful things a generative model will produce. This is why machine learning development is much more iterative than traditional software development: We’re embarking on a journey to discover these things. Building a system quickly and then spending most of your time debugging it is a practical way to get such systems working.

Keep learning!

Andrew

内容中包含的图片若涉及版权问题,请及时与我们联系删除