AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents

简介

AI代理人通过将基于文本的推理与外部工具调用相结合来解决复杂任务。不幸的是，AI代理人容易受到提示注入攻击的影响，其中由外部工具返回的数据劫持代理人执行恶意任务。为了衡量AI代理人的对抗鲁棒性，我们引入了AgentDojo，这是一个用于执行不受信任数据上的代理工具的评估框架。为了捕捉攻击和防御的不断演变，AgentDojo不是一个静态测试套件，而是一个可扩展的环境，用于设计和评估新的代理任务、防御和自适应攻击。我们在该环境中提供了97个现实任务（如管理电子邮件客户端、导航电子银行网站或进行旅行预订）、629个安全测试用例以及来自文献的各种攻击和防御范例。我们发现，AgentDojo对于攻击和防御都是一个挑战：最先进的LLM在许多任务上失败（即使没有攻击），而现有的提示注入攻击破坏了一些安全属性，但并非全部。我们希望AgentDojo能够促进对解决常见任务的AI代理人的新设计原则的研究。我们在https://github.com/ethz-spylab/agentdojo上发布了AgentDojo的代码。
图表
解决问题

AgentDojo: An Evaluation Framework for Robustness of AI Agents to Prompt Injection Attacks
关键思路

AgentDojo is an extensible evaluation framework for measuring the adversarial robustness of AI agents that execute tools over untrusted data. It includes realistic tasks, security test cases, and various attack and defense paradigms from the literature.
其它亮点

AgentDojo poses a challenge for both attacks and defenses, and can foster research on new design principles for AI agents that solve common tasks in a reliable and robust manner. The code for AgentDojo is released at https://github.com/ethz-spylab/agentdojo.
相关研究

Recent related work includes 'Adversarial Attacks on Neural Networks for Graph Data: A Survey' by Zheng et al., 'Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning' by Wang et al., and 'Robustness to Adversarial Examples through an Ensemble of Specialists' by Tramèr et al.

AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents

评论