Efficient Detection of Toxic Prompts in Large Language Models

简介

大型语言模型（LLMs）如ChatGPT和Gemini已经显著推进了自然语言处理，使得各种应用如聊天机器人和自动化内容生成成为可能。然而，这些模型可能会被恶意个体利用，他们会制作有毒提示来引诱产生有害或不道德的回应。这些个体通常使用越狱技术来绕过安全机制，凸显了需要强大的有毒提示检测方法。现有的检测技术，包括黑盒和白盒，面临着与有毒提示多样性、可扩展性和计算效率相关的挑战。因此，我们提出了ToxicDetector，一种轻量级灰盒方法，旨在高效地检测LLMs中的有毒提示。ToxicDetector利用LLMs创建有毒概念提示，使用嵌入向量形成特征向量，并采用多层感知器（MLP）分类器进行提示分类。我们在LLama模型的各个版本、Gemma-2和多个数据集上的评估表明，ToxicDetector实现了高达96.39％的准确率和2.00％的低误报率，优于现有的最先进方法。此外，ToxicDetector每个提示的处理时间为0.0780秒，非常适合实时应用。ToxicDetector实现了高准确性、高效性和可扩展性，使其成为LLMs中有毒提示检测的实用方法。
图表
解决问题

ToxicDetector: A Lightweight Greybox Method for Detecting Toxic Prompts in Large Language Models
关键思路

ToxicDetector leverages LLMs to create toxic concept prompts, uses embedding vectors to form feature vectors, and employs a Multi-Layer Perceptron (MLP) classifier for prompt classification.
其它亮点

ToxicDetector achieves a high accuracy of 96.39% and a low false positive rate of 2.00%, outperforming state-of-the-art methods. It has a processing time of 0.0780 seconds per prompt, making it highly suitable for real-time applications. The evaluation was done on various versions of the LLama models, Gemma-2, and multiple datasets.
相关研究

Related work includes blackbox and whitebox methods for toxic prompt detection in LLMs, which face challenges related to the diversity of toxic prompts, scalability, and computational efficiency.

Efficient Detection of Toxic Prompts in Large Language Models

评论