转自今天的爱可可AI前沿推介

[CL] Precise Zero-Shot Dense Retrieval without Relevance Labels

L Gao, X Ma, J Lin, J Callan
[CMU & University of Waterloo]

无相关标签的精确零样本稠密检索

要点:

  1. 假设文档嵌入(HyDE)可用于在没有相关标签的情况下创建有效的零样本稠密检索系统;
  2. HyDE优于当前的无监督稠密检索器,在各种任务和语言中表现出与微调检索器类似的强大性能;
  3. HyDE可用作后端,用于路由较不常见/新兴的查询,而较常见的查询可由有监督稠密检索器提供。

摘要:
虽然稠密检索已被证明跨任务、跨语言式有效且高效的,但在没有相关标签的情况下,仍然很难创建有效的完全零样本稠密检索系统。本文认识到零样本学习和编码相关性的困难,建议通过假设文档嵌入(HyDE)实现改善。给定一个查询,HyDE第一个零样本指示遵循指令的语言模型(例如InstructGPT)生成一个假设的文档,该文档捕获了相关性模式,但不真实,可能包含虚假的详细信息。然后,一个无监督对比学习编码器(例如Contriever)将文档编码为嵌入向量。该向量标识语料库嵌入空间中的一个邻域,其中根据向量相似性检索类似的真实文档。第二步将生成的文档接地到实际语料库,编码器的稠密瓶颈过滤掉了正确的细节。实验表明,HyDE的表现明显优于最先进的无监督稠密检索器Contriever,并在各种任务(如网络搜索、QA、事实验证)和语言(如sw、ko、ja)方面表现出与微调检索器相当的强大性能。

While dense retrieval has been shown effective and efficient across tasks and languages, it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance label is available. In this paper, we recognize the difficulty of zero-shot learning and encoding relevance. Instead, we propose to pivot through Hypothetical Document Embeddings (HyDE). Given a query, HyDE first zero-shot instructs an instruction-following language model (e.g. InstructGPT) to generate a hypothetical document. The document captures relevance patterns but is unreal and may contain false details. Then, an unsupervised contrastively learned encoder (e.g. Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, where similar real documents are retrieved based on vector similarity. This second step ground the generated document to the actual corpus, with the encoder’s dense bottleneck filtering out the in correct details. Our experiments show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever and shows strong performance comparable to fine-tuned retrievers, across various tasks (e.g. web search, QA, fact verification) and languages (e.g. sw, ko, ja).

论文链接:http://boston.lti.cs.cmu.edu/luyug/HyDE/HyDE.pdf
图片
图片

内容中包含的图片若涉及版权问题,请及时与我们联系删除