LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism

简介

大型语言模型（LLMs）的上下文窗口正在迅速增加，导致不同请求之间以及同一请求的不同阶段之间的资源使用存在巨大的方差。由于受到静态并行策略的限制，现有的LLM服务系统无法有效利用底层资源来服务不同阶段的可变长度请求。为了解决这个问题，我们提出了一种新的并行性范式，弹性序列并行（ESP），以弹性地适应不同请求和阶段之间的差异。基于ESP，我们设计并构建了LoongServe，一个LLM服务系统，它(1)通过实时弹性调整并行度来提高计算效率，(2)通过减少键值缓存迁移开销和重叠部分解码通信与计算来提高通信效率，(3)通过减少实例间键值缓存碎片化来提高GPU内存效率。我们在不同的真实数据集下进行了评估，结果显示，与分块预填充和预填充解码分离相比，LoongServe将最大吞吐量提高了最多3.85倍和5.81倍。
图表
解决问题

LoongServe: Elastic Sequence Parallelism for Efficient Large Language Model Serving
关键思路

LoongServe proposes elastic sequence parallelism (ESP) to adapt to variance in resource usage between different requests and phases, and improves computation, communication, and GPU memory efficiency in large language model serving.
其它亮点

LoongServe improves maximum throughput by up to 3.85x compared to chunked prefill and 5.81x compared to prefill-decoding disaggregation. It reduces key-value cache migration overhead and overlapping partial decoding communication with computation, and reduces key-value cache fragmentation across instances. The evaluation is conducted on diverse real-world datasets.
相关研究

Related work includes existing LLM serving systems that are restricted by static parallelism strategies, and other parallelism approaches like pipeline parallelism and tensor parallelism.

LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism

评论