A Tale of Two Scales: Reconciling Horizontal and Vertical Scaling for Inference Serving Systems

简介

推断服务在部署机器学习模型到实际应用中非常重要，确保高效处理和快速响应推断请求。然而，在这些系统中管理资源存在重大挑战，特别是在不同和不可预测的工作负载下保持性能。两种主要的扩展策略，水平和垂直扩展，提供不同的优点和限制。水平扩展增加更多实例以处理增加的负载，但可能会遇到冷启动问题和增加的管理复杂性。垂直扩展增加现有实例的容量，允许更快的响应，但受到硬件和模型并行化能力的限制。本文介绍了 Themis，一个旨在利用水平和垂直扩展的优点的推断服务系统。Themis采用两阶段自动扩展策略：最初使用就地垂直扩展来处理工作负载激增，然后在工作负载稳定后切换到水平扩展以优化资源效率。该系统对深度学习模型的处理延迟进行分析，计算排队延迟，并采用不同的动态规划算法根据工作负载情况最优地解决水平和垂直扩展问题。通过真实工作负载跟踪的广泛评估表明，与最先进的水平或垂直自动扩展方法相比，Themis可以实现超过10倍的SLO违规减少，同时在工作负载稳定时保持资源效率。
图表
解决问题

Themis: Optimizing Resource Efficiency for Distributed Deep Learning Inference Serving
关键思路

Themis employs a two-stage autoscaling strategy that leverages both horizontal and vertical scaling to optimize resource efficiency while maintaining performance under varying and unpredictable workloads.
其它亮点

The system profiles the processing latency of deep learning models, calculates queuing delays, and employs different dynamic programming algorithms to solve the joint horizontal and vertical scaling problem optimally based on the workload situation. Extensive evaluations with real-world workload traces demonstrate over $10\times$ SLO violation reduction compared to the state-of-the-art horizontal or vertical autoscaling approaches while maintaining resource efficiency when the workload is stable.
相关研究

Related work includes 'DeepRM: A Deep Reinforcement Learning Framework for Resource Management in Cloud Computing' and 'AutoScale: Dynamic, Robust Capacity Management for Multi-Tier Data Centers'.

A Tale of Two Scales: Reconciling Horizontal and Vertical Scaling for Inference Serving Systems

评论