Speeding up Policy Simulation in Supply Chain RL

简介

在策略优化算法中，模拟某个状态相关策略下的单个动力系统轨迹是核心瓶颈。在单个模拟中必须执行许多内在的串行策略评估，构成了大部分瓶颈。例如，在将策略优化应用于供应链优化（SCO）问题时，模拟供应链一个月可能需要几个小时。我们提出了一种名为Picard迭代的策略模拟迭代算法。该方案将策略评估任务分配给独立的进程。在一个迭代中，单个进程仅在其分配的任务上评估策略，同时假定其他任务的某个“缓存”评估结果；缓存在迭代结束时更新。在GPU上实现，该方案允许在单个轨迹上对策略进行批处理评估。我们证明，许多SCO问题所提供的结构允许在少数迭代中收敛，而与视野无关。我们展示了在大规模SCO问题上的实际速度提升达到400倍，即使只使用单个GPU，同时还展示了在其他强化学习环境中的实际功效。
图表
解决问题

Picard Iteration: A Unified GPU Architecture for Accelerating Policy Iteration
关键思路

The Picard Iteration algorithm assigns policy evaluation tasks to independent processes and updates the cache at the end of each iteration, allowing for batched evaluation of policies on a single trajectory. This algorithm is implemented on GPUs and is proven to converge in a small number of iterations for many supply chain optimization problems.
其它亮点

The Picard Iteration algorithm leads to practical speedups of 400x on large-scale supply chain optimization problems, even with a single GPU. The algorithm is also effective in other reinforcement learning environments. The experiments were designed using GPUs and the authors provide open-source code. Future work could explore the application of this algorithm in other domains beyond supply chain optimization.
相关研究

Related work in this field includes 'A Survey on Parallelization of Deep Reinforcement Learning Algorithms' and 'Efficient Parallel Methods for Deep Reinforcement Learning'.

Speeding up Policy Simulation in Supply Chain RL

评论