- 简介在策略优化算法中,模拟某个状态相关策略下的单个动力系统轨迹是核心瓶颈。在单个模拟中必须执行许多内在的串行策略评估,构成了大部分瓶颈。例如,在将策略优化应用于供应链优化(SCO)问题时,模拟供应链一个月可能需要几个小时。我们提出了一种名为Picard迭代的策略模拟迭代算法。该方案将策略评估任务分配给独立的进程。在一个迭代中,单个进程仅在其分配的任务上评估策略,同时假定其他任务的某个“缓存”评估结果;缓存在迭代结束时更新。在GPU上实现,该方案允许在单个轨迹上对策略进行批处理评估。我们证明,许多SCO问题所提供的结构允许在少数迭代中收敛,而与视野无关。我们展示了在大规模SCO问题上的实际速度提升达到400倍,即使只使用单个GPU,同时还展示了在其他强化学习环境中的实际功效。
- 图表
- 解决问题Picard Iteration: A Unified GPU Architecture for Accelerating Policy Iteration
- 关键思路The Picard Iteration algorithm assigns policy evaluation tasks to independent processes and updates the cache at the end of each iteration, allowing for batched evaluation of policies on a single trajectory. This algorithm is implemented on GPUs and is proven to converge in a small number of iterations for many supply chain optimization problems.
- 其它亮点The Picard Iteration algorithm leads to practical speedups of 400x on large-scale supply chain optimization problems, even with a single GPU. The algorithm is also effective in other reinforcement learning environments. The experiments were designed using GPUs and the authors provide open-source code. Future work could explore the application of this algorithm in other domains beyond supply chain optimization.
- Related work in this field includes 'A Survey on Parallelization of Deep Reinforcement Learning Algorithms' and 'Efficient Parallel Methods for Deep Reinforcement Learning'.
沙发等你来抢
去评论
评论
沙发等你来抢