Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization