Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs