分享

BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

热度