Reactive Machines
Upgrading to Egoocentric Video Question in response to large multimodal models

The Egoocentric Video Rivering (QA) question requires long-term models, view of the first person, and special challenges such as camera movement regularly. This systematic examination document both multimedia multimodel models (GPT-4O, Gemini-1.5-VLA-7B) checked using Zero-Tun Settings for both OKQA settings and higher settings. We introduce QAEGO4DV2 to reduce the sound of the Qaegego4D description, making more comparisons. Our results indicate that edited provision of video-Llava-7B and QWEN2-VL-stricter achieves the New State performance, passing the Error benchmarks.



