RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning

nimda March 16, 2026

0 5 1 minute read

RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning

Dense image captions are important for cross-modal alignment in visual language pre-training and image-to-text generation, but measuring expert-quality annotations is expensive. While abstracting with strong visual language models (VLMs) is a viable alternative, supervised filtering tends to produce limited output diversity and weak generalization. Reinforcement learning (RL) can overcome these limitations, but its success so far has been focused on validation domains that rely on deterministic testing – a luxury not available in open captions. We address this limitation with RubiCap, a novel RL framework that detects specific reward signals, specific samples from rubrics written by LLM. RubiCap begins by assembling a diverse committee of candidate captions, then uses an LLM rubric writer to draw out consensus strengths and examine shortcomings in current policy. These details are converted into clear evaluation criteria, allowing the LLM judge to discard absolute quality evaluations and replace negative scalar awards with systematic, multifaceted evaluations. Across extensive benchmarks, RubiCap achieves the highest win rates in CapArena, the most efficient supervised distillation, RL prior methods, human expert annotations, and improved GPT-4V results. In CaptionQA, it shows superior word performance: our 7B model is similar to Qwen2.5-VL-32B-Instruct, and our 3B model outperforms its 7B counterpart. Notably, using the compact RubiCap-3B as a captioner produces stronger pre-trained VLMs than those trained on captions from proprietary models.