MetaphorStar is the first end-to-end visual reinforcement learning (RL) framework specifically designed for image implication understanding.
Metaphorical comprehension in images remains a critical challenge for AI systems. While Multimodal Large Language Models (MLLMs) excel at basic VQA, they struggle with nuanced cultural, emotional, and contextual implications.
To address this, we propose MetaphorStar, the first end-to-end visual reinforcement learning (RL) framework for image implication tasks. Our framework introduces the TFQ (True-False Question) paradigm to convert subjective interpretations into verifiable binary judgments, enabling stable RL optimization.
Our open-source MetaphorStar family (3B, 7B, 32B), trained using TFQ-GRPO, achieves significant performance improvements and state-of-the-art results.
Click to jump to each section.
Understanding visual metaphors requires complex cognitive chains:
Visual Elements β Symbolic Recognition β Metaphorical Mapping β Cultural Context β Deep Implication.
Standard Supervised Fine-Tuning (SFT) is insufficient for teaching this process.
We leverage Reinforcement Learning (RL) to optimize the reasoning process itself. However, applying RL to subjective visual interpretation is challenging due to the lack of "ground truth". We solve this with the True-False Question (TFQ) Paradigm:
<think>...</think> tags.
Current benchmarks like MCQ (Multiple-Choice Question) and OSQ (Open-Style Question) have limitations. MCQ is stable but medium-difficulty, while OSQ is hard but difficult to evaluate and optimize.
We introduce True-False Question (TFQ) as a fine-grained foundation because it offers:
We construct a large-scale dataset leveraging 1,434 high-quality metaphorical images from II-Bench. We ensure high quality through Human-in-the-loop Prompting with Expert-validated Ground-Truth Implication and high-quality, human-authored reference examples, which align the modelβs outputs with human-logical constraints. We also manually verify the generated data.
Generation Pipeline & Design Principles:
TFQ-GRPO (Group Relative Policy Optimization for True-False Questions) is our specialized visual RL algorithm.
Structured Output Format: We enforce a strict structure separating reasoning from judgment:
<think> [reasoning] </think> <answer> [True/False] </answer>
Multi-Component Reward Function:
\[R_{\text{total}} = R_{\text{accuracy}} + \lambda_{\text{format}} \cdot R_{\text{format}}\]
Correctness is rewarded based on the binary answer, while format rewards ensure structural compliance.
Group Relative Optimization: We sample \(K\) diverse outputs for each question and optimize the policy based on the relative advantage of each output compared to the group average.
We introduce the MetaphorStar family, which comprises three sizes: 3B, 7B, and 32B. We utilize the QwenVL-2.5 series as the base model.
| Model | Base Model | Size | Link |
|---|---|---|---|
| MetaphorStar-3B | Qwen2.5-VL-3B | 3B | HuggingFace |
| MetaphorStar-7B | Qwen2.5-VL-7B | 7B | HuggingFace |
| MetaphorStar-32B | Qwen2.5-VL-32B | 32B | HuggingFace |
To gain insight into the internal reasoning mechanisms of our model, we analyze its token-level generation entropy. Figure 3 provides a visualization of this entropy as MetaphorStar-7B generates responses for the TFQ, MCQ, and OSQ tasks.
Our analysis reveals that high-entropy tokens, representing points of highest uncertainty for the model, are not randomly distributed. This aligns with recent findings that "high-entropy minority" of tokens is critical for complex reasoning. In the context of image implication, we observe that these spikes in uncertainty consistently occur at crucial semantic and logical junctions.
Specifically, the model exhibits high entropy when generating logical connectors (e.g., "therefore", "thus", "but") that pivot the argument or establish a causal link. We also note high entropy for key function words (e.g., "the", "is"), quantifiers, and pronouns, suggesting that the model's core cognitive effort is concentrated on making definitive logical leaps and structuring the relationship between concepts. Conversely, low-entropy (high-confidence) tokens are typically associated with reproducing factual details from the image or completing deterministic phrasal structures.
State-of-the-Art Performance: MetaphorStar-32B achieves SOTA on all major benchmarks.
Does learning metaphor help general vision? YES.
We evaluated MetaphorStar on general benchmarks (MMBench, MathVista, MMVet). The results show that training on implication tasks improves valid performance on general complex reasoning tasks, suggesting that "Image Metaphor Understanding" serves as a high-level cognitive workout for MLLMs.
We conduct comprehensive ablation studies to validate our design choices and understand the factors contributing to MetaphorStar's success. Our analysis covers four critical dimensions: model scale, data scale, architectural choices, and training strategies.
Model Parameter Scaling: We observed consistent performance gains scaling from 3B to 32B. Larger models benefit more significantly from the RL stage, showing emergent reasoning capabilities with longer CoT paths.
Training Data Scaling: Scaling TFQ-Data from 1k to 14k samples shows a log-linear improvement in reasoning accuracy. The diversity of the dataset (covering politics, art, humor) is crucial for preventing overfitting to specific visual styles.
We validated our framework across LLaVA architectures. TFQ-GRPO proves to be model-agnostic, consistently improving the reasoning baseline of different backbones.
We explore the impact of different training strategies by comparing three approaches: TFQ-SFT (SFT only), TFQ-SFT + TFQ-GRPO (SFT warmup + RL), and TFQ-GRPO (End-to-end RL).
Counterintuitively, SFT warmup actively harms performance. End-to-end RL (TFQ-GRPO) achieves best results on TFQ and MCQ. SFT-involving strategies cause catastrophic collapse on MCQ (46% β 28%), indicating severe generalization damage due to "SFT Curse" and entropy collapse.
The "SFT Curse" in Visual Metaphor Reasoning
Our analysis reveals a critical finding for reasoning-heavy visual tasks: SFT warmup is not only unnecessary but actively detrimental.
Conclusion: End-to-end RL (TFQ-GRPO) leverages high initial entropy for global optimization, proving superior for open-ended reasoning tasks.
To conclude, we present MetaphorStar, a pioneering framework that introduces visual reinforcement learning to the domain of image metaphor understanding. By establishing the TFQ paradigm and the TFQ-GRPO algorithm, we successfully bridge the gap between subjective visual interpretation and objective RL optimization.
Our release includes the MetaphorStar model family, the large-scale TFQ-Data, and the rigorous TFQ-Bench.
Crucially, our findings demonstrate that learning to reason about metaphors serves as a high-level cognitive workout, enhancing general visual reasoning capabilities. We hope our open-source contribution will inspire further research into reasoning-based visual learning and the cognitive depths of MLLMs.
@article{zhang2026metaphorstar,
title={MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning},
author={Chenhao Zhang, Yazhe Niu and Hongsheng Li},
journal={arXiv preprint arXiv:xxx},
year={2026}
}
This website is adapted from VISIONx @NYU, licensed under Creative Commons Attribution-ShareAlike 4.0 International License.