MetaphorStar

Image Metaphor Understanding and Reasoning
with End-to-End Visual Reinforcement Learning

MetaphorStar is the first end-to-end visual reinforcement learning (RL) framework specifically designed for image implication understanding.

Visual RL Icon
First Visual RL Framework for Image Metaphors: Integrating Chain-of-Thought reasoning with Group Relative Policy Optimization (GRPO).
Dataset Icon
Large-Scale Dataset & Benchmark: TFQ-Data with 14k high-quality training samples + TFQ-Bench for rigorous evaluation.
SOTA Icon
State-of-the-Art Performance: Outperforms Gemini-3.0-Pro on multiple benchmarks.
Improvement Icon
82.6% Improvement: Massive accuracy gains on TFQ tasks compared to base models.
Reasoning Icon
Enhanced General Reasoning: Improves performance on general vision benchmarks (MMBench, MathVista, MMVet).
MetaphorStar Teaser

Metaphorical comprehension in images remains a critical challenge for AI systems. While Multimodal Large Language Models (MLLMs) excel at basic VQA, they struggle with nuanced cultural, emotional, and contextual implications.

To address this, we propose MetaphorStar, the first end-to-end visual reinforcement learning (RL) framework for image implication tasks. Our framework introduces the TFQ (True-False Question) paradigm to convert subjective interpretations into verifiable binary judgments, enabling stable RL optimization. Our open-source MetaphorStar family (3B, 7B, 32B), trained using TFQ-GRPO, achieves significant performance improvements and state-of-the-art results.

Motivation Icon Motivation Method Icon Method Data Icon TFQ Data/Bench Model Icon Models Experiments Icon Experiments

Click to jump to each section.


Motivation

Understanding visual metaphors requires complex cognitive chains: Visual Elements β†’ Symbolic Recognition β†’ Metaphorical Mapping β†’ Cultural Context β†’ Deep Implication. Standard Supervised Fine-Tuning (SFT) is insufficient for teaching this process.

We leverage Reinforcement Learning (RL) to optimize the reasoning process itself. However, applying RL to subjective visual interpretation is challenging due to the lack of "ground truth". We solve this with the True-False Question (TFQ) Paradigm:

Method

1. Why TFQ?

Current benchmarks like MCQ (Multiple-Choice Question) and OSQ (Open-Style Question) have limitations. MCQ is stable but medium-difficulty, while OSQ is hard but difficult to evaluate and optimize.

We introduce True-False Question (TFQ) as a fine-grained foundation because it offers:

2. TFQ-Data & TFQ-Bench

We construct a large-scale dataset leveraging 1,434 high-quality metaphorical images from II-Bench. We ensure high quality through Human-in-the-loop Prompting with Expert-validated Ground-Truth Implication and high-quality, human-authored reference examples, which align the model’s outputs with human-logical constraints. We also manually verify the generated data.

Generation Pipeline & Design Principles:

TFQ Data and Bench
Figure 1: Overview of TFQ-Data and TFQ-Bench splits. TFQ-Data-Full contains ~14k questions for training, while TFQ-Bench provides rigorous evaluation sets strictly disjoint from training data.

3. TFQ-GRPO: Visual Reinforcement Learning

TFQ-GRPO (Group Relative Policy Optimization for True-False Questions) is our specialized visual RL algorithm.

Structured Output Format: We enforce a strict structure separating reasoning from judgment:
<think> [reasoning] </think> <answer> [True/False] </answer>

Multi-Component Reward Function:
\[R_{\text{total}} = R_{\text{accuracy}} + \lambda_{\text{format}} \cdot R_{\text{format}}\] Correctness is rewarded based on the binary answer, while format rewards ensure structural compliance.

Group Relative Optimization: We sample \(K\) diverse outputs for each question and optimize the policy based on the relative advantage of each output compared to the group average.

MetaphorStar Model Family

We introduce the MetaphorStar family, which comprises three sizes: 3B, 7B, and 32B. We utilize the QwenVL-2.5 series as the base model.

Model Base Model Size Link
MetaphorStar-3B Qwen2.5-VL-3B 3B HuggingFace
MetaphorStar-7B Qwen2.5-VL-7B 7B HuggingFace
MetaphorStar-32B Qwen2.5-VL-32B 32B HuggingFace

Analyzing Token Entropy in Reasoning

To gain insight into the internal reasoning mechanisms of our model, we analyze its token-level generation entropy. Figure 3 provides a visualization of this entropy as MetaphorStar-7B generates responses for the TFQ, MCQ, and OSQ tasks.

Token Entropy Visualization
Figure 3: The visualization of token entropy for MetaphorStar-7B on TFQ, MCQ, and OSQ. High-entropy (red) indicates high uncertainty, while low-entropy (blue) indicates high confidence.

Our analysis reveals that high-entropy tokens, representing points of highest uncertainty for the model, are not randomly distributed. This aligns with recent findings that "high-entropy minority" of tokens is critical for complex reasoning. In the context of image implication, we observe that these spikes in uncertainty consistently occur at crucial semantic and logical junctions.

Specifically, the model exhibits high entropy when generating logical connectors (e.g., "therefore", "thus", "but") that pivot the argument or establish a causal link. We also note high entropy for key function words (e.g., "the", "is"), quantifiers, and pronouns, suggesting that the model's core cognitive effort is concentrated on making definitive logical leaps and structuring the relationship between concepts. Conversely, low-entropy (high-confidence) tokens are typically associated with reproducing factual details from the image or completing deterministic phrasal structures.

Experiments & Results

Main Results on Metaphor Benchmarks

State-of-the-Art Performance: MetaphorStar-32B achieves SOTA on all major benchmarks.

Main Results
Figure 2: Main performance comparison. MetaphorStar-32B outperforms Gemini-3.0-Pro on several metrics.

General Visual Reasoning

Does learning metaphor help general vision? YES.
We evaluated MetaphorStar on general benchmarks (MMBench, MathVista, MMVet). The results show that training on implication tasks improves valid performance on general complex reasoning tasks, suggesting that "Image Metaphor Understanding" serves as a high-level cognitive workout for MLLMs.

General Results
Figure 4: Performance on general vision benchmarks.

Ablation Study

We conduct comprehensive ablation studies to validate our design choices and understand the factors contributing to MetaphorStar's success. Our analysis covers four critical dimensions: model scale, data scale, architectural choices, and training strategies.

1. Model & Data Scaling

Model Parameter Scaling: We observed consistent performance gains scaling from 3B to 32B. Larger models benefit more significantly from the RL stage, showing emergent reasoning capabilities with longer CoT paths.

Model Scaling
Figure 5: Model Parameter Scaling. We observed consistent performance gains scaling from 3B to 32B. Larger models benefit more significantly from the RL stage.

Training Data Scaling: Scaling TFQ-Data from 1k to 14k samples shows a log-linear improvement in reasoning accuracy. The diversity of the dataset (covering politics, art, humor) is crucial for preventing overfitting to specific visual styles.

Data Scaling
Figure 6: Data Training Scaling. Scaling TFQ-Data from 1k to 14k samples shows a log-linear improvement in reasoning accuracy.

2. Different Model Architecture

We validated our framework across LLaVA architectures. TFQ-GRPO proves to be model-agnostic, consistently improving the reasoning baseline of different backbones.

Different Architecture
Figure 7: Ablation on Different Model Architectures.

3. Different Training Strategy

We explore the impact of different training strategies by comparing three approaches: TFQ-SFT (SFT only), TFQ-SFT + TFQ-GRPO (SFT warmup + RL), and TFQ-GRPO (End-to-end RL).
Counterintuitively, SFT warmup actively harms performance. End-to-end RL (TFQ-GRPO) achieves best results on TFQ and MCQ. SFT-involving strategies cause catastrophic collapse on MCQ (46% β†’ 28%), indicating severe generalization damage due to "SFT Curse" and entropy collapse.

Different Training Strategy
Figure 8: Comparison of Different Training Strategies.

Discussion & Key Insights

The "SFT Curse" in Visual Metaphor Reasoning

Our analysis reveals a critical finding for reasoning-heavy visual tasks: SFT warmup is not only unnecessary but actively detrimental.

Conclusion: End-to-end RL (TFQ-GRPO) leverages high initial entropy for global optimization, proving superior for open-ended reasoning tasks.

Conclusion

To conclude, we present MetaphorStar, a pioneering framework that introduces visual reinforcement learning to the domain of image metaphor understanding. By establishing the TFQ paradigm and the TFQ-GRPO algorithm, we successfully bridge the gap between subjective visual interpretation and objective RL optimization.

Our release includes the MetaphorStar model family, the large-scale TFQ-Data, and the rigorous TFQ-Bench. Crucially, our findings demonstrate that learning to reason about metaphors serves as a high-level cognitive workout, enhancing general visual reasoning capabilities. We hope our open-source contribution will inspire further research into reasoning-based visual learning and the cognitive depths of MLLMs.

Citation

@article{zhang2026metaphorstar,
  title={MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning}, 
  author={Chenhao Zhang, Yazhe Niu and Hongsheng Li},
  journal={arXiv preprint arXiv:xxx},
  year={2026}
}
            

This website is adapted from VISIONx @NYU, licensed under Creative Commons Attribution-ShareAlike 4.0 International License.