CVPR 2026

Stepwise Credit Assignment for GRPO on Flow-Matching Models

Yash Savani^1*, Branislav Kveton², Yuchen Liu², Yilin Wang², Jing Shi², Subhojyoti Mukherjee², Nikos Vlassis², Krishna Kumar Singh²

¹Carnegie Mellon University ²Adobe Research
*Work done while intern at Adobe

arXiv Blog

Teaser figure showing stepwise credit assignment from temporal reward structure

Stepwise credit assignment from temporal reward structure. (Left) Two trajectories from the same initial noise reach similar final rewards (~0.90), but diverge substantially at intermediate steps. Uniform credit assignment treats them nearly identically; Stepwise-Flow-GRPO uses gains g_t = r_t-1 - r_t to credit steps that improve reward and penalize those that hurt it. (Right) This finer credit assignment yields faster convergence and higher final reward.

Abstract

Flow-GRPO successfully applies reinforcement learning to flow models, but uses uniform credit assignment across all steps. This ignores the temporal structure of diffusion generation: early steps determine composition and content (low-frequency structure), while late steps resolve details and textures (high-frequency details). Moreover, assigning uniform credit based solely on the final image can inadvertently reward suboptimal intermediate steps, especially when errors are corrected later in the diffusion trajectory. We propose Stepwise-Flow-GRPO, which assigns credit based on each step's reward improvement. By leveraging Tweedie's formula to obtain intermediate reward estimates and introducing gain-based advantages, our method achieves superior sample efficiency and faster convergence. We also introduce a DDIM-inspired SDE that improves reward quality while preserving stochasticity for policy gradients.

Key Results

0.87
GenEval Score
(vs 0.72 Flow-GRPO)

3/4
Settings with superior
sample & wall-clock efficiency

Sample Efficiency

Stepwise-Flow-GRPO consistently outperforms Flow-GRPO in reward per training step across all settings.

Reward vs. training step across four settings: PickScore, ImageReward, and UnifiedReward on GenEval, and PickScore on PickScore Dataset.

Wall-Clock Efficiency

Despite additional computation for intermediate denoising, our method converges faster in wall-clock time.

Reward vs. wall-clock time for the same settings. Stepwise-Flow-GRPO achieves visibly superior performance in 3 out of 4 settings.

Extended Training

400 GPU-hour run. Stepwise-Flow-GRPO achieves 0.87 GenEval, substantially outperforming Flow-GRPO (0.72) and surpassing GPT-4o (0.84).

GenEval Benchmark

Model	Overall	Single Obj.	Two Obj.	Counting	Colors	Position	Attr. Bind.
Pretrained Models
SD3.5-M (cfg=1.0)	0.28	0.71	0.23	0.15	0.45	0.05	0.08
SD3.5-M (cfg=4.5)	0.63	0.98	0.78	0.50	0.81	0.24	0.52
Standard Training Duration
Flow-GRPO (cfg=1.0, PickScore)	0.60	0.96	0.73	0.67	0.67	0.21	0.35
Ours (cfg=1.0, PickScore)	0.60	0.96	0.75	0.67	0.67	0.21	0.34
Flow-GRPO (cfg=4.5, PickScore)	0.68	0.98	0.82	0.64	0.82	0.24	0.59
Ours (cfg=4.5, PickScore)	0.71	0.98	0.85	0.70	0.82	0.29	0.59
Extended Training
Flow-GRPO (cfg=4.5, GenEval, 400 GPU hrs)	0.72	—	—	—	—	—	—
Ours (cfg=4.5, UnifiedReward, 60 GPU hrs)	0.74	0.99	0.89	0.73	0.83	0.34	0.66
Ours (cfg=4.5, GenEval, 400 GPU hrs)	0.87	0.99	0.93	0.89	0.87	0.73	0.80
Reference: State-of-the-art Models
Janus-Pro-7B	0.80	0.99	0.89	0.59	0.90	0.79	0.66
SANA-1.5 4.8B	0.81	0.99	0.93	0.86	0.84	0.59	0.65
GPT-4o	0.84	0.99	0.92	0.85	0.92	0.75	0.61

After 400 GPU hours of extended training, our method achieves 0.87 overall GenEval score, substantially outperforming Flow-GRPO (0.72) and surpassing GPT-4o (0.84).

Qualitative Results

Qualitative comparison between Flow-GRPO and Stepwise-Flow-GRPO

Qualitative comparison. Stepwise-Flow-GRPO produces better spatial reasoning, attribute binding, and counting compared to Flow-GRPO. Flow-GRPO sometimes merges objects or places them unrealistically, while our method generates more plausible compositions.

More Results

Extended qualitative results. Comparison of generated images from Stepwise-Flow-GRPO across diverse GenEval prompts, demonstrating improved compositional understanding, spatial reasoning, and attribute binding.

Qualitative comparison across training objectives

Comparison across training objectives. Generated images from GenEval prompts using base SD3.5-M (left), GenEval reward training (middle), and UnifiedReward training (right). GenEval reward training improves prompt adherence and benchmark scores, while UnifiedReward training produces higher overall visual quality and more photorealistic images.

BibTeX

@inproceedings{savani2026stepwise,
  title     = {Stepwise Credit Assignment for GRPO on Flow-Matching Models},
  author    = {Savani, Yash and Kveton, Branislav and Liu, Yuchen and Wang, Yilin and Shi, Jing and Mukherjee, Subhojyoti and Vlassis, Nikos and Singh, Krishna Kumar},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}