arxiv:2509.07295

Reconstruction Alignment Improves Unified Multimodal Models

Published on Sep 8

· Submitted by

sanaka87 on Sep 10

Upvote

Authors:

Ji Xie ,

Abstract

Reconstruction Alignment (RecA) is a post-training method that enhances multimodal models by using visual embeddings as dense prompts, improving image generation and editing fidelity.

AI-generated summary

Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73rightarrow0.90) and DPGBench (80.93rightarrow88.15), while also boosting editing benchmarks (ImgEdit 3.38rightarrow3.75, GEdit 6.94rightarrow7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs

View arXiv page View PDF Project page GitHub 155 Add to collection

Community

sanaka87

Paper author Paper submitter 5 days ago

•

edited 5 days ago

BAGEL demo https://huggingface.co/spaces/sanaka87/BAGEL-RecA

Current unified multimodal models (UMMs) are trained on image–text pairs, which often leads to generation lagging behind understanding. We introduce Reconstruction Alignment (RecA), which directly leverages dense features from the model’s own visual understanding encoder as prompts, enabling the UMM to self-supervise through image reconstruction and thereby align understanding and generation at the semantic level.

With only 27 GPU hours, RecA substantially improves UMMs such as BAGEL and Harmon across autoregressive, masked-autoregressive, and diffusion frameworks.

For the fine-tuned Harmon model, GenEval improves from 0.73 → 0.90 and DPGBench from 80.93 → 88.15.
For the fine-tuned BAGEL model, image editing benchmarks also improve significantly: ImgEdit: 3.38 → 3.75, GEdit: 6.94 → 7.25.