Papers
arxiv:2509.07295

Reconstruction Alignment Improves Unified Multimodal Models

Published on Sep 8
· Submitted by sanaka87 on Sep 10
Authors:
Ji Xie ,
,
,

Abstract

Reconstruction Alignment (RecA) is a post-training method that enhances multimodal models by using visual embeddings as dense prompts, improving image generation and editing fidelity.

AI-generated summary

Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73rightarrow0.90) and DPGBench (80.93rightarrow88.15), while also boosting editing benchmarks (ImgEdit 3.38rightarrow3.75, GEdit 6.94rightarrow7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs

Community

Paper author Paper submitter
edited 5 days ago

BAGEL demo https://huggingface.co/spaces/sanaka87/BAGEL-RecA

Current unified multimodal models (UMMs) are trained on image–text pairs, which often leads to generation lagging behind understanding. We introduce Reconstruction Alignment (RecA), which directly leverages dense features from the model’s own visual understanding encoder as prompts, enabling the UMM to self-supervise through image reconstruction and thereby align understanding and generation at the semantic level.

With only 27 GPU hours, RecA substantially improves UMMs such as BAGEL and Harmon across autoregressive, masked-autoregressive, and diffusion frameworks.

  • For the fine-tuned Harmon model, GenEval improves from 0.73 → 0.90 and DPGBench from 80.93 → 88.15.
  • For the fine-tuned BAGEL model, image editing benchmarks also improve significantly: ImgEdit: 3.38 → 3.75, GEdit: 6.94 → 7.25.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 7

Browse 7 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.07295 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 10