Reconstruction Alignment Improves Unified Multimodal Models
Abstract
Reconstruction Alignment (RecA) is a post-training method that enhances multimodal models by using visual embeddings as dense prompts, improving image generation and editing fidelity.
Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73rightarrow0.90) and DPGBench (80.93rightarrow88.15), while also boosting editing benchmarks (ImgEdit 3.38rightarrow3.75, GEdit 6.94rightarrow7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs
Community
BAGEL demo https://huggingface.co/spaces/sanaka87/BAGEL-RecA
Current unified multimodal models (UMMs) are trained on image–text pairs, which often leads to generation lagging behind understanding. We introduce Reconstruction Alignment (RecA), which directly leverages dense features from the model’s own visual understanding encoder as prompts, enabling the UMM to self-supervise through image reconstruction and thereby align understanding and generation at the semantic level.
With only 27 GPU hours, RecA substantially improves UMMs such as BAGEL and Harmon across autoregressive, masked-autoregressive, and diffusion frameworks.
- For the fine-tuned Harmon model, GenEval improves from 0.73 → 0.90 and DPGBench from 80.93 → 88.15.
- For the fine-tuned BAGEL model, image editing benchmarks also improve significantly: ImgEdit: 3.38 → 3.75, GEdit: 6.94 → 7.25.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing (2025)
- UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding (2025)
- Visual Representation Alignment for Multimodal Large Language Models (2025)
- Interleaving Reasoning for Better Text-to-Image Generation (2025)
- Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation (2025)
- Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents (2025)
- Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 7
Browse 7 models citing this paperDatasets citing this paper 0
No dataset linking this paper