Universal Few-Shot Spatial Control for Diffusion Models (UFC)

This repository presents Universal Few-Shot Spatial Control for Diffusion Models (UFC), a versatile few-shot control adapter for generalizing to novel spatial conditions in text-to-image diffusion models. Our method is applicable to both UNet and DiT diffusion backbones.

The model was presented in the paper Universal Few-Shot Spatial Control for Diffusion Models.

Official code and more details can be found at the GitHub repository.

Abstract

Spatial conditioning in pretrained text-to-image diffusion models has significantly improved fine-grained control over the structure of generated images. However, existing control adapters exhibit limited adaptability and incur high training costs when encountering novel spatial control conditions that differ substantially from the training tasks. To address this limitation, we propose Universal Few-Shot Control (UFC), a versatile few-shot control adapter capable of generalizing to novel spatial conditions. Given a few image-condition pairs of an unseen task and a query condition, UFC leverages the analogy between query and support conditions to construct task-specific control features, instantiated by a matching mechanism and an update on a small set of task-specific parameters. Experiments on six novel spatial control tasks show that UFC, fine-tuned with only 30 annotated examples of novel tasks, achieves fine-grained control consistent with the spatial conditions. Notably, when fine-tuned with 0.1% of the full training data, UFC achieves competitive performance with the fully supervised baselines in various control tasks. We also show that UFC is applicable agnostically to various diffusion backbones and demonstrate its effectiveness on both UNet and DiT architectures.

Model Checkpoints

This project provides various model checkpoints for UFC with both UNet and DiT backbones, fine-tuned for different few-shot tasks.

Checkpoints using UNet backbone

Few-shot Task	Few-shot (30-shot) Fine-tuned Model	Base Meta-trained Model	Description
`Canny`	UNet_canny	UNet_taskgr23	The base model is trained with 4 tasks: `[Depth, Normal, Pose, Densepose]`
`Hed`	UNet_hed	UNet_taskgr23	The base model is trained with 4 tasks: `[Depth, Normal, Pose, Densepose]`
`Depth`	UNet_depth	UNet_taskgr13	The base model is trained with 4 tasks: `[Canny, HED, Pose, Densepose]`
`Normal`	UNet_normal	UNet_taskgr13	The base model is trained with 4 tasks: `[Canny, HED, Pose, Densepose]`
`Pose`	UNet_pose	UNet_taskgr12	The base model is trained with 4 tasks: `[Canny, HED, Depth, Normal]`
`Densepose`	UNet_densepose	UNet_taskgr12	The base model is trained with 4 tasks: `[Canny, HED, Depth, Normal]`

Checkpoints using DiT backbone

Few-shot Task	Few-shot (30-shot) Fine-tuned Model	Base Meta-trained Model	Description
`Canny`	DiT_canny	DiT_taskgr23	The base model is trained with 4 tasks: `[Depth, Normal, Pose, Densepose]`
`Hed`	DiT_hed	DiT_taskgr23	The base model is trained with 4 tasks: `[Depth, Normal, Pose, Densepose]`
`Depth`	DiT_depth	DiT_taskgr13	The base model is trained with 4 tasks: `[Canny, HED, Pose, Densepose]`
`Normal`	DiT_normal	DiT_taskgr13	The base model is trained with 4 tasks: `[Canny, HED, Pose, Densepose]`
`Pose`	DiT_pose	DiT_taskgr12	The base model is trained with 4 tasks: `[Canny, HED, Depth, Normal]`
`Densepose`	DiT_densepose	DiT_taskgr12	The base model is trained with 4 tasks: `[Canny, HED, Depth, Normal]`