Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps Paper • 2501.09732 • Published Jan 16 • 72
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks Paper • 2410.10563 • Published Oct 14, 2024 • 39
Subject-driven Text-to-Image Generation via Apprenticeship Learning Paper • 2304.00186 • Published Apr 1, 2023
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? Paper • 2406.13121 • Published Jun 19, 2024 • 2
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions Paper • 2403.19651 • Published Mar 28, 2024 • 23
UniIR: Training and Benchmarking Universal Multimodal Information Retrievers Paper • 2311.17136 • Published Nov 28, 2023 • 7
From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces Paper • 2306.00245 • Published May 31, 2023
Instruct-Imagen: Image Generation with Multi-modal Instruction Paper • 2401.01952 • Published Jan 3, 2024 • 32
Gemini: A Family of Highly Capable Multimodal Models Paper • 2312.11805 • Published Dec 19, 2023 • 47
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Paper • 2210.03347 • Published Oct 7, 2022 • 3
Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities Paper • 2302.11154 • Published Feb 22, 2023 • 1
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? Paper • 2302.11713 • Published Feb 23, 2023 • 1
PaLI-X: On Scaling up a Multilingual Vision and Language Model Paper • 2305.18565 • Published May 29, 2023 • 3