hexianghu (Hexiang Hu)

authored a paper 8 months ago

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

Paper • 2501.09732 • Published Jan 16 • 72

authored a paper 11 months ago

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

Paper • 2410.10563 • Published Oct 14, 2024 • 39

authored 3 papers about 1 year ago

authored 7 papers over 1 year ago

MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

Paper • 2403.19651 • Published Mar 28, 2024 • 23

Re-Imagen: Retrieval-Augmented Text-to-Image Generator

Paper • 2209.14491 • Published Sep 29, 2022

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

Paper • 2311.17136 • Published Nov 28, 2023 • 7

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

Paper • 2306.00245 • Published May 31, 2023

PreSTU: Pre-Training for Scene-Text Understanding

Paper • 2209.05534 • Published Sep 12, 2022

Instruct-Imagen: Image Generation with Multi-modal Instruction

Paper • 2401.01952 • Published Jan 3, 2024 • 32

Gemini: A Family of Highly Capable Multimodal Models

Paper • 2312.11805 • Published Dec 19, 2023 • 47

authored 3 papers almost 2 years ago

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Paper • 2210.03347 • Published Oct 7, 2022 • 3

Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities

Paper • 2302.11154 • Published Feb 22, 2023 • 1

Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?

Paper • 2302.11713 • Published Feb 23, 2023 • 1

authored a paper over 2 years ago

PaLI-X: On Scaling up a Multilingual Vision and Language Model

Paper • 2305.18565 • Published May 29, 2023 • 3

Hexiang Hu

AI & ML interests

Organizations

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

Subject-driven Text-to-Image Generation via Apprenticeship Learning

Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?

Imagen 3

MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

Re-Imagen: Retrieval-Augmented Text-to-Image Generator

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

PreSTU: Pre-Training for Scene-Text Understanding

Instruct-Imagen: Image Generation with Multi-modal Instruction

Gemini: A Family of Highly Capable Multimodal Models

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities

Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?

PaLI-X: On Scaling up a Multilingual Vision and Language Model

Hexiang Hu

AI & ML interests

Organizations

hexianghu's activity