metadata

title: OWLv2 Visual Prompting
short_description: OWLv2 zero-shot detection with visual prompt
emoji: 👀
sdk: gradio
app_file: app.py
models:
  - google/owlv2-base-patch16-ensemble
  - google/owlv2-large-patch14-ensemble

OWLv2: Zero-Shot Object Detection with Visual Prompting

This demo showcases the OWLv2 model's ability to perform zero-shot object detection using both text and visual prompts. More importantly, it compares different approaches for selecting a query embedding from a visual prompt. The method used in Hugging Face's transformers library often underperforms because of how the visual prompt embedding is selected.

The Problem with the Default Method

The standard implementation in transformers (using model.image_guided_detection) selects an embedding from the prompt image by maximizing its box's IoU with the full prompt image area and its distance from the average of other embeddings (embed_image_query).

However, this selection heuristic does not account for padding and often selects the largest box, which may also span the padded background. This leads to selecting an irrelevant embedding and, consequently, poor detection performance in the target image.

An Alternative Approach: Objectness × IoU

This demo implements and compares an alternative method for selecting the query embedding. This method works by maximizing a combination of the objectness score (predicted by the model) and the box's IoU score with the non-padded area of the prompt image. The selected embedding, therefore, tends to represent the most distinct and largest object on the prompt image while excluding any padded areas.

Results

This space compares the results from both methods. The examples clearly demonstrate that this alternative embedding selection approach provides significantly more accurate and reliable results, often performing on par with text-based prompting.