Spaces:
Running
on
Zero
Running
on
Zero
title: OWLv2 Visual Prompting | |
short_description: OWLv2 zero-shot detection with visual prompt | |
emoji: π | |
sdk: gradio | |
app_file: app.py | |
models: | |
- google/owlv2-base-patch16-ensemble | |
- google/owlv2-large-patch14-ensemble | |
# OWLv2: Zero-Shot Object Detection with Visual Prompting | |
This demo showcases the OWLv2 model's ability to perform zero-shot object detection using both text and visual prompts. More importantly, it compares different approaches for selecting a query embedding from a visual prompt. The method used in Hugging Face's `transformers` library often underperforms because of how the visual prompt embedding is selected. | |
## The Problem with the Default Method | |
The standard implementation in `transformers` (using `model.image_guided_detection`) selects an embedding from the prompt image by maximizing its box's IoU with the full prompt image area and its distance from the average of other embeddings (`embed_image_query`). | |
However, this selection heuristic does not account for padding and often selects the largest box, which may also span the padded background. This leads to selecting an irrelevant embedding and, consequently, poor detection performance in the target image. | |
## An Alternative Approach: Objectness Γ IoU | |
This demo implements and compares an alternative method for selecting the query embedding. This method works by maximizing a combination of the objectness score (predicted by the model) and the box's IoU score with the non-padded area of the prompt image. The selected embedding, therefore, tends to represent the most distinct and largest object on the prompt image while excluding any padded areas. | |
## Results | |
This space compares the results from both methods. The examples clearly demonstrate that this alternative embedding selection approach provides significantly more accurate and reliable results, often performing on par with text-based prompting. | |