Spaces:

vvmnnnkv
/

owlv2-visual-prompt

Running on Zero

App Files Files Community

owlv2-visual-prompt / README.md

vvmnnnkv

interesting findings

1cf572f 17 days ago

preview code

raw

history blame contribute delete

1.92 kB

	---
	title: OWLv2 Visual Prompting
	short_description: OWLv2 zero-shot detection with visual prompt
	emoji: 👀
	sdk: gradio
	app_file: app.py
	models:
	- google/owlv2-base-patch16-ensemble
	- google/owlv2-large-patch14-ensemble
	---

	# OWLv2: Zero-Shot Object Detection with Visual Prompting

	This demo showcases the OWLv2 model's ability to perform zero-shot object detection using both text and visual prompts. More importantly, it compares different approaches for selecting a query embedding from a visual prompt. The method used in Hugging Face's `transformers` library often underperforms because of how the visual prompt embedding is selected.

	## The Problem with the Default Method

	The standard implementation in `transformers` (using `model.image_guided_detection`) selects an embedding from the prompt image by maximizing its box's IoU with the full prompt image area and its distance from the average of other embeddings (`embed_image_query`).

	However, this selection heuristic does not account for padding and often selects the largest box, which may also span the padded background. This leads to selecting an irrelevant embedding and, consequently, poor detection performance in the target image.

	## An Alternative Approach: Objectness × IoU

	This demo implements and compares an alternative method for selecting the query embedding. This method works by maximizing a combination of the objectness score (predicted by the model) and the box's IoU score with the non-padded area of the prompt image. The selected embedding, therefore, tends to represent the most distinct and largest object on the prompt image while excluding any padded areas.

	## Results

	This space compares the results from both methods. The examples clearly demonstrate that this alternative embedding selection approach provides significantly more accurate and reliable results, often performing on par with text-based prompting.