Spaces:

vvmnnnkv
/

owlv2-visual-prompt

Running on Zero

App Files Files Community

vvmnnnkv commited on 15 days ago

Commit

1cf572f

1 Parent(s): dd282a3

interesting findings

Browse files

Files changed (6) hide show

README.md +10 -18
app.py +193 -70
test-data/prompt5.jpg +3 -0
test-data/prompt6.jpg +3 -0
test-data/target5.jpg +3 -0
test-data/target6.jpg +3 -0

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: OWLv2 Visual Prompt
 short_description: OWLv2 zero-shot detection with visual prompt
 emoji: 👀
 sdk: gradio
@@ -9,28 +9,20 @@ models:
  - google/owlv2-large-patch14-ensemble
 ---
-# OWLv2: Zero-shot detection with visual prompt 👀
-This demo showcases the OWLv2 model's ability to perform zero-shot object detection using visual and text prompts.
-You can either provide a text prompt or an image as a visual prompt to detect objects in the target image.
-For visual prompting, following sample code is used, taken from the HF documentation:
-```python
-    processor = AutoProcessor.from_pretrained("google/owlv2-base-patch16-ensemble")
-    model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble")
-    target_image = Image.open(...)
-    prompt_image = Image.open(...)
-    inputs = processor(images=target_image, query_images=prompt_image, return_tensors="pt")
-    # forward pass
-    with torch.no_grad():
-        outputs = model.image_guided_detection(**inputs)
-    target_sizes = torch.Tensor([image.size[::-1]])
-    results = processor.post_process_image_guided_detection(outputs=outputs, threshold=0.9, nms_threshold=0.3, target_sizes=target_sizes)
-```
-For some reason, visual prompt works much worse than text, perhaps it's HF implementation issue.

 ---
+title: OWLv2 Visual Prompting
 short_description: OWLv2 zero-shot detection with visual prompt
 emoji: 👀
 sdk: gradio
  - google/owlv2-large-patch14-ensemble
 ---
+# OWLv2: Zero-Shot Object Detection with Visual Prompting
+This demo showcases the OWLv2 model's ability to perform zero-shot object detection using both text and visual prompts. More importantly, it compares different approaches for selecting a query embedding from a visual prompt. The method used in Hugging Face's `transformers` library often underperforms because of how the visual prompt embedding is selected.
+## The Problem with the Default Method
+The standard implementation in `transformers` (using `model.image_guided_detection`) selects an embedding from the prompt image by maximizing its box's IoU with the full prompt image area and its distance from the average of other embeddings (`embed_image_query`).
+However, this selection heuristic does not account for padding and often selects the largest box, which may also span the padded background. This leads to selecting an irrelevant embedding and, consequently, poor detection performance in the target image.
+## An Alternative Approach: Objectness × IoU
+This demo implements and compares an alternative method for selecting the query embedding. This method works by maximizing a combination of the objectness score (predicted by the model) and the box's IoU score with the non-padded area of the prompt image. The selected embedding, therefore, tends to represent the most distinct and largest object on the prompt image while excluding any padded areas.
+## Results
+This space compares the results from both methods. The examples clearly demonstrate that this alternative embedding selection approach provides significantly more accurate and reliable results, often performing on par with text-based prompting.

app.py CHANGED Viewed

@@ -8,7 +8,10 @@ import torch
 import gradio as gr
 import supervision as sv
 import spaces
-from transformers import AutoProcessor, Owlv2ForObjectDetection
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
@@ -18,14 +21,23 @@ def init_model(model_id):
     model = Owlv2ForObjectDetection.from_pretrained(model_id)
     model.eval()
     model.to(DEVICE)
-    return processor, model
 @spaces.GPU
 def inference(prompts, target_image, model_id, conf_thresh, iou_thresh, prompt_type):
-    processor, model = init_model(model_id)
-    result = None
-    class_names = {}
     if prompt_type == "Text":
         inputs = processor(
@@ -36,40 +48,128 @@ def inference(prompts, target_image, model_id, conf_thresh, iou_thresh, prompt_t
         with torch.no_grad():
             outputs = model(**inputs)
-        target_sizes = torch.tensor([target_image.size[::-1]])
-        result = processor.post_process_grounded_object_detection(
-            outputs=outputs,
-            target_sizes=target_sizes,
-            threshold=conf_thresh
-        )[0]
         class_names = {k: v for k, v in enumerate(prompts["texts"])}
     elif prompt_type == "Visual":
         inputs = processor(
             images=target_image,
-            query_images=prompts["images"],
             return_tensors="pt"
         ).to(DEVICE)
         with torch.no_grad():
-            outputs = model.image_guided_detection(**inputs)
-        # Post-process results
-        target_sizes = torch.tensor([target_image.size[::-1]])
-        result = processor.post_process_image_guided_detection(
-            outputs=outputs,
-            target_sizes=target_sizes,
-            threshold=conf_thresh,
-            nms_threshold=iou_thresh
-        )[0]
-        # prepare for supervision: add 0 label for all boxes
-        result['labels'] = torch.zeros(len(result['boxes']), dtype=torch.int64)
-        class_names = {0: "object"}
     detections = sv.Detections.from_transformers(result, class_names)
-    resolution_wh = target_image.size
     thickness = sv.calculate_optimal_line_thickness(resolution_wh=resolution_wh)
     text_scale = sv.calculate_optimal_text_scale(resolution_wh=resolution_wh)
@@ -79,7 +179,7 @@ def inference(prompts, target_image, model_id, conf_thresh, iou_thresh, prompt_t
         in zip(detections['class_name'], detections.confidence)
     ]
-    annotated_image = target_image.copy()
     annotated_image = sv.BoxAnnotator(color_lookup=sv.ColorLookup.INDEX, thickness=thickness).annotate(
         scene=annotated_image, detections=detections)
     annotated_image = sv.LabelAnnotator(color_lookup=sv.ColorLookup.INDEX, text_scale=text_scale, smart_position=True).annotate(
@@ -87,36 +187,28 @@ def inference(prompts, target_image, model_id, conf_thresh, iou_thresh, prompt_t
     return annotated_image
 def app():
     with gr.Blocks():
         with gr.Row():
             with gr.Column():
-                with gr.Row():
-                    target_image = gr.Image(type="pil", label="Target Image", visible=True, interactive=True)
                 detect_button = gr.Button(value="Detect Objects")
                 prompt_type = gr.Textbox(value='Visual', visible=False)  # Default prompt type
                 with gr.Tab("Visual") as visual_tab:
-                    with gr.Row():
-                        prompt_image = gr.Image(type="pil", label="Prompt Image", visible=True, interactive=True)
                 with gr.Tab("Text") as text_tab:
                     texts = gr.Textbox(label="Input Texts", value='', placeholder='person,bus', visible=True, interactive=True)
-                visual_tab.select(
-                    fn=lambda: ("Visual", gr.update(visible=True)),
-                    inputs=None,
-                    outputs=[prompt_type, prompt_image]
-                )
-                text_tab.select(
-                    fn=lambda: ("Text", gr.update(value=None, visible=False)),
-                    inputs=None,
-                    outputs=[prompt_type, prompt_image]
-                )
                 model_id = gr.Dropdown(
                     label="Model",
                     choices=[
@@ -133,7 +225,7 @@ def app():
                     value=0.25,
                 )
                 iou_thresh = gr.Slider(
-                    label="IoU Threshold",
                     minimum=0.0,
                     maximum=1.0,
                     step=0.05,
@@ -141,8 +233,32 @@ def app():
                 )
             with gr.Column():
-                output_image = gr.Image(type="numpy", label="Annotated Image", visible=True)
         def run_inference(prompt_image, target_image, texts, model_id, conf_thresh, iou_thresh, prompt_type):
             # add text/built-in prompts
@@ -162,7 +278,7 @@ def app():
         detect_button.click(
             fn=run_inference,
             inputs=[prompt_image, target_image, texts, model_id, conf_thresh, iou_thresh, prompt_type],
-            outputs=[output_image],
         )
         ###################### Examples ##########################
@@ -193,6 +309,20 @@ def app():
                 "google/owlv2-base-patch16-ensemble",
                 0.9,
                 0.3,
             ]
             ]
@@ -216,7 +346,20 @@ def app():
                 "test-data/target4.jpg",
                 "cat",
                 "google/owlv2-base-patch16-ensemble",
-                0.3]
             ],
             inputs=[target_image, texts, model_id, conf_thresh],
             visible=False, cache_examples=False, label="Text Prompt Examples")
@@ -255,28 +398,8 @@ with gradio_app:
     """)
     gr.Markdown("""
     This demo showcases the OWLv2 model's ability to perform zero-shot object detection using visual and text prompts.
     You can either provide a text prompt or an image as a visual prompt to detect objects in the target image.
-    For visual prompting, following sample code is used, taken from the HF documentation:
-    ```python
-       processor = AutoProcessor.from_pretrained("google/owlv2-base-patch16-ensemble")
-       model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble")
-       target_image = Image.open(...)
-       prompt_image = Image.open(...)
-       inputs = processor(images=target_image, query_images=prompt_image, return_tensors="pt")
-       # forward pass
-       with torch.no_grad():
-           outputs = model.image_guided_detection(**inputs)
-       target_sizes = torch.Tensor([image.size[::-1]])
-       results = processor.post_process_image_guided_detection(outputs=outputs, threshold=0.9, nms_threshold=0.3, target_sizes=target_sizes)
-    ```
-    For some reason, visual prompt works much worse than text, perhaps it's HF implementation issue.
     """)
     with gr.Row():

 import gradio as gr
 import supervision as sv
 import spaces
+from PIL import Image
+from transformers import AutoProcessor, Owlv2ForObjectDetection, Owlv2Processor
+from transformers.models.owlv2.modeling_owlv2 import Owlv2ImageGuidedObjectDetectionOutput, center_to_corners_format, box_iou
+#from transformers.models.owlv2.image_processing_owlv2
 DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
     model = Owlv2ForObjectDetection.from_pretrained(model_id)
     model.eval()
     model.to(DEVICE)
+    image_size = tuple(processor.image_processor.size.values())
+    image_mean = torch.tensor(
+    processor.image_processor.image_mean, device=DEVICE
+    ).view(1, 3, 1, 1)
+    image_std = torch.tensor(
+    processor.image_processor.image_std, device=DEVICE
+    ).view(1, 3, 1, 1)
+    return processor, model, image_size, image_mean, image_std
 @spaces.GPU
 def inference(prompts, target_image, model_id, conf_thresh, iou_thresh, prompt_type):
+    processor, model, image_size, image_mean, image_std = init_model(model_id)
+    annotated_image_my = None
+    annotated_image_hf = None
+    annotated_prompt_image = None
     if prompt_type == "Text":
         inputs = processor(
         with torch.no_grad():
             outputs = model(**inputs)
+            target_sizes = torch.tensor([target_image.size[::-1]])
+            result = processor.post_process_grounded_object_detection(
+                outputs=outputs,
+                target_sizes=target_sizes,
+                threshold=conf_thresh
+            )[0]
         class_names = {k: v for k, v in enumerate(prompts["texts"])}
+        # annotate the target image
+        annotated_image_hf = annotate_image(result, class_names, target_image)
     elif prompt_type == "Visual":
+        prompt_image = prompts["images"]
         inputs = processor(
             images=target_image,
+            query_images=prompt_image,
             return_tensors="pt"
         ).to(DEVICE)
         with torch.no_grad():
+            query_feature_map = model.image_embedder(pixel_values=inputs.query_pixel_values)[0]
+            feature_map = model.image_embedder(pixel_values=inputs.pixel_values)[0]
+            batch_size, num_patches_height, num_patches_width, hidden_dim = feature_map.shape
+            image_feats = torch.reshape(feature_map, (batch_size, num_patches_height * num_patches_width, hidden_dim))
+            batch_size, num_patches_height, num_patches_width, hidden_dim = query_feature_map.shape
+            query_image_feats = torch.reshape(query_feature_map, (batch_size, num_patches_height * num_patches_width, hidden_dim))
+            # Select using hf method
+            query_embeds2, box_indices, pred_boxes = model.embed_image_query(
+                query_image_features=query_image_feats,
+                query_feature_map=query_feature_map
+            )
+            # Select top object from prompt image * iou
+            objectnesses = torch.sigmoid(model.objectness_predictor(query_image_feats))
+            _, source_class_embeddings = model.class_predictor(query_image_feats)
+            # identify the box that covers only the prompt image area excluding padding
+            pw, ph = prompt_image.size
+            max_side = max(pw, ph)
+            each_query_box = torch.tensor([[0, 0, pw/max_side, ph/max_side]], device=DEVICE)
+            pred_boxes_as_corners = center_to_corners_format(pred_boxes)
+            each_query_pred_boxes = pred_boxes_as_corners[0]
+            ious, _ = box_iou(each_query_box, each_query_pred_boxes)
+            comb_score = objectnesses * ious
+            top_obj_idx = torch.argmax(comb_score, dim=-1)
+            query_embeds = source_class_embeddings[0][top_obj_idx]
+            # Predict object boxes
+            target_pred_boxes = model.box_predictor(image_feats, feature_map)
+            # Predict for prompt: my method
+            (pred_logits, class_embeds) = model.class_predictor(image_feats=image_feats, query_embeds=query_embeds)
+            outputs = Owlv2ImageGuidedObjectDetectionOutput(
+                logits=pred_logits,
+                target_pred_boxes=target_pred_boxes,
+            )
+            # Post-process results
+            target_sizes = torch.tensor([target_image.size[::-1]])
+            result = processor.post_process_image_guided_detection(
+                outputs=outputs,
+                target_sizes=target_sizes,
+                threshold=conf_thresh,
+                nms_threshold=iou_thresh
+            )[0]
+            # prepare for supervision: add 0 label for all boxes
+            result['labels'] = torch.zeros(len(result['boxes']), dtype=torch.int64)
+            class_names = {0: "object"}
+            # annotate the target image
+            annotated_image_my = annotate_image(result, class_names, pad_to_square(target_image))
+            # Predict for prompt: hf method
+            (pred_logits, class_embeds) = model.class_predictor(image_feats=image_feats, query_embeds=query_embeds2)
+            # Predict object boxes
+            outputs = Owlv2ImageGuidedObjectDetectionOutput(
+                logits=pred_logits,
+                target_pred_boxes=target_pred_boxes,
+            )
+            # Post-process results
+            target_sizes = torch.tensor([target_image.size[::-1]])
+            result = processor.post_process_image_guided_detection(
+                outputs=outputs,
+                target_sizes=target_sizes,
+                threshold=conf_thresh,
+                nms_threshold=iou_thresh
+            )[0]
+            # prepare for supervision: add 0 label for all boxes
+            result['labels'] = torch.zeros(len(result['boxes']), dtype=torch.int64)
+            class_names = {0: "object"}
+            # annotate the target image
+            annotated_image_hf = annotate_image(result, class_names, pad_to_square(target_image))
+            # Render selected prompt embedding
+            query_pred_boxes = pred_boxes[0, [top_obj_idx, box_indices[0]]].unsqueeze(0)
+            query_logits = torch.reshape(objectnesses[0, [top_obj_idx, box_indices[0]]], (1, 2, 1))
+            query_outputs = Owlv2ImageGuidedObjectDetectionOutput(
+                logits=query_logits,
+                target_pred_boxes=query_pred_boxes,
+            )
+            query_result = processor.post_process_image_guided_detection(
+                outputs=query_outputs,
+                target_sizes=torch.tensor([prompt_image.size[::-1]]),
+                threshold=0.0,
+                nms_threshold=1.0
+            )[0]
+            query_result['labels'] = torch.Tensor([0, 1])
+            # Annotate the prompt image
+            query_class_names = {0: "my", 1: "hf"}
+            # annotate the prompt image
+            annotated_prompt_image = annotate_image(query_result, query_class_names, pad_to_square(prompt_image))
+    return annotated_image_my, annotated_image_hf, annotated_prompt_image
+def annotate_image(result, class_names, image):
     detections = sv.Detections.from_transformers(result, class_names)
+    resolution_wh = image.size
     thickness = sv.calculate_optimal_line_thickness(resolution_wh=resolution_wh)
     text_scale = sv.calculate_optimal_text_scale(resolution_wh=resolution_wh)
         in zip(detections['class_name'], detections.confidence)
     ]
+    annotated_image = image.copy()
     annotated_image = sv.BoxAnnotator(color_lookup=sv.ColorLookup.INDEX, thickness=thickness).annotate(
         scene=annotated_image, detections=detections)
     annotated_image = sv.LabelAnnotator(color_lookup=sv.ColorLookup.INDEX, text_scale=text_scale, smart_position=True).annotate(
     return annotated_image
+def pad_to_square(image, background_color=(128, 128, 128)):
+    width, height = image.size
+    max_side = max(width, height)
+    result = Image.new(image.mode, (max_side, max_side), background_color)
+    result.paste(image, (0, 0))
+    return result
 def app():
     with gr.Blocks():
         with gr.Row():
             with gr.Column():
+                target_image = gr.Image(type="pil", label="Target Image", visible=True, interactive=True)
                 detect_button = gr.Button(value="Detect Objects")
                 prompt_type = gr.Textbox(value='Visual', visible=False)  # Default prompt type
                 with gr.Tab("Visual") as visual_tab:
+                    prompt_image = gr.Image(type="pil", label="Prompt Image", visible=True, interactive=True)
                 with gr.Tab("Text") as text_tab:
                     texts = gr.Textbox(label="Input Texts", value='', placeholder='person,bus', visible=True, interactive=True)
                 model_id = gr.Dropdown(
                     label="Model",
                     choices=[
                     value=0.25,
                 )
                 iou_thresh = gr.Slider(
+                    label="NSM Threshold",
                     minimum=0.0,
                     maximum=1.0,
                     step=0.05,
                 )
             with gr.Column():
+                output_image_hf_gr = gr.Group()
+                with output_image_hf_gr:
+                    gr.Markdown("### Annotated Image (HF default)")
+                    output_image_hf = gr.Image(type="numpy", visible=True, show_label=False)
+                output_image_my_gr = gr.Group()
+                with output_image_my_gr:
+                    gr.Markdown("### Annotated Image (Objectness × IoU variant)")
+                    output_image_my = gr.Image(type="numpy", visible=True, show_label=False)
+                annotated_prompt_image_gr = gr.Group()
+                with annotated_prompt_image_gr:
+                    gr.Markdown("### Prompt Image with Selected Embeddings and Objectness Score")
+                    annotated_prompt_image = gr.Image(type="numpy", visible=True, show_label=False)
+            visual_tab.select(
+                fn=lambda: ("Visual", gr.update(visible=True), gr.update(visible=True), gr.update(visible=True)),
+                inputs=None,
+                outputs=[prompt_type, prompt_image, output_image_my_gr, annotated_prompt_image_gr]
+            )
+            text_tab.select(
+                fn=lambda: ("Text", gr.update(value=None, visible=False), gr.update(visible=False), gr.update(visible=False)),
+                inputs=None,
+                outputs=[prompt_type, prompt_image, output_image_my_gr, annotated_prompt_image_gr]
+            )
         def run_inference(prompt_image, target_image, texts, model_id, conf_thresh, iou_thresh, prompt_type):
             # add text/built-in prompts
         detect_button.click(
             fn=run_inference,
             inputs=[prompt_image, target_image, texts, model_id, conf_thresh, iou_thresh, prompt_type],
+            outputs=[output_image_my, output_image_hf, annotated_prompt_image],
         )
         ###################### Examples ##########################
                 "google/owlv2-base-patch16-ensemble",
                 0.9,
                 0.3,
+            ],
+            [
+                "test-data/target5.jpg",
+                "test-data/prompt5.jpg",
+                "google/owlv2-base-patch16-ensemble",
+                0.9,
+                0.3,
+            ],
+            [
+                "test-data/target6.jpg",
+                "test-data/prompt6.jpg",
+                "google/owlv2-base-patch16-ensemble",
+                0.9,
+                0.3,
             ]
             ]
                 "test-data/target4.jpg",
                 "cat",
                 "google/owlv2-base-patch16-ensemble",
+                0.3
+                ],
+                [
+                "test-data/target5.jpg",
+                "lemon,straw",
+                "google/owlv2-base-patch16-ensemble",
+                0.3
+                ],
+                [
+                "test-data/target6.jpg",
+                "beer logo",
+                "google/owlv2-base-patch16-ensemble",
+                0.3
+                ]
             ],
             inputs=[target_image, texts, model_id, conf_thresh],
             visible=False, cache_examples=False, label="Text Prompt Examples")
     """)
     gr.Markdown("""
     This demo showcases the OWLv2 model's ability to perform zero-shot object detection using visual and text prompts.
     You can either provide a text prompt or an image as a visual prompt to detect objects in the target image.
+    Additionally, it compares different approaches for selecting a query embedding from a visual prompt. The method used in Hugging Face's `transformers` by default often underperforms because of how the visual prompt embedding is selected (see README.md for more details).
     """)
     with gr.Row():

test-data/prompt5.jpg ADDED Viewed

Git LFS Details

SHA256: f55c19975883cdc3bee3f86754f2f9ea7b245279d80ddaee3823be4a74c36c3e
Pointer size: 130 Bytes
Size of remote file: 20.8 kB

test-data/prompt6.jpg ADDED Viewed

Git LFS Details

SHA256: 22bf079894f6828a8fe7b2f42f084f4005c778b70890b4fb1e912c3e2331c215
Pointer size: 129 Bytes
Size of remote file: 9.94 kB

test-data/target5.jpg ADDED Viewed

Git LFS Details

SHA256: 9b9d57f3e0c98439426c4dd1874c179b8b3bd749e425921e0a7da9725145e9b1
Pointer size: 131 Bytes
Size of remote file: 107 kB

test-data/target6.jpg ADDED Viewed

Git LFS Details

SHA256: fc1eadd24b1fc10d9584aa661374f3f718242772cd937f0e98e87384976c146d
Pointer size: 131 Bytes
Size of remote file: 127 kB