---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen3-VL-4B-Instruct
pipeline_tag: image-feature-extraction
library_name: transformers
tags:
- text-generation-inference
- vision_encoder
---

> [!Important]
This is the vision encoder component of the [Qwen-3VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) model. For more details, please visit the original model page or refer to the Qwen-VL technical reports published by Qwen.

**Qwen3-VL-4B Vision Encoder**: The `strangeropshf/qwen3-vl-4b-vision_encoder` is the isolated vision encoder component extracted from Qwen3-VL-4B-Instruct, featuring a **DeepStack multi-level Vision Transformer (ViT)** architecture that fuses hierarchical feature maps from multiple layers to capture both fine-grained details and global context simultaneously. It employs **Interleaved-MRoPE** positional embeddings for full-frequency spatial-temporal encoding across width, height, and time dimensions, enabling robust handling of high-resolution images (up to 896×896, 256 tokens/image) and long videos with precise **text-timestamp alignment** for second-level event localization. This native-resolution encoder processes dynamic aspect ratios without fixed-size cropping via **NaViT-style** dynamic tiling, delivering superior spatial reasoning, 2D/3D grounding, and video dynamics comprehension critical for the model's agentic capabilities.

## Quick Start with Transformers

### Install the required packages

``` py
gradio # - gradio@6.3.0
torch==2.8.0
torchvision
transformers==4.57.6
accelerate
```

### Usage

```py
import torch
import gradio as gr
from transformers import AutoProcessor
from transformers.models.qwen3_vl import Qwen3VLVisionModel
from PIL import Image

MODEL_ID = "strangeropshf/qwen3-vl-4b-vision_encoder"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32

print("Loading image processor...")
full_processor = AutoProcessor.from_pretrained(
    MODEL_ID, trust_remote_code=True
)
image_processor = full_processor.image_processor
print("Image processor loaded.")

print("Loading Qwen3-VL vision encoder...")
vision_model = Qwen3VLVisionModel.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=dtype
).to(device).eval()
print("Vision encoder ready.")

import inspect
sig = inspect.signature(vision_model.forward)
print(f"forward() signature: {sig}")


@torch.inference_mode()
def run_vision_encoder(image_path: str):
    if image_path is None:
        return "No image provided."

    image = Image.open(image_path).convert("RGB")

    inputs = image_processor(images=[image], return_tensors="pt")

    pixel_values = inputs["pixel_values"].to(device=device, dtype=dtype)
    grid_thw = inputs["image_grid_thw"].to(device=device)

    #    Qwen3VLVisionModel.forward(hidden_states, grid_thw) → Tensor
    #    "hidden_states" = pixel_values
    #    Returns a PLAIN tensor, not a named tuple
    feats = vision_model(
        hidden_states=pixel_values,
        grid_thw=grid_thw
    )

    if hasattr(feats, "last_hidden_state"):
        feats = feats.last_hidden_state
    elif isinstance(feats, (tuple, list)):
        feats = feats[0]

    shape_str = f"Output shape: {tuple(feats.shape)}"

    if feats.dim() == 3:        # (batch, seq, hidden)
        sample = feats[0, 0, :8]
    elif feats.dim() == 2:      # (total_patches, hidden)
        sample = feats[0, :8]
    else:
        sample = feats.flatten()[:8]

    sample_np = sample.detach().cpu().float().numpy()

    return (
        f"{shape_str}\n"
        f"dtype: {feats.dtype}\n"
        f"device: {feats.device}\n"
        f"grid_thw: {grid_thw.cpu().tolist()}\n\n"
        f"Sample values (first token, first 8 dims):\n"
        f"{sample_np}"
    )

with gr.Blocks() as demo:
    gr.Markdown("## Qwen3-VL Vision Encoder")
 
    with gr.Row():
        image_input = gr.Image(type="filepath", label="Input Image")
        output_text = gr.Textbox(label="Vision Encoder Output", lines=10)

    run_btn = gr.Button("Run Vision Encoder")
    run_btn.click(
        fn=run_vision_encoder,
        inputs=image_input,
        outputs=output_text
    )

demo.launch(debug=True)
```