|
--- |
|
library_name: transformers |
|
tags: |
|
- vision |
|
license: apache-2.0 |
|
pipeline_tag: zero-shot-object-detection |
|
--- |
|
|
|
|
|
# LLMDet (base variant) |
|
|
|
[LLMDet](https://arxiv.org/abs/2501.18954) model was proposed in [LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models |
|
](https://arxiv.org/abs/2501.18954) by Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, Wei-Shi Zheng. |
|
|
|
LLMDet improves upon the [MM Grounding DINO](https://huggingface.co/docs/transformers/model_doc/mm-grounding-dino) and [Grounding DINO](https://huggingface.co/docs/transformers/model_doc/grounding-dino) by co-training the model with a large language model. |
|
|
|
You can find all the LLMDet checkpoints under the [LLMDet](https://huggingface.co/collections/rziga/llmdet-68398b294d9866c16046dcdd) collection. Note that these checkpoints are inference only -- they do not include LLM which was used for training. The inference is identical to that of [MM Grounding DINO](https://huggingface.co/docs/transformers/model_doc/mm-grounding-dino). |
|
|
|
|
|
## Intended uses |
|
|
|
You can use the raw model for zero-shot object detection. |
|
|
|
Here's how to use the model for zero-shot object detection: |
|
|
|
```py |
|
import torch |
|
from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor |
|
from transformers.image_utils import load_image |
|
|
|
|
|
# Prepare processor and model |
|
model_id = "iSEE-Laboratory/llmdet_base" |
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
processor = AutoProcessor.from_pretrained(model_id) |
|
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device) |
|
|
|
# Prepare inputs |
|
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg" |
|
image = load_image(image_url) |
|
text_labels = [["a cat", "a remote control"]] |
|
inputs = processor(images=image, text=text_labels, return_tensors="pt").to(device) |
|
|
|
# Run inference |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
# Postprocess outputs |
|
results = processor.post_process_grounded_object_detection( |
|
outputs, |
|
threshold=0.4, |
|
target_sizes=[(image.height, image.width)] |
|
) |
|
|
|
# Retrieve the first image result |
|
result = results[0] |
|
for box, score, labels in zip(result["boxes"], result["scores"], result["labels"]): |
|
box = [round(x, 2) for x in box.tolist()] |
|
print(f"Detected {labels} with confidence {round(score.item(), 3)} at location {box}") |
|
``` |
|
|
|
## Training Data |
|
|
|
This model was trained on: |
|
- [Objects365v1](https://www.objects365.org/overview.html) |
|
- [GOLD-G](https://arxiv.org/abs/2104.12763) |
|
- [V3Det](https://github.com/V3Det/V3Det) |
|
- [GroundingCap-1M](https://arxiv.org/abs/2501.18954) |
|
|
|
|
|
## Evaluation results |
|
|
|
- Here's a table of LLMDet models and their performance on LVIS (results from [official repo](https://github.com/iSEE-Laboratory/LLMDet)): |
|
|
|
| Model | Pre-Train Data | MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP | Val1.0 APr | Val1.0 APc | Val1.0 APf | Val1.0 AP | |
|
| --------------------------------------------------------- | -------------------------------------------- | ------------ | ----------- | ----------- | ----------- | ---------- | ---------- | ---------- | ----------- | |
|
| [llmdet_tiny](https://huggingface.co/rziga/llmdet_tiny) | (O365,GoldG,GRIT,V3Det) + GroundingCap-1M | 44.7 | 37.3 | 39.5 | 50.7 | 34.9 | 26.0 | 30.1 | 44.3 | |
|
| [llmdet_base](https://huggingface.co/rziga/llmdet_base) | (O365,GoldG,V3Det) + GroundingCap-1M | 48.3 | 40.8 | 43.1 | 54.3 | 38.5 | 28.2 | 34.3 | 47.8 | |
|
| [llmdet_large](https://huggingface.co/rziga/llmdet_large) | (O365V2,OpenImageV6,GoldG) + GroundingCap-1M | 51.1 | 45.1 | 46.1 | 56.6 | 42.0 | 31.6 | 38.8 | 50.2 | |
|
|
|
|
|
|
|
## BibTeX entry and citation info |
|
|
|
```bib |
|
@article{fu2025llmdet, |
|
title={LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models}, |
|
author={Fu, Shenghao and Yang, Qize and Mo, Qijie and Yan, Junkai and Wei, Xihan and Meng, Jingke and Xie, Xiaohua and Zheng, Wei-Shi}, |
|
journal={arXiv preprint arXiv:2501.18954}, |
|
year={2025} |
|
} |
|
``` |
|
|