LLMDet (base variant)
LLMDet model was proposed in LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models by Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, Wei-Shi Zheng.
LLMDet improves upon the MM Grounding DINO and Grounding DINO by co-training the model with a large language model.
You can find all the LLMDet checkpoints under the LLMDet collection. Note that these checkpoints are inference only -- they do not include LLM which was used for training. The inference is identical to that of MM Grounding DINO.
Intended uses
You can use the raw model for zero-shot object detection.
Here's how to use the model for zero-shot object detection:
import torch
from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor
from transformers.image_utils import load_image
# Prepare processor and model
model_id = "iSEE-Laboratory/llmdet_base"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
# Prepare inputs
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(image_url)
text_labels = [["a cat", "a remote control"]]
inputs = processor(images=image, text=text_labels, return_tensors="pt").to(device)
# Run inference
with torch.no_grad():
outputs = model(**inputs)
# Postprocess outputs
results = processor.post_process_grounded_object_detection(
outputs,
threshold=0.4,
target_sizes=[(image.height, image.width)]
)
# Retrieve the first image result
result = results[0]
for box, score, labels in zip(result["boxes"], result["scores"], result["labels"]):
box = [round(x, 2) for x in box.tolist()]
print(f"Detected {labels} with confidence {round(score.item(), 3)} at location {box}")
Training Data
This model was trained on:
Evaluation results
Here's a table of LLMDet models and their performance on LVIS (results from official repo):
Model Pre-Train Data MiniVal APr MiniVal APc MiniVal APf MiniVal AP Val1.0 APr Val1.0 APc Val1.0 APf Val1.0 AP llmdet_tiny (O365,GoldG,GRIT,V3Det) + GroundingCap-1M 44.7 37.3 39.5 50.7 34.9 26.0 30.1 44.3 llmdet_base (O365,GoldG,V3Det) + GroundingCap-1M 48.3 40.8 43.1 54.3 38.5 28.2 34.3 47.8 llmdet_large (O365V2,OpenImageV6,GoldG) + GroundingCap-1M 51.1 45.1 46.1 56.6 42.0 31.6 38.8 50.2
BibTeX entry and citation info
@article{fu2025llmdet,
title={LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models},
author={Fu, Shenghao and Yang, Qize and Mo, Qijie and Yan, Junkai and Wei, Xihan and Meng, Jingke and Xie, Xiaohua and Zheng, Wei-Shi},
journal={arXiv preprint arXiv:2501.18954},
year={2025}
}
- Downloads last month
- 14