LLMDet (base variant)

LLMDet model was proposed in LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models by Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, Wei-Shi Zheng.

LLMDet improves upon the MM Grounding DINO and Grounding DINO by co-training the model with a large language model.

You can find all the LLMDet checkpoints under the LLMDet collection. Note that these checkpoints are inference only -- they do not include LLM which was used for training. The inference is identical to that of MM Grounding DINO.

Intended uses

You can use the raw model for zero-shot object detection.

Here's how to use the model for zero-shot object detection:

import torch
from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor
from transformers.image_utils import load_image


# Prepare processor and model
model_id = "iSEE-Laboratory/llmdet_base"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

# Prepare inputs
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(image_url)
text_labels = [["a cat", "a remote control"]]
inputs = processor(images=image, text=text_labels, return_tensors="pt").to(device)

# Run inference
with torch.no_grad():
    outputs = model(**inputs)

# Postprocess outputs
results = processor.post_process_grounded_object_detection(
    outputs,
    threshold=0.4,
    target_sizes=[(image.height, image.width)]
)

# Retrieve the first image result
result = results[0]
for box, score, labels in zip(result["boxes"], result["scores"], result["labels"]):
    box = [round(x, 2) for x in box.tolist()]
    print(f"Detected {labels} with confidence {round(score.item(), 3)} at location {box}")

Training Data

This model was trained on:

Evaluation results

  • Here's a table of LLMDet models and their performance on LVIS (results from official repo):

    Model Pre-Train Data MiniVal APr MiniVal APc MiniVal APf MiniVal AP Val1.0 APr Val1.0 APc Val1.0 APf Val1.0 AP
    llmdet_tiny (O365,GoldG,GRIT,V3Det) + GroundingCap-1M 44.7 37.3 39.5 50.7 34.9 26.0 30.1 44.3
    llmdet_base (O365,GoldG,V3Det) + GroundingCap-1M 48.3 40.8 43.1 54.3 38.5 28.2 34.3 47.8
    llmdet_large (O365V2,OpenImageV6,GoldG) + GroundingCap-1M 51.1 45.1 46.1 56.6 42.0 31.6 38.8 50.2

BibTeX entry and citation info

@article{fu2025llmdet,
  title={LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models},
  author={Fu, Shenghao and Yang, Qize and Mo, Qijie and Yan, Junkai and Wei, Xihan and Meng, Jingke and Xie, Xiaohua and Zheng, Wei-Shi},
  journal={arXiv preprint arXiv:2501.18954},
  year={2025}
}
Downloads last month
14
Safetensors
Model size
233M params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including iSEE-Laboratory/llmdet_base