fisheye8k_facebook_deformable-detr-box-supervised

This model is a fine-tuned version of facebook/deformable-detr-box-supervised on the Fisheye8K dataset. It was developed within the framework of the Mcity Data Engine project.

The Mcity Data Engine provides modules for the complete data-based development cycle for AI algorithms, especially focusing on identifying rare and novel classes through an open-vocabulary data selection process within Intelligent Transportation Systems (ITS). This model is a practical application of the data engine for improving object detection of vulnerable road users and other transportation-related entities.

Paper: Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection
Project Page: Mcity Data Engine Documentation
GitHub Repository: mcity/mcity_data_engine

It achieves the following results on the evaluation set:

Loss: 3.5085

Model description

This model is designed for object detection in traffic scenarios, particularly for identifying classes like Bus, Bike, Car, Pedestrian, and Truck in fisheye camera imagery. It leverages the Deformable DETR architecture and is fine-tuned using the iterative data improvement methodology proposed in the Mcity Data Engine project. Its goal is to improve the detection of long-tail and novel classes in large amounts of unlabeled data, which is especially challenging in Intelligent Transportation Systems.

Intended uses & limitations

This model is intended for research and development in autonomous driving and intelligent transportation systems, specifically for improving the detection of long-tail and rare classes within the Mcity Data Engine's iterative model improvement pipeline.

Limitations include its training on specific fisheye camera data, which may affect generalization to other camera types or environments without further fine-tuning. The training process focuses on open-vocabulary data selection, meaning its performance on very common, standard objects might be comparable to other models, but its strength lies in identifying more challenging or rare instances.

Training and evaluation data

The model was trained on the Voxel51/fisheye8k dataset. This dataset is used as part of the Mcity Data Engine's workflow, specifically for demonstrating "Embedding Selection" to determine both representative and rare samples for iterative model improvement. More details about the data curation and selection process can be found in the associated paper and the Mcity Data Engine GitHub repository.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 1
eval_batch_size: 8
seed: 0
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
num_epochs: 36
mixed_precision_training: Native AMP

Training results

| Training Loss | Epoch | Step | Validation Loss | |:-------------:|:-----:|:-----:|:---------------:|
| 2.551 | 1.0 | 5288 | 2.9515 |
| 2.4989 | 2.0 | 10576 | 2.9100 |
| 2.2642 | 3.0 | 15864 | 2.9280 |
| 5.2218 | 4.0 | 21152 | 7.3972 |
| 3.69 | 5.0 | 26440 | 2.8083 |
| 3.3462 | 6.0 | 31728 | 5.0976 |
| 2.5944 | 7.0 | 37016 | 4.1669 |
| 2.5709 | 8.0 | 42304 | 3.6812 |
| 2.6956 | 9.0 | 47592 | 4.0466 |
| 2.5195 | 10.0 | 52880 | 3.5085 |\

Framework versions

Transformers 4.48.3
Pytorch 2.5.1+cu124
Datasets 3.2.0
Tokenizers 0.21.0

Sample Usage

You can use this model directly with the Hugging Face transformers library for object detection:

from transformers import AutoImageProcessor, DeformableDetrForObjectDetection
import torch
from PIL import Image
import requests

# Load image (replace with your image path or URL)
url = "http://images.cocodataset.org/val2017/000000039769.jpg" # Example image from COCO
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

# Load the image processor and model
image_processor = AutoImageProcessor.from_pretrained("mcity-data-engine/fisheye8k_facebook_deformable-detr-box-supervised")
model = DeformableDetrForObjectDetection.from_pretrained("mcity-data-engine/fisheye8k_facebook_deformable-detr-box-supervised")

# Prepare inputs
inputs = image_processor(images=image, return_tensors="pt")

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)

# You can further process the outputs (logits, boxes, etc.) for visualization or evaluation.
# For example, to get predicted bounding boxes:
target_sizes = torch.tensor([image.size[::-1]])
results = image_processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.5)[0]

print(f"Detected objects for image of size {image.size}:")
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    print(
        f"  Detected {model.config.id2label[label.item()]} with confidence "
        f"{round(score.item(), 3)} at location {box}"
    )

Acknowledgements

Mcity would like to thank Amazon Web Services (AWS) for their pivotal role in providing the cloud infrastructure on which the Data Engine depends. We couldn’t have done it without their tremendous support!

Citation

If you use the Mcity Data Engine in your research, feel free to cite the project:

@article{bogdoll2025mcitydataengine,
  title={Mcity Data Engine},
  author={Bogdoll, Daniel and Anata, Rajanikant Patnaik and Stevens, Gregory},
  journal={GitHub. Note: https://github.com/mcity/mcity_data_engine},
  year={2025}
}

Downloads last month: 13

Model tree for mcity-data-engine/fisheye8k_facebook_deformable-detr-box-supervised

Base model

facebook/deformable-detr-box-supervised

Finetuned

(4)

this model

Dataset used to train mcity-data-engine/fisheye8k_facebook_deformable-detr-box-supervised

Evaluation results

Metadata error: specify a dataset to view leaderboard