Ryan Pfister
commited on
Commit
·
58343f2
1
Parent(s):
7c97aba
Add YOLO12l-seg person segmentation model with documentation and example code
Browse files- README.md +139 -0
- requirements.txt +5 -0
- sample_inference.py +124 -0
- yolo12l-person-seg.pt +3 -0
README.md
ADDED
|
@@ -0,0 +1,139 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
- yolo
|
| 5 |
+
- yolo12
|
| 6 |
+
- segmentation
|
| 7 |
+
- object-detection
|
| 8 |
+
- person-detection
|
| 9 |
+
- instance-segmentation
|
| 10 |
+
- pytorch
|
| 11 |
+
- ultralytics
|
| 12 |
+
- computer-vision
|
| 13 |
+
datasets:
|
| 14 |
+
- coco
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# YOLO12-seg Person Segmentation Model
|
| 18 |
+
|
| 19 |
+
A YOLO12-large (YOLO12l) instance segmentation model trained specifically for detecting and segmenting people with high precision.
|
| 20 |
+
|
| 21 |
+
## Model Description
|
| 22 |
+
|
| 23 |
+
This model is a fine-tuned YOLO12-seg model optimized exclusively for person segmentation. It uses the large (L) scale configuration of YOLO12, featuring 28.76M parameters and 510 layers with a depth and width of 1.0.
|
| 24 |
+
|
| 25 |
+
### Key Features
|
| 26 |
+
|
| 27 |
+
- **Single-Class Focus**: Specialized in detecting only people
|
| 28 |
+
- **Detailed Segmentation**: Provides pixel-perfect segmentation masks
|
| 29 |
+
- **High Throughput**: Optimized for processing hundreds of images per minute
|
| 30 |
+
- **Quality-Optimized**: Trained specifically for accurate boundary delineation
|
| 31 |
+
- **GPU-Optimized**: The Large (L) model is designed for GPU deployment, not edge devices or mobile phones
|
| 32 |
+
|
| 33 |
+
## Training
|
| 34 |
+
|
| 35 |
+
The model was trained on a filtered version of the COCO dataset containing only images with people:
|
| 36 |
+
|
| 37 |
+
- **Training Images**: 64,114 images containing people
|
| 38 |
+
- **Validation Images**: 2,693 images containing people
|
| 39 |
+
- **Training Details**:
|
| 40 |
+
- Initially trained for 100 epochs
|
| 41 |
+
- Extended training for additional 200 epochs (300 total)
|
| 42 |
+
- Input resolution: 640×640
|
| 43 |
+
- Class-focused optimization with `single_cls=True` and `classes=0`
|
| 44 |
+
- Optimized for segmentation with `overlap_mask=True` and `mask_ratio=4`
|
| 45 |
+
- Extended training with cosine learning rate schedule and patience=20
|
| 46 |
+
|
| 47 |
+
## Performance
|
| 48 |
+
|
| 49 |
+
The model achieves the following metrics on the COCO person validation set:
|
| 50 |
+
|
| 51 |
+
| Metric | Value |
|
| 52 |
+
| ------------------------- | ---------- |
|
| 53 |
+
| Box mAP50-95 (COCO) | 0.628 |
|
| 54 |
+
| Box mAP50 (COCO) | 0.840 |
|
| 55 |
+
| Mask mAP50-95 | 0.524 |
|
| 56 |
+
| Mask mAP50 | 0.821 |
|
| 57 |
+
| Box Precision | 0.835 |
|
| 58 |
+
| Box Recall | 0.745 |
|
| 59 |
+
| Mask Precision | 0.843 |
|
| 60 |
+
| Mask Recall | 0.723 |
|
| 61 |
+
|
| 62 |
+
These metrics were computed on a validation set of 5,000 images with 10,777 instances.
|
| 63 |
+
|
| 64 |
+
## Use Cases
|
| 65 |
+
|
| 66 |
+
This model is ideal for applications requiring precise person segmentation:
|
| 67 |
+
|
| 68 |
+
- Human-centric image editing
|
| 69 |
+
- Background removal focused on people
|
| 70 |
+
- Virtual try-on applications
|
| 71 |
+
- People counting and crowd analysis
|
| 72 |
+
- Smart surveillance systems
|
| 73 |
+
|
| 74 |
+
## Usage
|
| 75 |
+
|
| 76 |
+
The model can be used directly with Ultralytics YOLOv8:
|
| 77 |
+
|
| 78 |
+
```python
|
| 79 |
+
from ultralytics import YOLO
|
| 80 |
+
|
| 81 |
+
# Load the model
|
| 82 |
+
model = YOLO('path/to/yolo12l-person-seg.pt')
|
| 83 |
+
|
| 84 |
+
# Perform inference
|
| 85 |
+
results = model('image.jpg')
|
| 86 |
+
|
| 87 |
+
# Process results (segmentation masks and bounding boxes)
|
| 88 |
+
for result in results:
|
| 89 |
+
boxes = result.boxes # Tensor operations can be performed on boxes
|
| 90 |
+
masks = result.masks # Segmentation masks
|
| 91 |
+
|
| 92 |
+
if masks is not None:
|
| 93 |
+
# Process masks
|
| 94 |
+
for mask in masks:
|
| 95 |
+
# Use the mask for your application
|
| 96 |
+
pass
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
For segmentation visualization:
|
| 100 |
+
|
| 101 |
+
```python
|
| 102 |
+
import cv2
|
| 103 |
+
import numpy as np
|
| 104 |
+
from ultralytics import YOLO
|
| 105 |
+
|
| 106 |
+
# Load the model and image
|
| 107 |
+
model = YOLO('path/to/yolo12l-person-seg.pt')
|
| 108 |
+
image = cv2.imread('image.jpg')
|
| 109 |
+
|
| 110 |
+
# Perform inference
|
| 111 |
+
results = model(image)
|
| 112 |
+
|
| 113 |
+
# Process and visualize the first result
|
| 114 |
+
result = results[0]
|
| 115 |
+
if result.masks is not None:
|
| 116 |
+
masks = result.masks.data.cpu().numpy()
|
| 117 |
+
for i, mask in enumerate(masks):
|
| 118 |
+
# Create a colored overlay for each mask
|
| 119 |
+
color = [np.random.randint(0, 255) for _ in range(3)]
|
| 120 |
+
mask_image = np.zeros_like(image, dtype=np.uint8)
|
| 121 |
+
mask_image[mask.astype(bool)] = color
|
| 122 |
+
image = cv2.addWeighted(image, 1.0, mask_image, 0.5, 0)
|
| 123 |
+
|
| 124 |
+
# Display or save the image
|
| 125 |
+
cv2.imwrite('segmented_image.jpg', image)
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
## Limitations
|
| 129 |
+
|
| 130 |
+
- This model is optimized for person segmentation only and won't detect other classes
|
| 131 |
+
- Performance may be reduced in extreme lighting conditions
|
| 132 |
+
- Occluded persons may have incomplete segmentation masks
|
| 133 |
+
- Small or distant people might not be detected as reliably as those in foreground
|
| 134 |
+
- **GPU Recommended**: As a Large (L) model, real-time inference performance benefits from a dedicated GPU
|
| 135 |
+
- **Edge Device Limitations**: Not optimized for mobile or edge deployment (consider YOLO12n or YOLO12s for those use cases)
|
| 136 |
+
|
| 137 |
+
## License
|
| 138 |
+
|
| 139 |
+
This model is available under the Apache 2.0 license.
|
requirements.txt
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ultralytics>=8.3.0
|
| 2 |
+
torch>=2.0.0
|
| 3 |
+
opencv-python>=4.7.0
|
| 4 |
+
numpy>=1.22.0
|
| 5 |
+
Pillow>=9.5.0
|
sample_inference.py
ADDED
|
@@ -0,0 +1,124 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# sample_inference.py
|
| 2 |
+
import argparse
|
| 3 |
+
import torch
|
| 4 |
+
from ultralytics import YOLO
|
| 5 |
+
import cv2
|
| 6 |
+
import numpy as np
|
| 7 |
+
import json
|
| 8 |
+
from PIL import Image
|
| 9 |
+
|
| 10 |
+
def main():
|
| 11 |
+
parser = argparse.ArgumentParser(description='Run person segmentation with YOLO12l-seg model')
|
| 12 |
+
parser.add_argument('--model', type=str, default='yolo12l-person-seg.pt', help='Model path')
|
| 13 |
+
parser.add_argument('--image', type=str, required=True, help='Image path for inference')
|
| 14 |
+
parser.add_argument('--output', type=str, default='output.jpg', help='Output visualization image path')
|
| 15 |
+
parser.add_argument('--json', type=str, default='detections.json', help='JSON output file for detection data')
|
| 16 |
+
parser.add_argument('--conf', type=float, default=0.5, help='Confidence threshold')
|
| 17 |
+
args = parser.parse_args()
|
| 18 |
+
|
| 19 |
+
# Load the model
|
| 20 |
+
model = YOLO(args.model)
|
| 21 |
+
|
| 22 |
+
# Move to appropriate device if available
|
| 23 |
+
if torch.cuda.is_available():
|
| 24 |
+
print(f"Using CUDA device: {torch.cuda.get_device_name(0)}")
|
| 25 |
+
model.to('cuda')
|
| 26 |
+
device = 'cuda'
|
| 27 |
+
use_half = True
|
| 28 |
+
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
|
| 29 |
+
print("Using Apple Silicon MPS")
|
| 30 |
+
model.to('mps')
|
| 31 |
+
device = 'mps'
|
| 32 |
+
use_half = False
|
| 33 |
+
else:
|
| 34 |
+
print("Using CPU")
|
| 35 |
+
device = None
|
| 36 |
+
use_half = False
|
| 37 |
+
|
| 38 |
+
# Load and check input image
|
| 39 |
+
try:
|
| 40 |
+
img = Image.open(args.image)
|
| 41 |
+
img_width, img_height = img.size
|
| 42 |
+
print(f"Image dimensions: {img_width}x{img_height}")
|
| 43 |
+
except Exception as e:
|
| 44 |
+
print(f"Error opening image: {e}")
|
| 45 |
+
return
|
| 46 |
+
|
| 47 |
+
# Run inference
|
| 48 |
+
if device == 'cuda':
|
| 49 |
+
results = model(args.image, classes=0, conf=args.conf, device=device, half=use_half)
|
| 50 |
+
elif device == 'mps':
|
| 51 |
+
results = model(args.image, classes=0, conf=args.conf, device=device)
|
| 52 |
+
else:
|
| 53 |
+
results = model(args.image, classes=0, conf=args.conf)
|
| 54 |
+
|
| 55 |
+
# Process results
|
| 56 |
+
detections = []
|
| 57 |
+
visualization_img = cv2.imread(args.image)
|
| 58 |
+
|
| 59 |
+
for result in results:
|
| 60 |
+
masks = result.masks
|
| 61 |
+
boxes = result.boxes
|
| 62 |
+
|
| 63 |
+
if boxes is None or len(boxes) == 0:
|
| 64 |
+
print("No people detected in the image")
|
| 65 |
+
return
|
| 66 |
+
|
| 67 |
+
person_count = len(boxes)
|
| 68 |
+
print(f"Detected {person_count} people")
|
| 69 |
+
|
| 70 |
+
# Visualize and extract data
|
| 71 |
+
if masks is not None:
|
| 72 |
+
for i, (mask, box) in enumerate(zip(masks.xy, boxes)):
|
| 73 |
+
confidence = float(box.conf[0])
|
| 74 |
+
x1, y1, x2, y2 = map(int, box.xyxy[0])
|
| 75 |
+
|
| 76 |
+
# Extract mask points
|
| 77 |
+
polygon_points = mask.tolist()
|
| 78 |
+
|
| 79 |
+
# Calculate percentages of image dimensions
|
| 80 |
+
x_coords = [point[0] for point in polygon_points]
|
| 81 |
+
y_coords = [point[1] for point in polygon_points]
|
| 82 |
+
min_x, max_x = min(x_coords), max(x_coords)
|
| 83 |
+
min_y, max_y = min(y_coords), max(y_coords)
|
| 84 |
+
width_pct = (max_x - min_x) / img_width
|
| 85 |
+
height_pct = (max_y - min_y) / img_height
|
| 86 |
+
|
| 87 |
+
# Create detection record
|
| 88 |
+
detection = {
|
| 89 |
+
"id": i,
|
| 90 |
+
"confidence": confidence,
|
| 91 |
+
"box": [x1, y1, x2, y2],
|
| 92 |
+
"points": polygon_points,
|
| 93 |
+
"width_pct": width_pct,
|
| 94 |
+
"height_pct": height_pct,
|
| 95 |
+
}
|
| 96 |
+
detections.append(detection)
|
| 97 |
+
|
| 98 |
+
# Draw bounding box
|
| 99 |
+
cv2.rectangle(visualization_img, (x1, y1), (x2, y2), (0, 255, 0), 2)
|
| 100 |
+
cv2.putText(visualization_img, f'Person: {confidence:.2f}', (x1, y1 - 10),
|
| 101 |
+
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
|
| 102 |
+
|
| 103 |
+
# Draw segmentation mask
|
| 104 |
+
color_mask = np.zeros_like(visualization_img, dtype=np.uint8)
|
| 105 |
+
mask_points = np.array(polygon_points, dtype=np.int32)
|
| 106 |
+
cv2.fillPoly(color_mask, [mask_points], (0, 0, 255))
|
| 107 |
+
|
| 108 |
+
# Blend the mask with the original image
|
| 109 |
+
visualization_img = cv2.addWeighted(visualization_img, 1.0, color_mask, 0.5, 0)
|
| 110 |
+
|
| 111 |
+
# Save visualization
|
| 112 |
+
cv2.imwrite(args.output, visualization_img)
|
| 113 |
+
print(f"Visualization saved to {args.output}")
|
| 114 |
+
|
| 115 |
+
# Save detection data to JSON
|
| 116 |
+
with open(args.json, 'w') as f:
|
| 117 |
+
json.dump({
|
| 118 |
+
"person_count": person_count,
|
| 119 |
+
"detections": detections
|
| 120 |
+
}, f, indent=4)
|
| 121 |
+
print(f"Detection data saved to {args.json}")
|
| 122 |
+
|
| 123 |
+
if __name__ == "__main__":
|
| 124 |
+
main()
|
yolo12l-person-seg.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:abc090155cfe7a883fcc613868f482fa7db04ea67a6b4366c58c07deaa4c2ba1
|
| 3 |
+
size 58148802
|