Upload README.md

6d226e9 verified about 1 month ago

6.53 kB

	---
	library_name: transformers
	license: mit
	tags:
	- image-segmentation
	- semantic-segmentation
	- segformer
	- facade
	- building
	- mixed-rectification
	- unrectified
	- vision
	pipeline_tag: image-segmentation
	datasets:
	- Xpitfire/cmp_facade
	- merve/scene_parse_150
	metrics:
	- mean_iou
	model-index:
	- name: segformer-b0-facade-mixed
	results:
	- task:
	type: image-segmentation
	name: Semantic Segmentation
	dataset:
	type: mixed
	name: CMP Facade + ADE20K filtered
	split: validation
	metrics:
	- type: mean_iou
	value: 0.0
	name: Mean IoU
	---

	# SegFormer-B0 — Facade Segmentation (Mixed Rectified + Unrectified)

	> Status: Training script ready. Run `train.py` to produce this model. See How to Train below.

	A SegFormer-B0 model trained on mixed rectified and unrectified facade data for a 2-pass pipeline:

	1. Pass 1 — Run on raw street photo (unrectified/perspective) to detect the dominant facade wall
	2. Rectify detected facade via homography
	3. Pass 2 — Run on rectified crop to extract windows, doors, balconies cleanly

	\| \| \|
	\|---\|---\|
	\| Architecture \| SegFormer-B0 (Mix Transformer encoder + all-MLP decoder) \|
	\| Parameters \| ~3.7 M \|
	\| Input \| RGB image, any resolution (resized to 512×512) \|
	\| Output \| 6-class pixel mask \|
	\| Format \| SafeTensors \|
	\| Base model \| [Marco333/segformer-b0-facade-cmp](https://huggingface.co/Marco333/segformer-b0-facade-cmp) \|

	## 6-Class Taxonomy

	\| ID \| Class \| Function \| Pass \|
	\|:--:\|-------\|----------\|:--:\|
	\| 0 \| `background` \| Sky, ground, non-facade regions \| Both \|
	\| 1 \| `facade_wall` \| Main wall surface (merged: facade, molding, cornice, pillar, sill, deco) \| Both \|
	\| 2 \| `window` \| Windows + blinds + shopfronts \| Both \|
	\| 3 \| `door` \| Doors + shopfronts \| Both \|
	\| 4 \| `balcony` \| Balconies \| Both \|
	\| 5 \| `vegetation_occluder` \| Trees, plants occluding facade \| Both \|

	## Two-Pass Pipeline

	```python
	from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation
	from PIL import Image
	import torch.nn.functional as F

	processor = SegformerImageProcessor.from_pretrained("Marco333/segformer-b0-facade-mixed")
	model = SegformerForSemanticSegmentation.from_pretrained("Marco333/segformer-b0-facade-mixed")

	# Pass 1: raw street photo
	image = Image.open("street_photo.jpg").convert("RGB")
	inputs = processor(images=image, return_tensors="pt")
	outputs = model(**inputs)
	mask = F.interpolate(outputs.logits, size=image.size[::-1],
	mode="bilinear", align_corners=False).argmax(dim=1)[0]

	# Find biggest facade_wall blob (class 1), compute homography, rectify...
	# Then Pass 2 on rectified crop:
	rectified = Image.open("rectified_facade.jpg").convert("RGB")
	inputs2 = processor(images=rectified, return_tensors="pt")
	outputs2 = model(**inputs2)
	mask2 = F.interpolate(outputs2.logits, size=rectified.size[::-1],
	mode="bilinear", align_corners=False).argmax(dim=1)[0]
	```

	## How to Train

	This repo contains the training script. Run it on any GPU (T4 or better):

	```bash
	pip install transformers datasets evaluate accelerate torch torchvision Pillow numpy
	python train.py
	```

	What the script does:

	1. Loads CMP Facade dataset (~492 rectified images, 12 classes) → remaps to 6-class
	2. Loads ADE20K scene_parse_150 (~20K images) → filters to building-containing scenes → remaps to same 6-class
	3. Applies perspective augmentation (RandomPerspective, p=0.4) during training — simulates oblique camera angles
	4. Starts from your existing model (`segformer-b0-facade-cmp`, 48.56% mIoU)
	5. Trains 80 epochs, pushes best model to this Hub repo

	Training time: ~4-6h on T4 GPU

	## Data Sources

	\| Dataset \| Type \| Images \| Geometry \| Classes (raw) \| Classes (remapped) \|
	\|---------\|------\|--------\|----------\|--------------\|-------------------\|
	\| [CMP Facade](https://huggingface.co/datasets/Xpitfire/cmp_facade) \| Primary \| ~492 \| Rectified \| 12 \| 6 (background, wall, window, door, balcony, ignore) \|
	\| [ADE20K scene_parse_150](https://huggingface.co/datasets/merve/scene_parse_150) \| Augmentation \| ~5K filtered \| Unrectified (perspective) \| 150 \| 6 (same taxonomy) \|

	### Why mix these?

	- CMP alone = excellent on rectified facades, fails on street-view perspective
	- ADE20K adds natural perspective building scenes (wall, building, house, skyscraper classes)
	- Perspective augmentation (`RandomPerspective`, distortion=0.3, p=0.4) closes the geometric domain gap

	Literature confirms the gap: [Texture2LoD3](https://huggingface.co/papers/2504.05249) measured SegFormer drops ~10pp IoU on unrectified vs rectified facades. Perspective augmentation during training is the practical fix.

	## Training Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base checkpoint \| `Marco333/segformer-b0-facade-cmp` \|
	\| Optimizer \| AdamW \|
	\| Learning rate \| 6 × 10⁻⁵ \|
	\| LR schedule \| Polynomial decay \|
	\| Warmup \| 10% of steps \|
	\| Weight decay \| 0.01 \|
	\| Effective batch size \| 8 (4 × device · 2 grad accum) \|
	\| Resolution \| 512 × 512 \|
	\| Precision \| FP16 \|
	\| Epochs \| 80 \|
	\| Augmentation \| ColorJitter + RandomPerspective (p=0.4, distortion=0.3) \|
	\| Selection metric \| Highest mean IoU on validation \|

	## Expected Improvements vs. CMP-Only Model

	\| Capability \| CMP-only (baseline) \| Mixed (this model) \|
	\|---\|---\|---\|
	\| Rectified facades \| ✅ 48.6% mIoU \| ✅ Likely 55-70% (more data + transfer) \|
	\| Unrectified street photos \| ❌ Untrained \| ✅ Trained on ADE20K perspective scenes \|
	\| Perspective robustness \| ~10pp IoU drop \| Gap closed via augmentation \|

	## Citation

	CMP Facade:
	```bibtex
	@INPROCEEDINGS{Tylecek13,
	author = {Radim Tyle{\v c}ek and Radim {\v S}{\' a}ra},
	title = {Spatial Pattern Templates for Recognition of Objects with Regular Structure},
	booktitle = {Proc. GCPR},
	year = {2013},
	}
	```

	ADE20K:
	```bibtex
	@article{zhou2017scene,
	title={Scene Parsing through ADE20K Dataset},
	author={Zhou, Bolei and Zhao, Hang and Puig, Xavier and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio},
	journal={CVPR},
	year={2017}
	}
	```

	SegFormer:
	```bibtex
	@article{xie2021segformer,
	title={SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers},
	author={Xie, Enze and Wang, Wenhai and Yu, Zhiding and Anandkumar, Anima and Alvarez, Jose M and Luo, Ping},
	journal={arXiv preprint arXiv:2105.15203},
	year={2021}
	}
	```