Image Segmentation
Transformers
semantic-segmentation
segformer
facade
building
mixed-rectification
unrectified
vision
Eval Results (legacy)
Instructions to use Marco333/segformer-b0-facade-mixed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Marco333/segformer-b0-facade-mixed with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-segmentation", model="Marco333/segformer-b0-facade-mixed")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Marco333/segformer-b0-facade-mixed", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| license: mit | |
| tags: | |
| - image-segmentation | |
| - semantic-segmentation | |
| - segformer | |
| - facade | |
| - building | |
| - mixed-rectification | |
| - unrectified | |
| - vision | |
| pipeline_tag: image-segmentation | |
| datasets: | |
| - Xpitfire/cmp_facade | |
| - merve/scene_parse_150 | |
| metrics: | |
| - mean_iou | |
| model-index: | |
| - name: segformer-b0-facade-mixed | |
| results: | |
| - task: | |
| type: image-segmentation | |
| name: Semantic Segmentation | |
| dataset: | |
| type: mixed | |
| name: CMP Facade + ADE20K filtered | |
| split: validation | |
| metrics: | |
| - type: mean_iou | |
| value: 0.0 | |
| name: Mean IoU | |
| # SegFormer-B0 β Facade Segmentation (Mixed Rectified + Unrectified) | |
| > **Status:** Training script ready. Run `train.py` to produce this model. See **How to Train** below. | |
| A **SegFormer-B0** model trained on **mixed rectified and unrectified facade data** for a 2-pass pipeline: | |
| 1. **Pass 1** β Run on raw street photo (unrectified/perspective) to detect the dominant facade wall | |
| 2. **Rectify** detected facade via homography | |
| 3. **Pass 2** β Run on rectified crop to extract windows, doors, balconies cleanly | |
| | | | | |
| |---|---| | |
| | **Architecture** | SegFormer-B0 (Mix Transformer encoder + all-MLP decoder) | | |
| | **Parameters** | ~3.7 M | | |
| | **Input** | RGB image, any resolution (resized to 512Γ512) | | |
| | **Output** | 6-class pixel mask | | |
| | **Format** | SafeTensors | | |
| | **Base model** | [Marco333/segformer-b0-facade-cmp](https://huggingface.co/Marco333/segformer-b0-facade-cmp) | | |
| ## 6-Class Taxonomy | |
| | ID | Class | Function | Pass | | |
| |:--:|-------|----------|:--:| | |
| | 0 | `background` | Sky, ground, non-facade regions | Both | | |
| | 1 | `facade_wall` | Main wall surface (merged: facade, molding, cornice, pillar, sill, deco) | Both | | |
| | 2 | `window` | Windows + blinds + shopfronts | Both | | |
| | 3 | `door` | Doors + shopfronts | Both | | |
| | 4 | `balcony` | Balconies | Both | | |
| | 5 | `vegetation_occluder` | Trees, plants occluding facade | Both | | |
| ## Two-Pass Pipeline | |
| ```python | |
| from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation | |
| from PIL import Image | |
| import torch.nn.functional as F | |
| processor = SegformerImageProcessor.from_pretrained("Marco333/segformer-b0-facade-mixed") | |
| model = SegformerForSemanticSegmentation.from_pretrained("Marco333/segformer-b0-facade-mixed") | |
| # Pass 1: raw street photo | |
| image = Image.open("street_photo.jpg").convert("RGB") | |
| inputs = processor(images=image, return_tensors="pt") | |
| outputs = model(**inputs) | |
| mask = F.interpolate(outputs.logits, size=image.size[::-1], | |
| mode="bilinear", align_corners=False).argmax(dim=1)[0] | |
| # Find biggest facade_wall blob (class 1), compute homography, rectify... | |
| # Then Pass 2 on rectified crop: | |
| rectified = Image.open("rectified_facade.jpg").convert("RGB") | |
| inputs2 = processor(images=rectified, return_tensors="pt") | |
| outputs2 = model(**inputs2) | |
| mask2 = F.interpolate(outputs2.logits, size=rectified.size[::-1], | |
| mode="bilinear", align_corners=False).argmax(dim=1)[0] | |
| ``` | |
| ## How to Train | |
| This repo contains the training script. Run it on any GPU (T4 or better): | |
| ```bash | |
| pip install transformers datasets evaluate accelerate torch torchvision Pillow numpy | |
| python train.py | |
| ``` | |
| **What the script does:** | |
| 1. Loads **CMP Facade** dataset (~492 rectified images, 12 classes) β remaps to 6-class | |
| 2. Loads **ADE20K scene_parse_150** (~20K images) β filters to building-containing scenes β remaps to same 6-class | |
| 3. Applies **perspective augmentation** (RandomPerspective, p=0.4) during training β simulates oblique camera angles | |
| 4. Starts from your existing model (`segformer-b0-facade-cmp`, 48.56% mIoU) | |
| 5. Trains 80 epochs, pushes best model to this Hub repo | |
| **Training time:** ~4-6h on T4 GPU | |
| ## Data Sources | |
| | Dataset | Type | Images | Geometry | Classes (raw) | Classes (remapped) | | |
| |---------|------|--------|----------|--------------|-------------------| | |
| | [CMP Facade](https://huggingface.co/datasets/Xpitfire/cmp_facade) | Primary | ~492 | Rectified | 12 | 6 (background, wall, window, door, balcony, ignore) | | |
| | [ADE20K scene_parse_150](https://huggingface.co/datasets/merve/scene_parse_150) | Augmentation | ~5K filtered | Unrectified (perspective) | 150 | 6 (same taxonomy) | | |
| ### Why mix these? | |
| - **CMP alone** = excellent on rectified facades, fails on street-view perspective | |
| - **ADE20K** adds natural perspective building scenes (wall, building, house, skyscraper classes) | |
| - **Perspective augmentation** (`RandomPerspective`, distortion=0.3, p=0.4) closes the geometric domain gap | |
| Literature confirms the gap: [Texture2LoD3](https://huggingface.co/papers/2504.05249) measured SegFormer drops ~10pp IoU on unrectified vs rectified facades. Perspective augmentation during training is the practical fix. | |
| ## Training Configuration | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | Base checkpoint | `Marco333/segformer-b0-facade-cmp` | | |
| | Optimizer | AdamW | | |
| | Learning rate | 6 Γ 10β»β΅ | | |
| | LR schedule | Polynomial decay | | |
| | Warmup | 10% of steps | | |
| | Weight decay | 0.01 | | |
| | Effective batch size | 8 (4 Γ device Β· 2 grad accum) | | |
| | Resolution | 512 Γ 512 | | |
| | Precision | FP16 | | |
| | Epochs | 80 | | |
| | Augmentation | ColorJitter + RandomPerspective (p=0.4, distortion=0.3) | | |
| | Selection metric | Highest mean IoU on validation | | |
| ## Expected Improvements vs. CMP-Only Model | |
| | Capability | CMP-only (baseline) | Mixed (this model) | | |
| |---|---|---| | |
| | Rectified facades | β 48.6% mIoU | β Likely 55-70% (more data + transfer) | | |
| | Unrectified street photos | β Untrained | β Trained on ADE20K perspective scenes | | |
| | Perspective robustness | ~10pp IoU drop | Gap closed via augmentation | | |
| ## Citation | |
| CMP Facade: | |
| ```bibtex | |
| @INPROCEEDINGS{Tylecek13, | |
| author = {Radim Tyle{\v c}ek and Radim {\v S}{\' a}ra}, | |
| title = {Spatial Pattern Templates for Recognition of Objects with Regular Structure}, | |
| booktitle = {Proc. GCPR}, | |
| year = {2013}, | |
| } | |
| ``` | |
| ADE20K: | |
| ```bibtex | |
| @article{zhou2017scene, | |
| title={Scene Parsing through ADE20K Dataset}, | |
| author={Zhou, Bolei and Zhao, Hang and Puig, Xavier and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio}, | |
| journal={CVPR}, | |
| year={2017} | |
| } | |
| ``` | |
| SegFormer: | |
| ```bibtex | |
| @article{xie2021segformer, | |
| title={SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers}, | |
| author={Xie, Enze and Wang, Wenhai and Yu, Zhiding and Anandkumar, Anima and Alvarez, Jose M and Luo, Ping}, | |
| journal={arXiv preprint arXiv:2105.15203}, | |
| year={2021} | |
| } | |
| ``` | |