Faster Segment Anything: Towards Lightweight SAM for Mobile Applications
Paper
β’
2306.14289
β’
Published
β’
15
MobileSAM performs on par with the original SAM (at least visually) and keeps exactly the same pipeline as the original SAM except for a change on the image encoder. Specifically, we replace the original heavyweight ViT-H encoder (632M) with a much smaller Tiny-ViT (5M). On a single GPU, MobileSAM runs around 12ms per image: 8ms on the image encoder and 4ms on the mask decoder.
The comparison of ViT-based image encoder is summarzed as follows:
| Image Encoder | Original SAM | MobileSAM |
|---|---|---|
| Paramters | 611M | 5M |
| Speed | 452ms | 8ms |
Original SAM and MobileSAM have exactly the same prompt-guided mask decoder:
| Mask Decoder | Original SAM | MobileSAM |
|---|---|---|
| Paramters | 3.876M | 3.876M |
| Speed | 4ms | 4ms |
The comparison of the whole pipeline is summarzed as follows:
| Whole Pipeline (Enc+Dec) | Original SAM | MobileSAM |
|---|---|---|
| Paramters | 615M | 9.66M |
| Speed | 456ms | 12ms |
@article{kirillov2023segany,
title={Segment Anything},
author={Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Doll{\'a}r, Piotr and Girshick, Ross},
journal={arXiv:2304.02643},
year={2023}
}
@InProceedings{tiny_vit,
title={TinyViT: Fast Pretraining Distillation for Small Vision Transformers},
author={Wu, Kan and Zhang, Jinnian and Peng, Houwen and Liu, Mengchen and Xiao, Bin and Fu, Jianlong and Yuan, Lu},
booktitle={European conference on computer vision (ECCV)},
year={2022}
BibTeX:
@article{mobile_sam,
title={Faster Segment Anything: Towards Lightweight SAM for Mobile Applications},
author={Zhang, Chaoning and Han, Dongshen and Qiao, Yu and Kim, Jung Uk and Bae, Sung Ho and Lee, Seungkyu and Hong, Choong Seon},
journal={arXiv preprint arXiv:2306.14289},
year={2023}
}