๐Ÿš€ Mobile VLA: Vision-Language-Action Model for Mobile Robots

๐Ÿ“‹ Model Description

Mobile VLA๋Š” Kosmos-2B๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ Mobile Robot ์ „์šฉ Vision-Language-Action ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์žฅ์• ๋ฌผ ํšŒํ”ผ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์—ฐ์†์ ์ธ 3D ์•ก์…˜ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

๐ŸŽฏ ํ•ต์‹ฌ ๊ธฐ๋Šฅ

  • Vision-Language-Action: ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์ง€์‹œ์‚ฌํ•ญ์„ ๋ฐ›์•„ ๋กœ๋ด‡ ์•ก์…˜ ์˜ˆ์ธก
  • 3D ์—ฐ์† ์ œ์–ด: [linear_x, linear_y, angular_z] ํ˜•ํƒœ์˜ ์—ฐ์† ์•ก์…˜ ๊ณต๊ฐ„
  • ์žฅ์• ๋ฌผ ํšŒํ”ผ: 1-box, 2-box ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์ขŒ์šฐ ํšŒํ”ผ ์ „๋žต ํ•™์Šต
  • ์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ: ํšจ์œจ์ ์ธ vision-only ์ฒ˜๋ฆฌ๋กœ ๋น ๋ฅธ ์ถ”๋ก 

๐Ÿ”ง ๊ธฐ์ˆ  ์‚ฌ์–‘

  • ๋ฐฑ๋ณธ ๋ชจ๋ธ: microsoft/kosmos-2-patch14-224
  • ์ž…๋ ฅ: RGB ์ด๋ฏธ์ง€ (224x224) + ํ…์ŠคํŠธ ์ง€์‹œ์‚ฌํ•ญ
  • ์ถœ๋ ฅ: 3D ์—ฐ์† ์•ก์…˜ ๋ฒกํ„ฐ
  • ํ•™์Šต ๋ฐฉ์‹: Huber Loss ๊ธฐ๋ฐ˜ ํšŒ๊ท€
  • ๋ฐ์ดํ„ฐ: 72๊ฐœ ์‹ค์ œ ๋กœ๋ด‡ ์—ํ”ผ์†Œ๋“œ

๐Ÿ“Š ์„ฑ๋Šฅ ์ง€ํ‘œ

์ „์ฒด ์„ฑ๋Šฅ

  • ์ „์ฒด MAE: 0.285
  • ์ž„๊ณ„๊ฐ’ ์ •ํ™•๋„ (0.1): 37.5%

์•ก์…˜๋ณ„ ์„ฑ๋Šฅ

์•ก์…˜ MAE Rยฒ Score ์„ค๋ช…
linear_x 0.243 0.354 ์ „์ง„/ํ›„์ง„ (์šฐ์ˆ˜)
linear_y 0.550 0.293 ์ขŒ์šฐ ์ด๋™ (๋ณดํ†ต)
angular_z 0.062 0.000 ํšŒ์ „ (๋‚ฎ์Œ)

์‹œ๋‚˜๋ฆฌ์˜ค๋ณ„ ์„ฑ๋Šฅ

์‹œ๋‚˜๋ฆฌ์˜ค MAE ๋“ฑ๊ธ‰ ์„ค๋ช…
1box_right_vertical 0.217 B+ ์šฐ์ˆ˜
1box_left_horizontal 0.303 B ์–‘ํ˜ธ
2box_left_vertical 0.322 B ์–‘ํ˜ธ
1box_left_vertical 0.337 B- ๋ณดํ†ต

๐Ÿš€ ์‚ฌ์šฉ ๋ฐฉ๋ฒ•

์„ค์น˜

pip install transformers torch pillow numpy

๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•

from mobile_vla import MobileVLAModel, MobileVLATrainer
from PIL import Image
import torch

# ๋ชจ๋ธ ๋กœ๋“œ
model = MobileVLAModel.from_pretrained("minuum/mobile-vla")

# ์ด๋ฏธ์ง€์™€ ํƒœ์Šคํฌ ์ค€๋น„
image = Image.open("robot_camera.jpg")
task = "Navigate around obstacles to track the target cup"

# ์˜ˆ์ธก
with torch.no_grad():
    actions = model.predict(image, task)
    
print(f"Predicted actions: {actions}")
# ์ถœ๋ ฅ: [linear_x, linear_y, angular_z]

๊ณ ๊ธ‰ ์‚ฌ์šฉ๋ฒ•

# ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ
images = [Image.open(f"frame_{i}.jpg") for i in range(8)]
actions = model.predict_sequence(images, task)

# ์‹ค์‹œ๊ฐ„ ์ œ์–ด
for frame in camera_stream:
    action = model.predict(frame, task)
    robot.execute(action)

๐Ÿ—๏ธ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜

[RGB Images] โ†’ [Kosmos-2B Vision] โ†’ [Action Head] โ†’ [3D Actions]
     โ†“              โ†“                    โ†“             โ†“
   224x224    Image Features         Regression    [x, y, ฮธ]

ํ•ต์‹ฌ ์ปดํฌ๋„ŒํŠธ

  1. Kosmos-2B Vision Model: ์ด๋ฏธ์ง€ ํŠน์ง• ์ถ”์ถœ
  2. Action Head: 3D ํšŒ๊ท€ ํ—ค๋“œ (512 โ†’ 3*chunk_size)
  3. Window/Chunk: 8ํ”„๋ ˆ์ž„ ๊ด€์ฐฐ โ†’ 2ํ”„๋ ˆ์ž„ ์˜ˆ์ธก

๐Ÿ“ˆ RoboVLMs์™€์˜ ๋น„๊ต

ํ•ญ๋ชฉ RoboVLMs Mobile VLA
๋ฐ์ดํ„ฐ ์š”๊ตฌ๋Ÿ‰ ์ˆ˜๋ฐฑ๋งŒ ๋ฐ๋ชจ 72 ์—ํ”ผ์†Œ๋“œ
์•ก์…˜ ๊ณต๊ฐ„ 7-DOF Discrete 3D Continuous
์ถ”๋ก  ์†๋„ ๋ณตํ•ฉ์  ๋น ๋ฆ„
ํŠนํ™” ๋ถ„์•ผ ๋ฒ”์šฉ Manipulation Mobile Robot
ํ‰๊ฐ€ ๋ฐฉ์‹ ์„ฑ๊ณต๋ฅ  ๋‹ค์ฐจ์› ํšŒ๊ท€ ์ง€ํ‘œ

๐ŸŽฏ ์ฃผ์š” ๊ฐœ์„ ์‚ฌํ•ญ

  • ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ: 1000๋ฐฐ ์ ์€ ๋ฐ์ดํ„ฐ๋กœ ์‹ค์šฉ์  ์„ฑ๋Šฅ
  • ์‹ค์‹œ๊ฐ„ ์„ฑ๋Šฅ: Vision-only ์ฒ˜๋ฆฌ๋กœ ๋น ๋ฅธ ์ถ”๋ก 
  • ์—ฐ์† ์ œ์–ด: ์ •๋ฐ€ํ•œ 3D ์•ก์…˜ ์˜ˆ์ธก
  • ์‹œ๋‚˜๋ฆฌ์˜ค ํŠนํ™”: ์žฅ์• ๋ฌผ ํšŒํ”ผ ์ „์šฉ ์ตœ์ ํ™”

๐Ÿ“š ํ•™์Šต ๋ฐ์ดํ„ฐ

  • ์—ํ”ผ์†Œ๋“œ ์ˆ˜: 72๊ฐœ
  • ์‹œ๋‚˜๋ฆฌ์˜ค: 1box/2box ร— left/right ร— vertical/horizontal
  • ์•ก์…˜: [linear_x, linear_y, angular_z] ์—ฐ์† ๊ฐ’
  • ์ด๋ฏธ์ง€: ์‹ค์ œ ๋กœ๋ด‡ ์นด๋ฉ”๋ผ RGB (224x224)

๐Ÿ”ฌ ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ

์ด ๋ชจ๋ธ์€ RoboVLMs์˜ Window/Chunk ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์œ ์ง€ํ•˜๋ฉด์„œ Mobile Robot์— ํŠนํ™”๋œ ๊ธฐ๋Šฅ์„ ์ถ”๊ฐ€ํ•œ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค:

  1. Window/Chunk ์œ ์ง€: 8ํ”„๋ ˆ์ž„ ๊ด€์ฐฐ โ†’ 2ํ”„๋ ˆ์ž„ ์˜ˆ์ธก ๊ตฌ์กฐ
  2. Kosmos-2B ํ†ตํ•ฉ: Vision-Language ๋ฐฑ๋ณธ ํ™œ์šฉ
  3. ์—ฐ์† ์ œ์–ด: Discrete โ†’ Continuous ์•ก์…˜ ๊ณต๊ฐ„ ์ „ํ™˜
  4. ์‹ค์ œ ๋กœ๋ด‡ ๋ฐ์ดํ„ฐ: HDF5 ํ˜•ํƒœ์˜ ์‹ค์ œ ์ˆ˜์ง‘ ๋ฐ์ดํ„ฐ

๐Ÿ“„ ์ธ์šฉ

@misc{mobile_vla_2024,
  title={Mobile VLA: Vision-Language-Action Model for Mobile Robot Navigation},
  author={Mobile VLA Team},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/minuum/mobile-vla}
}

๐Ÿค ๊ธฐ์—ฌ

์ด ๋ชจ๋ธ์€ RoboVLMs ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์œผ๋ฉฐ, Mobile Robot ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ ๋ฐœ์ „์„ ์œ„ํ•ด ๊ณต๊ฐœ๋ฉ๋‹ˆ๋‹ค.

๐Ÿ“ž ์—ฐ๋ฝ์ฒ˜


Generated on 2025-08-21

Downloads last month
1
Video Preview
loading