MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 45
FP8 per-channel quantized version of allenai/MolmoAct-7B-D-0812.
float-quantized)[out_features, 1])lm_head, vision_backbone, wte (kept in original precision)Compared against BF16 baseline on identical inputs (3-view robot manipulation task).
The only differing dimension is dim 2 (z-axis translation):
| Metric | Value |
|---|---|
| BF16 value | 0.000457 |
| FP8 value | 0.000307 |
| Absolute error | 0.000150 |
| Error / std (0.006416) | 2.3% |
| Error / operating range (q01–q99 = 0.038) | 0.4% |
| Dims with exact match | 6 / 7 |
This is a single-bin difference in the 256-bin action discretization — sub-millimeter at typical robot step sizes.
| Dim | Axis | BF16 | FP8 | Match |
|---|---|---|---|---|
| 0 | x | -0.000833 | -0.000833 | exact |
| 1 | y | 0.004786 | 0.004786 | exact |
| 2 | z | 0.000457 | 0.000307 | 1 bin off |
| 3 | roll | 0.002318 | 0.002318 | exact |
| 4 | pitch | 0.005848 | 0.005848 | exact |
| 5 | yaw | 0.009253 | 0.009253 | exact |
| 6 | gripper | 0.996078 | 0.996078 | exact |
Requires the MolmoAct vLLM integration (molmoact_v2.py).
from vllm import LLM, SamplingParams
model = LLM(
model="<this-repo>",
trust_remote_code=True,
dtype="bfloat16",
)
sampling_params = SamplingParams(max_tokens=256, temperature=0)
instruction = "close the box"
prompt = (
f"The task is {instruction}. "
"What is the action that the robot should take. "
f"To figure out the action that the robot should take to {instruction}, "
"let's think through it step by step. "
"First, what is the depth map for the first image? "
"Second, what is the trajectory of the end effector in the first image? "
"Based on the depth map of the first image and the trajectory of the end effector in the first image, "
"along with other images from different camera views as additional information, "
"what is the action that the robot should take?"
)
# Pass multi-view images and generate
outputs = model.generate(
{"prompt": prompt, "multi_modal_data": {"image": [img1, img2, img3]}},
sampling_params=sampling_params,
)
Apache 2.0. See the base model for full details.
@misc{molmoact2025,
title={MolmoAct: Action Reasoning Models that can Reason in Space},
author={Jason Lee and Jiafei Duan and Haoquan Fang and Yuquan Deng and Shuo Liu and Boyang Li and Bohan Fang and Jieyu Zhang and Yi Ru Wang and Sangho Lee and Winson Han and Wilbert Pumacay and Angelica Wu and Rose Hendrix and Karen Farley and Eli VanderBilt and Ali Farhadi and Dieter Fox and Ranjay Krishna},
year={2025},
eprint={2508.07917},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2508.07917}
}