MolmoAct 7B-D FP8

FP8 per-channel quantized version of allenai/MolmoAct-7B-D-0812.

Quantization Details

Method: FP8 E4M3 per-channel (weight-only, W8A16)
Format: compressed-tensors (float-quantized)
Scale: one scale per output channel ([out_features, 1])
Excluded: lm_head, vision_backbone, wte (kept in original precision)
Tool: QStream

Quality

Compared against BF16 baseline on identical inputs (3-view robot manipulation task).

Action error analysis

The only differing dimension is dim 2 (z-axis translation):

Metric	Value
BF16 value	0.000457
FP8 value	0.000307
Absolute error	0.000150
Error / std (0.006416)	2.3%
Error / operating range (q01–q99 = 0.038)	0.4%
Dims with exact match	6 / 7

This is a single-bin difference in the 256-bin action discretization — sub-millimeter at typical robot step sizes.

Per-dimension action comparison

Dim	Axis	BF16	FP8	Match
0	x	-0.000833	-0.000833	exact
1	y	0.004786	0.004786	exact
2	z	0.000457	0.000307	1 bin off
3	roll	0.002318	0.002318	exact
4	pitch	0.005848	0.005848	exact
5	yaw	0.009253	0.009253	exact
6	gripper	0.996078	0.996078	exact

Depth map and trajectory

Depth map: ~10/100 tokens differ vs BF16 (adjacent depth bins)
Trajectory: close to BF16, 1 of 5 points exact match

Usage (vLLM)

Requires the MolmoAct vLLM integration (molmoact_v2.py).

from vllm import LLM, SamplingParams

model = LLM(
    model="<this-repo>",
    trust_remote_code=True,
    dtype="bfloat16",
)

sampling_params = SamplingParams(max_tokens=256, temperature=0)

instruction = "close the box"
prompt = (
    f"The task is {instruction}. "
    "What is the action that the robot should take. "
    f"To figure out the action that the robot should take to {instruction}, "
    "let's think through it step by step. "
    "First, what is the depth map for the first image? "
    "Second, what is the trajectory of the end effector in the first image? "
    "Based on the depth map of the first image and the trajectory of the end effector in the first image, "
    "along with other images from different camera views as additional information, "
    "what is the action that the robot should take?"
)

# Pass multi-view images and generate
outputs = model.generate(
    {"prompt": prompt, "multi_modal_data": {"image": [img1, img2, img3]}},
    sampling_params=sampling_params,
)

Model Details

Architecture: Qwen2.5-7B LLM + SigLip2 vision backbone
LLM: 28 layers, hidden_size=3584, 28 attention heads, 4 KV heads, head_dim=128
Vision: 27-layer ViT, hidden_size=1152, patch_size=14, input 378x378
Action space: 7-DoF (x, y, z, roll, pitch, yaw, gripper), 256 bins per dimension

License

Apache 2.0. See the base model for full details.

Citation

@misc{molmoact2025,
      title={MolmoAct: Action Reasoning Models that can Reason in Space},
      author={Jason Lee and Jiafei Duan and Haoquan Fang and Yuquan Deng and Shuo Liu and Boyang Li and Bohan Fang and Jieyu Zhang and Yi Ru Wang and Sangho Lee and Winson Han and Wilbert Pumacay and Angelica Wu and Rose Hendrix and Karen Farley and Eli VanderBilt and Ali Farhadi and Dieter Fox and Ranjay Krishna},
      year={2025},
      eprint={2508.07917},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2508.07917}
}