MolmoAct Logo

MolmoAct 7B-D FP8

FP8 per-channel quantized version of allenai/MolmoAct-7B-D-0812.

Quantization Details

  • Method: FP8 E4M3 per-channel (weight-only, W8A16)
  • Format: compressed-tensors (float-quantized)
  • Scale: one scale per output channel ([out_features, 1])
  • Excluded: lm_head, vision_backbone, wte (kept in original precision)
  • Tool: QStream

Quality

Compared against BF16 baseline on identical inputs (3-view robot manipulation task).

Action error analysis

The only differing dimension is dim 2 (z-axis translation):

Metric Value
BF16 value 0.000457
FP8 value 0.000307
Absolute error 0.000150
Error / std (0.006416) 2.3%
Error / operating range (q01–q99 = 0.038) 0.4%
Dims with exact match 6 / 7

This is a single-bin difference in the 256-bin action discretization — sub-millimeter at typical robot step sizes.

Per-dimension action comparison

Dim Axis BF16 FP8 Match
0 x -0.000833 -0.000833 exact
1 y 0.004786 0.004786 exact
2 z 0.000457 0.000307 1 bin off
3 roll 0.002318 0.002318 exact
4 pitch 0.005848 0.005848 exact
5 yaw 0.009253 0.009253 exact
6 gripper 0.996078 0.996078 exact

Depth map and trajectory

  • Depth map: ~10/100 tokens differ vs BF16 (adjacent depth bins)
  • Trajectory: close to BF16, 1 of 5 points exact match

Usage (vLLM)

Requires the MolmoAct vLLM integration (molmoact_v2.py).

from vllm import LLM, SamplingParams

model = LLM(
    model="<this-repo>",
    trust_remote_code=True,
    dtype="bfloat16",
)

sampling_params = SamplingParams(max_tokens=256, temperature=0)

instruction = "close the box"
prompt = (
    f"The task is {instruction}. "
    "What is the action that the robot should take. "
    f"To figure out the action that the robot should take to {instruction}, "
    "let's think through it step by step. "
    "First, what is the depth map for the first image? "
    "Second, what is the trajectory of the end effector in the first image? "
    "Based on the depth map of the first image and the trajectory of the end effector in the first image, "
    "along with other images from different camera views as additional information, "
    "what is the action that the robot should take?"
)

# Pass multi-view images and generate
outputs = model.generate(
    {"prompt": prompt, "multi_modal_data": {"image": [img1, img2, img3]}},
    sampling_params=sampling_params,
)

Model Details

  • Architecture: Qwen2.5-7B LLM + SigLip2 vision backbone
  • LLM: 28 layers, hidden_size=3584, 28 attention heads, 4 KV heads, head_dim=128
  • Vision: 27-layer ViT, hidden_size=1152, patch_size=14, input 378x378
  • Action space: 7-DoF (x, y, z, roll, pitch, yaw, gripper), 256 bins per dimension

License

Apache 2.0. See the base model for full details.

Citation

@misc{molmoact2025,
      title={MolmoAct: Action Reasoning Models that can Reason in Space},
      author={Jason Lee and Jiafei Duan and Haoquan Fang and Yuquan Deng and Shuo Liu and Boyang Li and Bohan Fang and Jieyu Zhang and Yi Ru Wang and Sangho Lee and Winson Han and Wilbert Pumacay and Angelica Wu and Rose Hendrix and Karen Farley and Eli VanderBilt and Ali Farhadi and Dieter Fox and Ranjay Krishna},
      year={2025},
      eprint={2508.07917},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2508.07917}
}
Downloads last month
12
Safetensors
Model size
8B params
Tensor type
F32
·
F8_E4M3
·
Video Preview
loading

Model tree for Droidcraft-OY/MolmoAct-7B-D-FP8

Base model

Qwen/Qwen2.5-7B
Quantized
(2)
this model

Paper for Droidcraft-OY/MolmoAct-7B-D-FP8