I have produced quantized exllamav2 version 6Gb and much faster inference

#3
by sujitvasanth - opened

Hi. Was able to run your 7B Image-Text-to-Text model on exllamav2 which now supports qwen2, kimi, vision tower.
Had to build a custom architecture configuration for exllamav2 to support your model
It allows for much faster inference with much lower VRAM use
https://huggingface.co/sujitvasanth/OpenCUA-7B-exl2
Im just uploading now
Screenshot from 2025-10-09 10-52-13

XLang NLP Lab org

Wow! That's cool! I can help add the link to the OpenCUA-7B readme, if you are OK with it.

Best,
Xinyuan

problem... on deeper testing the quantised model its having inconcsistency with visual understanding... I will need to look at the exllama custom model structure

still struggling with proper implementation on exllamav3... do you have a version tof 7B that uses standard qwen2.5vl standard architecture?
also would help to know how 2d rope is transformed to 1d... my model is seeing the image in out of sync patches as i can tell how they are ordered..

Hi @Xinyuan ..yes its all working now - I had to adjust the python inference script to get the image embeds aligned properly
the working repo (same weights, updated inference script) is available at https://huggingface.co/sujitvasanth/OpenCUA-7B-exl2
regarding "I can help add the link to the OpenCUA-7B readme" - yes please do this.

I have also developed a pipeline for much lower resource for deploying on local computers (ubuntu, windows) uses vnc or rdp to self host without need for a full vitual machine.

Screenshot 2025-10-12 140245

XLang NLP Lab org

Thank you! Great to hear it’s working now!

I have added the link to the OpenCUA github repo and the OpenCUA-7B README.
image

Also very nice idea on the lightweight local deployment — that could be quite useful for community users.

Best,
Xinyuan

xywang626 changed discussion status to closed

Sign up or log in to comment