I have produced quantized exllamav2 version 6Gb and much faster inference
Hi. Was able to run your 7B Image-Text-to-Text model on exllamav2 which now supports qwen2, kimi, vision tower.
Had to build a custom architecture configuration for exllamav2 to support your model
It allows for much faster inference with much lower VRAM use
https://huggingface.co/sujitvasanth/OpenCUA-7B-exl2
Im just uploading now
Wow! That's cool! I can help add the link to the OpenCUA-7B readme, if you are OK with it.
Best,
Xinyuan
problem... on deeper testing the quantised model its having inconcsistency with visual understanding... I will need to look at the exllama custom model structure
still struggling with proper implementation on exllamav3... do you have a version tof 7B that uses standard qwen2.5vl standard architecture?
also would help to know how 2d rope is transformed to 1d... my model is seeing the image in out of sync patches as i can tell how they are ordered..
Hi
@Xinyuan
..yes its all working now - I had to adjust the python inference script to get the image embeds aligned properly
the working repo (same weights, updated inference script) is available at https://huggingface.co/sujitvasanth/OpenCUA-7B-exl2
regarding "I can help add the link to the OpenCUA-7B readme" - yes please do this.
I have also developed a pipeline for much lower resource for deploying on local computers (ubuntu, windows) uses vnc or rdp to self host without need for a full vitual machine.

