Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
JonnaMatΒ 
posted an update 13 days ago
Post
5924
πŸš€ FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference

πŸ”Ž Check out our latest FlashHead-enabled model: embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead

🧩 Seamless integration with vllm:
docker run --rm -it \
  --network host \
  --shm-size=8g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  --runtime=nvidia \
  --name=vllm-serve \
  -e HF_TOKEN=hf_*** \
  -e HF_HOME=/root/.cache/huggingface \
  embedl/vllm:latest-jetson-orin-flashhead \
  vllm serve "embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead" \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.75 \
    --max-num-seqs 2 \
    --trust-remote-code


πŸ€“ Want to learn more about FlashHead? Check out this blog post: https://huggingface.co/blog/JonnaMat/flashhead

In this post