Generated by Optimum

optimum-cli export openvino \
  --model ibm-granite/granite-3.2-2b-instruct \
  --task text-generation-with-past \
  --weight-format int4 \
  --trust-remote-code \
  --group-size 128 \
  --quant-mode int4_f8e4m3 \
  ov

Usage with openvino_genai

pip install optimum-intel@git+https://github.com/huggingface/optimum-intel.git
pip install openvino-genai fastapi uvicorn huggingface_hub 
pip install nncf 
import huggingface_hub as hf_hub
import time  

model_id = "hsuwill000/granite-3.2-2b-instruct_int4_ov"
model_path = "ov"

hf_hub.snapshot_download(model_id, local_dir=model_path)

import openvino_genai as ov_genai

pipe = ov_genai.LLMPipeline(model_path, "CPU")

# Create a streamer function
def streamer(subword):
    print(subword, end='', flush=True)
    # Return flag corresponds whether generation should be stopped.
    return ov_genai.StreamingStatus.RUNNING

pipe.start_chat()
while True:
    try:
        prompt = input('question:\n')
    except EOFError:
        break
    gen_result = pipe.generate([prompt],  streamer=streamer, max_new_tokens=32768)
    print(f"\n--- TPS --- {gen_result.perf_metrics.get_throughput().mean:.2f} tokens/s")
    print('\n----------\n')
pipe.finish_chat()
Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for hsuwill000/granite-3.2-2b-instruct_int4_ov