Gemma4-Axera Banner

Gemma 4 E2B GPTQ INT4 on AXERA NPU

Ready-to-run deployment package for google/gemma-4-E2B-it on AX650 / NPU3.

  • This release packages the w4a16 AXERA NPU runtime.
  • Compatible with Pulsar2 5.2 and later.
  • Includes the tokenizer/config files required at runtime.
  • Includes compiled Gemma 4 text .axmodel files, Vision .axmodel files, and fixed-duration Audio .axmodel files.
  • Supports text-only chat, single-image multimodal inference, and fixed-duration audio transcription in the legacy Python demo flow.
  • Supports text, image, audio, and video understanding in the axllm interactive and OpenAI-compatible service flow.

Supported Platform

  • AX650 / NPU3

Validated Devices

This package has been validated on the following AX650-based devices:

Performance

All measurements below were taken on AX650 / NPU3. TTFT stands for time to first token.

  • w8a16: TTFT is approximately 2175 ms (1152 tokens), with a decode throughput of approximately 7.99 tokens/s (theoretical maximum).
  • w4a16: TTFT is approximately 1568 ms (1152 tokens), with a decode throughput of approximately 12.41 tokens/s (theoretical maximum).

The packaged text runtime in this release is the w4a16 build. Its text runtime files are packaged at the repository root. The w8a16 numbers are provided for reference only.

Startup Runtime Footprint

Item Value
Flash total (all 41 axmodels) 2.92 GiB (2988.27 MiB)
Runtime CMM total (default config) 2.87 GiB (2943.48 MiB)

Vision Encoder Latency

Model Resolution Soft Tokens Time (ms)
gemma4_vision_h336_w480_t70.axmodel 336x480 70 87.966 ms
gemma4_vision_h480_w672_t140.axmodel 480x672 140 258.329 ms
gemma4_vision_h672_w960_t280.axmodel 672x960 280 750.429 ms

Audio Encoder Latency and Accuracy

Validated on AX650 / NPU3 with the packaged WAV sample clips and the legacy Python demo flow.

Model Audio Duration Audio Tokens Encoder Cosine vs ONNX Board-side Transcription vs ONNX
gemma4_audio_5s.axmodel 5s 125 0.996173 character-for-character match
gemma4_audio_30s.axmodel 30s 750 0.998999 / 0.999799 (chunk0 / chunk1) character-for-character match

Single-run latency (ax_run_model --warmup 1 --repeat 5, NPU3 single-core affinity):

Model CMM Size Min Max Avg
gemma4_audio_5s.axmodel ~310 MiB 28.905 ms 28.955 ms 28.930 ms
gemma4_audio_30s.axmodel ~359 MiB 170.848 ms 171.166 ms 170.978 ms

Package Layout

.
├── README.md
├── config.json
├── post_config.json
├── infer_axmodel.py
├── gradio_demo.py
├── assets/
├── gemma_4_e2b_it_tokenizer/
├── gemma4_tokenizer.txt
├── gemma4_text_p128_l*.axmodel
├── gemma4_text_post.axmodel
├── gemma4_audio_5s.axmodel
├── gemma4_audio_30s.axmodel
├── gemma4_vision_h336_w480_t70.axmodel
├── gemma4_vision_h480_w672_t140.axmodel
├── gemma4_vision_h672_w960_t280.axmodel
├── model.embed_tokens_per_layer.weight.npy
├── model.embed_tokens.weight.bfloat16.bin
├── model.per_layer_model_projection.weight.npy
├── model.per_layer_projection_norm.weight.npy
├── vit_models/
└── utils/

This package uses a hybrid layout: the tokenizer stays in a subdirectory, the packaged text runtime files plus the Vision and Audio .axmodel files live at the repository root, and vit_models/ keeps the accompanying Vision metadata JSON files.

The Python demo scripts auto-detect the packaged paths above. If you keep this layout unchanged, you can run the Python examples later in this README without passing extra path arguments.

Sample Image

Both the axllm flow and the legacy Python demo flow below can use the packaged sample image: assets/sample.png

sample

Sample Audio

The package also includes three packaged WAV clips for board-side audio validation:

  • assets/gemma4_audio_test_5s.wav
  • assets/gemma4_audio_test_chunk0_30s.wav
  • assets/gemma4_audio_test_chunk1_30s.wav

Sample Video

The package also includes one packaged MP4 clip for axllm-side video validation:

  • assets/red-panda-openai.mp4

For axllm run, extract the clip into a frame directory and pass video:<frames_dir>. For axllm serve, you can use the same frame directory with the OpenAI-compatible video example below.

Direct Inference with axllm

The axllm workflow is still being refined. The instructions below reflect the current validated flow and may be adjusted as the packaging continues to evolve.

Download the Model Package

Download the release package from Hugging Face:

mkdir -p AXERA-TECH/gemma-4-E2B-it-GPTQ-INT4
cd AXERA-TECH/gemma-4-E2B-it-GPTQ-INT4
hf download AXERA-TECH/gemma-4-E2B-it-GPTQ-INT4 --local-dir .

Install axllm

Option 1: clone the repository and run the installer:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

Option 2: install with a one-line command (default branch: axllm):

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

Option 3: download the prebuilt binary from GitHub Actions CI:

If you do not have a local build environment, download the latest CI-generated axllm binary from GitHub Actions: https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm Then run:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

Run on the Board

The package root is already arranged for axllm, so no extra runtime path arguments are required.

Note: the command below assumes you run it from the parent directory of AXERA-TECH/gemma-4-E2B-it-GPTQ-INT4. If you are already inside the package directory, use axllm run . instead.

For multimodal testing, you can use the sample image shown above: ./assets/sample.png.

axllm run AXERA-TECH/gemma-4-E2B-it-GPTQ-INT4

Example prompts:

10:23:02.461 INF Init:1072 | LLM init start
10:23:02.461 INF Init:1087 | shared kv enabled: num_kv_shared_layers=20
10:23:02.461 INF Init:1103 | attention config: layers=35 sliding=28 full=7 linear=0 sliding_window=512 ref_full_layer_idx=0
tokenizer_type = 3
huggingface tokenizer mode = space_replace_bpe
 31% | ##########                       |  12 /  38 [6.94s<21.96s, 1.73 count/s] init 10 axmodel ok,remain_cmm(7489 MB                                                                                                                       34% | ##########                       |  13 /  38 [7.34s<21.45s, 1.77 count/s] init 11 axmodel ok,remain_cmm(7452 MB                                                                                                                       36% | ###########                      |  14 /  38 [7.79s<21.14s, 1.80 count/s] init 12 axmodel ok,remain_cmm(7416 MB                                                                                                                       39% | ############                     |  15 /  38 [8.18s<20.73s, 1.83 count/s] init 13 axmodel ok,remain_cmm(7379 MB                                                                                                                       42% | #############                    |  16 /  38 [8.68s<20.61s, 1.84 count/s] init 14 axmodel ok,remain_cmm(7331 MB                                                                                                                       44% | ##############                   |  17 /  38 [9.25s<20.67s, 1.84 count/s] init 15 axmodel ok,remain_cmm(7279 MB                                                                                                                       47% | ###############                  |  18 /  38 [9.86s<20.81s, 1.83 count/s] init 16 axmodel ok,remain_cmm(7226 MB                                                                                                                       50% | ################                 |  19 /  38 [10.45s<20.90s, 1.82 count/s] init 17 axmodel ok,remain_cmm(7173 M                                                                                                                       52% | ################                 |  20 /  38 [11.09s<21.06s, 1.80 count/s] init 18 axmodel ok,remain_cmm(7120 M                                                                                                                       55% | #################                |  21 /  38 [11.82s<21.38s, 1.78 count/s] init 19 axmodel ok,remain_cmm(7057 M                                                                                                                       57% | ##################               |  22 /  38 [12.41s<21.44s, 1.77 count/s] init 20 axmodel ok,remain_cmm(7004 M                                                                                                                       60% | ###################              |  23 /  38 [12.99s<21.46s, 1.77 count/s] init 21 axmodel ok,remain_cmm(6951 M                                                                                                                       63% | ####################             |  24 /  38 [13.57s<21.49s, 1.77 count/s] init 22 axmodel ok,remain_cmm(6898 M                                                                                                                       65% | #####################            |  25 /  38 [14.17s<21.54s, 1.76 count/s] init 23 axmodel ok,remain_cmm(6845 M                                                                                                                       68% | #####################            |  26 /  38 [14.94s<21.84s, 1.74 count/s] init 24 axmodel ok,remain_cmm(6782 M                                                                                                                       71% | ######################           |  27 /  38 [15.44s<21.73s, 1.75 count/s] init 25 axmodel ok,remain_cmm(6729 M                                                                                                                       73% | #######################          |  28 /  38 [16.01s<21.72s, 1.75 count/s] init 26 axmodel ok,remain_cmm(6677 M                                                                                                                       76% | ########################         |  29 /  38 [16.60s<21.76s, 1.75 count/s] init 27 axmodel ok,remain_cmm(6624 M                                                                                                                       78% | #########################        |  30 /  38 [17.20s<21.79s, 1.74 count/s] init 28 axmodel ok,remain_cmm(6571 M                                                                                                                       81% | ##########################       |  31 /  38 [18.59s<22.79s, 1.67 count/s] init 29 axmodel ok,remain_cmm(6508 M                                                                                                                       84% | ##########################       |  32 /  38 [19.49s<23.14s, 1.64 count/s] init 30 axmodel ok,remain_cmm(6455 M                                                                                                                       86% | ###########################      |  33 /  38 [20.10s<23.14s, 1.64 count/s] init 31 axmodel ok,remain_cmm(6402 M                                                                                                                       89% | ############################     |  34 /  38 [20.47s<22.88s, 1.66 count/s] init 32 axmodel ok,remain_cmm(6349 M                                                                                                                       92% | #############################    |  35 /  38 [20.88s<22.67s, 1.68 count/s] init 33 axmodel ok,remain_cmm(6296 M                                                                                                                       94% | ##############################   |  36 /  38 [21.31s<22.49s, 1.69 count/s] init 34 axmodel ok,remain_cmm(6233 M                                                                                                                       97% | ###############################  |  37 /  38 [24.98s<25.65s, 1.48 count/s] init post axmodel ok,remain_cmm(5813 MB)
10:23:27.441 INF Init:1249 | max_token_len : 2047
10:23:27.441 INF Init:1252 | kv_cache_size : 256, kv_cache_num: 2047
10:23:27.441 INF init_groups_from_model:755 | prefill_token_num : 128
10:23:27.441 INF init_groups_from_model:969 | decode grp: 0, gid: 0, max_token_len : 2047
10:23:27.441 INF init_groups_from_model:973 | prefill grp: 0, gid: 1, history_cap: 0, total_cap: 128, symbolic_cap: 1
10:23:27.441 INF init_groups_from_model:973 | prefill grp: 1, gid: 2, history_cap: 128, total_cap: 256, symbolic_cap: 128
10:23:27.441 INF init_groups_from_model:973 | prefill grp: 2, gid: 3, history_cap: 256, total_cap: 384, symbolic_cap: 256
10:23:27.441 INF init_groups_from_model:973 | prefill grp: 3, gid: 4, history_cap: 384, total_cap: 512, symbolic_cap: 384
10:23:27.441 INF init_groups_from_model:973 | prefill grp: 4, gid: 5, history_cap: 512, total_cap: 640, symbolic_cap: 512
10:23:27.441 INF init_groups_from_model:973 | prefill grp: 5, gid: 6, history_cap: 640, total_cap: 768, symbolic_cap: 640
10:23:27.441 INF init_groups_from_model:973 | prefill grp: 6, gid: 7, history_cap: 768, total_cap: 896, symbolic_cap: 768
10:23:27.441 INF init_groups_from_model:973 | prefill grp: 7, gid: 8, history_cap: 896, total_cap: 1024, symbolic_cap: 896
10:23:27.441 INF init_groups_from_model:973 | prefill grp: 8, gid: 9, history_cap: 1024, total_cap: 1152, symbolic_cap: 1024
10:23:27.441 INF init_groups_from_model:980 | prefill_max_token_num : 1152
10:23:27.441 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  38 /  38 [24.98s<24.98s, 1.52 count/s] embed_selector init ok
10:23:27.464 INF Init:475 | Gemma4 per-layer helper enabled: vocab=262144 hidden=1536 layers=35 per_layer=256 pad=0
10:23:29.513 INF init_audio_profile:555 | Gemma4 audio profile init ok: path=../../gemma-4-E2B-it-GPTQ-INT4/gemma4_audio_5s.axmodel duration=5.0s mel_frames=499 tokens=125 out_dtype=fp32
10:23:29.818 INF init_audio_profile:555 | Gemma4 audio profile init ok: path=../../gemma-4-E2B-it-GPTQ-INT4/gemma4_audio_30s.axmodel duration=30.0s mel_frames=2999 tokens=750 out_dtype=fp32
10:23:29.818 INF Init:1161 | Gemma4 video config: num_frames=32 do_sample_frames=1
10:23:29.818 INF Init:1232 | Gemma4-VL token ids: image_pad=258880 video_pad=258884 audio_pad=258881
10:23:29.818 INF Init:1239 | VisionModule init ok: type=Gemma4VL, tokens_per_block=70, embed_size=1536, out_dtype=fp32
10:23:29.818 WRN Init:1248 | Vision preprocess backend: SimpleCV (OpenCV not found at build time; minor differences vs OpenCV are possible)
10:23:29.820 INF load_config:444 | load config:
10:23:29.820 INF load_config:444 | {
10:23:29.820 INF load_config:444 |     "enable_repetition_penalty": false,
10:23:29.820 INF load_config:444 |     "enable_temperature": false,
10:23:29.820 INF load_config:444 |     "enable_top_k_sampling": false,
10:23:29.820 INF load_config:444 |     "enable_top_p_sampling": false,
10:23:29.820 INF load_config:444 |     "penalty_window": 64,
10:23:29.820 INF load_config:444 |     "repetition_penalty": 1.0,
10:23:29.820 INF load_config:444 |     "temperature": 1.0,
10:23:29.820 INF load_config:444 |     "top_k": 64,
10:23:29.820 INF load_config:444 |     "top_p": 0.95
10:23:29.820 INF load_config:444 | }
10:23:29.820 INF Init:1348 | LLM init ok
Commands:
  /q, /exit  退出
  /reset     重置 kvcache
  /dd        删除一轮对话
  /pp        打印历史对话
Ctrl+C: 停止当前生成
VLM enabled: after each prompt, input media path (empty = text-only). Use "video:<frames_dir>" for video, "audio:<file>" for audio.
----------------------------------------
prompt >> Could you give me a rundown of your core capabilities?
media >>
10:32:29.187 INF SetKVCache:1662 | decode_grpid:0 prefill_grpid:1 history_cap:0 total_cap:128 symbolic_cap:1 precompute_len:0 input_num_token:184 prefer_symbolic_group:0
10:32:29.187 INF SetKVCache:1684 | current prefill_max_token_num:1152
10:32:29.227 INF SetKVCache:1700 | first run
10:32:29.282 INF Run:1812 | input token num : 184, prefill_split_num : 2
10:32:30.166 INF Run:1895 | prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=128
10:32:30.166 INF Run:1919 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
10:32:30.358 INF Run:1895 | prefill chunk p=1 history_len=128 grpid=2 kv_cache_num=128 input_tokens=56
10:32:30.358 INF Run:1919 | prefill indices shape: p=1 idx_elems=128 idx_rows=1 pos_rows=0
10:32:30.586 INF Run:2104 | ttft: 1303.76 ms
I am Gemma 4, a Large Language Model developed by Google DeepMind. I am an open weights model.

Here is a rundown of my core capabilities:

*   **Text Understanding and Processing:** I can understand, process, and interpret a wide variety of text inputs.
*   **Text Generation:** I can generate coherent, contextually relevant, and high-quality text in response to prompts. This includes answering questions, summarizing information, writing creative content (like stories or poems), translating languages, and generating code snippets.
*   **Instruction Following:** I excel at following complex instructions and completing tasks as specified in the prompt.
*   **Multimodality (Input):** I can understand and process **text and image** inputs.
*   **Audio Capabilities:** Depending on the specific version of the Gemma 4 family being run (e.g., the 2B or 4B models), I may also have the capability to process **audio** input.
*   **Output Generation:** I generate **text only**. I cannot generate images.
*   **Knowledge Base:** My knowledge is based on the massive dataset I was trained on, with a knowledge cutoff of **January 2025**. I rely on this training data for factual knowledge and reasoning.
*   **Tool Use:** I can use tools only if specific endpoints are provided to me in the context. By default, I **do not have access to Google Search or other external tools**.

**In summary, I am designed to be a versatile language processor capable of complex reasoning, detailed text generation, and understanding multimodal input (text and image).**

How can I help you today? Feel free to give me a task!

10:33:06.379 NTC Run:2472 | hit eos,decode avg 9.83 token/s
10:33:06.822 INF GetKVCache:1627 | precompute_len:537, remaining:615 (tracked)
prompt >> Okay, I got it.
media >>
10:33:56.942 INF SetKVCache:1662 | decode_grpid:0 prefill_grpid:6 history_cap:640 total_cap:768 symbolic_cap:640 precompute_len:537 input_num_token:16 prefer_symbolic_group:0
10:33:56.942 INF SetKVCache:1684 | current prefill_max_token_num:512
10:33:56.947 INF Run:1812 | input token num : 16, prefill_split_num : 1
10:33:56.998 INF Run:1895 | prefill chunk p=0 history_len=537 grpid=6 kv_cache_num=640 input_tokens=16
10:33:56.998 INF Run:1919 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
10:33:57.337 INF Run:2104 | ttft: 389.81 ms
Great! If you have any questions, need help with writing, summarizing information, brainstorming ideas, or anything else that involves text, feel free to ask. I'm here to assist! 😊

10:34:01.313 NTC Run:2472 | hit eos,decode avg 9.56 token/s
10:34:01.731 INF GetKVCache:1627 | precompute_len:592, remaining:560 (tracked)
prompt >> Describe the visual elements of this image.
media >> /root/sample.png
10:36:40.945 INF EncodeForContent:1782 | vision cache store: /root/sample.png
10:36:41.395 INF SetKVCache:1662 | decode_grpid:0 prefill_grpid:7 history_cap:768 total_cap:896 symbolic_cap:768 precompute_len:592 input_num_token:92 prefer_symbolic_group:1
10:36:41.395 INF SetKVCache:1684 | current prefill_max_token_num:512
10:36:41.412 INF Run:1812 | input token num : 92, prefill_split_num : 1
10:36:41.678 INF Run:1895 | prefill chunk p=0 history_len=592 grpid=7 kv_cache_num=768 input_tokens=92
10:36:41.679 INF Run:1919 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
10:36:42.057 INF Run:2104 | ttft: 645.14 ms
Please provide the image you are referring to. I need an image to describe it for you.

It looks like you might have intended to upload an image, but it didn't come through in your message.

**Once you provide the image, I will gladly describe the visual elements for you!**

10:36:48.275 NTC Run:2472 | hit eos,decode avg 9.81 token/s
10:36:48.715 INF GetKVCache:1627 | precompute_len:746, remaining:406 (tracked)
prompt >> Describe the visual elements of this image.
media >> /root/sample.png
10:37:17.470 INF EncodeForContent:1685 | vision cache hit (mem): /root/sample.png
10:37:17.471 INF EncodeForContent:1685 | vision cache hit (mem): /root/sample.png
10:37:17.884 INF SetKVCache:1662 | decode_grpid:0 prefill_grpid:8 history_cap:896 total_cap:1024 symbolic_cap:896 precompute_len:746 input_num_token:91 prefer_symbolic_group:1
10:37:17.884 INF SetKVCache:1684 | current prefill_max_token_num:384
10:37:17.885 INF Run:1812 | input token num : 91, prefill_split_num : 1
10:37:18.178 INF Run:1895 | prefill chunk p=0 history_len=746 grpid=8 kv_cache_num=896 input_tokens=91
10:37:18.178 INF Run:1919 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
10:37:18.598 INF Run:2104 | ttft: 712.84 ms
Since you have provided an image, here is a detailed description of the visual elements:

**Overall Impression:**
The image is a vibrant, cartoonish illustration of a stylized, angry-looking red lobster. It has a bold, energetic, and slightly aggressive design, typical of a mascot or a character design.

**The Lobster:**
*   **Color:** The main body of the lobster is a bright, saturated red.
*   **Shape and Form:** It is depicted in a dynamic pose, suggesting energy or aggression. Its body is segmented, showing claws, a body, and a tail.
*   **Details:** The lobster features prominent features:
    *   **Eyes:** Large, round eyes that suggest intensity or anger.
    *   **Antennae/Claws:** The appendages are stylized and pointed, adding to the aggressive look.
    *   **Texture:** The lines are bold, and there is a strong use of highlights (white or lighter red) to give the illustration a glossy, three-dimensional, and cartoonish pop.

**The Style and Composition:**
*   **Outline:** The entire character is outlined with a thick, black border, which makes the image stand out sharply against any background.
*   **Style:** The art style is flat yet dynamic, relying heavily on solid blocks of color and sharp lines rather than subtle shading. This makes it perfect for a sticker or a logo.
*   **Text/Borders:** There appears to be some stylized text or a logo element around the character, though the focus is entirely on the lobster itself.

**In summary, the image is a high-energy, cartoon illustration of a red lobster, designed to look fierce and appealing, likely intended for merchandise or a bold graphic.**

10:37:54.858 NTC Run:2472 | hit eos,decode avg 10.04 token/s
10:37:55.695 INF GetKVCache:1627 | precompute_len:1202, remaining:-50 (tracked)
prompt >>

/reset is supported in the interactive path. After resetting the KV cache, you can continue with a new text-only or image prompt.

Current axllm multimodal scope:

  • image: validated single-image understanding
  • audio:<file>: validated audio understanding / transcription with the packaged 5s and 30s audio encoders
  • video:<frames_dir>: validated video or multi-frame understanding; frames are read in filename order

In this package, audio and video inputs are exposed through the axllm flow. The legacy Python demo flow below covers text, image, and audio.

Serve with axllm

To launch the packaged model through the local axllm service:

Note: the command below assumes you run it from the parent directory of AXERA-TECH/gemma-4-E2B-it-GPTQ-INT4. If you are already inside the package directory, use axllm serve . --port 8000 instead.

axllm serve AXERA-TECH/gemma-4-E2B-it-GPTQ-INT4 --port 8000
curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models

OpenAI-Compatible Multimodal Examples

If you cloned AXERA-TECH/ax-llm, you can reuse its example clients for multimodal requests. scripts/openai_demo.py covers image and audio; scripts/openai_video_demo.py covers video.

Example: audio transcription through axllm serve

python3 scripts/openai_demo.py \
  --model AXERA-TECH/gemma-4-E2B-it-GPTQ-INT4 \
  --api_url http://<board-ip>:8000/v1 \
  --audio /path/to/gemma-4-E2B-it-GPTQ-INT4/assets/gemma4_audio_test_5s.wav \
  --prompt "Transcribe the speech in its original language. Output only the transcription."

Example: video understanding through axllm serve

mkdir -p /tmp/red-panda-frames
ffmpeg -i /path/to/gemma-4-E2B-it-GPTQ-INT4/assets/red-panda-openai.mp4 -vf fps=1 /tmp/red-panda-frames/frame_%03d.png
python3 scripts/openai_video_demo.py \
  --model AXERA-TECH/gemma-4-E2B-it-GPTQ-INT4 \
  --api_url http://<board-ip>:8000/v1 \
  --frames_dir /tmp/red-panda-frames \
  --prompt "Describe this video."

If the client machine does not share the same filesystem with the board, add --mode base64 to scripts/openai_video_demo.py so the selected frames are uploaded in the request payload instead of referenced by server-side path.

Browser UI with lite_webui

lite_webui_openai_demo

If you want a browser UI for the OpenAI-compatible service started by axllm serve, use AXERA-TECH/lite_webui on GitHub or AXERA-TECH/lite_webui on Hugging Face.

Set the OpenAI base URL to http://<board-ip>:8000 and the model name to AXERA-TECH/gemma-4-E2B-it-GPTQ-INT4.

Python Runtime Requirements

Install the following packages on the AX board:

  • pyaxengine
  • transformers>=5.5.0
  • numpy
  • ml_dtypes
  • pillow
  • torch
  • gradio for the web demo only

Before running any Python demo command in this package, make sure the Python dependency overlay is visible in PYTHONPATH:

export PYTHONPATH=/path/to/your/gemma4_pydeps:$PYTHONPATH

If your board image ships with an older transformers stack, this pure-Python overlay is the recommended way to supply the required runtime dependencies.

Legacy Python Demo Flow

Enter the package directory on the board:

cd /path/to/your/gemma-4-E2B-it-GPTQ-INT4

Text-Only Inference

Run the following command:

python3 infer_axmodel.py \
  --prompt "What is the capital of the United States?" \
  --max_new_tokens 256

A typical output looks like this:

[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Init InferenceSession: 100%|██████████████████████████████████████████████████████████| 35/35 [00:02<00:00, 12.73it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Model loaded successfully!
slice_indices: [0]
Slice prefill done: 0
answer >> The capital of the United States is **Washington, D.C.**

Multimodal Inference

Use the sample image shown above: assets/sample.png

Recommended profile: 70 soft tokens at 336x480.

python3 infer_axmodel.py \
  --image_path ./assets/sample.png \
  --prompt "Describe this image in detail." \
  --system_prompt "" \
  --max_new_tokens 1024

A typical output looks like this:

Init InferenceSession: 100%|██████████████████████████████████████████████████████████| 35/35 [00:01<00:00, 18.79it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Model loaded successfully!
slice_indices: [0]
Slice prefill done: 0
answer >> This image is a close-up illustration of a **cartoon character**, likely an **octopus or a similar cephalopod**, rendered in a vibrant, bold style.

Here's a detailed description:

* **Subject:** The image features a single character, which appears to be an octopus or a similar sea creature.
* **Color and Style:** The character is predominantly **red** with black outlines, giving it a very energetic and striking appearance. It uses a flat, graphic style typical of modern cartoons or sticker designs.
* **Form and Details:**
    * The creature has a rounded, bulbous body.
    * It possesses multiple tentacles, which are depicted as thick, muscular appendages.
    * The overall impression is one of energy and perhaps a slightly mischievous or aggressive character, given the bright red color.
    * There are prominent outlines defining the shape of the body and the tentacles.
* **Composition:** The image is a close-up, focusing entirely on the character. It is presented against a plain white background, making the character stand out sharply.
* **Mood:** The illustration is dynamic and bold, designed to catch the viewer's eye.

**In summary:** it is a high-energy red cartoon illustration of a cephalopod, likely intended for use as a sticker, icon, or character graphic.

In addition to the default t70 profile, the package also includes two higher-resolution Vision models:

VIT file Resolution Soft tokens
gemma4_vision_h336_w480_t70.axmodel 336x480 70
gemma4_vision_h480_w672_t140.axmodel 480x672 140
gemma4_vision_h672_w960_t280.axmodel 672x960 280

To use a different profile, pass --vit_model_path explicitly. The runtime will infer the matching soft-token count from the filename:

python3 infer_axmodel.py \
  --image_path ./assets/sample.png \
  --prompt "Describe this image in detail." \
  --system_prompt "" \
  --vit_model_path ./gemma4_vision_h480_w672_t140.axmodel \
  --max_new_tokens 256
python3 infer_axmodel.py \
  --image_path ./assets/sample.png \
  --prompt "Describe this image in detail." \
  --system_prompt "" \
  --vit_model_path ./gemma4_vision_h672_w960_t280.axmodel \
  --max_new_tokens 256

Example output with the 672x960 / t280 profile:

Init InferenceSession: 100%|██████████████████████████████████████████████████████████| 35/35 [00:05<00:00,  6.63it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Model loaded successfully!
[WARN] Image token block (group_id=0, pos 5-284) spans 3 prefill slices. Bidirectional attention within earlier slices is partial (chunked prefill limitation).
slice_indices: [0, 1, 2]
Slice prefill done: 0
Slice prefill done: 1
Slice prefill done: 2
answer >> This is a vibrant illustration of a **cartoon-style red crab**.

Here's a detailed description:

* **Subject:** The central focus is a stylized crab, rendered in bright red with black outlines, giving it a bold, comic-book or sticker-like appearance.
* **Color Palette:** The primary colors are vivid red for the body, claws, and legs, with black used for outlines and shading to define the shape and add depth.
* **Details of the Crab:**
    * **Body:** The main body is curved and segmented, showing detailed features like antennae, legs, and a segmented tail.
    * **Claws (Chelipeds):** The crab has prominent claws that are depicted in a dynamic pose. One claw is raised high, and the other is clenched in a fist, suggesting aggression or excitement.
    * **Face:** The crab has a cheerful expression, indicated by large, smiling eyes and a wide, open mouth, suggesting it is friendly or energetic.
    * **Style:** The illustration is highly stylized with thick black outlines, which is characteristic of modern cartoon or mascot art.
* **Composition:** The crab is presented in a dynamic pose, filling a significant portion of the frame, suggesting energy and action.
* **Background:** The background is plain white, which makes the red crab stand out sharply and emphasizes the character itself.

**Overall Impression:** The image is a high-energy, bold, and appealing piece of character art, likely intended for use as a sticker, logo, or character graphic. It conveys a sense of excitement, boldness, and playful aggression.

Audio Inference

The package includes two fixed-duration audio encoders:

  • gemma4_audio_5s.axmodel for 5s / 125 audio tokens
  • gemma4_audio_30s.axmodel for 30s / 750 audio tokens

For board-side validation, use the packaged WAV clips listed in the sample-audio section above.

Example: 5s profile

python3 infer_axmodel.py \
  --audio_path ./assets/gemma4_audio_test_5s.wav \
  --audio_model_path ./gemma4_audio_5s.axmodel \
  --audio_duration_sec 5 \
  --audio_tokens 125 \
  --system_prompt "" \
  --prompt "Transcribe the speech in its original language. Output only the transcription." \
  --max_new_tokens 128

Example: 30s profile

python3 infer_axmodel.py \
  --audio_path ./assets/gemma4_audio_test_chunk0_30s.wav \
  --audio_model_path ./gemma4_audio_30s.axmodel \
  --audio_duration_sec 30 \
  --audio_tokens 750 \
  --system_prompt "" \
  --prompt "Transcribe the speech in its original language. Output only the transcription." \
  --max_new_tokens 256

Notes:

  • The two commands above are the recommended packaged commands for this GPTQ release; latency and accuracy results are intentionally left for manual fill-in.
  • The 30s / 750-token profile spans multiple 128-token prefill slices. The runtime may print a warning about partial bidirectional attention across earlier slices inside the same multimodal block. This is expected for chunked prefill.
  • The Python demo loader handles WAV directly with the Python standard library. For mp3 / flac / m4a / ogg, install librosa on the board.

Gradio Demo

python3 gradio_demo.py \
  --host 0.0.0.0 \
  --port 7860

demo

After the server starts, open http://<board-ip>:7860 in your browser.

Packaged Python Runtime Paths

The Python demo scripts use the following default paths:

  • Tokenizer and config: ./gemma_4_e2b_it_tokenizer
  • Text LLM runtime root: ./
  • Vision axmodels: ./
  • Audio axmodels: ./

If you move any of these directories, pass the new values with --hf_model, --axmodel_path, --vit_model_path, and --audio_model_path.

For the Python demo flow, --axmodel_path should point to the directory that contains the text runtime files such as gemma4_text_p128_l*.axmodel, gemma4_text_post.axmodel, model.embed_tokens.weight.bfloat16.bin, and the model.*per_layer*.npy files.

These path arguments apply to the Python demo flow only. The axllm flow reads the same root-level runtime files packaged in this repository.

The current text runtime package contains 35 decoder layers and kv_cache_len=2047. The packaged runtime already includes the embedding and per-layer weight files needed by Gemma 4, so the original model.safetensors files are not required for board-side inference.

Conversion References

If you need the original model files or want to rebuild the deployment artifacts, start with:

Discussion

  • GitHub Issues
  • QQ group: 139953715
Downloads last month
158
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/gemma-4-E2B-it-GPTQ-INT4

Finetuned
(179)
this model

Collection including AXERA-TECH/gemma-4-E2B-it-GPTQ-INT4