openbmb
/

MiniCPM-S-1B-sft-llama-format

Text Generation

text-generation-inference

Model card Files Files and versions

Raincleared commited on Jul 4, 2024

Commit

b1aba82

·

verified ·

1 Parent(s): 2c58152

Update README.md

Files changed (1) hide show

README.md +12 -0

README.md CHANGED Viewed

@@ -19,6 +19,18 @@ license: apache-2.0
 **This model is converted from [MiniCPM-S-1B-sft](https://huggingface.co/openbmb/MiniCPM-S-1B-sft/) as a LLaMA format to make its usage more convenient.**
 ### Introduction
 The utilization of activation sparsity, namely the existence of considerable weakly-contributed elements among activation outputs, is a promising method for inference acceleration of large language models (LLMs) ([Liu et al., 2023](https://proceedings.mlr.press/v202/liu23am/liu23am.pdf); [Song et al., 2023](https://arxiv.org/pdf/2312.12456.pdf)). Concretely, acceleration methods based on activation sparsity usually achieve higher inference speed by making wiser resource allocation and computation policies to avoid resource waste on these weakly-contributed parameters.

 **This model is converted from [MiniCPM-S-1B-sft](https://huggingface.co/openbmb/MiniCPM-S-1B-sft/) as a LLaMA format to make its usage more convenient.**
+### Chat Template
+To make the model sophisticatedly respond to a query, it is recommended to use a standard chat prompt, such as:
+```
+<用户>{prompt}<AI>
+```
+where `prompt` is the query text, while `<用户>` and `<AI>` are prompt tokens.
+Also, make sure that you have **a bos token `<s>` at the beginning of any input**, or the model can sometimes behave improperly.
 ### Introduction
 The utilization of activation sparsity, namely the existence of considerable weakly-contributed elements among activation outputs, is a promising method for inference acceleration of large language models (LLMs) ([Liu et al., 2023](https://proceedings.mlr.press/v202/liu23am/liu23am.pdf); [Song et al., 2023](https://arxiv.org/pdf/2312.12456.pdf)). Concretely, acceleration methods based on activation sparsity usually achieve higher inference speed by making wiser resource allocation and computation policies to avoid resource waste on these weakly-contributed parameters.