Text Generation
GGUF
Finnish
llama.cpp
finnish
llama
conversational
munish0838 nielsr HF Staff commited on
Commit
51c8f82
·
verified ·
1 Parent(s): 62d01a8

Improve model card: Add GGUF usage, paper link, and correct metadata (#1)

Browse files

- Improve model card: Add GGUF usage, paper link, and correct metadata (56bb08bc63a21db12db076d595dedadb87af3fe9)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +74 -33
README.md CHANGED
@@ -1,24 +1,26 @@
1
  ---
2
- language:
3
- - fi
4
- license: apache-2.0
5
- tags:
6
- - finnish
7
- - llama
8
  datasets:
9
  - Finnish-NLP/CulturaX_fi_cleaned
10
  - Finnish-NLP/HPLT_1.2_fi_cleaned
11
  - Finnish-NLP/wikipedia_20231101_fi_cleaned
12
  - Finnish-NLP/Reddit_fi_2006_2022
13
  - intfloat/multilingual_cc_news
14
- inference: false
15
- base_model: Finnish-NLP/Ahma-3B
 
16
  pipeline_tag: text-generation
17
-
 
 
 
 
18
  ---
19
 
20
  # QuantFactory/Ahma-3B-GGUF
21
- This is quantized version of [Finnish-NLP/Ahma-3B](https://huggingface.co/Finnish-NLP/Ahma-3B) created using llama.cpp
 
 
22
 
23
  # Ahma-3B for Finnish
24
 
@@ -39,9 +41,46 @@ There are two different sized Ahma models, all pretrained from scratch for 139B
39
 
40
  This model was pretrained only in a self-supervised way, without any supervised training. You can use this model for text generation or fine-tune it for a downstream task. This model followed a 2-stage pretraining approach where single-turn instruction-following examples were mixed in with the other training data in the second stage (explained more later in this readme). Thanks to this approach, this pretrained model is already capable of instruction following, but you might get even better results if you specifically fine-tune it for instruction following or other use cases. For instruction-following fine-tuning, you should use the same prompt format showcased below.
41
 
42
- ### How to use
 
 
 
 
 
43
 
44
- If you want to use this model for instruction-following, you need to use the same prompt format we used in the second stage of the pretraining (basically the same format what Meta used in their Llama2 models). **Note: do not use "LlamaTokenizer" from transformers library but always use the AutoTokenizer instead, or use the plain sentencepiece tokenizer.** Here is an example using the instruction-following prompt format, with some generation arguments you can modify for your use:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  ```python
47
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -50,7 +89,11 @@ system_prompt = "Olet tekoälyavustaja. Vastaat aina mahdollisimman avuliaasti.
50
 
51
 
52
  def format_prompt(prompt: str) -> str:
53
- prompt = f" [INST] <<SYS>>\n{system_prompt.strip()}\n<</SYS>>\n\n{prompt.strip()} [/INST] "
 
 
 
 
54
  return prompt
55
 
56
 
@@ -128,27 +171,27 @@ The final training dataset had 23 billion words (calculated with regex "\w+") an
128
  The first stage:
129
  |Dataset | Words | Ratio |
130
  |:-----------------------------|:------------|:-------------|
131
- |CulturaX | 12.820B | 59.88\% |
132
- |HPLT v1.2 | 5.034B | 23.51\% |
133
- |Suomi24 | 3.018B | 14.09\% |
134
- |Reddit | 0.141B | 0.66\% |
135
- |CC-News | 0.311B | 1.45\% |
136
- |FI news corpus | 0.004B | 0.02\% |
137
- |Project Lönnrot | 0.083B | 0.39\% |
138
- |**TOTAL** | **21.410B** | **100.0\%** |
139
 
140
 
141
  The second stage:
142
  |Dataset | Words | Ratio |
143
  |:--------------------------------------------------------------|:------------|:------------|
144
- |CulturaX (cleaner sample using KenLM perplexity score) | 2.252B | 55.48\% |
145
- |Wikipedia | 0.095B | 2.34\% |
146
- |STT | 0.253B | 6.23\% |
147
- |Yle | 0.212B | 5.22\% |
148
- |Finnish parliament speeches | 0.021B | 0.52\% |
149
- |Finnish higher education public theses | 0.855B | 21.07\% |
150
- |Finnish instruction-following datasets (note: 2X upsampled) | 0.371B | 9.14\% |
151
- |**TOTAL** | **4.059B** | **100.0\%** |
152
 
153
  ## Training procedure
154
 
@@ -163,7 +206,7 @@ The model was trained on TPUv4-32 VM, sponsored by the [Google TPU Research Clou
163
 
164
  The 2-stage pretraining approach was inspired by [MiniCPM](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20) findings. For the first stage (85% of the entire training), we used noisier web-scraped datasets. For the second stage (15% of the entire training), we primarily used cleaner datasets and instruction-following datasets shuffled together, like in MiniCPM. The learning rate schedule for the 2-stage pretraining was Warmup-Stable-Decay (WSD). During the first stage, the learning rate schedule had a linear warmup for about 8 billion tokens to a peak learning rate of 1e-4 (note: with the Lion optimizer, the learning rate had to be about 10 times smaller than with the commonly used AdamW), followed by a stable phase where the rate of 1e-4 was kept constant. During the second stage, the learning rate schedule had a linear decay from 1e-4 to 1e-5 for the first 13 billion tokens, followed by a stable phase for the remaining tokens.
165
 
166
- In the first stage, the model was trained for 118 billion tokens, which is about three epochs of the first-stage training data, inspired by the findings of [this paper](https://arxiv.org/abs/2305.16264). In the second stage, the model was trained for 21 billion tokens, which is about three epochs of the second-stage training data.
167
 
168
  Thanks to the WSD learning rate schedule, you can more easily experiment with different first-stage model checkpoints. For example, you could apply the second-stage training on an earlier checkpoint or continue pretraining further before the second stage. Model checkpoints were pushed to this repository every 100,000 training steps (approximately 13 billion tokens).
169
 
@@ -250,6 +293,4 @@ As we can see, Ahma 3B model struggles with multi-turn examples, as expected, si
250
  ## Acknowledgements
251
 
252
  This project would not have been possible without compute generously provided by Google through the
253
- [TPU Research Cloud](https://sites.research.google/trc/).
254
-
255
-
 
1
  ---
2
+ base_model: Finnish-NLP/Ahma-3B
 
 
 
 
 
3
  datasets:
4
  - Finnish-NLP/CulturaX_fi_cleaned
5
  - Finnish-NLP/HPLT_1.2_fi_cleaned
6
  - Finnish-NLP/wikipedia_20231101_fi_cleaned
7
  - Finnish-NLP/Reddit_fi_2006_2022
8
  - intfloat/multilingual_cc_news
9
+ language:
10
+ - fi
11
+ license: apache-2.0
12
  pipeline_tag: text-generation
13
+ tags:
14
+ - finnish
15
+ - llama
16
+ - gguf
17
+ library_name: llama.cpp
18
  ---
19
 
20
  # QuantFactory/Ahma-3B-GGUF
21
+ This is a GGUF quantized version of [Finnish-NLP/Ahma-3B](https://huggingface.co/Finnish-NLP/Ahma-3B), created using [llama.cpp](https://github.com/ggerganov/llama.cpp).
22
+
23
+ The training of the underlying `Finnish-NLP/Ahma-3B` model was inspired by the findings in the paper [Scaling Data-Constrained Language Models](https://huggingface.co/papers/2305.16264).
24
 
25
  # Ahma-3B for Finnish
26
 
 
41
 
42
  This model was pretrained only in a self-supervised way, without any supervised training. You can use this model for text generation or fine-tune it for a downstream task. This model followed a 2-stage pretraining approach where single-turn instruction-following examples were mixed in with the other training data in the second stage (explained more later in this readme). Thanks to this approach, this pretrained model is already capable of instruction following, but you might get even better results if you specifically fine-tune it for instruction following or other use cases. For instruction-following fine-tuning, you should use the same prompt format showcased below.
43
 
44
+ ### GGUF Usage (via llama.cpp or llama-cpp-python)
45
+
46
+ To use this GGUF file, you can utilize `llama.cpp` or its Python bindings `llama-cpp-python`. First, ensure you have `llama-cpp-python` installed:
47
+ ```bash
48
+ pip install llama-cpp-python
49
+ ```
50
 
51
+ Then, you can load the GGUF model and generate text:
52
+
53
+ ```python
54
+ from llama_cpp import Llama
55
+
56
+ # Make sure the GGUF file (e.g., ahma-3b-q4_0.gguf) is downloaded in the same directory or provide the full path.
57
+ # You can find the GGUF files in the "Files and versions" tab of this model repo.
58
+ llm = Llama(model_path="./ahma-3b-q4_0.gguf", n_ctx=2048)
59
+
60
+ # The original prompt format for Ahma models
61
+ system_prompt = "Olet tekoälyavustaja. Vastaat aina mahdollisimman avuliaasti. Vastauksesi eivät saa sisältää mitään haitallista, epäeettistä, rasistista, seksististä, vaarallista tai laitonta sisältöä. Jos kysymyksessä ei ole mitään järkeä tai se ei ole asiasisällöltään johdonmukainen, selitä miksi sen sijaan, että vastaisit jotain väärin. Jos et tiedä vastausta kysymykseen, älä kerro väärää tietoa."
62
+
63
+ user_prompt = "Mitä hyötyjä pienet avoimen lähdekoodin kielimallit tuovat?"
64
+
65
+ # Format the prompt using the Ahma model's expected instruction format
66
+ prompt = f" [INST] <<SYS>>
67
+ {system_prompt.strip()}
68
+ <</SYS>>
69
+
70
+ {user_prompt.strip()} [/INST] "
71
+
72
+ output = llm(
73
+ prompt,
74
+ max_tokens=512,
75
+ stop=["</s>", "[/INST]"], # Add [/INST] to stop generation after the model's response if it repeats
76
+ echo=True, # Echo the prompt back to see the full generation
77
+ )
78
+ print(output["choices"][0]["text"])
79
+ ```
80
+
81
+ ### Original Model Usage (via Transformers library)
82
+
83
+ If you want to use the original `Finnish-NLP/Ahma-3B` model (not this GGUF quantized version) for instruction-following, you need to use the same prompt format we used in the second stage of the pretraining (basically the same format what Meta used in their Llama2 models). **Note: do not use "LlamaTokenizer" from transformers library but always use the AutoTokenizer instead, or use the plain sentencepiece tokenizer.** Here is an example using the instruction-following prompt format, with some generation arguments you can modify for your use:
84
 
85
  ```python
86
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
89
 
90
 
91
  def format_prompt(prompt: str) -> str:
92
+ prompt = f" [INST] <<SYS>>
93
+ {system_prompt.strip()}
94
+ <</SYS>>
95
+
96
+ {prompt.strip()} [/INST] "
97
  return prompt
98
 
99
 
 
171
  The first stage:
172
  |Dataset | Words | Ratio |
173
  |:-----------------------------|:------------|:-------------|
174
+ |CulturaX | 12.820B | 59.88% |
175
+ |HPLT v1.2 | 5.034B | 23.51% |
176
+ |Suomi24 | 3.018B | 14.09% |
177
+ |Reddit | 0.141B | 0.66% |
178
+ |CC-News | 0.311B | 1.45% |
179
+ |FI news corpus | 0.004B | 0.02% |
180
+ |Project Lönnrot | 0.083B | 0.39% |
181
+ |**TOTAL** | **21.410B** | **100.0%** |
182
 
183
 
184
  The second stage:
185
  |Dataset | Words | Ratio |
186
  |:--------------------------------------------------------------|:------------|:------------|
187
+ |CulturaX (cleaner sample using KenLM perplexity score) | 2.252B | 55.48% |
188
+ |Wikipedia | 0.095B | 2.34% |
189
+ |STT | 0.253B | 6.23% |
190
+ |Yle | 0.212B | 5.22% |
191
+ |Finnish parliament speeches | 0.021B | 0.52% |
192
+ |Finnish higher education public theses | 0.855B | 21.07% |
193
+ |Finnish instruction-following datasets (note: 2X upsampled) | 0.371B | 9.14% |
194
+ |**TOTAL** | **4.059B** | **100.0%** |
195
 
196
  ## Training procedure
197
 
 
206
 
207
  The 2-stage pretraining approach was inspired by [MiniCPM](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20) findings. For the first stage (85% of the entire training), we used noisier web-scraped datasets. For the second stage (15% of the entire training), we primarily used cleaner datasets and instruction-following datasets shuffled together, like in MiniCPM. The learning rate schedule for the 2-stage pretraining was Warmup-Stable-Decay (WSD). During the first stage, the learning rate schedule had a linear warmup for about 8 billion tokens to a peak learning rate of 1e-4 (note: with the Lion optimizer, the learning rate had to be about 10 times smaller than with the commonly used AdamW), followed by a stable phase where the rate of 1e-4 was kept constant. During the second stage, the learning rate schedule had a linear decay from 1e-4 to 1e-5 for the first 13 billion tokens, followed by a stable phase for the remaining tokens.
208
 
209
+ In the first stage, the model was trained for 118 billion tokens, which is about three epochs of the first-stage training data, inspired by the findings of [this paper](https://huggingface.co/papers/2305.16264). In the second stage, the model was trained for 21 billion tokens, which is about three epochs of the second-stage training data.
210
 
211
  Thanks to the WSD learning rate schedule, you can more easily experiment with different first-stage model checkpoints. For example, you could apply the second-stage training on an earlier checkpoint or continue pretraining further before the second stage. Model checkpoints were pushed to this repository every 100,000 training steps (approximately 13 billion tokens).
212
 
 
293
  ## Acknowledgements
294
 
295
  This project would not have been possible without compute generously provided by Google through the
296
+ [TPU Research Cloud](https://sites.research.google/trc/).