Improve model card: Add GGUF usage, paper link, and correct metadata (#1)

Browse files

- Improve model card: Add GGUF usage, paper link, and correct metadata (56bb08bc63a21db12db076d595dedadb87af3fe9)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md +74 -33

README.md CHANGED Viewed

@@ -1,24 +1,26 @@
 ---
-language:
-- fi
-license: apache-2.0
-tags:
-- finnish
-- llama
 datasets:
 - Finnish-NLP/CulturaX_fi_cleaned
 - Finnish-NLP/HPLT_1.2_fi_cleaned
 - Finnish-NLP/wikipedia_20231101_fi_cleaned
 - Finnish-NLP/Reddit_fi_2006_2022
 - intfloat/multilingual_cc_news
-inference: false
-base_model: Finnish-NLP/Ahma-3B
 pipeline_tag: text-generation
 ---
 # QuantFactory/Ahma-3B-GGUF
-This is quantized version of [Finnish-NLP/Ahma-3B](https://huggingface.co/Finnish-NLP/Ahma-3B) created using llama.cpp
 # Ahma-3B for Finnish
@@ -39,9 +41,46 @@ There are two different sized Ahma models, all pretrained from scratch for 139B
 This model was pretrained only in a self-supervised way, without any supervised training. You can use this model for text generation or fine-tune it for a downstream task. This model followed a 2-stage pretraining approach where single-turn instruction-following examples were mixed in with the other training data in the second stage (explained more later in this readme). Thanks to this approach, this pretrained model is already capable of instruction following, but you might get even better results if you specifically fine-tune it for instruction following or other use cases. For instruction-following fine-tuning, you should use the same prompt format showcased below.
-### How to use
-If you want to use this model for instruction-following, you need to use the same prompt format we used in the second stage of the pretraining (basically the same format what Meta used in their Llama2 models). **Note: do not use "LlamaTokenizer" from transformers library but always use the AutoTokenizer instead, or use the plain sentencepiece tokenizer.** Here is an example using the instruction-following prompt format, with some generation arguments you can modify for your use:
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -50,7 +89,11 @@ system_prompt = "Olet tekoälyavustaja. Vastaat aina mahdollisimman avuliaasti.
 def format_prompt(prompt: str) -> str:
-    prompt = f" [INST] <<SYS>>\n{system_prompt.strip()}\n<</SYS>>\n\n{prompt.strip()} [/INST] "
     return prompt
@@ -128,27 +171,27 @@ The final training dataset had 23 billion words (calculated with regex "\w+") an
 The first stage:
 |Dataset                       | Words       | Ratio        |
 |:-----------------------------|:------------|:-------------|
-|CulturaX                      | 12.820B     | 59.88\%      |
-|HPLT v1.2                     | 5.034B      | 23.51\%      |
-|Suomi24                       | 3.018B      | 14.09\%      |
-|Reddit                        | 0.141B      | 0.66\%       |
-|CC-News                       | 0.311B      | 1.45\%       |
-|FI news corpus                | 0.004B      | 0.02\%       |
-|Project Lönnrot               | 0.083B      | 0.39\%       |
-|**TOTAL**                     | **21.410B** | **100.0\%**  |
 The second stage:
 |Dataset                                                        | Words       | Ratio       |
 |:--------------------------------------------------------------|:------------|:------------|
-|CulturaX (cleaner sample using KenLM perplexity score)         | 2.252B      | 55.48\%     |
-|Wikipedia                                                      | 0.095B      | 2.34\%      |
-|STT                                                            | 0.253B      | 6.23\%      |
-|Yle                                                            | 0.212B      | 5.22\%      |
-|Finnish parliament speeches                                    | 0.021B      | 0.52\%      |
-|Finnish higher education public theses                         | 0.855B      | 21.07\%     |
-|Finnish instruction-following datasets (note: 2X upsampled)    | 0.371B      | 9.14\%      |
-|**TOTAL**                                                      | **4.059B**  | **100.0\%** |
 ## Training procedure
@@ -163,7 +206,7 @@ The model was trained on TPUv4-32 VM, sponsored by the [Google TPU Research Clou
 The 2-stage pretraining approach was inspired by [MiniCPM](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20) findings. For the first stage (85% of the entire training), we used noisier web-scraped datasets. For the second stage (15% of the entire training), we primarily used cleaner datasets and instruction-following datasets shuffled together, like in MiniCPM. The learning rate schedule for the 2-stage pretraining was Warmup-Stable-Decay (WSD). During the first stage, the learning rate schedule had a linear warmup for about 8 billion tokens to a peak learning rate of 1e-4 (note: with the Lion optimizer, the learning rate had to be about 10 times smaller than with the commonly used AdamW), followed by a stable phase where the rate of 1e-4 was kept constant. During the second stage, the learning rate schedule had a linear decay from 1e-4 to 1e-5 for the first 13 billion tokens, followed by a stable phase for the remaining tokens.
-In the first stage, the model was trained for 118 billion tokens, which is about three epochs of the first-stage training data, inspired by the findings of [this paper](https://arxiv.org/abs/2305.16264). In the second stage, the model was trained for 21 billion tokens, which is about three epochs of the second-stage training data.
 Thanks to the WSD learning rate schedule, you can more easily experiment with different first-stage model checkpoints. For example, you could apply the second-stage training on an earlier checkpoint or continue pretraining further before the second stage. Model checkpoints were pushed to this repository every 100,000 training steps (approximately 13 billion tokens).
@@ -250,6 +293,4 @@ As we can see, Ahma 3B model struggles with multi-turn examples, as expected, si
 ## Acknowledgements
 This project would not have been possible without compute generously provided by Google through the
-[TPU Research Cloud](https://sites.research.google/trc/).

 ---
+base_model: Finnish-NLP/Ahma-3B
 datasets:
 - Finnish-NLP/CulturaX_fi_cleaned
 - Finnish-NLP/HPLT_1.2_fi_cleaned
 - Finnish-NLP/wikipedia_20231101_fi_cleaned
 - Finnish-NLP/Reddit_fi_2006_2022
 - intfloat/multilingual_cc_news
+language:
+- fi
+license: apache-2.0
 pipeline_tag: text-generation
+tags:
+- finnish
+- llama
+- gguf
+library_name: llama.cpp
 ---
 # QuantFactory/Ahma-3B-GGUF
+This is a GGUF quantized version of [Finnish-NLP/Ahma-3B](https://huggingface.co/Finnish-NLP/Ahma-3B), created using [llama.cpp](https://github.com/ggerganov/llama.cpp).
+The training of the underlying `Finnish-NLP/Ahma-3B` model was inspired by the findings in the paper [Scaling Data-Constrained Language Models](https://huggingface.co/papers/2305.16264).
 # Ahma-3B for Finnish
 This model was pretrained only in a self-supervised way, without any supervised training. You can use this model for text generation or fine-tune it for a downstream task. This model followed a 2-stage pretraining approach where single-turn instruction-following examples were mixed in with the other training data in the second stage (explained more later in this readme). Thanks to this approach, this pretrained model is already capable of instruction following, but you might get even better results if you specifically fine-tune it for instruction following or other use cases. For instruction-following fine-tuning, you should use the same prompt format showcased below.
+### GGUF Usage (via llama.cpp or llama-cpp-python)
+To use this GGUF file, you can utilize `llama.cpp` or its Python bindings `llama-cpp-python`. First, ensure you have `llama-cpp-python` installed:
+```bash
+pip install llama-cpp-python
+```
+Then, you can load the GGUF model and generate text:
+```python
+from llama_cpp import Llama
+# Make sure the GGUF file (e.g., ahma-3b-q4_0.gguf) is downloaded in the same directory or provide the full path.
+# You can find the GGUF files in the "Files and versions" tab of this model repo.
+llm = Llama(model_path="./ahma-3b-q4_0.gguf", n_ctx=2048)
+# The original prompt format for Ahma models
+system_prompt = "Olet tekoälyavustaja. Vastaat aina mahdollisimman avuliaasti. Vastauksesi eivät saa sisältää mitään haitallista, epäeettistä, rasistista, seksististä, vaarallista tai laitonta sisältöä. Jos kysymyksessä ei ole mitään järkeä tai se ei ole asiasisällöltään johdonmukainen, selitä miksi sen sijaan, että vastaisit jotain väärin. Jos et tiedä vastausta kysymykseen, älä kerro väärää tietoa."
+user_prompt = "Mitä hyötyjä pienet avoimen lähdekoodin kielimallit tuovat?"
+# Format the prompt using the Ahma model's expected instruction format
+prompt = f" [INST] <<SYS>>
+{system_prompt.strip()}
+<</SYS>>
+{user_prompt.strip()} [/INST] "
+output = llm(
+  prompt,
+  max_tokens=512,
+  stop=["</s>", "[/INST]"], # Add [/INST] to stop generation after the model's response if it repeats
+  echo=True, # Echo the prompt back to see the full generation
+)
+print(output["choices"][0]["text"])
+```
+### Original Model Usage (via Transformers library)
+If you want to use the original `Finnish-NLP/Ahma-3B` model (not this GGUF quantized version) for instruction-following, you need to use the same prompt format we used in the second stage of the pretraining (basically the same format what Meta used in their Llama2 models). **Note: do not use "LlamaTokenizer" from transformers library but always use the AutoTokenizer instead, or use the plain sentencepiece tokenizer.** Here is an example using the instruction-following prompt format, with some generation arguments you can modify for your use:
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 def format_prompt(prompt: str) -> str:
+    prompt = f" [INST] <<SYS>>
+{system_prompt.strip()}
+<</SYS>>
+{prompt.strip()} [/INST] "
     return prompt
 The first stage:
 |Dataset                       | Words       | Ratio        |
 |:-----------------------------|:------------|:-------------|
+|CulturaX                      | 12.820B     | 59.88%       |
+|HPLT v1.2                     | 5.034B      | 23.51%       |
+|Suomi24                       | 3.018B      | 14.09%       |
+|Reddit                        | 0.141B      | 0.66%        |
+|CC-News                       | 0.311B      | 1.45%        |
+|FI news corpus                | 0.004B      | 0.02%        |
+|Project Lönnrot               | 0.083B      | 0.39%        |
+|**TOTAL**                     | **21.410B** | **100.0%**   |
 The second stage:
 |Dataset                                                        | Words       | Ratio       |
 |:--------------------------------------------------------------|:------------|:------------|
+|CulturaX (cleaner sample using KenLM perplexity score)         | 2.252B      | 55.48%      |
+|Wikipedia                                                      | 0.095B      | 2.34%       |
+|STT                                                            | 0.253B      | 6.23%       |
+|Yle                                                            | 0.212B      | 5.22%       |
+|Finnish parliament speeches                                    | 0.021B      | 0.52%       |
+|Finnish higher education public theses                         | 0.855B      | 21.07%      |
+|Finnish instruction-following datasets (note: 2X upsampled)    | 0.371B      | 9.14%       |
+|**TOTAL**                                                      | **4.059B**  | **100.0%**  |
 ## Training procedure
 The 2-stage pretraining approach was inspired by [MiniCPM](https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20) findings. For the first stage (85% of the entire training), we used noisier web-scraped datasets. For the second stage (15% of the entire training), we primarily used cleaner datasets and instruction-following datasets shuffled together, like in MiniCPM. The learning rate schedule for the 2-stage pretraining was Warmup-Stable-Decay (WSD). During the first stage, the learning rate schedule had a linear warmup for about 8 billion tokens to a peak learning rate of 1e-4 (note: with the Lion optimizer, the learning rate had to be about 10 times smaller than with the commonly used AdamW), followed by a stable phase where the rate of 1e-4 was kept constant. During the second stage, the learning rate schedule had a linear decay from 1e-4 to 1e-5 for the first 13 billion tokens, followed by a stable phase for the remaining tokens.
+In the first stage, the model was trained for 118 billion tokens, which is about three epochs of the first-stage training data, inspired by the findings of [this paper](https://huggingface.co/papers/2305.16264). In the second stage, the model was trained for 21 billion tokens, which is about three epochs of the second-stage training data.
 Thanks to the WSD learning rate schedule, you can more easily experiment with different first-stage model checkpoints. For example, you could apply the second-stage training on an earlier checkpoint or continue pretraining further before the second stage. Model checkpoints were pushed to this repository every 100,000 training steps (approximately 13 billion tokens).
 ## Acknowledgements
 This project would not have been possible without compute generously provided by Google through the
+[TPU Research Cloud](https://sites.research.google/trc/).