Improve model card: add `library_name` and primary paper link
Browse filesThis PR improves the model card by:
- Adding `library_name: transformers` to the metadata, which enables the "How to use" widget on the model page and ensures better integration with the Hugging Face ecosystem.
- Adding a prominent link to the paper [Scaling Data-Constrained Language Models](https://huggingface.co/papers/2305.16264), where this model was explicitly presented. This provides crucial context for users visiting the model page.
README.md
CHANGED
|
@@ -1,23 +1,25 @@
|
|
| 1 |
---
|
| 2 |
-
language:
|
| 3 |
-
- fi
|
| 4 |
-
license: apache-2.0
|
| 5 |
-
tags:
|
| 6 |
-
- finnish
|
| 7 |
-
- llama
|
| 8 |
datasets:
|
| 9 |
- Finnish-NLP/CulturaX_fi_cleaned
|
| 10 |
- Finnish-NLP/HPLT_1.2_fi_cleaned
|
| 11 |
- Finnish-NLP/wikipedia_20231101_fi_cleaned
|
| 12 |
- Finnish-NLP/Reddit_fi_2006_2022
|
| 13 |
- intfloat/multilingual_cc_news
|
| 14 |
-
|
|
|
|
|
|
|
| 15 |
pipeline_tag: text-generation
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
---
|
| 18 |
|
| 19 |
# Ahma-3B for Finnish
|
| 20 |
|
|
|
|
|
|
|
| 21 |
Ahma-3B is 3B parameter decoder-only transformer model based on Meta's Llama (v1) architecture pretrained from scratch on Finnish language. Original Llama model architecture was introduced in
|
| 22 |
[this paper](https://arxiv.org/abs/2302.13971)
|
| 23 |
and first released at [this page](https://github.com/facebookresearch/llama).
|
|
@@ -62,7 +64,11 @@ system_prompt = "Olet tekoälyavustaja. Vastaat aina mahdollisimman avuliaasti.
|
|
| 62 |
|
| 63 |
|
| 64 |
def format_prompt(prompt: str) -> str:
|
| 65 |
-
prompt = f" [INST] <<SYS
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
return prompt
|
| 67 |
|
| 68 |
|
|
@@ -144,27 +150,27 @@ The final training dataset had 23 billion words (calculated with regex "\w+") an
|
|
| 144 |
The first stage:
|
| 145 |
|Dataset | Words | Ratio |
|
| 146 |
|:-----------------------------|:------------|:-------------|
|
| 147 |
-
|CulturaX | 12.820B | 59.88
|
| 148 |
-
|HPLT v1.2 | 5.034B | 23.51
|
| 149 |
-
|Suomi24 | 3.018B | 14.09
|
| 150 |
-
|Reddit | 0.141B | 0.66
|
| 151 |
-
|CC-News | 0.311B | 1.45
|
| 152 |
-
|FI news corpus | 0.004B | 0.02
|
| 153 |
-
|Project Lönnrot | 0.083B | 0.39
|
| 154 |
-
|**TOTAL** | **21.410B** | **100.0
|
| 155 |
|
| 156 |
|
| 157 |
The second stage:
|
| 158 |
|Dataset | Words | Ratio |
|
| 159 |
|:--------------------------------------------------------------|:------------|:------------|
|
| 160 |
-
|CulturaX (cleaner sample using KenLM perplexity score) | 2.252B | 55.48
|
| 161 |
-
|Wikipedia | 0.095B | 2.34
|
| 162 |
-
|STT | 0.253B | 6.23
|
| 163 |
-
|Yle | 0.212B | 5.22
|
| 164 |
-
|Finnish parliament speeches | 0.021B | 0.52
|
| 165 |
-
|Finnish higher education public theses | 0.855B | 21.07
|
| 166 |
-
|Finnish instruction-following datasets (note: 2X upsampled) | 0.371B | 9.14
|
| 167 |
-
|**TOTAL** | **4.059B** | **100.0
|
| 168 |
|
| 169 |
## Training procedure
|
| 170 |
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
datasets:
|
| 3 |
- Finnish-NLP/CulturaX_fi_cleaned
|
| 4 |
- Finnish-NLP/HPLT_1.2_fi_cleaned
|
| 5 |
- Finnish-NLP/wikipedia_20231101_fi_cleaned
|
| 6 |
- Finnish-NLP/Reddit_fi_2006_2022
|
| 7 |
- intfloat/multilingual_cc_news
|
| 8 |
+
language:
|
| 9 |
+
- fi
|
| 10 |
+
license: apache-2.0
|
| 11 |
pipeline_tag: text-generation
|
| 12 |
+
tags:
|
| 13 |
+
- finnish
|
| 14 |
+
- llama
|
| 15 |
+
inference: false
|
| 16 |
+
library_name: transformers
|
| 17 |
---
|
| 18 |
|
| 19 |
# Ahma-3B for Finnish
|
| 20 |
|
| 21 |
+
This model was presented in the paper [Scaling Data-Constrained Language Models](https://huggingface.co/papers/2305.16264).
|
| 22 |
+
|
| 23 |
Ahma-3B is 3B parameter decoder-only transformer model based on Meta's Llama (v1) architecture pretrained from scratch on Finnish language. Original Llama model architecture was introduced in
|
| 24 |
[this paper](https://arxiv.org/abs/2302.13971)
|
| 25 |
and first released at [this page](https://github.com/facebookresearch/llama).
|
|
|
|
| 64 |
|
| 65 |
|
| 66 |
def format_prompt(prompt: str) -> str:
|
| 67 |
+
prompt = f" [INST] <<SYS>>
|
| 68 |
+
{system_prompt.strip()}
|
| 69 |
+
<</SYS>>
|
| 70 |
+
|
| 71 |
+
{prompt.strip()} [/INST] "
|
| 72 |
return prompt
|
| 73 |
|
| 74 |
|
|
|
|
| 150 |
The first stage:
|
| 151 |
|Dataset | Words | Ratio |
|
| 152 |
|:-----------------------------|:------------|:-------------|
|
| 153 |
+
|CulturaX | 12.820B | 59.88% |
|
| 154 |
+
|HPLT v1.2 | 5.034B | 23.51% |
|
| 155 |
+
|Suomi24 | 3.018B | 14.09% |
|
| 156 |
+
|Reddit | 0.141B | 0.66% |
|
| 157 |
+
|CC-News | 0.311B | 1.45% |
|
| 158 |
+
|FI news corpus | 0.004B | 0.02% |
|
| 159 |
+
|Project Lönnrot | 0.083B | 0.39% |
|
| 160 |
+
|**TOTAL** | **21.410B** | **100.0%** |
|
| 161 |
|
| 162 |
|
| 163 |
The second stage:
|
| 164 |
|Dataset | Words | Ratio |
|
| 165 |
|:--------------------------------------------------------------|:------------|:------------|
|
| 166 |
+
|CulturaX (cleaner sample using KenLM perplexity score) | 2.252B | 55.48% |
|
| 167 |
+
|Wikipedia | 0.095B | 2.34% |
|
| 168 |
+
|STT | 0.253B | 6.23% |
|
| 169 |
+
|Yle | 0.212B | 5.22% |
|
| 170 |
+
|Finnish parliament speeches | 0.021B | 0.52% |
|
| 171 |
+
|Finnish higher education public theses | 0.855B | 21.07% |
|
| 172 |
+
|Finnish instruction-following datasets (note: 2X upsampled) | 0.371B | 9.14% |
|
| 173 |
+
|**TOTAL** | **4.059B** | **100.0%** |
|
| 174 |
|
| 175 |
## Training procedure
|
| 176 |
|