Text Generation
Transformers
Safetensors
Finnish
llama
finnish
conversational
text-generation-inference
nielsr HF Staff commited on
Commit
7a79027
·
verified ·
1 Parent(s): 0b51e96

Improve model card: add `library_name` and primary paper link

Browse files

This PR improves the model card by:
- Adding `library_name: transformers` to the metadata, which enables the "How to use" widget on the model page and ensures better integration with the Hugging Face ecosystem.
- Adding a prominent link to the paper [Scaling Data-Constrained Language Models](https://huggingface.co/papers/2305.16264), where this model was explicitly presented. This provides crucial context for users visiting the model page.

Files changed (1) hide show
  1. README.md +31 -25
README.md CHANGED
@@ -1,23 +1,25 @@
1
  ---
2
- language:
3
- - fi
4
- license: apache-2.0
5
- tags:
6
- - finnish
7
- - llama
8
  datasets:
9
  - Finnish-NLP/CulturaX_fi_cleaned
10
  - Finnish-NLP/HPLT_1.2_fi_cleaned
11
  - Finnish-NLP/wikipedia_20231101_fi_cleaned
12
  - Finnish-NLP/Reddit_fi_2006_2022
13
  - intfloat/multilingual_cc_news
14
- inference: false
 
 
15
  pipeline_tag: text-generation
16
-
 
 
 
 
17
  ---
18
 
19
  # Ahma-3B for Finnish
20
 
 
 
21
  Ahma-3B is 3B parameter decoder-only transformer model based on Meta's Llama (v1) architecture pretrained from scratch on Finnish language. Original Llama model architecture was introduced in
22
  [this paper](https://arxiv.org/abs/2302.13971)
23
  and first released at [this page](https://github.com/facebookresearch/llama).
@@ -62,7 +64,11 @@ system_prompt = "Olet tekoälyavustaja. Vastaat aina mahdollisimman avuliaasti.
62
 
63
 
64
  def format_prompt(prompt: str) -> str:
65
- prompt = f" [INST] <<SYS>>\n{system_prompt.strip()}\n<</SYS>>\n\n{prompt.strip()} [/INST] "
 
 
 
 
66
  return prompt
67
 
68
 
@@ -144,27 +150,27 @@ The final training dataset had 23 billion words (calculated with regex "\w+") an
144
  The first stage:
145
  |Dataset | Words | Ratio |
146
  |:-----------------------------|:------------|:-------------|
147
- |CulturaX | 12.820B | 59.88\% |
148
- |HPLT v1.2 | 5.034B | 23.51\% |
149
- |Suomi24 | 3.018B | 14.09\% |
150
- |Reddit | 0.141B | 0.66\% |
151
- |CC-News | 0.311B | 1.45\% |
152
- |FI news corpus | 0.004B | 0.02\% |
153
- |Project Lönnrot | 0.083B | 0.39\% |
154
- |**TOTAL** | **21.410B** | **100.0\%** |
155
 
156
 
157
  The second stage:
158
  |Dataset | Words | Ratio |
159
  |:--------------------------------------------------------------|:------------|:------------|
160
- |CulturaX (cleaner sample using KenLM perplexity score) | 2.252B | 55.48\% |
161
- |Wikipedia | 0.095B | 2.34\% |
162
- |STT | 0.253B | 6.23\% |
163
- |Yle | 0.212B | 5.22\% |
164
- |Finnish parliament speeches | 0.021B | 0.52\% |
165
- |Finnish higher education public theses | 0.855B | 21.07\% |
166
- |Finnish instruction-following datasets (note: 2X upsampled) | 0.371B | 9.14\% |
167
- |**TOTAL** | **4.059B** | **100.0\%** |
168
 
169
  ## Training procedure
170
 
 
1
  ---
 
 
 
 
 
 
2
  datasets:
3
  - Finnish-NLP/CulturaX_fi_cleaned
4
  - Finnish-NLP/HPLT_1.2_fi_cleaned
5
  - Finnish-NLP/wikipedia_20231101_fi_cleaned
6
  - Finnish-NLP/Reddit_fi_2006_2022
7
  - intfloat/multilingual_cc_news
8
+ language:
9
+ - fi
10
+ license: apache-2.0
11
  pipeline_tag: text-generation
12
+ tags:
13
+ - finnish
14
+ - llama
15
+ inference: false
16
+ library_name: transformers
17
  ---
18
 
19
  # Ahma-3B for Finnish
20
 
21
+ This model was presented in the paper [Scaling Data-Constrained Language Models](https://huggingface.co/papers/2305.16264).
22
+
23
  Ahma-3B is 3B parameter decoder-only transformer model based on Meta's Llama (v1) architecture pretrained from scratch on Finnish language. Original Llama model architecture was introduced in
24
  [this paper](https://arxiv.org/abs/2302.13971)
25
  and first released at [this page](https://github.com/facebookresearch/llama).
 
64
 
65
 
66
  def format_prompt(prompt: str) -> str:
67
+ prompt = f" [INST] <<SYS>>
68
+ {system_prompt.strip()}
69
+ <</SYS>>
70
+
71
+ {prompt.strip()} [/INST] "
72
  return prompt
73
 
74
 
 
150
  The first stage:
151
  |Dataset | Words | Ratio |
152
  |:-----------------------------|:------------|:-------------|
153
+ |CulturaX | 12.820B | 59.88% |
154
+ |HPLT v1.2 | 5.034B | 23.51% |
155
+ |Suomi24 | 3.018B | 14.09% |
156
+ |Reddit | 0.141B | 0.66% |
157
+ |CC-News | 0.311B | 1.45% |
158
+ |FI news corpus | 0.004B | 0.02% |
159
+ |Project Lönnrot | 0.083B | 0.39% |
160
+ |**TOTAL** | **21.410B** | **100.0%** |
161
 
162
 
163
  The second stage:
164
  |Dataset | Words | Ratio |
165
  |:--------------------------------------------------------------|:------------|:------------|
166
+ |CulturaX (cleaner sample using KenLM perplexity score) | 2.252B | 55.48% |
167
+ |Wikipedia | 0.095B | 2.34% |
168
+ |STT | 0.253B | 6.23% |
169
+ |Yle | 0.212B | 5.22% |
170
+ |Finnish parliament speeches | 0.021B | 0.52% |
171
+ |Finnish higher education public theses | 0.855B | 21.07% |
172
+ |Finnish instruction-following datasets (note: 2X upsampled) | 0.371B | 9.14% |
173
+ |**TOTAL** | **4.059B** | **100.0%** |
174
 
175
  ## Training procedure
176