File size: 7,287 Bytes
5c96d64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
language:
- en
- ko
license: cc-by-nc-4.0
tags:
- dnotitia
- nlp
- llm
- slm
- conversation
- chat
base_model:
- meta-llama/Meta-Llama-3.1-8B
library_name: transformers
pipeline_tag: text-generation
---

# DNA 1.0 8B Instruct

<p align="center">
<img src="assets/dna-logo.png" width="400" style="margin: 40px auto;">
</p>

**DNA 1.0 8B Instruct** is a <u>state-of-the-art (**SOTA**)</u> bilingual language model based on Llama architecture, specifically optimized for Korean language understanding and generation, while also maintaining strong English capabilities. The model was developed through a sophisticated process involving model merging via spherical linear interpolation (**SLERP**) with Llama 3.1 8B Instruct, and underwent knowledge distillation (**KD**) using Llama 3.1 405B as the teacher model. It was extensively trained through continual pre-training (**CPT**) with a high-quality Korean dataset. The training pipeline was completed with supervised fine-tuning (**SFT**) and direct preference optimization (**DPO**) to align with human preferences and enhance instruction-following abilities.

DNA 1.0 8B Instruct was fine-tuned on approximately 10B tokens of carefully curated data and has undergone extensive instruction tuning to enhance its ability to follow complex instructions and engage in natural conversations.

- **Developed by:** Dnotitia Inc.
- **Supported Languages:** Korean, English
- **Vocab Size:** 128,256
- **Context Length:** 131,072 tokens
- **License:** CC BY-NC 4.0

## Training Procedure

<p align="center">
<img src="assets/training-procedure.png" width="600" style="margin: 40px auto;">
</p>

## Evaluation

We evaluated DNA 1.0 8B Instruct against other prominent language models of similar size across various benchmarks, including Korean-specific tasks and general language understanding metrics. More details will be provided in the upcoming <u>Technical Report</u>.

| Language | Benchmark  | **dnotitia/DNA-1.0-8B-Instruct** | LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct | yanolja/EEVE-Korean-Instruct-10.8B-v1.0 | meta-llama/Llama-3.1-8B-Instruct | mistralai/Mistral-7B-Instruct-v0.3 | NCSOFT/Llama-VARCO-8B-Instruct | upstage/SOLAR-10.7B-Instruct-v1.0 | 
|----------|------------|----------------------------------|--------------------------------------|-----------------------------------------|----------------------------------|------------------------------------|--------------------------------|-----------------------------------|
| Korean   | KMMLU      | **53.26** (1st)                  | <u>45.28</u>                         | 42.17                                   | 41.66                            | 31.45                              | 38.49                          | 41.50                             | 
|          | KMMLU-hard | **29.46** (1st)                  | 20.78                                | 19.25                                   | 20.49                            | 17.86                              | 19.83                          | 20.61                             |
|          | KoBEST     | **83.40** (1st)                  | 80.13                                | <u>81.67</u>                            | 67.56                            | 63.77                              | 72.99                          | 73.26                             |
|          | Belebele   | **57.99** (1st)                  | 45.11                                | 49.40                                   | <u>54.70</u>                     | 40.31                              | 53.17                          | 48.68                             | 
|          | CSATQA     | **43.32** (1st)                  | 34.76                                | <u>39.57</u>                            | 36.90                            | 27.27                              | 32.62                          | 34.22                             | 
| English  | MMLU       | <u>66.59</u> (2nd)               | 64.32                                | 63.63                                   | **68.26**                        | 62.04                              | 63.25                          | 65.30                             | 
|          | MMLU-Pro   | **43.05** (1st)                  | 38.90                                | 32.79                                   | <u>40.92</u>                     | 33.49                              | 37.11                          | 30.25                             | 
|          | GSM8K      | **80.52** (1st)                  | <u>80.06</u>                         | 56.18                                   | 75.82                            | 49.66                              | 64.14                          | 69.22                             | 

- The *highest* *scores* are in **bold** form, and the *second*\-*highest* *scores* are <u>underlined</u>.
- These results were obtained using a 5-shot evaluation setting.


## Quickstart

This model requires `transformers >= 4.43.0`.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

tokenizer = AutoTokenizer.from_pretrained('dnotitia/DNA-1.0-8B-Instruct')
model = AutoModelForCausalLM.from_pretrained('dnotitia/DNA-1.0-8B-Instruct', device_map='auto')
streamer = TextStreamer(tokenizer, skip_prompt=True)

conversation = [
    {"role": "system", "content": "You are a helpful assistant, Dnotitia DNA."},
    {"role": "user", "content": "너의 이름은?"},
]
inputs = tokenizer.apply_chat_template(conversation,
                                       add_generation_prompt=True,
                                       return_dict=True,
                                       return_tensors="pt").to(model.device)
_ = model.generate(**inputs, streamer=streamer)
```

## Limitations

While DNA 1.0 8B Instruct demonstrates strong performance, users should be aware of the following limitations:

- The model may occasionally generate biased or inappropriate content
- Responses are based on training data and may not reflect current information
- The model may sometimes produce factually incorrect or inconsistent answers
- Performance may vary depending on the complexity and domain of the task
- Generated content should be reviewed for accuracy and appropriateness

## License

This model is released under CC BY-NC 4.0 license. For commercial usage inquiries, please [contact us](https://www.dnotitia.com/contact/post-form).

## Appendix

DNA 1.0 8B Instruct model architecture <sup>[1]</sup>:
<img src="assets/model-architecture.png" width="500" style="margin: 40px auto;">

[1]: <https://www.linkedin.com/posts/sebastianraschka_the-llama-32-1b-and-3b-models-are-my-favorite-activity-7248317830943686656-yyYD/>

The median percentage of model’s weight difference between before and after the merge (our SFT model + Llama 3.1 8B Instruct):
<img src="assets/ours-vs-merged.png" width="100%" style="margin: 40px auto;">

## Citation

If you use or discuss this model in your academic research, please cite the project to help spread awareness:

```
@article{dnotitiadna2024,
  title = {Dnotitia DNA 1.0 8B Instruct},
  author = {Jungyup Lee, Jemin Kim, Sang Park, Seungjae Lee},
  year = {2024},
  url = {https://huggingface.co/dnotitia/DNA-1.0-8B-Instruct},
  version = {1.0},
}
```