Papers
arxiv:2404.05829

SambaLingo: Teaching Large Language Models New Languages

Published on Apr 8, 2024
· Submitted by akhaliq on Apr 10, 2024
Authors:
,
,
,
,
,
,
,
,

Abstract

This investigation explores comprehensive techniques for adapting large language models to new languages, addressing vocabulary extension, preference optimization, and data scarcity issues across multiple languages and model sizes.

AI-generated summary

Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low-resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.

Community

Sign up or log in to comment

Models citing this paper 34

Browse 34 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2404.05829 in a dataset README.md to link it from this page.

Spaces citing this paper 16

Collections including this paper 8