Extending the context above 32k

The current config.json is set for context length up to 32k tokens. Add the "rope_scaling" section to config.json to enable YaRN, eg:

To extend the context to 64k:

  "max_position_embeddings": 65536,
  ...
  "rope_scaling": {
    "factor": 2.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  },

To extend the context to 128k:

  "max_position_embeddings": 131072,
  ...
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  },

NOTE: Because llama.cpp uses "static-YaRN" the scaling factor remains constant regardless of input length! Only add the rope_scaling configuration when processing long contexts is required...

How this model was created

1. The initial model was created from Qwen2.5-Coder-0.5B-Instruct using transplant-vocab:

python ./transplant_vocab.py \
    ./Qwen2.5-Coder-0.5B-Instruct \
    ./DeepSeek-V3-0324-BF16 \
    ./DeepSeek-V3-DRAFT-0.6B-UNTRAINED \
    --override "<｜▁pad▁｜>" "<|endoftext|>" \
    --override "<｜fim▁hole｜>" "<|fim_middle|>" \
    --override "<｜fim▁begin｜>" "<|fim_prefix|>" \
    --override "<｜fim▁end｜>" "<|fim_suffix|>" \
    --override "<｜User｜>" "<|im_start|>user\\n" \
    --override "<｜Assistant｜>" "<|im_start|>assistant\\n" \
    --override "<|EOT|>" "<|endoftext|>" \
    --override "<｜tool▁calls▁begin｜>" "<tool_call>" \
    --override "<｜tool▁call▁begin｜>" "<tool_call>" \
    --override "<｜tool▁outputs▁begin｜>" "<tool_call>" \
    --override "<｜tool▁output▁begin｜>" "<tool_call>" \
    --override "<｜tool▁calls▁end｜>" "</tool_call>" \
    --override "<｜tool▁call▁end｜>" "</tool_call>" \
    --override "<｜tool▁outputs▁end｜>" "</tool_call>" \
    --override "<｜tool▁output▁end｜>" "</tool_call>" \
    --override "<｜tool▁sep｜>" "</tool_call>"

2. The following datasets were used to create a fine-tuning dataset of ~1.6B tokens:

formatted just between <｜end▁of▁sentence｜> tags.

3. The model was then trained using qlora-pipe-lite for 1 epoch with a batch size of 30 and a sequence length of 32k (~1M tokens per step):

# ==============================
# MODEL AND OUTPUT CONFIGURATION
# ==============================

model_dir = 'models/DeepSeek-V3-DRAFT-0.6B-UNTRAINED'
output_dir = 'finetuned'

# ===========================
# TRAINING TYPE CONFIGURATION
# ===========================

full_fine_tune = true

# =======================
# OPTIMIZER CONFIGURATION
# =======================

lr = 2e-5

# ======================
# TRAINING CONFIGURATION
# ======================

sequence_len = 32768

gradient_accumulation_steps = 5  # 5×6 = batch size 30, 5×6×32768 = ~1M tokens per step

# =====================
# DATASET CONFIGURATION
# =====================

[[datasets]]
dataset_path = 'datasets/common-crawl-sample/*.json'
drop_tails = true

[[datasets]]
dataset_path = 'datasets/the-stack-smol-xl/*.jsonl'
drop_tails = true

I used six RTX A6000 GPUs over three nodes and hence the 30 batch size (6 x 5 gradient accumulation steps = 30).

jukofyork
/

DeepSeek-V3-DRAFT-0.6B-v2.0

Extending the context above 32k

To extend the context to 64k:

To extend the context to 128k:

How this model was created

1. The initial model was created from Qwen2.5-Coder-0.5B-Instruct using transplant-vocab:

2. The following datasets were used to create a fine-tuning dataset of ~1.6B tokens:

3. The model was then trained using qlora-pipe-lite for 1 epoch with a batch size of 30 and a sequence length of 32k (~1M tokens per step):

Model tree for jukofyork/DeepSeek-V3-DRAFT-0.6B-v2.0

Datasets used to train jukofyork/DeepSeek-V3-DRAFT-0.6B-v2.0

Collection including jukofyork/DeepSeek-V3-DRAFT-0.6B-v2.0

Draft Models