Safetensors
qwen2
draft
speculative-decoding

A 0.6B parameter draft (speculative decoding) model for use with DeepSeek-V3-0324 and DeepSeek-V3.

See DeepSeek-V3-DRAFT-0.6B-v2.0-GGUF for the models in gguf format for use with llama.cpp.


Extending the context above 32k

The current config.json is set for context length up to 32k tokens. Add the "rope_scaling" section to config.json to enable YaRN, eg:

To extend the context to 64k:

  "max_position_embeddings": 65536,
  ...
  "rope_scaling": {
    "factor": 2.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  },

To extend the context to 128k:

  "max_position_embeddings": 131072,
  ...
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  },

NOTE: Because llama.cpp uses "static-YaRN" the scaling factor remains constant regardless of input length! Only add the rope_scaling configuration when processing long contexts is required...


How this model was created

1. The initial model was created from Qwen2.5-Coder-0.5B-Instruct using transplant-vocab:

python ./transplant_vocab.py \
    ./Qwen2.5-Coder-0.5B-Instruct \
    ./DeepSeek-V3-0324-BF16 \
    ./DeepSeek-V3-DRAFT-0.6B-UNTRAINED \
    --override "<|▁pad▁|>" "<|endoftext|>" \
    --override "<|fim▁hole|>" "<|fim_middle|>" \
    --override "<|fim▁begin|>" "<|fim_prefix|>" \
    --override "<|fim▁end|>" "<|fim_suffix|>" \
    --override "<|User|>" "<|im_start|>user\\n" \
    --override "<|Assistant|>" "<|im_start|>assistant\\n" \
    --override "<|EOT|>" "<|endoftext|>" \
    --override "<|tool▁calls▁begin|>" "<tool_call>" \
    --override "<|tool▁call▁begin|>" "<tool_call>" \
    --override "<|tool▁outputs▁begin|>" "<tool_call>" \
    --override "<|tool▁output▁begin|>" "<tool_call>" \
    --override "<|tool▁calls▁end|>" "</tool_call>" \
    --override "<|tool▁call▁end|>" "</tool_call>" \
    --override "<|tool▁outputs▁end|>" "</tool_call>" \
    --override "<|tool▁output▁end|>" "</tool_call>" \
    --override "<|tool▁sep|>" "</tool_call>"

2. The following datasets were used to create a fine-tuning dataset of ~1.6B tokens:

formatted just between <|end▁of▁sentence|> tags.

3. The model was then trained using qlora-pipe-lite for 1 epoch with a batch size of 30 and a sequence length of 32k (~1M tokens per step):

# ==============================
# MODEL AND OUTPUT CONFIGURATION
# ==============================

model_dir = 'models/DeepSeek-V3-DRAFT-0.6B-UNTRAINED'
output_dir = 'finetuned'

# ===========================
# TRAINING TYPE CONFIGURATION
# ===========================

full_fine_tune = true

# =======================
# OPTIMIZER CONFIGURATION
# =======================

lr = 2e-5

# ======================
# TRAINING CONFIGURATION
# ======================

sequence_len = 32768

gradient_accumulation_steps = 5  # 5×6 = batch size 30, 5×6×32768 = ~1M tokens per step

# =====================
# DATASET CONFIGURATION
# =====================

[[datasets]]
dataset_path = 'datasets/common-crawl-sample/*.json'
drop_tails = true

[[datasets]]
dataset_path = 'datasets/the-stack-smol-xl/*.jsonl'
drop_tails = true

I used six RTX A6000 GPUs over three nodes and hence the 30 batch size (6 x 5 gradient accumulation steps = 30).

image/png

Downloads last month
3
Safetensors
Model size
590M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jukofyork/DeepSeek-V3-DRAFT-0.6B-v2.0

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(60)
this model
Quantizations
1 model

Datasets used to train jukofyork/DeepSeek-V3-DRAFT-0.6B-v2.0

Collection including jukofyork/DeepSeek-V3-DRAFT-0.6B-v2.0