File size: 7,135 Bytes
			
			| 8c60d73 95b023c 8c60d73 9e714af 8c60d73 d3a634d 8c60d73 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 | ---
license: mit
datasets:
- allura-org/Celeste-Filtered
- allura-org/neon-41k
- EVA-UNIT-01/Lilith-v0.2
language:
- en
base_model:
- THUDM/GLM-4-32B-0414
library_name: transformers
---
<img src="image_28.png">
<small>Image by CalamitousFelicitousness</small>
---
# GLM-4-32B-0414 Neon v2
RP finetune of GLM-4-32B-0414. Feels nice, lots of personality, lots of variety, if bit quirky sometimes. Pretty smart, but sometimes plays dumb for a swipe, just let it be itself. Nice prose, not too Claude-ish or Gemini-ish. Bit of structural repetitions happen sometimes, but that's how modern LLMs are so ¯\\_(ツ)_/¯. Seems to like JSON formatted system prompts.
Model was trained by Auri.
---
**Training notes**
Model was trained on a dataset consisting of 77M tokens of synthetic RP and short story gen data for one epoch. Training took around 28 hours on 4xRTX 3090 workstation, generously provided by [OwenArli](https://huggingface.co/OwenArli). Went with some sane defaults for training config, QLoRA plus CCE and sequence parallelism allowed to fit in 16k fit on 96GB. It overall trained smoother than 9B. I still have the issue with NaN Eval/Loss, still not sure of the reason why.
Huge thanks to [ArliAI](https://www.arliai.com/) for providing compute and collaborating on this run!
**Format**
Model responds to GLM4 instruct formatting, exactly like it's base model. Backends struggle to add BOS token automatically, so you'll need to do it yourself. Jinja template should work for chat completions.
```
[gMASK]<sop><|system|>
{system_prompt}<|user|>
{prompt}<|assistant|>
```
**Recommended Samplers**
Nothing special, just classics.
```
Temperature - 1
Min-P - 0.1
Repetition Penalty - 1.03
```
[Example master import for SillyTavern (using Shingane-v1 system prompt by Steelskull)](https://huggingface.co/allura-org/GLM4-9B-Neon-v2/blob/main/GLM-Shingane-v1.json)
**Running on KoboldCPP and other backends**
To run GGUFs correctly, you need the most recent version of KoboldCPP, and to pass `--overridekv glm4.rope.dimension_count=int:64` to the CLI command or put `glm4.rope.dimension_count=int:64` into overridekv box in the GUI (under the Tokens tab at the very bottom).
Thanks to DaringDuck and tofumagnate for info how to apply this fix.
~~To run this model on vLLM, you'll need to build it from source from the git repo, full GLM4 support hasn't reached release yet.~~ Should work OOTB on vLLM >=0.8.5.
ExLLaMAv2 currently doesn't properly support GLM-4-32B, unlike 9B. EXL3 should work, but it's untested.
Latest versions of llama.cpp server should also allow running GGUFs out-of-the-box.
---
**Special Thanks**
Once again, huge kudos to OwenArli for providing compute and helping with tuning along the way!
Big thanks to Artus for providing free inference for pre-release showcase of this model!
And big thanks to BeaverAI community for giving feedback and helping to figure out optimal settings!
---
**Training config**
<details><summary>See Axolotl config</summary>
```yaml
# Model
base_model: /home/owen/models/GLM-4-32B-0414
strict: false
model_type: AutoModelForCausalLM
# Liger Kernels and CCE (optimization)
plugins:
  - axolotl.integrations.liger.LigerPlugin
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
liger_rope: false
liger_rms_norm: false
liger_glu_activation: false
liger_fused_linear_cross_entropy: false
cut_cross_entropy: true
# Output and HuggingFace
output_dir: ./GLM-32B-Neon-v2
hub_model_id: AuriAetherwiing/GLM-32B-Neon-v2-LoRA
hf_use_auth_token: true
hub_strategy: "all_checkpoints"
# WandB
wandb_project: allura-org
wandb_entity:
wandb_name: GLM-32B-Neon-v2
# Data
#chat_template: chatml
#train_on_inputs: false
group_by_length: false
datasets:
  - path: ./Neon/neon.jsonl
    type: chat_template
    field_messages: conversations
    message_field_role: from
    message_field_content: value
    train_on_eos: all
  - path: ./Neon/S2.jsonl
    type: chat_template
    field_messages: conversations
    message_field_role: from
    message_field_content: value
    train_on_eos: all
  - path: ./Neon/SystemChat_subset_filtered_sharegpt_utf8fix.jsonl
    type: chat_template
    field_messages: conversations
    message_field_role: from
    message_field_content: value
    train_on_eos: all
dataset_prepared_path: ./lora_last_run_prepared
chat_template: jinja
chat_template_jinja: |
  [gMASK]<sop>{%- for msg in messages %}{%- if msg.role == 'system' %}<|system|>
  {{ msg.content }}{%- elif msg.role == 'user' %}<|user|>
  {{ msg.content }}{%- elif msg.role == 'assistant' %}<|assistant|>
  {{ msg.content }}{%- endif %}{%- endfor %}{% if add_generation_prompt %}<|assistant|>{% endif %}
## Evaluation
val_set_size: 0.005
evals_per_epoch: 8
eval_table_size:
eval_max_new_tokens: 128
# Technical aspects
sequence_len: 16384
save_safetensors: true
saves_per_epoch: 4
logging_steps: 1
#special_tokens:
#  pad_token: <pad>
# Quantization
bf16: auto
fp16:
tf32: false
## For LoRA
load_in_8bit: false
load_in_4bit: true
# LoRA
peft_use_rslora: false
peft_use_dora: false # better but slower
adapter: qlora # lora or qlora
lora_model_dir:
lora_r: 64 # 64 is optimal for most trains on instruct
lora_alpha: 64
lora_dropout: 0.1
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
# loraplus_lr_ratio: 8 # works to converge faster but is kinda cancer bc makes model unstable
#loraplus_lr_embedding:
# Training hyperparameters
# max_steps:
num_epochs: 1
# Anti Overfit and Stability
weight_decay: 0.01
max_grad_norm: 1.0
## Learning Rate
warmup_ratio: 0.05
learning_rate: 1e-5
lr_scheduler: rex
#lr_scheduler_kwargs:
#    min_lr: 0.0000024
optimizer: adamw_torch # usually adamw_torch or paged_adamw_8bit
## Batch Size
gradient_accumulation_steps: 32     # More effective batch size - stabler train, usually. MBS also speeds it up.
micro_batch_size: 1          # Batch size per gpu = micro_batch_size * gradient_accumulation_steps
eval_batch_size: 1
# Optimizations
pad_to_sequence_len: true
sample_packing: true
eval_sample_packing: false
flash_attention: true
xformers_attention:
gradient_checkpointing:
gradient_checkpointing_kwargs:
   use_reentrant: false
   
# Set to a divisor (> 1) of the number of GPUs available
sequence_parallel_degree: 4  # Split sequences across 4 GPUs
# Optional; strides across the key dimension. Larger values use more memory but should make training faster.
heads_k_stride: 1
# Optional; one of "varlen_llama3", "batch_ring", "batch_zigzag", "batch_stripe". Defaults to
# "varlen_llama3" when `sample_packing: true`, and "batch_ring" otherwise.
ring_attn_func:
   
# deepspeed: /home/owen/axolotl/deepspeed_configs/zero3_bf16_cpuoffload_all.json
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: false
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Glm4DecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_activation_checkpointing: true
```
</details> | 
