distily_bench_gpt2_attn

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 208.9635
  • eval_frwikippl: 1351.4938
  • eval_zhwikippl: 781.2166
  • eval_loss: 19.7940
  • eval_runtime: 17.3332
  • eval_samples_per_second: 57.693
  • eval_steps_per_second: 7.212

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=2.0, loss_fn=ce, layer_mapper=None, projector=None))
  • train_embeddings: True
  • learning_rate: 4e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.2195 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second zhwikippl
teacher eval 30.2086 57.2728 18.1784
0 0 55429.6875 57698.8047 24.5150 17.4179 57.412 7.177 56988.9141
1000 0.0808 702.9320 4403.8062 20.5050 17.3512 57.633 7.204 19095.4688
2000 0.1616 507.8192 3252.5339 20.3170 17.3451 57.653 7.207 2454.8054
3000 0.2424 418.1162 2743.4949 20.2070 17.3188 57.741 7.218 1193.5658
4000 0.3232 372.6640 2567.2002 20.1200 17.2361 58.018 7.252 1026.6641
5000 0.4040 320.0249 2154.7588 20.0340 17.3151 57.753 7.219 1183.4081
6000 0.4848 278.3867 1778.2332 19.9610 17.3435 57.658 7.207 869.0625
7000 0.5657 251.7534 1568.9419 19.9040 17.4023 57.464 7.183 807.5215
8000 0.6465 230.5502 1399.7903 19.8380 17.3855 57.519 7.19 816.4125
9000 0.7273 208.9635 1351.4938 19.7940 17.3332 57.693 7.212 781.2166
10000 0.8081 192.8560 1211.9225 19.7530 17.3032 57.793 7.224 608.5041
11000 0.8889 179.3916 1140.7820 19.6930 17.2721 57.897 7.237 624.0573
12000 0.9697 161.3999 997.4732 19.6480 17.21 58.106 7.263 560.2280
12375 1.0 158.3214 948.9705 19.6380 17.3071 57.78 7.222 575.3149

Framework versions

  • Distily 0.2.0
  • Transformers 4.44.0
  • Pytorch 2.3.0
  • Datasets 2.20.0
Downloads last month
2
Safetensors
Model size
0.1B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for lapp0/distily_bench_gpt2_attn

Quantized
(82)
this model