Checkpoints for the main experiments in "Forgetting Transformer: Softmax Attention with a Forget Gate" (https://arxiv.org/abs/2503.02130).
-
zhixuan-lin/fox-pro-760m-longcrawl64-48b
Text Generation • 0.8B • Updated • 3 -
zhixuan-lin/transformer-pro-760m-longcrawl64-48b
Text Generation • 0.8B • Updated • 4 -
zhixuan-lin/fox-llama-760m-longcrawl64-48b
Text Generation • 0.8B • Updated • 3 -
zhixuan-lin/transformer-llama-760m-longcrawl64-48b
Text Generation • 0.8B • Updated • 4