Running on CPU Upgrade Featured 2.72k The Smol Training Playbook 📚 2.72k The secrets to building world-class LLMs
view reply Hi @sirluk , thanks for the great post. Do you know if the above masking technique works for some attention implementations and would be incompatible with some other? For example, would the above masking work with SDPA/flash_attention_2 and eager (each of these implementations are dealt a bit differently in https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/modeling_mistral.py#L666 for example)?
view article Article Efficient LLM Pretraining: Packed Sequences and Masked Attention Oct 7, 2024 • 64
Running 3.6k The Ultra-Scale Playbook 🌌 3.6k The ultimate guide to training LLM on large GPU Clusters