JohannesGaessler's picture
CUDA: optimize FA for GQA + large batches (llama/12014)
6662d54