new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jul 29

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Efficiently serving large language models (LLMs) requires batching many requests together to reduce the cost per request. Yet, the key-value (KV) cache, which stores attention keys and values to avoid re-computations, significantly increases memory demands and becomes the new bottleneck in speed and memory usage. This memory demand increases with larger batch sizes and longer context lengths. Additionally, the inference speed is limited by the size of KV cache, as the GPU's SRAM must load the entire KV cache from the main GPU memory for each token generated, causing the computational core to be idle during this process. A straightforward and effective solution to reduce KV cache size is quantization, which decreases the total bytes taken by KV cache. However, there is a lack of in-depth studies that explore the element distribution of KV cache to understand the hardness and limitation of KV cache quantization. To fill the gap, we conducted a comprehensive study on the element distribution in KV cache of popular LLMs. Our findings indicate that the key cache should be quantized per-channel, i.e., group elements along the channel dimension and quantize them together. In contrast, the value cache should be quantized per-token. From this analysis, we developed a tuning-free 2bit KV cache quantization algorithm, named KIVI. With the hardware-friendly implementation, KIVI can enable Llama (Llama-2), Falcon, and Mistral models to maintain almost the same quality while using 2.6times less peak memory usage (including the model weight). This reduction in memory usage enables up to 4times larger batch size, bringing 2.35times sim 3.47times throughput on real LLM inference workload. The source code is available at https://github.com/jy-yuan/KIVI.

The X-ray Integral Field Unit at the end of the Athena reformulation phase

The Athena mission entered a redefinition phase in July 2022, driven by the imperative to reduce the mission cost at completion for the European Space Agency below an acceptable target, while maintaining the flagship nature of its science return. This notably called for a complete redesign of the X-ray Integral Field Unit (X-IFU) cryogenic architecture towards a simpler active cooling chain. Passive cooling via successive radiative panels at spacecraft level is now used to provide a 50 K thermal environment to an X-IFU owned cryostat. 4.5 K cooling is achieved via a single remote active cryocooler unit, while a multi-stage Adiabatic Demagnetization Refrigerator ensures heat lift down to the 50 mK required by the detectors. Amidst these changes, the core concept of the readout chain remains robust, employing Transition Edge Sensor microcalorimeters and a SQUID-based Time-Division Multiplexing scheme. Noteworthy is the introduction of a slower pixel. This enables an increase in the multiplexing factor (from 34 to 48) without compromising the instrument energy resolution, hence keeping significant system margins to the new 4 eV resolution requirement. This allows reducing the number of channels by more than a factor two, and thus the resource demands on the system, while keeping a 4' field of view (compared to 5' before). In this article, we will give an overview of this new architecture, before detailing its anticipated performances. Finally, we will present the new X-IFU schedule, with its short term focus on demonstration activities towards a mission adoption in early 2027.