Ring-flash-linear-2.0: A Highly Efficient Hybrid Architecture for Test-Time Scaling
Test-Time Scaling has emerged as a crucial trend for raising the capability ceiling of large models. However, the computational and memory overhead associated with ultra-long context reasoning grows exponentially, particularly in attention computation and IO consumption.
To break this bottleneck, we have built the ultra-efficient Ling 2.0 Linear Hybrid Architecture by integrating a high-sparsity MoE structure with hybrid linear attention, based on the Ling 2.0 architecture.
Today, we are officially open-sourcing two highly efficient reasoning models: Ring-flash-linear-2.0 and Ring-mini-linear-2.0. We are also simultaneously releasing two self-developed high-performance fused operators: the FP8 Fused Operator and the Linear Attention Inference Fused Operator.
Thanks to the synergy between architectural optimization and high-performance operators, the inference cost of these two models in deep reasoning scenarios is only 1/10 that of similarly-sized Dense models, and a reduction of over 50% compared to the original Ring series. The strong alignment of operators between the training and inference engines enables long-cycle, stable, and highly efficient optimization during the reinforcement learning phase, allowing the models to maintain SOTA performance on multiple high-difficulty complex reasoning benchmarks.
Ring-flash-linear-2.0 Performance on High-Difficulty Reasoning Benchmarks
Ring-mini-linear-2.0 Performance on High-Difficulty Reasoning Benchmarks
The following shows the generation speed comparison between the Ring-mini-linear-2.0 (which uses the hybrid linear architecture) and Ring-mini-2.0 (which uses the standard attention mechanism).
Demo Video https://vimeo.com/1125491539 for the comparison
Testing conditions: Single H20, batch size = 256, generation length around 16k tokens. Shows the streamed output of the last request (an identical math problem). Ring-mini-linear-2.0 reduces end-to-end generation time by 40% compared to Ring-mini-2.0.
Ling 2.0 Linear: A More Efficient Hybrid Linear Architecture
Ling 2.0 Linear Architecture Diagram
The Ling 2.0 Linear hybrid linear architecture is specifically designed for two future trends in Large Language Models: Context Length Scaling and Test-Time Scaling.
The Ling 2.0 Linear architecture uses a hybrid structure of linear attention and standard attention, overcoming the weakness of poor recall capability found in purely linear attention mechanisms. By increasing the proportion of linear attention, the model achieves near-linear computational complexity, which significantly reduces the training and inference compute cost in high-concurrency and long-context scenarios.
We also conducted systematic experiments to introduce several improvements to the linear attention layers, such as:
- Adding Rotary Positional Embedding (RoPE) to the q and k inputs of the linear attention.
- Adopting a grouped and non-shared RMSNorm for the output of the linear attention.
Experiments show that these seemingly subtle changes result in higher training stability and better extrapolation capability.
The Ling 2.0 Linear architecture inherits the efficient MoE design of the Ling 2.0 architecture, achieving an architectural performance leverage of more than sevenfold through optimizations like the 1/32 expert activation ratio and the MTP layer. This means Ring-flash-linear-2.0, with only 6.1B active parameters, can rival dense model architectures below 40B in performance.
High-Performance Fused Operators
In recent years, FP8 mixed-precision training has garnered widespread attention. However, during model training, we found that most existing FP8 training solutions primarily focus on saving VRAM, with no significant improvement in actual computational efficiency. To address this, we developed a more efficient FP8 Fused Operator through fine-grained operator fusion and adaptive re-quantization techniques. This substantially enhances the computational efficiency of FP8 mixed-precision training, achieving speedups of 1.57x and 1.77x on the two models, respectively.
Left -> Right: baseline, mixed-precision, acceleration ratio
On the inference side, although linear attention is inherently more computationally efficient, existing inference solutions often lack support for highly efficient frameworks like SGLang and vLLM v1. Furthermore, the decode phase is often broken down into multiple kernels, further degrading inference efficiency. Therefore, we have adapted our released models for frameworks such as SGLang and vLLM v1. We also developed a more efficient Linear Attention Fused Operator (see PR: https://github.com/sgl-project/sglang/pull/10917), supporting more inference modes and further boosting the throughput of the inference engine.
Thanks to the Ling 2.0 Linear architecture and the high-performance fused operators mentioned above, Ring-mini-linear-2.0 and Ring-flash-linear-2.0 achieve near-linear time complexity and constant space complexity, maximizing inference efficiency! In the prefill and decode stages, the overall advantage compared to the previous GQA rapidly expands as input and output lengths increase, with the cost for ultra-long output being only 1/10 that of dense models!
Ring-flash-linear-2.0 Prefill Throughput (batch size = 1)
Ring-flash-linear-2.0 Decode Throughput (batch size = 64)
Testing conditions: 4 H20 GPUs, TP4, SGLang framework version v0.5.2, MTP disabled, tested using SGLang’s built-in sglang.bench_offline_throughput
script.
Ring-mini-linear-2.0 Prefill Throughput (batch size = 1)
Ring-mini-linear-2.0 Decode Throughput (batch size = 64)
Testing conditions: 1 H20 GPU, TP1, SGLang framework version v0.5.2, MTP disabled, tested using SGLang’s built-in sglang.bench_offline_throughput
script.
More Stable RL Training
Unlike pre-training and supervised fine-tuning, Reinforcement Learning for LLMs requires reliance on both training and inference. However, even standard components like RMSNorm and RoPE often have slight implementation differences across common training frameworks (e.g., Megatron, FSDP) and inference frameworks (e.g., vLLM, SGLang). These differences accumulate and amplify layer-by-layer, leading to discrepancies between the training and rollout outputs. This problem is further exacerbated by the dynamic routing of experts in MoE models. The training-inference (Train-Inference, or TI) discrepancy means the theoretical on-policy assumption does not hold in practice, resulting in RL instability and lower performance ceilings.
While this issue has recently received attention in some research, most proposed solutions attempt to mitigate it at the algorithmic level. We, however, have not only improved RL stability algorithmically but have also focused on fundamentally solving the TI discrepancy by addressing the training and inference framework level. To ensure TI consistency, we align the training and inference frameworks across three dimensions: identical logic implementation, maintaining appropriate precision, and eliminating non-determinism.
RL Long-Run Curve After Fixing TI Discrepancy Issues in Different Modules
As shown above, aligning each component leads to a more stable final training curve. Furthermore, we found that once TI alignment is achieved, no extra algorithmic modification is needed; the optimization can directly use rollout probabilities. The figure below demonstrates that this approach effectively enhances both RL training efficiency and stability.
Comparison of PPO Clip using Rollout Probs vs. Training Probs After TI Alignment (Up) Reward (Down) Percentage of tokens where the absolute Train-Inference probability difference is greater than 0.8
Use Case Demos
Sudoku Game
Write a web application for a Sudoku game.
Demo Video https://vimeo.com/1125491509
Tank Battle
Use Python to create a simplified tank battle game. Users use the up, down, left, and right keys on the keyboard to control the free movement of a tank. The spacebar fires bullets to defeat enemy tanks in the game scene. The scene contains five freely moving enemy tanks, which fire bullets in the direction of the current tank’s movement. Each time an enemy tank is defeated, one point is awarded, and a new enemy tank is randomly generated. The game ends when the user’s tank is hit by an enemy tank.
Demo Video https://vimeo.com/1125491475
Stock Trading Software
Please generate a page for a simulated stock trading software. The data can be randomly generated, and the page should include five sections: 1. Intraday second-level data, which needs to update every second and be displayed as a line chart. 2. Daily K-line chart, which can display 60 days of OHLC data using a candlestick chart (red for up, green for down). 3. Real-time trading volume, also updating every second, displayed as a number. 4. Daily trading volume data, displayed as a bar chart. 5. Company introduction, which can be randomly generated. Key requirements: 1. Please use Canvas to draw all curves and candlestick charts, ensuring image clarity and preparation for high-resolution devices. 2. The Canvas window should be able to automatically adjust its size based on the browser window size. 3. Use native JavaScript and HTML5 attributes without external libraries. 4. Please ensure all randomly generated price data is usable.
Demo Video https://vimeo.com/1125491443
Where to Find Us
We welcome everyone to visit our open-source repositories for download and use. Kindly provide your invaluable feedback through discussions or issues or PRs!
Ring-flash-linear-2.0
🤗 Hugging Face https://huggingface.co/inclusionAI/Ring-flash-linear-2.0
🤖 ModelScope https://modelscope.cn/models/inclusionAI/Ring-flash-linear-2.0
Ring-mini-linear-2.0
🤗 Hugging Face https://huggingface.co/inclusionAI/Ring-mini-linear-2.0
🤖 ModelScope https://modelscope.cn/models/inclusionAI/Ring-mini-linear-2.0/
GitHub (Hybrid Linear Code)
https://github.com/inclusionAI/Ring-V2/tree/main/hybrid_linear
SGLang PR (Linear Attention Fused Operator)
https://github.com/sgl-project/sglang/pull/10917
Collection: PublicRelease, Ling-2.0