Zhiyuan Tan
BHbean
AI & ML interests
None yet
Organizations
None yet
LLM Training Systems
MoE LLM Systems
New LLM Algorithms
Prompt Engineering
KV Cache Compression
papers regarding KV cache compression
-
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Paper • 2504.06261 • Published • 110 -
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
Paper • 2505.02922 • Published • 28 -
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding
Paper • 2506.15745 • Published • 13 -
Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction
Paper • 2508.02558 • Published • 10
Speculative Decoding
-
FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling
Paper • 2502.14856 • Published • 8 -
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens
Paper • 2502.18890 • Published • 30 -
DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting
Paper • 2503.00784 • Published • 13 -
Scaling Speculative Decoding with Lookahead Reasoning
Paper • 2506.19830 • Published • 12
OS for LLM
Survey
LLM resource-constrained Inference
LLM Internal Mechanism
parallelism
LLM reasoning systems
-
Efficiently Serving LLM Reasoning Programs with Certaindex
Paper • 2412.20993 • Published • 37 -
Efficient Inference for Large Reasoning Models: A Survey
Paper • 2503.23077 • Published • 46 -
Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence
Paper • 2503.20533 • Published • 12
LoRA
OS for LLM
LLM Training Systems
Survey
MoE LLM Systems
LLM resource-constrained Inference
New LLM Algorithms
LLM Internal Mechanism
Prompt Engineering
parallelism
KV Cache Compression
papers regarding KV cache compression
-
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Paper • 2504.06261 • Published • 110 -
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
Paper • 2505.02922 • Published • 28 -
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding
Paper • 2506.15745 • Published • 13 -
Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction
Paper • 2508.02558 • Published • 10
LLM reasoning systems
-
Efficiently Serving LLM Reasoning Programs with Certaindex
Paper • 2412.20993 • Published • 37 -
Efficient Inference for Large Reasoning Models: A Survey
Paper • 2503.23077 • Published • 46 -
Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence
Paper • 2503.20533 • Published • 12
Speculative Decoding
-
FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling
Paper • 2502.14856 • Published • 8 -
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens
Paper • 2502.18890 • Published • 30 -
DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting
Paper • 2503.00784 • Published • 13 -
Scaling Speculative Decoding with Lookahead Reasoning
Paper • 2506.19830 • Published • 12