Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Zhiyuan Tan's picture

3

Zhiyuan Tan

BHbean

Srameo's profile picture

·

BHbean

AI & ML interests

None yet

Organizations

None yet

BHbean 's collections 13

SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity

Paper • 2506.16500 • Published Jun 19 • 17

LLM Training Systems

NoLoCo: No-all-reduce Low Communication Training Method for Large Models

Paper • 2506.10911 • Published Jun 12 • 8
ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention

Paper • 2507.01004 • Published Jul 1 • 10

MoE LLM Systems

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

Paper • 2504.05897 • Published Apr 8 • 20

New LLM Algorithms

Multi-Token Attention

Paper • 2504.00927 • Published Apr 1 • 55

Prompt Engineering

Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking

Paper • 2503.19855 • Published Mar 25 • 29

KV Cache Compression

papers regarding KV cache compression

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

Paper • 2504.06261 • Published Apr 8 • 110
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference

Paper • 2505.02922 • Published May 5 • 28
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

Paper • 2506.15745 • Published Jun 18 • 13
Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction

Paper • 2508.02558 • Published Aug 4 • 10

Speculative Decoding

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

Paper • 2502.14856 • Published Feb 20 • 8
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens

Paper • 2502.18890 • Published Feb 26 • 30
DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

Paper • 2503.00784 • Published Mar 2 • 13
Scaling Speculative Decoding with Lookahead Reasoning

Paper • 2506.19830 • Published Jun 24 • 12

MemOS: A Memory OS for AI System

Paper • 2507.03724 • Published Jul 4 • 154

Taming the Titans: A Survey of Efficient LLM Inference Serving

Paper • 2504.19720 • Published Apr 28 • 12

LLM resource-constrained Inference

PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

Paper • 2504.08791 • Published Apr 7 • 137
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

Paper • 2504.11651 • Published Apr 15 • 31

LLM Internal Mechanism

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Paper • 2503.18878 • Published Mar 24 • 119
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models

Paper • 2503.22165 • Published Mar 28 • 28

PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

Paper • 2503.01328 • Published Mar 3 • 16

LLM reasoning systems

Efficiently Serving LLM Reasoning Programs with Certaindex

Paper • 2412.20993 • Published Dec 30, 2024 • 37
Efficient Inference for Large Reasoning Models: A Survey

Paper • 2503.23077 • Published Mar 29 • 46
Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence

Paper • 2503.20533 • Published Mar 26 • 12

SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity

Paper • 2506.16500 • Published Jun 19 • 17

MemOS: A Memory OS for AI System

Paper • 2507.03724 • Published Jul 4 • 154

LLM Training Systems

NoLoCo: No-all-reduce Low Communication Training Method for Large Models

Paper • 2506.10911 • Published Jun 12 • 8
ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention

Paper • 2507.01004 • Published Jul 1 • 10

Taming the Titans: A Survey of Efficient LLM Inference Serving

Paper • 2504.19720 • Published Apr 28 • 12

MoE LLM Systems

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

Paper • 2504.05897 • Published Apr 8 • 20

LLM resource-constrained Inference

PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

Paper • 2504.08791 • Published Apr 7 • 137
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float

Paper • 2504.11651 • Published Apr 15 • 31

New LLM Algorithms

Multi-Token Attention

Paper • 2504.00927 • Published Apr 1 • 55

LLM Internal Mechanism

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Paper • 2503.18878 • Published Mar 24 • 119
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models

Paper • 2503.22165 • Published Mar 28 • 28

Prompt Engineering

Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking

Paper • 2503.19855 • Published Mar 25 • 29

PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

Paper • 2503.01328 • Published Mar 3 • 16

KV Cache Compression

papers regarding KV cache compression

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

Paper • 2504.06261 • Published Apr 8 • 110
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference

Paper • 2505.02922 • Published May 5 • 28
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

Paper • 2506.15745 • Published Jun 18 • 13
Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction

Paper • 2508.02558 • Published Aug 4 • 10

LLM reasoning systems

Efficiently Serving LLM Reasoning Programs with Certaindex

Paper • 2412.20993 • Published Dec 30, 2024 • 37
Efficient Inference for Large Reasoning Models: A Survey

Paper • 2503.23077 • Published Mar 29 • 46
Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence

Paper • 2503.20533 • Published Mar 26 • 12

Speculative Decoding

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

Paper • 2502.14856 • Published Feb 20 • 8
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens

Paper • 2502.18890 • Published Feb 26 • 30
DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

Paper • 2503.00784 • Published Mar 2 • 13
Scaling Speculative Decoding with Lookahead Reasoning

Paper • 2506.19830 • Published Jun 24 • 12

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs