stereoplegic 's Collections Inference
updated
S^{3}: Increasing GPU Utilization during Generative Inference for
Higher Throughput
Paper
• 2306.06000
• Published • 1
Fast Distributed Inference Serving for Large Language Models
Paper
• 2305.05920
• Published • 1
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM
Inference Pipeline
Paper
• 2305.13144
• Published • 1
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert
(MoE) Inference
Paper
• 2303.06182
• Published • 1
Dynamic Context Pruning for Efficient and Interpretable Autoregressive
Transformers
Paper
• 2305.15805
• Published • 1
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper
• 2311.01282
• Published • 37
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Paper
• 2311.03285
• Published • 30
Fast Inference from Transformers via Speculative Decoding
Paper
• 2211.17192
• Published • 11
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper
• 2311.04934
• Published • 32
RecycleGPT: An Autoregressive Language Model with Recyclable Module
Paper
• 2308.03421
• Published • 9
Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing
Important Tokens
Paper
• 2305.04241
• Published • 1
Latency Adjustable Transformer Encoder for Language Understanding
Paper
• 2201.03327
• Published • 1
Punica: Multi-Tenant LoRA Serving
Paper
• 2310.18547
• Published • 2
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative
Model Inference with Unstructured Sparsity
Paper
• 2309.10285
• Published • 1
Distributed Inference and Fine-tuning of Large Language Models Over The
Internet
Paper
• 2312.08361
• Published • 27
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper
• 2312.04985
• Published • 40
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model
Scaling Laws
Paper
• 2401.00448
• Published • 30
Fast Inference of Mixture-of-Experts Language Models with Offloading
Paper
• 2312.17238
• Published • 7
Exploiting Inter-Layer Expert Affinity for Accelerating
Mixture-of-Experts Model Inference
Paper
• 2401.08383
• Published • 1
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts
Models
Paper
• 2402.07033
• Published • 19
IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs
Paper
• 2405.02842
• Published • 2