VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models Paper • 2409.17066 • Published Sep 25, 2024 • 28
SpaceEvo: Hardware-Friendly Search Space Design for Efficient INT8 Inference Paper • 2303.08308 • Published Mar 15, 2023 • 1
ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices Paper • 2303.09730 • Published Mar 17, 2023 • 1
Accelerating In-Browser Deep Learning Inference on Diverse Edge Clients through Just-in-Time Kernel Optimizations Paper • 2309.08978 • Published Sep 16, 2023
Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference Paper • 2306.14393 • Published Jun 26, 2023
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference Paper • 2308.12066 • Published Aug 23, 2023 • 4
BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation Paper • 2402.10631 • Published Feb 16, 2024 • 2
T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge Paper • 2407.00088 • Published Jun 25, 2024 • 12