Hogwild! Inference: Parallel LLM Generation via Concurrent Attention Paper • 2504.06261 • Published Apr 8 • 111
Sparse Finetuning for Inference Acceleration of Large Language Models Paper • 2310.06927 • Published Oct 10, 2023 • 14
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression Paper • 2306.03078 • Published Jun 5, 2023 • 3
Extreme Compression of Large Language Models via Additive Quantization Paper • 2401.06118 • Published Jan 11, 2024 • 13
Accurate Neural Network Pruning Requires Rethinking Sparse Optimization Paper • 2308.02060 • Published Aug 3, 2023 • 1
PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression Paper • 2405.14852 • Published May 23, 2024 • 2
Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis Paper • 2412.01819 • Published Dec 2, 2024 • 36
EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search Paper • 2410.14649 • Published Oct 18, 2024 • 9
Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization Paper • 2409.00492 • Published Aug 31, 2024 • 11