CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
Abstract
CUDA-L1, an automated reinforcement learning framework, significantly improves CUDA optimization across various GPU architectures, achieving substantial speedups without human expertise.
The exponential growth in demand for GPU computing resources, driven by the rapid advancement of Large Language Models, has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models (e.g. R1, o1) achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization. CUDA-L1 achieves performance improvements on the CUDA optimization task: trained on NVIDIA A100, it delivers an average speedup of x17.7 across all 250 CUDA kernels of KernelBench, with peak speedups reaching x449. Furthermore, the model also demonstrates excellent portability across GPU architectures, achieving average speedups of x17.8 on H100, x19.0 on RTX 3090, x16.5 on L40, x14.7 on H800, and x13.9 on H20 despite being optimized specifically for A100. Beyond these benchmark results, CUDA-L1 demonstrates several remarkable properties: 1) Discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) Uncovers fundamental principles of CUDA optimization; 3) Identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that harm performance. The capabilities of CUDA-L1 demonstrate that reinforcement learning can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. More importantly, the trained RL model extend the acquired reasoning abilities to new kernels. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.
Community
We would like to introduce CUDA-L1, an LLM trained with constrastive reinforcement learning, aiming to generate optimized CUDA kernel code.
📈 Performance:
- CUDA-L1 delivers an average speedup of 3.12× (median 1.42×) across all 250 CUDA kernels in KernelBench.
- CUDA-L1 can generate kernel speedups across different GPU architectures. For example, 3.12× speedup on L40, 2.39× speedup on H100, 2.37× speedup on H20, etc.
🌟CUDA-L1 Highlights:
- It discovers a variety of CUDA optimization techniques and learns to combine them strategically.
- It uncovers fundamental principles of CUDA optimization, such as the multiplicative nature of optimizations.
- It identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that actually harm performance.
Thanks for your time and your contribution to the community.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs (2025)
- Kevin: Multi-Turn RL for Generating CUDA Kernels (2025)
- ChipSeek-R1: Generating Human-Surpassing RTL with LLM via Hierarchical Reward-Driven Reinforcement Learning (2025)
- RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models (2025)
- A Technical Survey of Reinforcement Learning Techniques for Large Language Models (2025)
- GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning (2025)
- MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Hi @xxiaoyali ! I was interested in this paper when I heard about it a few days/weeks ago (it was trending on X). It is great work attempting to use RL for kernel generation, however many results leave us enthusiasts skeptical.
A group of individuals, including me, independently tested and found severe benchmarking errors and poor baselines for all the problems we took a look at. The main complaints we have against the results presented in the paper can be summarized as:
- Poor baselines
- Measuring timings without proper synchronization when using non-default streams
- Device caches not cleared between each repeated run of the benchmark
- Cached results during warmup passing the correctness tests, but the actual results from kernel being incorrect after warmup (due to not handling synchronization)
- Usage of cuda graphs to showcase speedup. A feature of an existing system skews the benchmarks in favor of your paper. The proper way to compare, at least in my opinion, would be to compare pytorch eager + cuda graph VS your optimized kernel + cuda graph, or not applying it at all in both cases.
- Usage of custom torch cuda/cudnn backend flags to showcase speedup. Again, features available in the pytorch framework should be used in both the baseline and your optimized kernel, or in neither, otherwise the baseline presented is worse.
- Speedups of 1.K x to ~3 x can usually be attributed to customizable flags and features already available in the pytorch framework. Some problem solutions do not seem to have usage of novel ideas, but still show speedup because of this reason. Even if the generated kernels are better for specific problem shapes, they fall behind expert-optimized kernels in the general case.
- Some problems do not have custom generated cuda code and just makes calls to pytorch APIs
- (There were a few other problems reported but I fail to recall them at this time)
- Additionally, comparing the generated kernels with the Triton programs generated by Inductor (
torch.compile
) would be a good practice to assess the true innovations discovered by applying RL for kernel generation.
Personally, I only tested a total of 5 problems randomly and found little to no speedup (< 1.5 x) after fixing the benchmarking issues, removing the cuda graph usage, pytorch backend flags, and other framework-provided optimizations. This is not to say that the work is anything short of amazing, but the methodology definitely raises questions.
That said, I want to give the project another try to understand the significant optimizations that were discovered and applied by RL. Is it safe to assume that the latest version of the github code and paper have been corrected with proper benchmarking methodology?
Congratulations on your work and thank you for opens sourcing it! I'm looking forward to more amazing work in this area of research by your team.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper