You would've implemented the 3-loop matrix multiplication many times as a ML practitioner, but the naive implementation is terrible for GPU performance. Modern GPUs achieve peak performance through careful memory access patterns and minimizing scheduling overhead.
In naive matmul (MxK . KxN), the computation happens in tiles - both for the output matrix and for how you read chunks from the input matrices. Each thread-block processes one output tile by loading corresponding tiles from input (for sum-reduction across K dimension), performing the computation, then terminating. The GPU launches many thread-blocks and schedules them across available streaming multiprocessors (SMs). When an SM finishes one tile, it gets assigned a new thread-block for the next uncomputed tile. This way, multiple output tiles are computed in parallel across the SMs, but we pay the cost for launching thread-blocks each time a new tile is computed.
Persistent matmul changes this approach. Instead of launching thread-blocks to compute some output tiles, computing the results on SMs in parallel, and repeating until all output tiles are computed, you launch only as many thread-blocks as you have SMs available (typically 80-132 on modern GPUs). These thread-blocks stay alive until all output tiles are computed, looping through multiple tiles sequentially. Each persistent thread-block may handle multiple output tiles.
The key benefit is the reduced thread-block launch latency. This persistence strategy, combined with other optimizations like coalesced memory loads/stores, block-tiling, warp-tiling, warp-specialization, double-buffering, ping-pong scheduling and other tricks, helps achieve peak performance. More on this in the future!
โจ The multimodal wave๐ - GLM-4.1V-Thinking: Image+Text > Text - Intern-S1: Image+Text > Text - Wan 2.2 - Text +Image > video - Skywork-R1V3: Image+Text > Text - Skywork-UniPic: Text > Image / Image > Text - Tar-7B: Any-to-Any - Ming-Lite-Omni-1.5: Any-to-Any - Step3: Image+Text > Text - HunyuanWorld-1: Image > 3D - ThinkSound: Video > Audio - Neta-Lumina: Text > Image
โจ Big month not only for models, but for policy too๐๏ธ - Announced Global Action Plan for AI Governance - Proposes to set up a World AI Cooperation Organization in Shanghai - Released International AI Open Source Collaboration Initiative - Published Risk Assessment Guidelines for Endpoint AI Agents
โจ Big event - WAIC - 355K offline visitors - 108 new released in 4 days - 145 sessions across key domains
Iโve been tracking things closely, but Julyโs open-source wave still blew me away. Canโt wait to see whatโs coming next! ๐
๐ค ๐พ Thanks so much to BBC News and the stellar Suranjana Tewari for having me on to talk about US <โ> China relationship in AI, and what it means for AI ethics.
โจ 321B total / 32B active - Apache 2.0 โจ MFA + AFD : cutting decoding cost by up to 70% vs. DeepSeek-V3 โจ 4T image-text pretraining: strong visionโlanguage grounding โจ Modular, efficient, deployable: runs on just 8ร48GB GPUs
๐ Optimum: The Last v1 Release ๐ Optimum v1.27 marks the final major release in the v1 series. As we close this chapter, we're laying the groundwork for a more modular and community-driven future: - Optimum v2: A lightweight core package for porting Transformers, Diffusers, or Sentence-Transformers to specialized AI hardware/software/accelerators.. - OptimumโONNX: A dedicated package where the ONNX/ONNX Runtime ecosystem lives and evolves, faster-moving and decoupled from the Optimum core.
๐ฏ Why this matters: - A clearer governance path for ONNX, fostering stronger community collaboration and improved developer experience.. - Enable innovation at a faster pace in a more modular, open-source environment.
๐ก What this means: - More transparency, broader participation, and faster development driven by the community and key actors in the ONNX ecosystem (PyTorch, Microsoft, Joshua Lochner ๐, ...) - A cleaner, more maintainable core Optimum, focused on extending HF libraries to special AI hardware/software/accelerators tooling and used by our partners (Intel Corporation, Amazon Web Services (AWS), AMD, NVIDIA, FuriosaAI, ...)
๐ ๏ธ Major updates I worked on in this release: โ Added support for Transformers v4.53 and SmolLM3 in ONNX/ONNXRuntime. โ Solved batched inference/generation for all supported decoder model architectures (LLMs).
โจ Big shoutout to @echarlaix for leading the refactoring work that cleanly separated ONNX exporter logic and enabled the creation of OptimumโONNX.
๐ฌ From Replika to everyday chatbots, millions of people are forming emotional bonds with AI, sometimes seeking comfort, sometimes seeking intimacy. But what happens when an AI tells you "I understand how you feel" and you actually believe it?
At Hugging Face, together with @frimelle and @yjernite, we dug into something we felt wasn't getting enough attention: the need to evaluate AI companionship behaviors. These are the subtle ways AI systems validate us, engage with us, and sometimes manipulate our emotional lives.
Here's what we found: ๐ Existing benchmarks (accuracy, helpfulness, safety) completely miss this emotional dimension. ๐ We mapped how leading AI systems actually respond to vulnerable prompts. ๐ We built the Interactions and Machine Attachment Benchmark (INTIMA): a first attempt at evaluating how models handle emotional dependency, boundaries, and attachment (with a full paper coming soon).
We just released TRL v0.20 with major multimodal upgrades!
๐๏ธ VLM support for GRPO (highly requested by the community!) ๐๏ธ New GSPO trainer (from @Qwen, released last week, VLM-ready) ๐ New MPO trainer (multimodal by design, as in the paper)
With the release of the EU data transparency template this week, we finally got to see one of the most meaningful artifacts to come out of the AI Act implementation so far (haven't you heard? AI's all about the data! ๐๐)
The impact of the template will depend on how effectively it establishes a minimum meaningful transparency standard for companies that don't otherwise offer any transparency into their handling of e.g. personal data or (anti?-)competitive practices in commercial licensing - we'll see how those play out as new models are released after August 2nd ๐
In the meantime, I wanted to see how the template works for a fully open-source + commercially viable model, so I filled it out for the SmolLM3 - which my colleagues at Hugging Face earlier this month ๐ค ICYMI, it's fully open-source with 3B parameters and performance matching the best similar-size models (I've switched all my local apps from Qwen3 to it, you should too ๐ก)
Verdict: congrats to the European Commission AI Office for making it so straightforward! Fully open and transparent models remain a cornerstone of informed regulation and governance, but the different organizational needs of their developers aren't always properly accounted for in new regulation. In this case, it took me all of two hours to fill out and publish the template (including reading the guidelines) - so kudos for making it feasible for smaller and distributed organizations ๐ Definitely a step forward for transparency ๐