Thank you for the great work! Can I suggest a few things? 🤗
- Imho, the plots (that look great btw) would be less confusing if your original dataset was shuffled, e.g. the figure is "Greedy packing" doesn't look like what you'd get in practice.
- On top of balancing the number of images in Stage 5, balancing the number of examples would also help training. For example—looking at your figure in Stage 4—there is a high difference in the number of examples per sequence (what you refer to as batch) btw sequence 1 and the last sequence. NVIDIA's EAGLE 2 paper (Li et al., 2025, https://arxiv.org/abs/2501.14818) shows that using a balanced version of knapsack helps training! (see figures 9 and 10). Just thought it'd be nice to share this technique with the community!

But again, really nice work on this blog post and on Picotron guys!