21 30 55

Tony Wu

tonywu71

AI & ML interests

LLM, Multimodal, Agents, Information Retrieval, RAG, Speech

Recent Activity

upvoted an article 17 days ago

You could have designed state of the art positional encoding

upvoted an article 17 days ago

Merge Large Language Models with mergekit

commented on an article 20 days ago

Efficient MultiModal Data Pipeline

View all activity

Organizations

upvoted 2 articles 17 days ago

Article

You could have designed state of the art positional encoding

•

Nov 25, 2024

• 331

Article

Merge Large Language Models with mergekit

•

Jan 9, 2024

• 130

commented on Efficient MultiModal Data Pipeline 20 days ago

My hands are full at the moment, so I'll have to pass sorry @ariG23498 !
But I'll be more than happy to further discuss VLM-related research and training tricks on X (I think we already follow each other anyway 😉).

commented on Efficient MultiModal Data Pipeline 22 days ago

Thank you for the great work! Can I suggest a few things? 🤗

Imho, the plots (that look great btw) would be less confusing if your original dataset was shuffled, e.g. the figure is "Greedy packing" doesn't look like what you'd get in practice.
On top of balancing the number of images in Stage 5, balancing the number of examples would also help training. For example—looking at your figure in Stage 4—there is a high difference in the number of examples per sequence (what you refer to as batch) btw sequence 1 and the last sequence. NVIDIA's EAGLE 2 paper (Li et al., 2025, https://arxiv.org/abs/2501.14818) shows that using a balanced version of knapsack helps training! (see figures 9 and 10). Just thought it'd be nice to share this technique with the community!