view article Article NVIDIA Releases 3 Million Sample Dataset for OCR, Visual Question Answering, and Captioning Tasks By nvidia and 4 others • Aug 11 • 73
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion Paper • 2503.11576 • Published Mar 14 • 116
view article Article SmolVLM2: Bringing Video Understanding to Every Device By orrzohar and 6 others • Feb 20 • 300
view article Article Open-source DeepResearch – Freeing our search agents By m-ric and 4 others • Feb 4 • 1.3k
view article Article Timm ❤️ Transformers: Use any timm model with transformers By ariG23498 and 4 others • Jan 16 • 51
MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild Paper • 2411.11098 • Published Nov 17, 2024 • 1
MobileNetV4 pretrained weights Collection Weights for MobileNet-V4 pretrained in timm • 17 items • Updated Aug 1 • 20
Transformer Explainer: Interactive Learning of Text-Generative Models Paper • 2408.04619 • Published Aug 8, 2024 • 170
🍃 MINT-1T Collection Data for "MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens" • 13 items • Updated Jul 24, 2024 • 62
Searching for Better ViT Baselines Collection Exploring ViT hparams and model shapes for the GPU poor (between tiny and base). • 28 items • Updated Aug 1 • 18
view article Article Introducing Idefics2: A Powerful 8B Vision-Language Model for the community By Leyo and 2 others • Apr 15, 2024 • 187
Uni-SMART: Universal Science Multimodal Analysis and Research Transformer Paper • 2403.10301 • Published Mar 15, 2024 • 54