Bagel‑Zebra‑CoT

A vision–language model fine‑tuned on the Zebra‑CoT dataset to generate high-quality interleaved visual chain‑of‑thought.

Model Description
Usage
Dataset
License
Citation
Links

Model Description

Bagel‑Zebra‑CoT is fine-tuned from Bagel‑7B on the Zebra‑CoT. The model is trained to generate interleaved text and image traces inherently during its own reasoning process.

Usage

For interleaved text and image inference and training with our model, please refer to our GitHub repository.

For general information and other details, please refer to the offical Bagel GitHub repository.

Dataset

Zebra‑CoT: 182,384 interleaved text‑image reasoning samples across 18 sub‑tasks in 4 categories (2D visual, 3D visual, scientific reasoning, visual logic & strategic games).

License

Bagel‑Zebra‑CoT is licensed under the Apache 2.0 license. It is finetuned from ByteDance-Seed/BAGEL-7B-MoT, which was finetuned from Qwen2.5-7B-Instruct and siglip-so400m-14-384-flash-attn2 model, and uses the FLUX.1-schnell VAE model, all under Apache 2.0.

Citation

If you use this model, please cite:

@misc{li2025zebracot,
      title={Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning}, 
      author={Ang Li and Charles Wang and Kaiyu Yue and Zikui Cai and Ollie Liu and Deqing Fu and Peng Guo and Wang Bill Zhu and Vatsal Sharan and Robin Jia and Willie Neiswanger and Furong Huang and Tom Goldstein and Micah Goldblum},
      year={2025},
      eprint={2507.16746},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.16746}, 
}

multimodal-reasoning-lab
/

Bagel-Zebra-CoT

Bagel‑Zebra‑CoT

Table of Contents

Model Description

Usage

Dataset

License

Citation

Links

Model tree for multimodal-reasoning-lab/Bagel-Zebra-CoT

Dataset used to train multimodal-reasoning-lab/Bagel-Zebra-CoT

Collection including multimodal-reasoning-lab/Bagel-Zebra-CoT

Zebra-CoT Models