This model has been pushed to the Hub using the PytorchModelHubMixin integration:

Library: llip-vitb-14-224

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Samuel Lavoie, Polina Kirichenko*, Mark Ibrahim*, Mido Assran, Andrew Gordon Wilson, Aaron Courville, Nicolas Ballas

* Equal contribution

Pytorch implementation and pretrained models for Llip. Llip produces strong image-text retrieval models and image and text encoders. The models are pre-trained on a dataset with 2.5B image-caption pairs and can contextualize the visual features on target captions.

Pretrained models

PyTorch implementation and pre-trained models for Llip. Pre-trained models

Backbone	# Mixture tokens	Avg ZS Acc.	HF model id
ViT-B/16	32	69.6	lavoies/llip-vitb-14-224
ViT-G/14	64	79.3	lavoies/llip-vitG-14-224

Loading the huggingface model:

>>> from transformers import AutoModel
>>> model = AutoModel.from_pretrained('lavoies/llip-vitG-14-224')

Citing Llip

If you find this repository useful in your research, please consider giving a star ⭐ and a citation

@inproceedings{lavoie2024modeling,
  title={Modeling Caption Diversity in Contrastive Vision-Language Pretraining},
  author={Samuel Lavoie and Polina Kirichenko and Mark Ibrahim and Mido Assran and Andrew Gordon Wilson and Aaron Courville and Nicolas Ballas},
  booktitle={Forty-first International Conference on Machine Learning},
  year={2024},
  url={https://openreview.net/forum?id=iaV2fU6Dif}
}