This model has been pushed to the Hub using the PytorchModelHubMixin integration:

  • Library: llip-vitb-14-224

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Samuel Lavoie, Polina Kirichenko*, Mark Ibrahim*, Mido Assran, Andrew Gordon Wilson, Aaron Courville, Nicolas Ballas

* Equal contribution

[paper][bibtex]

Pytorch implementation and pretrained models for Llip. Llip produces strong image-text retrieval models and image and text encoders. The models are pre-trained on a dataset with 2.5B image-caption pairs and can contextualize the visual features on target captions.

Pretrained models

PyTorch implementation and pre-trained models for Llip. Pre-trained models

Backbone # Mixture tokens Avg ZS Acc. HF model id
ViT-B/16 32 69.6 lavoies/llip-vitb-14-224
ViT-G/14 64 79.3 lavoies/llip-vitG-14-224

Loading the huggingface model:

>>> from transformers import AutoModel
>>> model = AutoModel.from_pretrained('lavoies/llip-vitG-14-224')

Citing Llip

If you find this repository useful in your research, please consider giving a star โญ and a citation

@inproceedings{lavoie2024modeling,
  title={Modeling Caption Diversity in Contrastive Vision-Language Pretraining},
  author={Samuel Lavoie and Polina Kirichenko and Mark Ibrahim and Mido Assran and Andrew Gordon Wilson and Aaron Courville and Nicolas Ballas},
  booktitle={Forty-first International Conference on Machine Learning},
  year={2024},
  url={https://openreview.net/forum?id=iaV2fU6Dif}
}
Downloads last month
1
Safetensors
Model size
2.82B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support