This model has been pushed to the Hub using the PytorchModelHubMixin integration:
- Library: llip-vitb-16-224
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
Samuel Lavoie, Polina Kirichenko*, Mark Ibrahim*, Mido Assran, Andrew Gordon Wilson, Aaron Courville, Nicolas Ballas
* Equal contribution
Pytorch implementation and pretrained models for Llip. Llip produces strong image-text retrieval models and image and text encoders. The models are pre-trained on a dataset with 2.5B image-caption pairs and can contextualize the visual features on target captions.
Pretrained models
PyTorch implementation and pre-trained models for Llip. Pre-trained models
Backbone | # Mixture tokens | Avg ZS Acc. | HF model id |
---|---|---|---|
ViT-B/16 | 32 | 69.6 | lavoies/llip-vitb-16-224 |
ViT-G/14 | 64 | 79.3 | lavoies/llip-vitG-14-224 |
Loading the huggingface model:
>>> from transformers import AutoModel
>>> model = AutoModel.from_pretrained('lavoies/llip-vitb-16-224')
Citing Llip
If you find this repository useful in your research, please consider giving a star โญ and a citation
@inproceedings{lavoie2024modeling,
title={Modeling Caption Diversity in Contrastive Vision-Language Pretraining},
author={Samuel Lavoie and Polina Kirichenko and Mark Ibrahim and Mido Assran and Andrew Gordon Wilson and Aaron Courville and Nicolas Ballas},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
url={https://openreview.net/forum?id=iaV2fU6Dif}
}
- Downloads last month
- 4
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support