metadata

license: mit
datasets:
  - HuggingFaceFW/fineweb-edu

SparseModernBERT α=2.0 Model Card

Model Overview

SparseModernBERT-alpha2.0 is a masked language model based on ModernBERT that replaces the standard softmax attention with an adaptive sparse attention mechanism (AdaSplash) using Triton.

The sparsity parameter α = 2.0 yields highly sparse attention patterns, improving efficiency while maintaining performance.

Key features:

Sparsity (α): 2.0
Tokenization: same as ModernBERT
Pretraining: masked language modeling on a large web corpus

Usage

Use the codebase from: https://github.com/deep-spin/SparseModernBERT

from transformers import AutoTokenizer
from sparse_modern_bert import CustomModernBertModel

model_id = "sardinelab/SparseModernBERT-alpha2.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = CustomModernBertModel.from_pretrained(model_id, trust_remote_code=True)

Citation

If you use this model in your work, please cite:

@article{goncalves2025adasplash,
  title={AdaSplash: Adaptive Sparse Flash Attention},
  author={Gon\c{c}alves, Nuno and Treviso, Marcos and Martins, Andr\'e F. T.},
  journal={arXiv preprint arXiv:2502.12082},
  year={2025}
}