AbLangPDB1: Contrastive-Learned Antibody Embeddings for General Epitope-Overlap Predictions

This repository contains the model, code, and tokenizers for AbLangPDB1.

Model Description

AbLangPDB1 is a fine-tuned antibody language model for generating embeddings of antibodies searching for epitope/antigen-specificity matches to reference antibodies.

The model was developed using contrastive learning on paired heavy and light chain sequences, as described in our paper:

Contrastive Learning Enables Epitope Overlap Predictions for Targeted Antibody Discovery. [bioRxiv], Clinton M. Holt, Alexis K. Janke, Parastoo Amlashi, Parker J. Jamieson, Toma M. Marinov, Ivelin S. Georgiev. 2025. https://doi.org/10.1101/2025.02.25.640114

Model Architecture

Heavy Chain Seq -> [AbLang Heavy] -> 768-dim -> | | -> [Concatenate] -> [Mixer Network] -> 1536-dim Paired Embedding Light Chain Seq -> [AbLang Light] -> 768-dim -> |

The AbLangPDB1 model uses the AbLangPaired architecture, a custom class that processes heavy and light chains of antibodies independently using the pre-trained AbLang models before fusing their embeddings together. The resulting embeddings from the two AbLang models are concatenated and passed through a custom Mixer network (6 fully connected feed forward layers) to produce a final, unified 1536-dimensional embedding for the paired antibody.

The pretrained heavy model is AbLang_heavy and the pretrained light model is AbLang_light. In brief, these use the RoBERTa architecture pretrained with the masked language modeling objective. Each model is 12 transformer blocks with 12 attenuated heads, an inner hidden size of 3072 and a hidden size of 768. It uses a learned positional embedding specific for antibodies with a max length of 160. The 768 dimensional embedding from each model is generated by mean pooling over all residue-level embeddings.

During training these pretrained models were frozen and a QLORA adapter was added.

Intended uses & limitations

The model is intended to be used to generate epitope-information-rich embeddings of antibodies, but a prediction head could be added to the model to make predictions such as neutralization capacity. Expect accuracy to be significantly better when comparing antibodies to those within the PDB.

Epitope Classification: Antibodies with unknown epitopes can be embedded and compared against a reference database of antibodies with known epitopes. The reference antibody with the highest cosine similarity represents the most similar epitope to the epitope of the given antibody. Limitation: Mouse BCRs are unlikely to perform well here and BCRs which do not bind a Pfam domain used in training are likely to have reduced classification accuracy.
Antibody Search: A reference antibody sequence can be embedded along with a large search database. Antibodies with high cosine similarities in the search database can be assumed to have similar epitope targets. entative candidates can then be chosen from each cluster for downstream characterization.

Training data

For AbLang-PDB, we curated 1,909 non-redundant human antibodies from the Structural Antibody Database (SAbDab) with a February 19, 2024 cutoff date . These were assigned antigen domains using the pfam_scan software such that two antibodies containing at least one shared Pfam were considered to be in the same category. For partitioning antibodies between training (80%), validation (10%), and test (10%) splits, antibodies sharing both heavy and light V-genes and CDRH3 amino acid identity >70% were assigned to the same clone group and distributed such that the same clone group was not present in both the training and test sets. Additionally, pairs with >92.5% sequence identity in either chain were excluded to maintain diversity.

Training Procedure

The AbLang-PDB model was trained using a Mean Squared Error loss function of the cosine similarity between a pair of antibody embeddings versus the ground truth amount of epitope overlap of the pair. Here the epitope overlap includes labels pushing antibodies binding the same antigen protein family in the general vicinity of each other while pushing those binding overlapping epitopes progressively closer.

How to Use

To use this model, first ensure you have the necessary libraries installed:

1. Setup

First, clone this repository and install the required libraries.

# Clone the repository to get the model script, weights, and tokenizers
git clone https://huggingface.co/clint-holt/AbLangPDB1
cd AbLangPDB1

# Install dependencies
pip install torch pandas "transformers>=4.30.0" safetensors

Then run the following code


import torch
import pandas as pd
from transformers import AutoTokenizer

# Import the custom model class and config from the cloned repository
from ablangpaired_model import AbLangPaired, AbLangPairedConfig

# 1. Load Model and Tokenizers
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_dir = "." # Assumes you are running this script from the cloned directory

# Configure the model to load the local weights
# The AbLangPairedConfig specifies the base AbLang models and the local checkpoint file
model_config = AbLangPairedConfig(checkpoint_filename=f"{model_dir}/ablangpdb_model.safetensors")
model = AbLangPaired(model_config, device).to(device)
model.eval()

# Tokenizers are stored in subdirectories
heavy_tokenizer = AutoTokenizer.from_pretrained(f"{model_dir}/heavy_tokenizer")
light_tokenizer = AutoTokenizer.from_pretrained(f"{model_dir}/light_tokenizer")

# 2. Prepare Antibody Sequences
data = {
    'HC_AA': ["EVQLVESGGGLVQPGGSLRLSCAASGFNLYYYSIHWVRQAPGKGLEWVASISPYSSSTSYADSVKGRFTISADTSKNTAYLQMNSLRAEDTAVYYCARGRWYRRALDYWGQGTLVTVSS"],
    'LC_AA': ["DIQMTQSPSSLSASVGDRVTITCRASQSVSSAVAWYQQKPGKAPKLLIYSASSLYSGVPSRFSGSRSGTDFTLTISSLQPEDFATYYCQQYPYYSSLITFGQGTKVEIK"]
}
df = pd.DataFrame(data)

# Pre-process sequences by adding spaces between amino acids
df["PREPARED_HC_SEQ"] = df["HC_AA"].apply(lambda x: " ".join(list(x)))
df["PREPARED_LC_SEQ"] = df["LC_AA"].apply(lambda x: " ".join(list(x)))

# 3. Tokenize and Embed
h_tokens = heavy_tokenizer(df["PREPARED_HC_SEQ"].tolist(), padding='longest', return_tensors="pt")
l_tokens = light_tokenizer(df["PREPARED_LC_SEQ"].tolist(), padding='longest', return_tensors="pt")

with torch.no_grad():
    embeddings = model(
        h_input_ids=h_tokens['input_ids'].to(device),
        h_attention_mask=h_tokens['attention_mask'].to(device),
        l_input_ids=l_tokens['input_ids'].to(device),
        l_attention_mask=l_tokens['attention_mask'].to(device)
    )

print("Embedding generation complete! ✅")
print("Shape of embeddings tensor:", embeddings.shape)
# Expected output shape: (1, 1536)

Citation

If you use this model or code in your research, please cite our paper:


@article {Holt2025.02.25.640114,
    author = {Holt, Clinton M. and Janke, Alexis K. and Amlashi, Parastoo and Jamieson, Parker J. and Marinov, Toma M. and Georgiev, Ivelin S.},
    title = {Contrastive Learning Enables Epitope Overlap Predictions for Targeted Antibody Discovery},
    elocation-id = {2025.02.25.640114},
    year = {2025},
    doi = {10.1101/2025.02.25.640114},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2025/04/01/2025.02.25.640114},
    eprint = {https://www.biorxiv.org/content/early/2025/04/01/2025.02.25.640114.full.pdf},
    journal = {bioRxiv}

}