BM-MAE: Multimodal Masked Autoencoder Pre-training for 3D MRI-Based Brain Tumor Analysis with Missing Modalities

Here you'll find pre-trained weights for a ViT encoder on MRI of anatomical brain tumors. The model, BM-MAE, was presented in Multimodal Masked Autoencoder Pre-training for 3D MRI-Based Brain Tumor Analysis with Missing Modalities. It introduces a masked image modeling pre-training strategy tailored for multimodal MRI data, capable of adapting to any combination of available modalities and efficiently reconstructing missing ones.

Code: https://github.com/Lucas-rbnt/BM-MAE

Please note that the entire model is supplied here; in most fine-tuning use cases, you'll only need the encoder.

Abstract

Multimodal magnetic resonance imaging (MRI) constitutes the first line of investigation for clinicians in the care of brain tumors, providing crucial insights for surgery planning, treatment monitoring, and biomarker identification. Pre-training on large datasets have been shown to help models learn transferable representations and adapt with minimal labeled data. This behavior is especially valuable in medical imaging, where annotations are often scarce. However, applying this paradigm to multimodal medical data introduces a challenge: most existing approaches assume that all imaging modalities are available during both pre-training and fine-tuning. In practice, missing modalities often occur due to acquisition issues, specialist unavailability, or specific experimental designs on small in-house datasets. Consequently, a common approach involves training a separate model for each desired modality combination, making the process both resource-intensive and impractical for clinical use. Therefore, we introduce BM-MAE, a masked image modeling pre-training strategy tailored for multimodal MRI data. The same pre-trained model seamlessly adapts to any combination of available modalities, extracting rich representations that capture both intra- and inter-modal information. This allows fine-tuning on any subset of modalities without requiring architectural changes, while still benefiting from a model pre-trained on the full set of modalities. Extensive experiments show that the proposed pre-training strategy outperforms or remains competitive with baselines that require separate pre-training for each modality subset, while substantially surpassing training from scratch on several downstream tasks. Additionally, it can quickly and efficiently reconstruct missing modalities, highlighting its practical value.

Framework

BM-MAE is based solely on block transformers for extracting multimodal anatomical MRI features that can later be used to fine-tune a model. The considered anatomical modalities are: T1, T1c, FLAIR, T2.

Quickstart

The simplest way to use BM-MAE is to extract a relevant representation through the ViT encoder. Suppose, for a patient, you only have two modalities: T1 and T2.

import torch
from bmmae.model import ViTEncoder
from bmmae.tokenizers import MRITokenizer

modalities = ['t1', 't2']
tokenizers = {
            modality: MRITokenizer(
                patch_size=(16, 16, 16),
                img_size=(128, 128, 128),
                hidden_size=768,
            )
            for modality in modalities
        }

encoder = ViTEncoder(
            modalities=modalities,
            tokenizers=tokenizers,
            cls_token=True
        )

# Load pre-trained weights from a local path (assuming 'pretrained_models/bmmae.pth' exists)
state_dict = torch.load('pretrained_models/bmmae.pth')
encoder.load_state_dict(state_dict, strict=False)
inputs = {'t1': torch.randn(1, 1, 128, 128, 128), 't2': torch.randn(1, 1, 128, 128, 128)}
outputs = encoder(inputs) # shape of [1, 1025, 768]
print(f"Output shape from local weights: {outputs.shape}")

Hugging Face Hub Loading

You can also load the models directly from the Hugging Face Hub:

from bmmae.model import BMMAE, ViTEncoder

# Load the full BMMAE model from the Hub
model = BMMAE.from_pretrained("luklebigbosse/BM-MAE")
print("BMMAE model loaded from Hugging Face Hub.")

# Load only the encoder from the Hub
encoder_only = ViTEncoder.from_pretrained("luklebigbosse/BM-MAE")
print("ViTEncoder loaded from Hugging Face Hub.")

Citation

If you find this repository helpful, please cite our paper:

@misc{robinet2025multimodalmaskedautoencoderpretraining,
      title={Multimodal Masked Autoencoder Pre-training for 3D MRI-Based Brain Tumor Analysis with Missing Modalities}, 
      author={Lucas Robinet and Ahmad Berjaoui and Elizabeth Cohen-Jonathan Moyal},
      year={2025},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
}