Llama4-Maverick-Eagle3-Speculators
Model Description
โ ๏ธ Development Reference Model: This model has been converted as a reference for development on vLLM. Once development is complete, it can be served using:
vllm serve nm-testing/Llama4-Maverick-Eagle3-Speculators
This is a manually converted Eagle3 speculator model based on NVIDIA's Llama-4-Maverick-17B-128E-Eagle3, reformatted for compatibility with the Speculators library and vLLM speculative decoding.
Development Status
๐ง Reference Implementation for vLLM Development
- This model serves as a reference implementation for vLLM Eagle3 support
- Contains non-standard features (auxiliary hidden states) that require vLLM extensions
- Once vLLM development is complete, will support direct serving
Key Features
- Architecture: Eagle3 speculator with Llama3-based draft head
- Target Verifier: Llama4 Maverick 17B (quantized w4a16)
- Vocabulary Size: 202,048 tokens (unusually large for a draft model)
- Special Feature: Uses auxiliary hidden states from verifier layers [1, 23, 44]
Configuration Details
This model represents a unique hybrid configuration:
- Draft Model: Llama3-based Eagle3 head (single transformer layer)
- Verifier Model:
RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16
- Architecture Class:
Llama4ForConditionalGeneration
for verifier
Non-Standard Features
This model includes several non-standard Eagle3 features preserved from the NVIDIA checkpoint:
- Auxiliary hidden state layers from positions [1, 23, 44]
- Custom layer normalization configurations
- Large vocabulary matching the target model
Usage
With vLLM (After Development Complete)
# Once vLLM development is complete, serve directly:
vllm serve nm-testing/Llama4-Maverick-Eagle3-Speculators
With Speculators Library
from speculators import SpeculatorModel
from transformers import AutoModelForCausalLM
# Load the speculator
speculator = SpeculatorModel.from_pretrained("nm-testing/Llama4-Maverick-Eagle3-Speculators")
# Load and attach the verifier
verifier = AutoModelForCausalLM.from_pretrained(
"RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16",
trust_remote_code=True
)
speculator.attach_verifier(verifier)
# Use for generation
outputs = speculator.generate(input_ids, max_length=100)
Configuration Structure
The model uses the Speculators Eagle3 format with additional fields for NVIDIA-specific features:
{
"speculators_model_type": "eagle3",
"architectures": ["Eagle3Speculator"],
"draft_vocab_size": 202048,
"transformer_layer_config": {
"rope_scaling": {
"rope_type": "llama3" // Confirms Llama3 architecture
}
},
"eagle_aux_hidden_state_layer_ids": [1, 23, 44],
"use_aux_hidden_state": true
}
Performance Notes
- Vocabulary Size: The 202K vocabulary is unusually large and may impact memory usage
- Auxiliary Hidden States: May require custom Eagle3Speculator extensions for full functionality
- Acceptance Rate: Expected ~2-3 tokens per forward pass based on NVIDIA benchmarks
Model Weights
- Format: SafeTensors
- Precision: bfloat16
- Size: ~3.2GB
Citation
If you use this model, please cite both the original NVIDIA model and the Speculators library:
@misc{nvidia2025llama4maverick,
title={Llama 4 Maverick 17B Eagle3},
author={NVIDIA Corporation},
year={2025},
publisher={Hugging Face}
}
@misc{speculators2024,
title={Speculators: A Unified Library for Speculative Decoding},
author={Neural Magic},
year={2024},
url={https://github.com/neuralmagic/speculators}
}
License
This model is subject to the NVIDIA Open Model License. Please review the license terms before use.
Acknowledgments
- Original model by NVIDIA Corporation
- Conversion and formatting for Speculators/vLLM compatibility
- Based on Eagle3 architecture with Llama3 draft head targeting Llama4 verifier
- Downloads last month
- 49
Model tree for nm-testing/Llama4-Maverick-Eagle3-Speculators
Base model
nvidia/Llama-4-Maverick-17B-128E-Eagle3