Llama4-Maverick-Eagle3-Speculators

Model Description

⚠️ Development Reference Model: This model has been converted as a reference for development on vLLM. Once development is complete, it can be served using:

vllm serve nm-testing/Llama4-Maverick-Eagle3-Speculators

This is a manually converted Eagle3 speculator model based on NVIDIA's Llama-4-Maverick-17B-128E-Eagle3, reformatted for compatibility with the Speculators library and vLLM speculative decoding.

Development Status

🚧 Reference Implementation for vLLM Development

This model serves as a reference implementation for vLLM Eagle3 support
Contains non-standard features (auxiliary hidden states) that require vLLM extensions
Once vLLM development is complete, will support direct serving

Key Features

Architecture: Eagle3 speculator with Llama3-based draft head
Target Verifier: Llama4 Maverick 17B (quantized w4a16)
Vocabulary Size: 202,048 tokens (unusually large for a draft model)
Special Feature: Uses auxiliary hidden states from verifier layers [1, 23, 44]

Configuration Details

This model represents a unique hybrid configuration:

Draft Model: Llama3-based Eagle3 head (single transformer layer)
Verifier Model: RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16
Architecture Class: Llama4ForConditionalGeneration for verifier

Non-Standard Features

This model includes several non-standard Eagle3 features preserved from the NVIDIA checkpoint:

Auxiliary hidden state layers from positions [1, 23, 44]
Custom layer normalization configurations
Large vocabulary matching the target model

Usage

With vLLM (After Development Complete)

# Once vLLM development is complete, serve directly:
vllm serve nm-testing/Llama4-Maverick-Eagle3-Speculators

With Speculators Library

from speculators import SpeculatorModel
from transformers import AutoModelForCausalLM

# Load the speculator
speculator = SpeculatorModel.from_pretrained("nm-testing/Llama4-Maverick-Eagle3-Speculators")

# Load and attach the verifier
verifier = AutoModelForCausalLM.from_pretrained(
    "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16",
    trust_remote_code=True
)
speculator.attach_verifier(verifier)

# Use for generation
outputs = speculator.generate(input_ids, max_length=100)

Configuration Structure

The model uses the Speculators Eagle3 format with additional fields for NVIDIA-specific features:

{
  "speculators_model_type": "eagle3",
  "architectures": ["Eagle3Speculator"],
  "draft_vocab_size": 202048,
  "transformer_layer_config": {
    "rope_scaling": {
      "rope_type": "llama3"  // Confirms Llama3 architecture
    }
  },
  "eagle_aux_hidden_state_layer_ids": [1, 23, 44],
  "use_aux_hidden_state": true
}

Performance Notes

Vocabulary Size: The 202K vocabulary is unusually large and may impact memory usage
Auxiliary Hidden States: May require custom Eagle3Speculator extensions for full functionality
Acceptance Rate: Expected ~2-3 tokens per forward pass based on NVIDIA benchmarks

Model Weights

Format: SafeTensors
Precision: bfloat16
Size: ~3.2GB

Citation

If you use this model, please cite both the original NVIDIA model and the Speculators library:

@misc{nvidia2025llama4maverick,
  title={Llama 4 Maverick 17B Eagle3},
  author={NVIDIA Corporation},
  year={2025},
  publisher={Hugging Face}
}

@misc{speculators2024,
  title={Speculators: A Unified Library for Speculative Decoding},
  author={Neural Magic},
  year={2024},
  url={https://github.com/neuralmagic/speculators}
}

License

This model is subject to the NVIDIA Open Model License. Please review the license terms before use.

Acknowledgments

Original model by NVIDIA Corporation
Conversion and formatting for Speculators/vLLM compatibility
Based on Eagle3 architecture with Llama3 draft head targeting Llama4 verifier

Downloads last month: 49

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nm-testing/Llama4-Maverick-Eagle3-Speculators

Base model

nvidia/Llama-4-Maverick-17B-128E-Eagle3

Finetuned

(1)

this model