AWS Trainium & Inferentia documentation
Neuron Model Cache
Neuron Model Cache
Why Use the Cache?
Problem: Neuron compilation takes 30-60 minutes for large models
Solution: Download pre-compiled models in seconds
The cache system stores compiled Neuron models on HuggingFace Hub, eliminating recompilation time for your team. When you train or load a model, the system automatically checks for cached versions before starting the expensive compilation process.
Key Benefits:
- Time savings: download compiled models in seconds vs. hours of compilation
- Team collaboration: share compiled models across team members and instances
- Cost reduction: avoid repeated compilation costs on cloud instances
- Automatic operation: works transparently with existing code
Quick Start
Training
from optimum.neuron import NeuronTrainer
# Cache works automatically - no configuration needed
trainer = NeuronTrainer(model=model, args=training_args)
trainer.train() # Downloads cached models if available
Inference
from optimum.neuron import NeuronModelForCausalLM
# Cache works automatically
model = NeuronModelForCausalLM.from_pretrained("model_id")
That’s it! The cache works automatically for supported model classes.
Supported Models
Model Class | Cache Support | Use Case | Notes |
---|---|---|---|
NeuronTrainer | ✅ Full | Training | Auto download + upload during training |
NeuronModelForCausalLM | ✅ Full | Inference | Auto download for inference |
Other NeuronModelForXXX | ❌ None | Inference | Use different export mechanism, no cache integration |
Important Limitation: Models like NeuronModelForSequenceClassification
, NeuronModelForQuestionAnswering
, etc. use a different compilation path that doesn’t integrate with the cache system. Only NeuronModelForCausalLM
and training workflows support caching.
How It Works
The cache system operates on two levels to minimize compilation time:
Cache Priority (fastest to slowest):
- Local cache → instant access from
/var/tmp/neuron-compile-cache
- Hub cache → download in seconds from HuggingFace Hub
- Compile from scratch → 30-60 minutes for large models
What Gets Cached: the system caches NEFF files (Neuron Executable File Format) - the compiled binary artifacts that run on Neuron cores, not the original model files.
Cache Identification: each cached compilation gets a unique hash based on:
- Model factors: architecture, precision (fp16/bf16), input shapes, task type
- Compilation factors: NeuronX compiler version, number of cores, optimization flags
- Environment factors: model checkpoint revision, Optimum Neuron version
This means even small changes to your setup may require recompilation, but identical configurations will always hit the cache.
Private Cache Setup
The default public cache (aws-neuron/optimum-neuron-cache
) is read-only for users - you can download cached models but cannot upload your own compilations. This public cache only contains models compiled by the Optimum team for common configurations.
For most use cases, you’ll want to create a private cache repository where you can store your own compiled models.
Why private cache?
- Upload your compilations: store models you compile for team reuse
- Private models: keep proprietary model compilations secure
- Team collaboration: share compiled artifacts across team members and CI/CD
- Custom configurations: cache models with your specific batch sizes, sequence lengths, etc.
Method 1: CLI Setup (Recommended)
# Create private cache repository
optimum-cli neuron cache create
# Set as default cache
optimum-cli neuron cache set your-org/your-cache-name
Method 2: Environment Variable
# Use for single training run
CUSTOM_CACHE_REPO="your-org/your-cache" python train.py
# Or export for session
export CUSTOM_CACHE_REPO="your-org/your-cache"
Prerequisites:
- Login:
huggingface-cli login
- write access to cache repository
CLI Commands
# Create new cache repository
optimum-cli neuron cache create [-n NAME] [--public]
# Set default cache repository
optimum-cli neuron cache set REPO_NAME
# Search for cached models
optimum-cli neuron cache lookup MODEL_ID
# Sync local cache with Hub
optimum-cli neuron cache synchronize
Advanced Usage
Use the Cache in Training Loops
If you do not use the NeuronTrainer
class, you can still leverage the cache system in your custom training loops. This is useful when you need more control over the training process or when integrating with custom training frameworks while still benefiting from cached compilations.
When to use this approach:
- custom training loops that don’t fit the
NeuronTrainer
pattern - advanced optimization scenarios requiring fine-grained control
Note: For most use cases, NeuronTrainer
handles caching automatically and is the recommended approach.
from optimum.neuron.cache import hub_neuronx_cache, synchronize_hub_cache
from optimum.neuron.cache.entries import SingleModelCacheEntry
from optimum.neuron.cache.training import patch_neuron_cc_wrapper
# Create cache entry
cache_entry = SingleModelCacheEntry(model_id, task, config, neuron_config)
# The NeuronX compiler will use the Hugging Face Hub cache system
with patch_neuron_cc_wrapper():
# The compiler will check the specified remote cache for pre-compiled NEFF files
with hub_neuronx_cache(entry=cache_entry, cache_repo_id="my-org/cache"):
model = training_loop() # Will use specified cache
# Synchronize local cache with Hub
synchronize_hub_cache(cache_repo_id="my-org/cache")
Cache Lookup
The inference cache includes a registry that lets you search for compatible pre-compiled models before attempting compilation. This is especially useful for inference where you want to avoid compilation altogether.
optimum-cli neuron cache lookup meta-llama/Llama-2-7b-chat-hf
Example output:
*** 1 entries found in cache ***
task: text-generation
batch_size: 1, sequence_length: 2048
num_cores: 24, precision: fp16
compiler_version: 2.12.54.0
checkpoint_revision: c1b0db933684edbfe29a06fa47eb19cc48025e93
Important: Finding entries doesn’t guarantee cache hits. Your exact configuration must match the cached parameters, including compiler version and model revision.
CI/CD Integration
The cache system works seamlessly in automated environments:
Environment Variables: use CUSTOM_CACHE_REPO
to specify cache repository in CI workflows
# In your CI configuration
CUSTOM_CACHE_REPO="your-org/your-cache" python train.py
Authentication: ensure your CI environment has access to your private cache repository:
- Set
HF_TOKEN
environment variable with appropriate read/write permissions - For GitHub Actions, store as a repository secret
Best Practices:
- use separate cache repositories for different environments (dev/staging/prod)
- consider cache repository permissions when setting up automated workflows
- monitor cache repository size in long-running CI workflows
Troubleshooting
“Cache repository does not exist”
Fix: Check repository name and login status → huggingface-cli login → Verify repo format: org/repo-name
“Graph will be recompiled”
Cause: No cached model matches your exact configuration Fix: Use lookup to find compatible configurations → optimum-cli neuron cache lookup MODEL_ID
Cache not uploading during training
Cause: No write permissions to cache repository Fix: Verify access and authentication → huggingface-cli whoami → Check cache repo permissions
Slow downloads
Cause: Large compiled models (GBs) downloading Fix: Ensure good internet connection → Monitor logs for download progress
Clear corrupted local cache
rm -rf /var/tmp/neuron-compile-cache/*