AWS Trainium & Inferentia documentation
optimum-neuron plugin for vLLM
optimum-neuron plugin for vLLM
The optimum-neuron
package includes a vLLM plugin
that registers an ‘optimum-neuron’ vLLM platform specifically designed to ease the deployment
of models hosted on the Hugging Face hub to AWS Trainium and Inferentia.
This platform supports two modes of operation:
- it can be used for the inference of pre-exported Neuron models directly from the hub,
- but it allows also the simplified deployment of vanilla models directly without recompilation using cached artifacts.
Notes
- only a relevant subset of all possible configurations for a given model are cached,
- you can use the
optimum-cli
to get all cached configurations for each model. - to deploy models that are not cached on the Hugging Face hub, you need to export them beforehand.
Setup
The easiest way to use the optimum-neuron
vLLM platform is to launch an Amazon ec2 instance using
the Hugging Face Neuron Deep Learning AMI. If you decide NOT to make your life easier by using Hugging Face Neuron Deep Learning AMI, you can install this functionality into your Neuron environment with pip install optimum-neuron[neuronx,vllm]
.
Note: Trn2 instances are not supported by the optimum-neuron
platform yet.
- After launching the instance, follow the instructions in Connect to your instance to connect to the instance
- Once inside your instance, activate the pre-installed
optimum-neuron
virtual environment by running
source /opt/aws_neuronx_venv_pytorch_2_7/bin/activate
Generating content programmatically
The easiest way to test a model is to use the python API:
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="unsloth/Llama-3.2-1B-Instruct",
max_num_seqs=4,
max_model_len=4096,
tensor_parallel_size=2)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Serving a model
The easiest way to serve a model is to use the optimum-cli
:
optimum-cli serve \ --model="unsloth/Llama-3.2-1B-Instruct" \ --batch_size=4 \ --sequence_length=4096 \ --tensor_parallel_size=2 \ --port=8080
You can also launch an Open AI compatible inference server directly using vLLM entry points:
python -m vllm.entrypoints.openai.api_server \ --model="unsloth/Llama-3.2-1B-Instruct" \ --max-num-seqs=4 \ --max-model-len=4096 \ --tensor-parallel-size=2 \ --port=8080
Use the following command to test the model:
curl 127.0.0.1:8080/v1/completions \ -H 'Content-Type: application/json' \ -X POST \ -d '{"prompt":"One of my fondest memory is", "temperature": 0.8, "max_tokens":128}'