AWS Trainium & Inferentia documentation

optimum-neuron plugin for vLLM

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

optimum-neuron plugin for vLLM

The optimum-neuron package includes a vLLM plugin that registers an ‘optimum-neuron’ vLLM platform specifically designed to ease the deployment of models hosted on the Hugging Face hub to AWS Trainium and Inferentia.

This platform supports two modes of operation:

  • it can be used for the inference of pre-exported Neuron models directly from the hub,
  • but it allows also the simplified deployment of vanilla models directly without recompilation using cached artifacts.

Notes

  • only a relevant subset of all possible configurations for a given model are cached,
  • you can use the optimum-cli to get all cached configurations for each model.
  • to deploy models that are not cached on the Hugging Face hub, you need to export them beforehand.

Setup

The easiest way to use the optimum-neuron vLLM platform is to launch an Amazon ec2 instance using the Hugging Face Neuron Deep Learning AMI. If you decide NOT to make your life easier by using Hugging Face Neuron Deep Learning AMI, you can install this functionality into your Neuron environment with pip install optimum-neuron[neuronx,vllm].

Note: Trn2 instances are not supported by the optimum-neuron platform yet.

  • After launching the instance, follow the instructions in Connect to your instance to connect to the instance
  • Once inside your instance, activate the pre-installed optimum-neuron virtual environment by running
source /opt/aws_neuronx_venv_pytorch_2_7/bin/activate

Generating content programmatically

The easiest way to test a model is to use the python API:

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="unsloth/Llama-3.2-1B-Instruct",
          max_num_seqs=4,
          max_model_len=4096,
          tensor_parallel_size=2)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Serving a model

The easiest way to serve a model is to use the optimum-cli:

optimum-cli serve \
    --model="unsloth/Llama-3.2-1B-Instruct" \
    --batch_size=4 \
    --sequence_length=4096 \
    --tensor_parallel_size=2 \
    --port=8080

You can also launch an Open AI compatible inference server directly using vLLM entry points:

python -m vllm.entrypoints.openai.api_server \
    --model="unsloth/Llama-3.2-1B-Instruct" \
    --max-num-seqs=4 \
    --max-model-len=4096 \
    --tensor-parallel-size=2 \
    --port=8080

Use the following command to test the model:

curl 127.0.0.1:8080/v1/completions \
    -H 'Content-Type: application/json' \
    -X POST \
    -d '{"prompt":"One of my fondest memory is", "temperature": 0.8, "max_tokens":128}'