Microsoft Azure documentation
Deploy Large Language Models (LLMs) on Azure AI
Deploy Large Language Models (LLMs) on Azure AI
This example showcases how to deploy a Large Language Model (LLM) from the Hugging Face Collection in Azure AI Foundry Hub as an Azure ML Managed Online Endpoint. Additionally, this example also showcases how to run inference with both the Azure ML Python SDK, the OpenAI Python SDK, and even how to locally run a Gradio application for chat completion.
Note that this example will go through the Python SDK / Azure CLI programmatic deployment, if you’d rather prefer using the one-click deployment experience, please check One-click deployments from the Hugging Face Hub on Azure ML. But note that when deploying from the Hugging Face Hub, the endpoint + deployment will be created within Azure ML instead of within Azure AI Foundry, whereas this example focuses on Azure AI Foundry Hub deployments (also made available on Azure ML, but not the other way around).
TL;DR Azure AI Foundry provides a unified platform for enterprise AI operations, model builders, and application development. Azure Machine Learning is a cloud service for accelerating and managing the machine learning (ML) project lifecycle.
This example will specifically deploy Qwen/Qwen2.5-32B-Instruct
from the Hugging Face Hub (or see it on AzureML or on Azure AI Foundry) as an Azure ML Managed Online Endpoint on Azure AI Foundry Hub.
Qwen2.5 is one of the latest series of Qwen large language models, bringing the following improvements upon Qwen2 such as:
- Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains.
- Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots.
- Long-context Support up to 128K tokens and can generate up to 8K tokens.
- Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more.
For more information, make sure to check their model card on the Hugging Face Hub.
Note that you can select any LLM available on the Hugging Face Hub with the “Deploy to AzureML” option enabled, or directly select any of the LLMs available in either the Azure ML or Azure AI Foundry Hub Model Catalog under the “HuggingFace” collection (note that for Azure AI Foundry the Hugging Face Collection will only be available for Hub-based projects).
Pre-requisites
To run the following example, you will need to comply with the following pre-requisites, alternatively, you can also read more about those in the Azure Machine Learning Tutorial: Create resources you need to get started.
- An Azure account with an active subscription.
- The Azure CLI installed and logged in.
- The Azure Machine Learning extension for the Azure CLI.
- An Azure Resource Group.
- A project based on an Azure AI Foundry Hub.
For more information, please go through the steps in Configure Microsoft Azure for Azure AI.
Setup and installation
In this example, the Azure Machine Learning SDK for Python will be used to create the endpoint and the deployment, as well as to invoke the deployed API. Along with it, you will also need to install azure-identity
to authenticate with your Azure credentials via Python.
%pip install azure-ai-ml azure-identity --upgrade --quiet
More information at Azure Machine Learning SDK for Python.
Then, for convenience setting the following environment variables is recommended as those will be used along the example for the Azure ML Client, so make sure to update and set those values accordingly as per your Microsoft Azure account and resources.
%env LOCATION eastus %env SUBSCRIPTION_ID <YOUR_SUBSCRIPTION_ID> %env RESOURCE_GROUP <YOUR_RESOURCE_GROUP> %env AI_FOUNDRY_HUB_PROJECT <YOUR_AI_FOUNDRY_HUB_PROJECT>
Finally, you also need to define both the endpoint and deployment names, as those will be used throughout the example too:
Note that endpoint names must to be globally unique per region i.e., even if you don’t have any endpoint named that way running under your subscription, if the name is reserved by another Azure customer, then you won’t be able to use the same name. Adding a timestamp or a custom identifier is recommended to prevent running into HTTP 400 validation issues when trying to deploy an endpoint with an already locked / reserved name. Also the endpoint name must be between 3 and 32 characters long.
import os
from uuid import uuid4
os.environ["ENDPOINT_NAME"] = f"qwen-endpoint-{str(uuid4())[:8]}"
os.environ["DEPLOYMENT_NAME"] = f"qwen-deployment-{str(uuid4())[:8]}"
Authenticate to Azure ML
Initially, you need to authenticate into the Azure AI Foundry Hub via Azure ML with the Azure ML Python SDK, which will be later used to deploy Qwen/Qwen2.5-32B-Instruct
as an Azure ML Managed Online Endpoint in your Azure AI Foundry Hub.
On standard Azure ML deployments you’d need to create the MLClient
using the Azure ML Workspace as the workspace_name
whereas for Azure AI Foundry, you need to provide the Azure AI Foundry Hub name as the workspace_name
instead, and that will deploy the endpoint under the Azure AI Foundry too.
import os
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
client = MLClient(
credential=DefaultAzureCredential(),
subscription_id=os.getenv("SUBSCRIPTION_ID"),
resource_group_name=os.getenv("RESOURCE_GROUP"),
workspace_name=os.getenv("AI_FOUNDRY_HUB_PROJECT"),
)
Create and Deploy Azure AI Endpoint
Before creating the Managed Online Endpoint, you need to build the model URI, which is formatted as it follows azureml://registries/HuggingFace/models/<MODEL_ID>/labels/latest
where the MODEL_ID
won’t be the Hugging Face Hub ID but rather its name on Azure, as follows:
model_id = "Qwen/Qwen2.5-32B-Instruct"
model_uri = (
f"azureml://registries/HuggingFace/models/{model_id.replace('/', '-').replace('_', '-').lower()}/labels/latest"
)
model_uri
To check if a model from the Hugging Face Hub is available in Azure, you should read about it in Supported Models. If not, you can always Request a model addition in the Hugging Face collection on Azure).
Then you need to create the ManagedOnlineEndpoint via the Azure ML Python SDK as follows.
Every model in the Hugging Face Collection is powered by an efficient inference backend, and each of those can run on a wide variety of instance types (as listed in Supported Hardware). Since for models and inference engines require a GPU-accelerated instance, you might need to request a quota increase as per Manage and increase quotas and limits for resources with Azure Machine Learning.
from azure.ai.ml.entities import ManagedOnlineEndpoint, ManagedOnlineDeployment
endpoint = ManagedOnlineEndpoint(name=os.getenv("ENDPOINT_NAME"))
deployment = ManagedOnlineDeployment(
name=os.getenv("DEPLOYMENT_NAME"),
endpoint_name=os.getenv("ENDPOINT_NAME"),
model=model_uri,
instance_type="Standard_NC40ads_H100_v5",
instance_count=1,
)
client.begin_create_or_update(endpoint).wait()
In Azure AI Foundry the endpoint will only be listed within the “My assets -> Models + endpoints” tab once the deployment is created, not before as in Azure ML where the endpoint is shown even if it doesn’t contain any active or in-progress deployments.
client.online_deployments.begin_create_or_update(deployment).wait()
Note that whilst the Azure AI Endpoint creation is relatively fast, the deployment will take longer since it needs to allocate the resources on Azure so expect it to take ~10-15 minutes, but it could as well take longer depending on the instance provisioning and availability.
Once deployed, via either the Azure AI Foundry or the Azure ML Studio you’ll be able to inspect the endpoint details, the real-time logs, how to consume the endpoint, and even use the, still on preview, monitoring feature. Find more information about it at Azure ML Managed Online Endpoints
Send requests to the Azure AI Endpoint
Finally, now that the Azure AI Endpoint is deployed, you can send requests to it. In this case, since the task of the model is text-generation
(also known as chat-completion
) you can either use the default scoring endpoint, being /generate
which is the standard text generation endpoint without chat capabilities (as leveraging the chat template or having an OpenAI-compatible OpenAPI interface), or alternatively just benefit from the fact that the inference engine in which the model is running on top exposes OpenAI-compatible routes as /v1/chat/completions
.
Note that below only some of the options are listed, but you can send requests to the deployed endpoint as long as you send the HTTP requests with the azureml-model-deployment
header set to the name of the Azure AI Deployment (not the Endpoint), and have the necessary authentication token / key to send requests to the given endpoint; then you can send HTTP request to all the routes that the backend engine is exposing, not only to the scoring route.
Azure Python SDK
You can invoke the Azure AI Endpoint on the scoring route, in this case /generate
(more information about it in the Qwen/Qwen2.5-32B-Instruct
page in either AzureML or Azure AI Foundry catalogs), via the Azure Python SDK with the previously instantiated azure.ai.ml.MLClient
(or instantiate a new one if working from a different session).
import json
import os
import tempfile
with tempfile.NamedTemporaryFile(mode="w+", delete=True, suffix=".json") as tmp:
json.dump({"inputs": "What is Deep Learning?", "parameters": {"max_new_tokens": 128}}, tmp)
tmp.flush()
response = client.online_endpoints.invoke(
endpoint_name=os.getenv("ENDPOINT_NAME"),
deployment_name=os.getenv("DEPLOYMENT_NAME"),
request_file=tmp.name,
)
print(json.loads(response))
Note that the Azure ML Python SDK requires a path to a JSON file when invoking the endpoints, meaning that whatever payload you want to send to the endpoint will need to be first converted into a JSON file, whilst that only applies to the requests sent via the Azure ML Python SDK.
OpenAI Python SDK
Since the inference engine in which the model is running on top exposes OpenAI-compatible routes, you can also leverage the OpenAI Python SDK to send requests to the deployed Azure AI Endpoint.
%pip install openai --upgrade --quiet
To use the OpenAI Python SDK with Azure ML Managed Online Endpoints, you need to first retrieve:
api_url
with the/v1
route (that contains thev1/chat/completions
endpoint that the OpenAI Python SDK will send requests to)api_key
which is the API Key in Azure AI or the primary key in Azure ML (unless a dedicated Azure ML Token is used instead)
from urllib.parse import urlsplit
api_key = client.online_endpoints.get_keys(os.getenv("ENDPOINT_NAME")).primary_key
url_parts = urlsplit(client.online_endpoints.get(os.getenv("ENDPOINT_NAME")).scoring_uri)
api_url = f"{url_parts.scheme}://{url_parts.netloc}"
Alternatively, you can also build the API URL manually as it follows, since the URIs are globally unique per region, meaning that there will only be one endpoint named the same way within the same region:
api_url = f"https://{os.getenv('ENDPOINT_NAME')}.{os.getenv('LOCATION')}.inference.ml.azure.com/v1"
Or just retrieve it from either the Azure AI Foundry or the Azure ML Studio.
Then you can use the OpenAI Python SDK normally, making sure to include the extra header azureml-model-deployment
header that contains the Azure AI / ML Deployment name.
Via the OpenAI Python SDK it can either be set within each call to chat.completions.create
via the extra_headers
parameter as commented below, or via the default_headers
parameter when instantiating the OpenAI
client (which is the recommended approach since the header needs to be present on each request, so setting it just once is preferred).
import os
from openai import OpenAI
openai_client = OpenAI(
base_url=f"{api_url}/v1",
api_key=api_key,
default_headers={"azureml-model-deployment": os.getenv("DEPLOYMENT_NAME")},
)
completion = openai_client.chat.completions.create(
model="Qwen/Qwen2.5-32B-Instruct",
messages=[
{"role": "system", "content": "You are an assistant that responds like a pirate."},
{
"role": "user",
"content": "What is Deep Learning?",
},
],
max_tokens=128,
# extra_headers={"azureml-model-deployment": os.getenv("DEPLOYMENT_NAME")},
)
print(completion)
cURL
Alternatively, you can also just use cURL
to send requests to the deployed endpoint, with the api_url
and api_key
values programmatically retrieved in the OpenAI snippet and now set as environment variables so that cURL
can use those, as it follows:
os.environ["API_URL"] = api_url
os.environ["API_KEY"] = api_key
!curl -sS $API_URL/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-H "azureml-model-deployment: $DEPLOYMENT_NAME" \
-d '{ \
"messages":[ \
{"role":"system","content":"You are an assistant that replies like a pirate."}, \
{"role":"user","content":"What is Deep Learning?"} \
], \
"max_tokens":128 \
}' | jq
Alternatively, you can also just go to the Azure AI Endpoint in either the Azure AI Foundry under “My assets -> Models + endpoints” or in the Azure ML Studio via “Endpoints”, and retrieve both the URL (note that it will default to the /generate
endpoint, but to use the OpenAI-compatible layer you need to use the /v1/chat/completions
endpoint instead) and the API Key values, as well as the Azure AI Deployment name for the given model.
Gradio
Gradio is the fastest way to demo your machine learning model with a friendly web interface so that anyone can use it. You can also leverage the OpenAI Python SDK to build a simple ChatInterface
that you can use within the Jupyter Notebook cell where you are running it.
Ideally you could deploy the Gradio Chat Interface connected to your Azure ML Managed Online Endpoint as an Azure Container App as described in Tutorial: Build and deploy from source code to Azure Container Apps. If you’d like us to show you how to do it for Gradio in particular, feel free to open an issue requesting it.
%pip install gradio --upgrade --quiet
See below an example on how to leverage Gradio’s ChatInterface
, or find more information about it at Gradio ChatInterface Docs.
import os
from typing import Dict, Iterator, List, Literal
import gradio as gr
from openai import OpenAI
openai_client = OpenAI(
base_url=api_url,
api_key=api_key,
default_headers={"azureml-model-deployment": os.getenv("DEPLOYMENT_NAME")},
)
def predict(message: str, history: List[Dict[Literal["role", "content"], str]]) -> Iterator[str]:
history.append({"role": "user", "content": message})
stream = openai_client.chat.completions.create(
model="Qwen/Qwen2.5-32B-Instruct",
messages=history,
stream=True,
)
chunks = []
for chunk in stream:
chunks.append(chunk.choices[0].delta.content or "")
yield "".join(chunks)
demo = gr.ChatInterface(predict, type="messages")
demo.launch()
Release resources
Once you are done using the Azure AI Endpoint / Deployment, you can delete the resources as it follows, meaning that you will stop paying for the instance on which the model is running and all the attached costs will be stopped.
client.online_endpoints.begin_delete(name=os.getenv("ENDPOINT_NAME")).result()
Conclusion
Throughout this example you learnt how to create and configure your Azure account for Azure ML and Azure AI Foundry, how to then create a Managed Online Endpoint running an open model from the Hugging Face Collection in the Azure ML / Azure AI Foundry model catalog, how to send inference requests to it afterwards with different alternatives, how to build a simple Gradio chat interface around it, and finally, how to stop and release the resources.
If you have any doubt, issue or question about this example, feel free to open an issue and we’ll do our best to help!
📍 Find the complete example on GitHub here!