Training using Jobs

Jobs lets you run training scripts on fully managed infrastructure (no need to handle GPUs, dependencies, or environment setup locally). This makes it easy to scale and monitor your experiments directly from the Hub.

In this guide, you’ll learn how to:

Run TRL training scripts using Jobs.
Configure hardware, timeouts, environment variables, and secrets.
Monitor and manage jobs from the CLI or Python.

When a model is trained using TRL + Jobs, a tag is automatically added to the model card.
You can explore models trained with this method Hugging Face model hub.

Requirements

Pro, Team, or Enterprise plan.
Logged into the Hugging Face Hub (hf auth login).

Preparing your Script

You can launch Jobs using either the hf jobs CLI or the Python API. A convenient option is to use UV scripts, which packages all dependencies directly into a single Python file. You can run them like this:

bash

python

You can also run jobs without UV:

bash

python

Adding Dependencies with UV

All example scripts in TRL are compatible with uv, allowing seamless execution with Jobs. You can check the full list of examples in Maintained examples.

Dependencies are specified at the top of the script using this structure:

# /// script
# dependencies = [
#     "trl @ git+https://github.com/huggingface/trl.git",
#     "peft",
# ]
# ///

When you run the UV script, these dependencies are automatically installed. In the example above, trl and peft would be installed before the script runs.

You can also provide dependencies directly in the uv run command:

bash

python

Hardware and Timeout Settings

Jobs allow you to select a specific hardware configuration using the --flavor flag. As of 08/25, the available options are:

CPU: cpu-basic, cpu-upgrade
GPU: t4-small, t4-medium, l4x1, l4x4, a10g-small, a10g-large, a10g-largex2, a10g-largex4, a100-large
TPU: v5e-1x1, v5e-2x2, v5e-2x4

You can always check the latest list of supported hardware flavors in Spaces config reference.

By default, jobs have a 30-minute timeout, after which they will automatically stop. For long-running tasks like training, you can increase the timeout as needed. Supported time units are:

s: seconds
m: minutes
h: hours
d: days

Example with a 2-hour timeout:

bash

python

Environment Variables, Secrets, and Token

You can pass environment variables, secrets, and your auth token to your jobs.

bash

python

Training and Evaluating a Model with Jobs

TRL example scripts are fully UV-compatible, allowing you to run a complete training workflow directly on Jobs. You can customize the training by providing the usual script arguments, along with hardware specifications and secrets.

To evaluate your training runs, in addition to reviewing the job logs, you can use Trackio, a lightweight experiment tracking library. Trackio enables end-to-end experiment management on the Hugging Face Hub. All TRL example scripts already support reporting to Trackio via the report_to argument. Using this feature saves your experiments in an interactive HF Space, making it easy to monitor metrics, compare runs, and track progress over time.

bash

python

Monitoring and Managing Jobs

After launching a job, you can track its progress on the Jobs page. Additionally, Jobs provides CLI and Python commands to check status, view logs, or cancel a job.

bash

python

Best Practices and Tips

Choose hardware that fits the size of your model and dataset for optimal performance.
Training jobs can be long-running. Consider increasing the default timeout.
Reuse training and evaluation scripts whenever possible to streamline workflows.

< > Update on GitHub