|
--- |
|
license: mit |
|
tags: |
|
- RLinf |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- gen-robot/openvla-7b-rlvla-warmup |
|
pipeline_tag: reinforcement-learning |
|
model-index: |
|
- name: RLinf-openvla-maniskill3-ppo |
|
results: |
|
- task: |
|
type: VLA |
|
dataset: |
|
type: maniskill-vision |
|
name: maniskill-vision |
|
metrics: |
|
- type: accuracy |
|
value: 82.0 |
|
- task: |
|
type: VLA |
|
dataset: |
|
type: maniskill-semantic |
|
name: maniskill-semantic |
|
metrics: |
|
- type: accuracy |
|
value: 80.6 |
|
- task: |
|
type: VLA |
|
dataset: |
|
type: maniskill-position |
|
name: maniskill-position |
|
metrics: |
|
- type: accuracy |
|
value: 89.3 |
|
--- |
|
|
|
<div align="center"> |
|
<img src="logo.svg" alt="RLinf-logo" width="500"/> |
|
</div> |
|
|
|
|
|
<div align="center"> |
|
<!-- <a href="TODO"><img src="https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv"></a> --> |
|
<!-- <a href="TODO"><img src="https://img.shields.io/badge/HuggingFace-yellow?logo=huggingface&logoColor=white" alt="Hugging Face"></a> --> |
|
<a href="https://github.com/RLinf/RLinf"><img src="https://img.shields.io/badge/Github-blue"></a> |
|
<a href="https://rlinf.readthedocs.io/en/latest/"><img src="https://img.shields.io/badge/Documentation-Purple?color=8A2BE2&logo=readthedocs"></a> |
|
<!-- <a href="TODO"><img src="https://devin.ai/assets/deepwiki-badge.png" alt="Ask DeepWiki.com" style="height:20px;"></a> |
|
<a href="TODO"><img src="https://img.shields.io/badge/微信-green?logo=wechat&"></a> --> |
|
</div> |
|
|
|
<h1 align="center">RLinf: Reinforcement Learning Infrastructure for Agentic AI</h1> |
|
|
|
[RLinf](https://github.com/RLinf/RLinf) is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development. |
|
|
|
|
|
<div align="center"> |
|
<img src="overview.png" alt="RLinf-overview" width="600"/> |
|
</div> |
|
|
|
## Model Description |
|
This model is trained on ``gen-robot/openvla-7b-rlvla-warmup`` by Proximal Policy Optimization (PPO) on the ManiSkill simulator. |
|
|
|
## Full OOD Evaluation and Results |
|
### Overall OOD Eval Results |
|
Note: rl4vla refers to the paper VLA-RL-Study: What Can RL Bring to VLA Generalization? An Empirical Study. |
|
| Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | __PPO-openvla__ | GRPO-openvla | |
|
|---------------|-----------|-----------------|----------------|-------------|---------------| |
|
| Avg results | 0.7608 | 0.61484375 | 0.6453125 | **0.822135417** | 0.7546875 | |
|
### OOD Eval on Vision |
|
|
|
| Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | __PPO-openvla__ | GRPO-openvla | |
|
|---------------|-----------|-----------------|----------------|-------------|---------------| |
|
| vision avg | 0.7656 | 0.846875 | 0.80546875 | **0.8203125** | 0.746875 | |
|
| unseen table | 0.844 | 0.9140625 | 0.9453125 | **0.95703125** | 0.8984375 | |
|
| dynamic texture (weak) | 0.833 | **0.91015625** | 0.82421875 | 0.85546875 | 0.7890625 | |
|
| dynamic texture (strong) | 0.63 | **0.7734375** | 0.625 | 0.72265625 | 0.65625 | |
|
| dynamic noise (weak) | 0.854 | 0.89453125 | **0.8984375** | 0.87109375 | 0.796875| |
|
| dynamic noise (strong) | 0.667 | **0.7421875** | 0.734375 | 0.6953125 | 0.59375| |
|
|
|
### OOD Eval on Semantic |
|
| Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | __PPO-openvla__ | GRPO-openvla | |
|
|---------------|-----------|-----------------|----------------|-------------|---------------| |
|
| object avg | 0.754 | 0.516113281 | 0.56640625 | **0.805664063** | 0.744140625| |
|
| train setting | 0.938 | 0.94140625 | 0.91796875 | **0.9609375** | 0.84375| |
|
| unseen objects | 0.714 | 0.8046875 | 0.77734375 | **0.81640625** | 0.765625| |
|
| unseen receptacles | 0.75 | 0.7421875 | 0.78125 | **0.8125** | 0.734375| |
|
| unseen instructions | 0.891 | 0.6796875 | 0.68359375 | **0.9453125** | 0.890625| |
|
| multi-object (both seen) | 0.75 | 0.3515625 | 0.4296875 | **0.84375** | 0.7578125| |
|
| multi-object (both unseen) | 0.578 | 0.3046875 | 0.38671875 | **0.62890625** | 0.578125| |
|
| distractive receptacle | 0.812 | 0.1875 | 0.31640625 | **0.828125** | 0.78125| |
|
| multi-receptacle (both unseen) | 0.599 | 0.1171875 | 0.23828125 | **0.609375** | 0.6015625| |
|
|
|
### OOD Eval on Position |
|
| Description | rl4vla | GRPO-openvlaoft | PPO-openvlaoft | __PPO-openvla__ | GRPO-openvla | |
|
|---------------|-----------|-----------------|----------------|-------------|---------------| |
|
| position avg | 0.776 | 0.4296875 | 0.560546875 | **0.892578125** | 0.81640625| |
|
| unseen position (object & receptacle) | 0.807 | 0.40234375 | 0.50390625 | **0.86328125** | 0.75| |
|
| mid-episode object reposition | 0.745 | 0.45703125 | 0.6171875 | **0.921875** | 0.8828125| |
|
|
|
## How to Use |
|
Please integrate the provided model with the [RLinf](https://github.com/RLinf/RLinf) codebase. To do so, modify the following parameters in the configuration file ``examples/embodiment/config/maniskill_ppo_openvla.yaml``: |
|
|
|
- Set ``actor.checkpoint_load_path``, ``actor.tokenizer.tokenizer_model``, and ``rollout.model_dir`` to the path of the model checkpoint. |
|
|
|
Note: If you intend to evaluate the model directly, make sure to set ``actor.model.is_lora`` to ``false``. |
|
|
|
## License |
|
This code repository and the model weights are licensed under the MIT License. |
|
|