Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings
Abstract
A multi-agent system using an omni-modal evaluation metric improves JavaScript game and animation generation but struggles with custom assets and audio-visual feedback.
While AI excels at generating text, audio, images, and videos, creating interactive audio-visual content such as video games remains challenging. Current LLMs can generate JavaScript games and animations, but lack automated evaluation metrics and struggle with complex content that normally requires teams of humans working for many months (multi-shot, multi-agents) using assets made by artists. To tackle these issues, we built a new metric and a multi-agent system. We propose AVR-Eval, a relative metric for multimedia content quality using Audio-Visual Recordings (AVRs). An omni-modal model (processing text, video, and audio) compares the AVRs of two contents, with a text model reviewing evaluations to determine superiority. We show that AVR-Eval properly identifies good from broken or mismatched content. We built AVR-Agent, a multi-agent system generating JavaScript code from a bank of multimedia assets (audio, images, 3D models). The coding agent selects relevant assets, generates multiple initial codes, uses AVR-Eval to identify the best version, and iteratively improves it through omni-modal agent feedback from the AVR. We run experiments on games and animations with AVR-Eval (win rate of content A against B). We find that content generated by AVR-Agent has a significantly higher win rate against content made through one-shot generation. However, models struggle to leverage custom assets and AVR feedback effectively, showing no higher win rate. This reveals a critical gap: while humans benefit from high-quality assets and audio-visual feedback, current coding models do not seem to utilize these resources as effectively, highlighting fundamental differences between human and machine content creation approaches.
Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PresentAgent: Multimodal Agent for Presentation Video Generation (2025)
- AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation (2025)
- UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks (2025)
- Fly, Fail, Fix: Iterative Game Repair with Reinforcement Learning and Large Multimodal Models (2025)
- ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing (2025)
- Audio-Sync Video Generation with Multi-Stream Temporal Control (2025)
- video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper