Post
1348
Pair a vision grounding model with a reasoning LLM with Cua
Cua just shipped v0.4 of the Cua Agent framework with Composite Agents - you can now pair a vision/grounding model with a reasoning LLM using a simple modelA+modelB syntax. Best clicks + best plans.
The problem: every GUI model speaks a different dialect.
• some want pixel coordinates
• others want percentages
• a few spit out cursed tokens like <|loc095|>
We built a universal interface that works the same across Anthropic, OpenAI, Hugging Face, etc.:
agent = ComputerAgent(
model="anthropic/claude-3-5-sonnet-20241022",
tools=[computer]
)
But here’s the fun part: you can combine models by specialization.
Grounding model (sees + clicks) + Planning model (reasons + decides) →
agent = ComputerAgent(
model="huggingface-local/HelloKKMe/GTA1-7B+openai/gpt-4o",
tools=[computer]
)
This gives GUI skills to models that were never built for computer use. One handles the eyes/hands, the other the brain. Think driver + navigator working together.
Two specialists beat one generalist. We’ve got a ready-to-run notebook demo - curious what combos you all will try.
Github : https://github.com/trycua/cua
Blog : https://www.trycua.com/blog/composite-agents
Cua just shipped v0.4 of the Cua Agent framework with Composite Agents - you can now pair a vision/grounding model with a reasoning LLM using a simple modelA+modelB syntax. Best clicks + best plans.
The problem: every GUI model speaks a different dialect.
• some want pixel coordinates
• others want percentages
• a few spit out cursed tokens like <|loc095|>
We built a universal interface that works the same across Anthropic, OpenAI, Hugging Face, etc.:
agent = ComputerAgent(
model="anthropic/claude-3-5-sonnet-20241022",
tools=[computer]
)
But here’s the fun part: you can combine models by specialization.
Grounding model (sees + clicks) + Planning model (reasons + decides) →
agent = ComputerAgent(
model="huggingface-local/HelloKKMe/GTA1-7B+openai/gpt-4o",
tools=[computer]
)
This gives GUI skills to models that were never built for computer use. One handles the eyes/hands, the other the brain. Think driver + navigator working together.
Two specialists beat one generalist. We’ve got a ready-to-run notebook demo - curious what combos you all will try.
Github : https://github.com/trycua/cua
Blog : https://www.trycua.com/blog/composite-agents