Picking the wrong model costs more than tokens
Every developer who has routed prompts through an AI gateway knows the pattern: you pick a model, send a prompt, get a response. If the model has a blind spot — a weakness in math, reasoning, or a particular domain — that is the only answer you get, and you do not know what you missed.
The standard fix is to throw a frontier model at everything. Fable 5 gets you 65.3% on DRACO deep research, but it costs $15/M tokens and can be banned overnight. Opus 4.8 is cheaper at $10/M but scores 58.8% — a 6.5-point gap for a 33% discount.
OpenRouter Fusion collapses that trade-off. Instead of asking one expert, you ask three in parallel, compare what they say, and synthesize the best of all of them — all in a single API call. A budget panel of Gemini 3 Flash ($0.15/M), Kimi K2.6 ($0.35/M), and DeepSeek V4 Pro ($0.87/M) scored 64.7% — within 1 point of Fable 5 at roughly one-tenth the cost.
This is not a new model. It is a new way to call existing models.
What Fusion actually does

OpenRouter Fusion: one API call, multi-model deliberation.
Fusion is OpenRouter’s implementation of the Mixture-of-Agents architecture (based on the MoA paper from Together AI, June 2024). A multi-model deliberation pipeline runs server-side and returns a single response:
- Fan-out. Your prompt is sent in parallel to a panel of models — by default 3–4, with web search and web fetch automatically enabled.
- Judge reads. A judge model (you pick which one) reads every response and produces a structured analysis: consensus points, contradictions, partial coverage, unique insights, blind spots.
- Final answer. The judge writes the final response grounded in that combined analysis.
It is not a majority vote. It is not averaging outputs. The judge compares, critiques, and synthesizes — and that distinction matters.
The DRACO benchmark that matters
AICodeKing puts Fusion through real-world coding and reasoning tasks — and finds the gap between the benchmark and practical results.
OpenRouter tested Fusion on 100 deep research tasks from the DRACO benchmark — complex questions requiring multi-source synthesis across 10 domains (academic research, finance, law, medicine, tech, UX, product comparisons — the kind of work you do).
| Configuration | Score |
|---|---|
| Fusion: Fable 5 + GPT-5.5 → Opus 4.8 (judge) | 69.0% |
| Fusion: Opus 4.8 + GPT-5.5 + Gemini 3.1 Pro → Opus 4.8 | 68.3% |
| Fusion: Opus 4.8 + GPT-5.5 → Opus 4.8 | 67.6% |
| Fusion: Opus 4.8 + Opus 4.8 → Opus 4.8 | 65.5% |
| Solo: Claude Fable 5 | 65.3% |
| Fusion: Gemini 3 Flash + Kimi K2.6 + DeepSeek V4 Pro → Opus 4.8 | 64.7% |
| Solo: DeepSeek V4 Pro | 60.3% |
| Solo: GPT-5.5 | 60.0% |
| Solo: Claude Opus 4.8 | 58.8% |
| Solo: Kimi K2.6 | 53.7% |
| Solo: Gemini 3.1 Pro | 45.4% |
| Solo: Gemini 3 Flash | 43.1% |
Two results stand out:
Frontier Fusion beats every individual model. A Fable 5 + GPT-5.5 panel judged by Opus 4.8 hit 69.0% — nearly 4 points above Fable 5 alone. Fusing two frontier models outperformed either one in isolation.
The budget panel is the real story. Three open/mid-range models (Gemini 3 Flash, Kimi K2.6, DeepSeek V4 Pro) scored 64.7%, within 1% of Fable 5 at roughly half the token cost. No single model in that panel cracked 55% alone.
Caveat: DRACO tests deep research — multi-source synthesis, citation quality, breadth. On practical coding tasks (simulators, SVG generation, math, structured output), independent testers found Fusion underperforms a single strong model. The benchmark result is real, but it lives in the research domain, not every domain.
How it works: the API
The simplest path is a drop-in API call with the openrouter/fusion model slug:
import openai
client = openai.OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="sk-or-..."
)
response = client.chat.completions.create(
model="openrouter/fusion",
messages=[
{"role": "user", "content": "Compare carbon tax vs cap-and-trade. Where do economists disagree?"}
]
)
print(response.choices[0].message.content)
Custom panel
Want different models in your panel? Pass a fusion plugin with your own analysis_models and model (judge):
{
"model": "openrouter/fusion",
"messages": [
{ "role": "user", "content": "..." }
],
"plugins": [{
"id": "fusion",
"model": "~anthropic/claude-opus-latest",
"analysis_models": [
"google/gemini-3-flash-preview",
"moonshotai/kimi-k2.7-code",
"deepseek/deepseek-v4-pro"
]
}]
}
Server tool
Fusion is also available as an openrouter:fusion server tool, which lets your outer model decide when to invoke multi-model deliberation:
{
"model": "anthropic/claude-fable-5",
"tools": [{ "type": "openrouter:fusion" }]
}
The tool’s description tells the model to call fusion only for tasks where multiple perspectives genuinely help — research, expert critique, compare-and-contrast, anything where being wrong is expensive. Set tool_choice: "required" to force it on every request.
Two presets
| Preset | Default Panel | Judge | Best For |
|---|---|---|---|
| Quality | Claude Opus, GPT latest, Gemini Pro latest | Outer model | Maximum accuracy, research-heavy work |
| Budget | Gemini 3 Flash, Kimi K2.6, DeepSeek V4 Pro | Outer model | Cost-sensitive, still beats most solo models |
Switch with a single param, or override the panel entirely.
When Fusion helps — and when it doesn’t
Reach for Fusion when:
- Deep research. Multi-source analysis where breadth of perspective catches blind spots.
- Architecture decisions. A coding model benefits from multiple opinions on system design.
- Cost-sensitive production. The budget panel beats GPT-5.5 and Opus 4.8 at a fraction of the price.
Skip Fusion when:
- Latency matters. Fusion runs 2–3× longer than a single model call. Not for chatbots or real-time.
- Simple deterministic tasks. Extraction, classification, formatting — one model does fine.
- Structured JSON output. Multiple models in the panel may produce diverging formats that confuse the synthesis.
The self-fusion surprise
The surprise from OpenRouter’s testing: Opus 4.8 fused with itself scored 65.5%, compared to 58.8% solo — a 6.7-point jump from running the same model twice with different reasoning paths and a judge reconciling them.
Much of Fusion’s lift comes from the synthesis step itself, not just model diversity. Running one strong model through two parallel reasoning passes and a comparison beat running it once by a wide margin.
Why this matters for the open-source moment
Fusion arrives the same month GLM-5.2, Kimi K2.7-Code, and Mellum2 all went open-weight. The combination changes the math:
- Open-weight models provide diverse panel members at low cost (Gemini Flash at $0.15/M, DeepSeek V4 Pro at $0.87/M).
- Fusion makes that diversity useful without complex orchestration on your side.
- The budget panel proves you do not need frontier models to get frontier-adjacent results — you just need three moderately good models and a judge that knows how to compare.
Related reading
- Multi-Provider AI Gateways: Fallback Routing, Cost Optimization, and Provider Diversity — How to route between providers when one goes down
- The Fable 5 Incident: Multi-Agent Jailbreaks and Sovereign Safety — Why open-weight redundancy matters
- GLM-5.2 vs Kimi K2.7-Code: Which Open-Weight Coding Model Should You Use? — The two models that make budget Fusion panels viable
Sources
- OpenRouter Blog — Surpassing Frontier Performance with Fusion
- OpenRouter Docs — Fusion Server Tool
- OpenRouter — Fusion Model Page
- MindStudio — What Is OpenRouter Fusion?
- Pulse 2.0 — OpenRouter Introduces Fusion
- Together AI — Mixture-of-Agents (arxiv:2406.04692)
- DRACO Benchmark (arxiv:2602.11685)
About the author
Charles Jasthyn De La Cueva is a full-stack developer and the founder of Open TechStack. He writes about AI engineering, developer tools, and practical model evaluation — grounded in real workflows, not press releases.