Picking the wrong model costs more than tokens

Every developer who has routed prompts through an AI gateway knows the pattern: you pick a model, send a prompt, get a response. If the model has a blind spot — a weakness in math, reasoning, or a particular domain — that is the only answer you get, and you do not know what you missed.

The standard fix is to throw a frontier model at everything. Fable 5 gets you 65.3% on DRACO deep research, but it costs $15/M tokens and can be banned overnight. Opus 4.8 is cheaper at $10/M but scores 58.8% — a 6.5-point gap for a 33% discount.

OpenRouter Fusion collapses that trade-off. Instead of asking one expert, you ask three in parallel, compare what they say, and synthesize the best of all of them — all in a single API call. A budget panel of Gemini 3 Flash ($0.15/M), Kimi K2.6 ($0.35/M), and DeepSeek V4 Pro ($0.87/M) scored 64.7% — within 1 point of Fable 5 at roughly one-tenth the cost.

This is not a new model. It is a new way to call existing models.


What Fusion actually does

OpenRouter Fusion page showing the model description with Quality and Budget preset labels, alongside the Quick Start API code.

OpenRouter Fusion: one API call, multi-model deliberation.

Fusion is OpenRouter’s implementation of the Mixture-of-Agents architecture (based on the MoA paper from Together AI, June 2024). A multi-model deliberation pipeline runs server-side and returns a single response:

  1. Fan-out. Your prompt is sent in parallel to a panel of models — by default 3–4, with web search and web fetch automatically enabled.
  2. Judge reads. A judge model (you pick which one) reads every response and produces a structured analysis: consensus points, contradictions, partial coverage, unique insights, blind spots.
  3. Final answer. The judge writes the final response grounded in that combined analysis.

It is not a majority vote. It is not averaging outputs. The judge compares, critiques, and synthesizes — and that distinction matters.


The DRACO benchmark that matters

AICodeKing puts Fusion through real-world coding and reasoning tasks — and finds the gap between the benchmark and practical results.

OpenRouter tested Fusion on 100 deep research tasks from the DRACO benchmark — complex questions requiring multi-source synthesis across 10 domains (academic research, finance, law, medicine, tech, UX, product comparisons — the kind of work you do).

ConfigurationScore
Fusion: Fable 5 + GPT-5.5 → Opus 4.8 (judge)69.0%
Fusion: Opus 4.8 + GPT-5.5 + Gemini 3.1 Pro → Opus 4.868.3%
Fusion: Opus 4.8 + GPT-5.5 → Opus 4.867.6%
Fusion: Opus 4.8 + Opus 4.8 → Opus 4.865.5%
Solo: Claude Fable 565.3%
Fusion: Gemini 3 Flash + Kimi K2.6 + DeepSeek V4 Pro → Opus 4.864.7%
Solo: DeepSeek V4 Pro60.3%
Solo: GPT-5.560.0%
Solo: Claude Opus 4.858.8%
Solo: Kimi K2.653.7%
Solo: Gemini 3.1 Pro45.4%
Solo: Gemini 3 Flash43.1%

Two results stand out:

Frontier Fusion beats every individual model. A Fable 5 + GPT-5.5 panel judged by Opus 4.8 hit 69.0% — nearly 4 points above Fable 5 alone. Fusing two frontier models outperformed either one in isolation.

The budget panel is the real story. Three open/mid-range models (Gemini 3 Flash, Kimi K2.6, DeepSeek V4 Pro) scored 64.7%, within 1% of Fable 5 at roughly half the token cost. No single model in that panel cracked 55% alone.

Caveat: DRACO tests deep research — multi-source synthesis, citation quality, breadth. On practical coding tasks (simulators, SVG generation, math, structured output), independent testers found Fusion underperforms a single strong model. The benchmark result is real, but it lives in the research domain, not every domain.


How it works: the API

The simplest path is a drop-in API call with the openrouter/fusion model slug:

import openai

client = openai.OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="sk-or-..."
)

response = client.chat.completions.create(
    model="openrouter/fusion",
    messages=[
        {"role": "user", "content": "Compare carbon tax vs cap-and-trade. Where do economists disagree?"}
    ]
)

print(response.choices[0].message.content)

Custom panel

Want different models in your panel? Pass a fusion plugin with your own analysis_models and model (judge):

{
  "model": "openrouter/fusion",
  "messages": [
    { "role": "user", "content": "..." }
  ],
  "plugins": [{
    "id": "fusion",
    "model": "~anthropic/claude-opus-latest",
    "analysis_models": [
      "google/gemini-3-flash-preview",
      "moonshotai/kimi-k2.7-code",
      "deepseek/deepseek-v4-pro"
    ]
  }]
}

Server tool

Fusion is also available as an openrouter:fusion server tool, which lets your outer model decide when to invoke multi-model deliberation:

{
  "model": "anthropic/claude-fable-5",
  "tools": [{ "type": "openrouter:fusion" }]
}

The tool’s description tells the model to call fusion only for tasks where multiple perspectives genuinely help — research, expert critique, compare-and-contrast, anything where being wrong is expensive. Set tool_choice: "required" to force it on every request.


Two presets

PresetDefault PanelJudgeBest For
QualityClaude Opus, GPT latest, Gemini Pro latestOuter modelMaximum accuracy, research-heavy work
BudgetGemini 3 Flash, Kimi K2.6, DeepSeek V4 ProOuter modelCost-sensitive, still beats most solo models

Switch with a single param, or override the panel entirely.


When Fusion helps — and when it doesn’t

Reach for Fusion when:

  • Deep research. Multi-source analysis where breadth of perspective catches blind spots.
  • Architecture decisions. A coding model benefits from multiple opinions on system design.
  • Cost-sensitive production. The budget panel beats GPT-5.5 and Opus 4.8 at a fraction of the price.

Skip Fusion when:

  • Latency matters. Fusion runs 2–3× longer than a single model call. Not for chatbots or real-time.
  • Simple deterministic tasks. Extraction, classification, formatting — one model does fine.
  • Structured JSON output. Multiple models in the panel may produce diverging formats that confuse the synthesis.

The self-fusion surprise

The surprise from OpenRouter’s testing: Opus 4.8 fused with itself scored 65.5%, compared to 58.8% solo — a 6.7-point jump from running the same model twice with different reasoning paths and a judge reconciling them.

Much of Fusion’s lift comes from the synthesis step itself, not just model diversity. Running one strong model through two parallel reasoning passes and a comparison beat running it once by a wide margin.


Why this matters for the open-source moment

Fusion arrives the same month GLM-5.2, Kimi K2.7-Code, and Mellum2 all went open-weight. The combination changes the math:

  • Open-weight models provide diverse panel members at low cost (Gemini Flash at $0.15/M, DeepSeek V4 Pro at $0.87/M).
  • Fusion makes that diversity useful without complex orchestration on your side.
  • The budget panel proves you do not need frontier models to get frontier-adjacent results — you just need three moderately good models and a judge that knows how to compare.


Sources


About the author

Charles Jasthyn De La Cueva is a full-stack developer and the founder of Open TechStack. He writes about AI engineering, developer tools, and practical model evaluation — grounded in real workflows, not press releases.