How Do You Design a Multi-Agent AI System That Actually Converges?

How Do You Design a Multi-Agent AI System That Actually Converges?#

Kaisen Yao. MIDS. Class of 2026 | Why a supervisor architecture can be better than a swarm

A technical reference for this project is available at Fincom.

When does adding more AI agents make a system smarter, and when does it make it collapse into endless disagreement?

Many teams assume that multi-agent systems are inherently superior. More agents. More perspectives. More debate. Surely that must mean better reasoning.

In our capstone project, we decided to test that assumption. We built Agent-Z, a multi-agent AI “investment committee” composed of specialized agents for research, quantitative analysis, and risk management. The intuition was straightforward: finance is inherently multi-disciplinary, so an AI system should mirror that structure.

It sounded elegant. It turned out to be wrong.

Multi-agent systems do not fail because the agents are weak. They fail because governance is missing.

This post explains what we learned about designing multi-agent debate systems, why swarm-style architectures often collapse in practice, and how structural constraints, especially a Supervisor layer, can transform chaos into something reliable.

1. The Seduction Of Swarm Intelligence#

The fully connected “swarm” architecture sounds elegant.

Every agent can:

Talk to every other agent
Critique reasoning
Propose revisions
Negotiate consensus

In theory, disagreement sharpens reasoning. In practice, disagreement compounds uncertainty.

Here is what tends to happen in unconstrained debate systems:

A research agent emphasizes breadth and context.
A quant agent emphasizes numerical precision.
A risk agent emphasizes downside exposure.

Each critiques the others based on different objectives. Without explicit arbitration or termination rules, the debate continues:

Agents respond to critiques.
New arguments introduce new uncertainties.
Revisions create additional disagreement.

The system does not converge. It drifts.

What looks like “more thinking” becomes:

Longer responses
Higher latency
Higher cost
Lower reproducibility

The key realization was this: debate is not self-regulating.

You can think of swarm-style systems as a feedback loop without damping:

more disagreement
  -> more revision
  -> more uncertainty
  -> more disagreement

Without structural damping, such as arbitration, turn limits, and termination rules, the system behaves like an unstable control system. It does not converge. It oscillates. Multi-agent systems require governance.

2. Why Supervisor Architectures Often Work Better#

We found that a Supervisor-based architecture, where a central coordinating agent routes tasks and synthesizes outputs, produces more stable results than a fully decentralized swarm.

This is not about hierarchy for its own sake. It is about structure.

Explicit Task Decomposition#

Instead of free-form debate, tasks are decomposed into:

Research
Quantitative analysis
Risk evaluation

Each agent handles a bounded responsibility.

Centralized Synthesis#

Rather than hoping consensus emerges, the Supervisor:

Aggregates outputs
Resolves conflicts
Produces a unified final answer

Controlled Termination#

The Supervisor decides when the system is “done.” Without this role, termination becomes ambiguous, and ambiguous termination is the root cause of endless loops.

Multi-agent is not about removing structure. It is about structured delegation.

3. A Minimal Implementation Blueprint#

If you were to implement a convergent multi-agent system from scratch, the minimal architecture should include:

A task decomposer
Specialized agents with bounded roles
A centralized synthesizer
An explicit termination condition

A simplified loop might look like this:

while not termination_condition:
    subtasks = supervisor.decompose(user_query)

    for task in subtasks:
        output = specialist.run(task)
        supervisor.collect(output)

    if supervisor.is_sufficient():
        break
    else:
        supervisor.refine()

return supervisor.synthesize()

The crucial distinction from swarm systems:

Agents do not negotiate freely with each other.
All coordination is mediated.
That mediation is what stabilizes the system.

4. Guardrails: How To Prevent Endless Debate Loops#

After experiencing unresolved disagreement in swarm-style setups, we identified several structural guardrails that are essential for multi-agent debate systems.

Hard Constraints#

Maximum number of debate turns
Maximum tool calls
Bounded message length

These caps prevent runaway loops. Without them, systems often trade coherence for verbosity.

Structured Workflow Loops#

Instead of allowing arbitrary peer-to-peer interaction, we introduced more constrained workflows.

For example:

Research produces context.
Quant analyzes numerical patterns.
Risk evaluates downside exposure.
Supervisor synthesizes.

Debate can still exist, but within a controlled sequence. This dramatically reduces variance in outputs.

Explicit Termination Mechanisms#

One of the most important insights was this:

If your system has no clear termination rule, it does not have a stable equilibrium.

We experimented with using a supervising agent as a judge to determine when:

Key analytical components are covered
No major contradictions remain
The output meets task requirements

Even if the evaluation rubric is evolving, having an explicit termination trigger is far superior to open-ended debate.

Without governance, multi-agent systems tend to:

Drift into recursive refinement
Inflate inference cost
Produce inconsistent outputs across runs
Optimize verbosity instead of clarity

Most dangerously, they create the illusion of intelligence rather than measurable improvement.

5. Evaluation: If You Cannot Measure Improvement, You Are Guessing#

Another hard lesson:

If you do not define evaluation early, improvement becomes subjective.

At first, we evaluated outputs informally:

Does this look more complete?
Does it feel smarter?

That approach does not scale.

We shifted toward structured evaluation using financial QA benchmarks and LLM-based judging frameworks.

Key elements included:

Predefined financial question sets
Expected key elements in answers
Checks for contradiction and completeness
Comparative testing across architectures

Even when the judging rubric is still being refined, having a benchmark dataset anchors development in something measurable. Otherwise, you risk optimizing aesthetics rather than performance.

6. From AI Committee To Financial Terminal#

Another major realization was product-related. A multi-agent group chat is not the end product. It is an engine.

In finance, real workflows look like:

Research a stock
Generate structured reports
Make trade decisions
Evaluate portfolio risk

The value of multi-agent reasoning is highest when embedded inside a structured decision environment, not when presented as an open-ended conversation.

In other words:

The committee is not the product. The committee is the reasoning layer inside a decision terminal.

Once multi-agent systems are placed inside clearly defined modules, such as Research, Trade, and Portfolio, their outputs become actionable rather than abstract. Structure improves not just reasoning stability, but user utility.

7. Key Takeaways#

If you are designing a multi-agent debate system, here are the core lessons:

Swarm architectures are fragile without arbitration and termination.
Supervisor architectures improve coherence and reproducibility.
Hard constraints, such as turn limits and tool caps, prevent runaway loops.
Structured workflows outperform free-form debate.
Evaluation must be defined early, or improvement becomes subjective.
Multi-agent systems create value when embedded in real workflows, not when left as open chat.

Most importantly: multi-agent systems are not a shortcut to intelligence. They are governance systems layered on top of probabilistic models. Design them like control systems, not like conversations.