Five Multi-Agent Coordination Patterns That Actually Work in Enterprise
Most multi-agent tutorials show toy examples. Enterprise deployments require coordination patterns that handle partial failures, maintain state across restarts, and scale to concurrent agent populations. Here are the five patterns we deploy most.
The five enterprise multi-agent coordination patterns are: Supervisor-Worker (central coordinator delegates to specialists), Peer-to-Peer Collaboration (agents negotiate via shared message bus), Pipeline (linear handoff transformations), Hierarchical Tree (multi-level supervision for complex process maps), and Event-Driven Pub/Sub (reactive agents with no central coordinator). Pattern selection is primarily driven by fault tolerance requirements and whether coordination logic should be centralised or distributed.
Why Pattern Selection Matters More Than Framework Selection
Teams spend too much time debating LangGraph versus CrewAI versus AutoGen and not enough time thinking about the coordination topology their use case actually requires. The framework is an implementation detail. The coordination pattern is an architectural decision that determines fault tolerance, observability, and scalability characteristics.
A Supervisor-Worker system and a Peer-to-Peer system implemented in the same framework will have completely different failure modes. The supervisor is a single point of failure; the peer network has partial failure complexity. Getting the pattern wrong means you will be fighting the architecture every time you scale or add resilience.
These five patterns cover 95% of enterprise multi-agent use cases we have seen. They are named, bounded, and have well-understood trade-offs. Treat them as a design vocabulary — when a new use case arrives, map it to the closest pattern first before reaching for a custom topology.
The Five Coordination Patterns
Pattern 1: Supervisor-Worker
Architecture: A central supervisor agent receives the top-level goal, decomposes it into subtasks, delegates each subtask to a specialist worker agent, and aggregates results. Workers report back to the supervisor only — they do not communicate with each other.
Best for: Tasks with clear decomposition into specialist domains — research + synthesis, data extraction + analysis + formatting. Works well when the supervisor can reliably assess worker output quality.
Failure modes: Supervisor is a single point of failure. If the supervisor's decomposition is wrong, all workers execute on a faulty plan. Supervisor token costs are high because it sees all intermediate results.
Implementation notes: Use LangGraph's hierarchical graph or CrewAI's crew with a manager. Always implement a worker timeout with supervisor fallback. Log supervisor decomposition decisions for debugging.
Pattern 2: Peer-to-Peer Collaboration
Architecture: Agents operate as peers, communicating via a shared message bus or structured debate protocol. No central coordinator — agents propose, critique, and refine each other's outputs through multiple rounds until convergence or timeout.
Best for: Tasks requiring diverse perspectives or adversarial validation — code review (generator vs. reviewer), document drafting (author vs. editor), risk assessment (proposer vs. critic).
Failure modes: Convergence is not guaranteed — peers can deadlock in disagreement. Significantly more expensive than Supervisor-Worker because every agent sees all messages. Difficult to debug: tracing which agent caused a quality regression requires full conversation replay.
Implementation notes: Always set a maximum round limit (3-5 rounds covers 90% of cases). Define explicit convergence criteria. Route unresolved debates to a human reviewer rather than forcing LLM resolution.
Pattern 3: Pipeline
Architecture: Agents are arranged in a linear sequence. Each agent receives the output of the previous agent, applies a transformation or enrichment, and passes the result forward. No agent communicates with any other agent except its immediate neighbours.
Best for: Document processing workflows with clear sequential stages — extract → normalise → enrich → classify → store. Also appropriate for content generation pipelines — draft → review → edit → format.
Failure modes: A failure at any stage blocks the entire pipeline. Errors propagate forward — a misclassification at stage 2 corrupts all subsequent stages. Pipeline latency is the sum of all stage latencies, which compounds quickly.
Implementation notes: Implement per-stage output validation before passing to the next stage. Use Temporal or Prefect for durable pipeline execution with checkpoint-restart. Store intermediate outputs so failed pipelines can restart from the last successful stage.
Pattern 4: Hierarchical Tree
Architecture: Multi-level supervision where top-level supervisors manage mid-level coordinators, which in turn manage specialist workers. Mirrors complex enterprise organisational charts. Each level only communicates with its immediate parent and children.
Best for: Large-scale enterprise process automation where a single top-level goal requires coordination across multiple domains — e.g., a due diligence agent that coordinates a legal review team, a financial analysis team, and a market research team simultaneously.
Failure modes: Coordination overhead scales with tree depth. Bugs in mid-level coordinator logic are hard to detect because they are shielded from both the top-level supervisor and the worker level. Tree depth beyond 3 levels becomes extremely expensive and difficult to debug.
Implementation notes: Keep tree depth to 2-3 levels maximum. Each level should have a well-defined scope that does not overlap with adjacent levels. LangGraph's multi-graph compilation is the cleanest implementation for this pattern.
Pattern 5: Event-Driven Publish-Subscribe
Architecture: Agents subscribe to event types on a shared bus. When an event is emitted (by a user action, system trigger, or another agent), all subscribed agents process it independently. No central coordinator — coordination emerges from event flow.
Best for: Reactive systems where multiple processes need to respond to the same events — an order placed event triggers inventory check, fraud scoring, notification, and analytics simultaneously. Also appropriate for continuous monitoring agents that react to data changes.
Failure modes: Event ordering is not guaranteed — agents processing the same event sequence may do so in different orders. Debugging requires event log reconstruction. Runaway event loops (agent A emits event X, which triggers agent B, which emits event X again) can be catastrophic without circuit breakers.
Implementation notes: Use a durable message broker (Kafka, Redis Streams, or Google Pub/Sub) — not an in-memory queue. Implement idempotency keys so agents can deduplicate replayed events. Add circuit breakers at the event emission layer.
LangGraph Supervisor-Worker Implementation with Error Recovery
from typing import TypedDict, Annotated, Literal
import operator
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.memory import MemorySaver
import json
# --- State definition ---
class WorkflowState(TypedDict):
goal: str
plan: list[dict] # Supervisor's task decomposition
completed_tasks: list[dict]
failed_tasks: list[dict]
worker_results: Annotated[list, operator.add]
final_output: str
retry_count: int
# --- LLM setup ---
supervisor_llm = ChatOpenAI(model="gpt-4o", temperature=0)
worker_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)
# --- Supervisor node: decompose goal into tasks ---
def supervisor_plan(state: WorkflowState) -> WorkflowState:
response = supervisor_llm.invoke([
SystemMessage(content="""You are a task decomposition supervisor.
Break the user's goal into 3-5 specialist tasks.
Respond with JSON: {"tasks": [{"id": "t1", "specialist": "researcher|analyst|writer", "instruction": "..."}]}"""),
HumanMessage(content=f"Goal: {state['goal']}")
])
try:
plan = json.loads(response.content)
return {"plan": plan["tasks"], "retry_count": 0}
except json.JSONDecodeError:
return {"plan": [], "retry_count": state.get("retry_count", 0) + 1}
# --- Worker nodes: execute assigned tasks ---
SPECIALIST_PROMPTS = {
"researcher": "You are a research specialist. Gather and summarise relevant information.",
"analyst": "You are an analyst. Analyse the provided information and identify key insights.",
"writer": "You are a writer. Produce clear, structured written output based on provided inputs.",
}
def execute_worker_tasks(state: WorkflowState) -> WorkflowState:
results = []
failed = []
for task in state["plan"]:
specialist = task.get("specialist", "analyst")
system_prompt = SPECIALIST_PROMPTS.get(specialist, SPECIALIST_PROMPTS["analyst"])
context = ""
if results:
context = "\nPrevious results:\n" + "\n".join(
f"- {r['task_id']}: {r['result'][:200]}" for r in results
)
try:
response = worker_llm.invoke([
SystemMessage(content=system_prompt),
HumanMessage(content=f"Task: {task['instruction']}{context}")
])
results.append({"task_id": task["id"], "specialist": specialist,
"result": response.content, "status": "success"})
except Exception as e:
failed.append({"task_id": task["id"], "error": str(e)})
return {"worker_results": results, "failed_tasks": failed}
# --- Supervisor aggregation node ---
def supervisor_aggregate(state: WorkflowState) -> WorkflowState:
results_text = "\n\n".join(
f"[{r['specialist'].upper()}]\n{r['result']}"
for r in state.get("worker_results", [])
)
response = supervisor_llm.invoke([
SystemMessage(content="Synthesise the worker outputs into a single coherent response."),
HumanMessage(content=f"Goal: {state['goal']}\n\nWorker results:\n{results_text}")
])
return {"final_output": response.content}
# --- Routing logic ---
def route_after_planning(state: WorkflowState) -> Literal["execute", "abort"]:
if not state["plan"] and state.get("retry_count", 0) >= 2:
return "abort"
return "execute"
def route_after_execution(state: WorkflowState) -> Literal["aggregate", "retry"]:
success_count = len(state.get("worker_results", []))
total_tasks = len(state.get("plan", []))
if success_count < total_tasks * 0.5: # >50% failed → retry
return "retry"
return "aggregate"
# --- Build graph ---
def build_supervisor_worker_graph() -> any:
builder = StateGraph(WorkflowState)
builder.add_node("plan", supervisor_plan)
builder.add_node("execute", execute_worker_tasks)
builder.add_node("aggregate", supervisor_aggregate)
builder.set_entry_point("plan")
builder.add_conditional_edges("plan", route_after_planning,
{"execute": "execute", "abort": END})
builder.add_conditional_edges("execute", route_after_execution,
{"aggregate": "aggregate", "retry": "plan"})
builder.add_edge("aggregate", END)
return builder.compile(checkpointer=MemorySaver())
if __name__ == "__main__":
graph = build_supervisor_worker_graph()
config = {"configurable": {"thread_id": "run-001"}}
result = graph.invoke(
{"goal": "Analyse Q3 2025 SaaS market trends and produce an executive summary.",
"plan": [], "completed_tasks": [], "failed_tasks": [],
"worker_results": [], "final_output": "", "retry_count": 0},
config=config
)
print(result["final_output"])LangGraph Supervisor-Worker implementation with conditional routing for error recovery. The route_after_execution function implements the key resilience logic: if more than 50% of worker tasks fail, the graph routes back to the supervisor for re-planning rather than attempting aggregation on partial results. MemorySaver provides checkpoint-restart capability for long-running workflows.
The Hierarchical Tree pattern is frequently over-applied. Teams map their org chart onto their agent architecture and end up with 4-5 levels of supervision that costs 10x more than necessary and is nearly impossible to debug. For 90% of enterprise use cases, Supervisor-Worker (2 levels) or Pipeline is sufficient. Only reach for the Hierarchical Tree when you have genuinely independent domain teams that require simultaneous coordination — not just because the process diagram looks hierarchical.
Pattern Selection Guide
| Pattern | Best For | Main Failure Mode | Relative Cost |
|---|---|---|---|
| Supervisor-Worker | Clear specialist decomposition | Supervisor single point of failure | Medium |
| Peer-to-Peer | Adversarial validation, debate | Deadlock, high token cost | High |
| Pipeline | Sequential document processing | Error propagation, stage blocking | Low |
| Hierarchical Tree | Cross-domain enterprise processes | Coordination overhead, debugging difficulty | Very High |
| Event-Driven Pub/Sub | Reactive, concurrent processing | Event loops, ordering issues | Low–Medium |
Inductivee's Pattern Recommendations in Practice
When a new client engagement starts with 'we want a multi-agent system,' our first question is always: 'what happens if one agent fails halfway through?' The answer almost always determines the correct coordination pattern before any other analysis is done.
If partial completion is worse than no completion (a PO amendment system where half a purchase order is processed), you need Pipeline with checkpoint-restart or Supervisor-Worker with atomic commit semantics. If partial completion is acceptable (a research report missing one section is still valuable), Peer-to-Peer or Supervisor-Worker with partial aggregation works fine.
Of the five patterns, Pipeline is significantly underused in enterprise contexts because it feels 'too simple.' Teams reach for Supervisor-Worker when a sequential Pipeline would be more reliable, cheaper, and easier to debug. If your workflow has a natural sequential order — and most document processing workflows do — start with Pipeline and upgrade to Supervisor-Worker only when you need dynamic task decomposition that cannot be pre-specified.
Frequently Asked Questions
What are the main multi-agent coordination patterns?
When should I use LangGraph for multi-agent systems?
How do multi-agent systems handle partial failures?
What is the Supervisor-Worker pattern in multi-agent AI?
How many agents should a multi-agent system have?
Written By
Inductivee Team
AuthorAgentic AI Engineering Team
The Inductivee engineering team — a remote-first group of multi-agent orchestration specialists, RAG pipeline architects, and data liquidity engineers who have shipped 40+ agentic deployments across 25+ enterprises since 2012. Our writing is grounded in what we actually build, break, and operate in production.
Inductivee is a remote-first agentic AI engineering firm with 40+ production deployments across 25+ enterprises since 2012. Our engineering content is written by active practitioners and technically reviewed before publication. Compliance: SOC2 Type II, HIPAA, GDPR, ISO 27001.
Engineer This With Inductivee
The engineering patterns in this article are what our team builds into production every day. Explore the related service to see how we deliver this capability at enterprise scale.
Agentic Custom Software Engineering
We engineer autonomous agentic systems that orchestrate enterprise workflows and unlock the hidden liquidity of your proprietary data.
ServiceAutonomous Agentic SaaS
Agentic SaaS development and autonomous platform engineering — we build SaaS products whose core loop is powered by LangGraph and CrewAI agents that execute workflows, not just manage them.
Related Articles
What Is Agentic AI? A Practical Guide for Enterprise Engineering Teams
CrewAI Tutorial: Enterprise Production Deployment Patterns and Hard-Won Lessons
LangGraph Multi-Agent Workflows: Production Patterns for Complex Stateful Orchestration
Ready to Build This Into Your Enterprise?
Inductivee engineers agentic systems, RAG pipelines, and enterprise data liquidity solutions. Let's scope your project.
Start a Project