Back to writing
June 7, 2026·11 min read

Right-sizing your agent swarm without chasing ghosts.

A straight-line CPU climb and a coder worker stuck near 1.1 GB looked like production problems. They were metric interpretation traps. Here are the sizing numbers we actually run.

container sizingSigNozself-hostingAI agentsobservability
Agent Swarm container CPU and memory sizing field notes
First verify the metric. Then change the container size.

We had the kind of graph that makes infrastructure teams reach for a rollback: container CPU climbing in a perfect straight line after deploy. At the same time, a coder worker looked stuck above 1 GB of RAM while idle.

Both looked like production problems. Both turned out to be measurement artifacts. The CPU line was an average over a cumulative counter. The memory plateau was page cache plus a long-lived Bun/Node high-water mark. The swarm was fine; the dashboard was asking the wrong question.

The rule is simple: do not resize a swarm because a graph looks scary. First prove the graph is measuring the thing you think it is measuring.

The numbers we actually run

Agent containers do not have one resource profile. A lead agent mostly coordinates. A content worker may sit close to idle. A coding worker can read a repository, spawn a harness process, run TypeScript, execute tests, and stream tool calls for an hour.

/ heavy worker
>=2 vCPU / 2-3 GB

Observed peaks around 1.4 CPU cores and 2.0 GB RAM during active coding.

/ lead
~1 vCPU / 1 GB

Observed peaks around 0.8 CPU cores and 830 MB while orchestrating.

/ light worker
0.5 vCPU / 512 MB-1 GB

Idle specialists sit around 3% CPU and 200-290 MB RAM.

That is the practical sizing split: implementation workers get headroom, leads get enough room to coordinate without pressure, and light specialists stay small until their actual workload says otherwise. Role names are a starting point, not a law. If a tester runs browser automation or repo-wide builds, size it like a heavy worker.

Kubernetes pod sizing

In Kubernetes, we set requests to the steady-state budget the scheduler should reserve and limits to the burst ceiling each pod can use during an active session. Heavy workers need the widest gap because build tools, language servers, test runners, and provider subprocesses can spike together.

Kubernetes is stricter about memory bursts than the generated Docker Compose setup I use for single-host deployments. Exceeding limits.memory is an immediate OOMKill, and pods running above requests.memory are the first eviction candidates under node memory pressure. Compose sets no hard memory ceiling by default, so Kubernetes requests should sit near the real working set and limits need real headroom. For a heavy coder, I budget the limit as MAX_CONCURRENT_TASKS * per-session peak (~2 GB) + page-cache headroom; at the default MAX_CONCURRENT_TASKS=1, that is one active task peaking around 2 GB, and raising it multiplies the peak. For the lead, request == limit gives Guaranteed QoS and keeps the orchestrator out of memory-pressure eviction; its observed peak is around 830 MB, so a modest equal request/limit is enough.

Pod typeSuggested replicasCPU requestCPU limitMemory requestMemory limitNotes
API server1-2500m1 CPU512 MiB1 GiBRun 2 replicas when the database and storage layer support the deployment topology.
Lead agent1750m1.5 CPU768 MiB1.5 GiBCoordination-heavy pods benefit from low latency more than high memory.
Heavy worker1 per pod1 CPU2 CPU2 GiB3 GiBBest default for coding agents, repo-wide checks, and tool-heavy implementation.
Light worker1 per pod250m750m512 MiB1 GiBSuitable for review, content, triage, and low-build inspection tasks.
Browser / E2E worker1 per pod1 CPU2-3 CPU2 GiB4 GiBBrowser automation and test fixtures need memory headroom beyond normal worker sizing.
Worker classRecommended pod shapeConcurrency per pod (MAX_CONCURRENT_TASKS)Scaling rule
Heavy coding1 agent per pod1 active task (default)Scale by adding pods, not by packing multiple coding sessions into one container.
Light specialist1 agent per pod1 active task (default)Scale horizontally when queue latency matters.
Thin relay / managed-provider worker1 agent per pod1-2 active tasksIncrease only if the provider runtime executes outside the worker and local tool use is light.

Docker Compose on a single host

On a single VPS or bare-metal host, we leave reserve capacity for the database, Docker, the kernel page cache, logs, and deploy-time overlap. The practical rule: allocate only 70-80% of host memory to steady-state containers and keep at least 1-2 vCPU uncommitted on busy boxes.

ServiceCPU budgetRAM budgetReplicasNotes
API server1 vCPU1 GiB1Keep close to the database. Increase CPU if API latency rises during task churn.
Lead agent1 vCPU1 GiB1Usually one lead is enough for a small to medium swarm.
Heavy worker2 vCPU burst2-3 GiBBy host capacityCount each active coding worker as the main unit of capacity.
Light worker0.5-1 vCPU512 MiB-1 GiBBy host capacityGood filler capacity after heavy workers are reserved.
Observability / proxy / support services0.5-1 vCPU512 MiB-2 GiB1 eachInclude these before calculating worker slots.
Example host classApprox. host resourcesRecommended swarm shapeNotes
Small VPS4 vCPU / 8 GiB RAMAPI + lead + 1 heavy worker + 1-2 light workersGood for evaluation, demos, and low-concurrency self-hosting.
Medium VPS8 vCPU / 16 GiB RAMAPI + lead + 3 heavy workers + 2-4 light workersPractical baseline for a small production team.
Large VPS / small dedicated host16 vCPU / 32 GiB RAMAPI + lead + 6-8 heavy workers + 4-8 light workersKeep 6-8 GiB free for OS cache, logs, deploy overlap, and occasional spikes.
Dedicated build host32 vCPU / 64 GiB RAMAPI + lead + 12-16 heavy workers + light workers as neededUseful when many workers run tests or builds locally. Split database/storage if API latency or disk I/O becomes noisy.

Concurrency is the real multiplier

The swarm scales most predictably when each local-runtime worker runs one active task at a time. The knob is MAX_CONCURRENT_TASKS; the generated Docker Compose default is MAX_CONCURRENT_TASKS=1, which means one active task per worker. Increasing per-worker concurrency can work for thin relay workers or low-tool tasks, but it multiplies memory peaks and makes local builds fight inside the same container.

Concurrency profileWorker countMAX_CONCURRENT_TASKS per workerTotal active tasksRecommended host or cluster budgetWhen to use
Evaluation1 lead + 1 heavy worker112-4 vCPU, 4-8 GiB RAMTrial deployments and occasional coding tasks.
Small team1 lead + 2-3 heavy workers + 2 light workers14-58 vCPU, 16 GiB RAMSeveral independent tasks per day with room for reviews and content work.
Busy team1 lead + 6-8 heavy workers + 4 light workers110-1216 vCPU, 32 GiB RAMRegular parallel implementation, review, and QA loops.
High throughput1-2 leads + 12-16 heavy workers + 8+ light workers120+32+ vCPU, 64+ GiB RAM or Kubernetes node poolSustained task queues where horizontal scale matters more than single-host simplicity.
Thin relay workersDepends on provider2VariesAdd 512 MiB-1 GiB RAM per extra active taskOnly for providers where execution happens outside the worker and local tooling is minimal.

Trap one: CPU that climbs because the query climbs

The CPU graph that triggered the investigation was suspicious because it was too clean. Real CPU usage jitters. This line rose with mechanical precision and reset on deploy.

The problem was the panel, not the container. It plotted container.cpu.utilization with average time aggregation, treating a cumulative-since-boot counter like an instantaneous usage gauge. Averaging a monotonic counter produces a fake climb by design.

# Wrong mental model:
avg(container.cpu.utilization)

# Better instantaneous CPU:
rate(container.cpu.usage.total)

We fixed the SigNoz Container CPU Percent panel by switching the time aggregation from average to rate. The straight-line climb disappeared. The actual instantaneous CPU was flat.

Trap two: memory that is high but reclaimable

The memory graph was more subtle. Our coder worker plateaued around 1,145 MB at idle, peaked at 1,979 MB during a heavy coding session, and reset to about 470 MB after the next redeploy. That shape can look like a leak if you only stare at container.memory.usage.total.

It was mostly two normal effects stacked together. First, cgroup page cache: coding sessions read and write a lot of files, and Linux keeps file-backed pages around until pressure forces eviction. Second, the high-water mark of a long-lived Bun/Node runner: heap can be reusable inside the runtime without RSS immediately returning to the OS.

# Better leak-triage panel:
working set = usage.total - inactive_file

Total memory still matters for OOM risk. It is not the same as a leak signal. For leak triage, graph working set and look for growth that survives restarts. If the plateau clears on every redeploy, start with page cache and runtime high-water marks before you go hunting through application code.

Harness choice is part of sizing

The harness provider changes the operational risk profile. claude is our most reliable general harness. codex is stable and good for deterministic implementation, review, and structured-output work. pi is fine for content and QA. opencode has produced intermittent session errors within roughly 10 seconds of task start, so we avoid it for determinism-critical workflow nodes.

That is not just reliability bookkeeping. It affects how much headroom you give a container and what work you route there. A Codex or Claude implementation worker gets heavy-worker resources. A content worker can stay light until the workflow adds builds, browser work, or large local processing.

Fix real leaks, then fix the dashboard

The same operational thread did uncover real cleanup work. PR #675, “Fix runner and MCP transport leaks,” bounded runner task-keyed bookkeeping and closed idle MCP owner/user transports that survived unclean disconnects. That was real.

The CPU climb after that was not. That distinction is the whole lesson: production agent infrastructure needs both code fixes and measurement discipline. If you mix them up, you either ignore a real leak or waste a morning tuning containers around a dashboard artifact.

The full reference guide, including role-by-role sizing and metric panel guidance, is in Performance & Resource Sizing.

/ get started

Build your swarm tonight.

A 7-day free trial on Cloud, or fork it on GitHub. Either way, your agents start compounding today.