June 7, 2026·11 min read

Right-sizing your agent swarm without chasing ghosts.

A straight-line CPU climb and a coder worker stuck near 1.1 GB looked like production problems. They were metric interpretation traps. Here are the sizing numbers we actually run.

container sizingSigNozself-hostingAI agentsobservability

Agent Swarm container CPU and memory sizing field notes — First verify the metric. Then change the container size.

We had the kind of graph that makes infrastructure teams reach for a rollback: container CPU climbing in a perfect straight line after deploy. At the same time, a coder worker looked stuck above 1 GB of RAM while idle.

Both looked like production problems. Both turned out to be measurement artifacts. The CPU line was an average over a cumulative counter. The memory plateau was page cache plus a long-lived Bun/Node high-water mark. The swarm was fine; the dashboard was asking the wrong question.

The rule is simple: do not resize a swarm because a graph looks scary. First prove the graph is measuring the thing you think it is measuring.

The numbers we actually run

Agent containers do not have one resource profile. A lead agent mostly coordinates. A content worker may sit close to idle. A coding worker can read a repository, spawn a harness process, run TypeScript, execute tests, and stream tool calls for an hour.

/ heavy worker

>=2 vCPU / 2-3 GB

Observed peaks around 1.4 CPU cores and 2.0 GB RAM during active coding.

/ lead

~1 vCPU / 1 GB

Observed peaks around 0.8 CPU cores and 830 MB while orchestrating.

/ light worker

0.5 vCPU / 512 MB-1 GB

Idle specialists sit around 3% CPU and 200-290 MB RAM.

That is the practical sizing split: implementation workers get headroom, leads get enough room to coordinate without pressure, and light specialists stay small until their actual workload says otherwise. Role names are a starting point, not a law. If a tester runs browser automation or repo-wide builds, size it like a heavy worker.

Kubernetes pod sizing

In Kubernetes, we set requests to the steady-state budget the scheduler should reserve and limits to the burst ceiling each pod can use during an active session. Heavy workers need the widest gap because build tools, language servers, test runners, and provider subprocesses can spike together.

Kubernetes is stricter about memory bursts than the generated Docker Compose setup I use for single-host deployments. Exceeding limits.memory is an immediate OOMKill, and pods running above requests.memory are the first eviction candidates under node memory pressure. Compose sets no hard memory ceiling by default, so Kubernetes requests should sit near the real working set and limits need real headroom. For a heavy coder, I budget the limit as MAX_CONCURRENT_TASKS * per-session peak (~2 GB) + page-cache headroom; at the default MAX_CONCURRENT_TASKS=1, that is one active task peaking around 2 GB, and raising it multiplies the peak. For the lead, request == limit gives Guaranteed QoS and keeps the orchestrator out of memory-pressure eviction; its observed peak is around 830 MB, so a modest equal request/limit is enough.

Pod type	Suggested replicas	CPU request	CPU limit	Memory request	Memory limit	Notes
API server	1-2	500m	1 CPU	512 MiB	1 GiB	Run 2 replicas when the database and storage layer support the deployment topology.
Lead agent	1	750m	1.5 CPU	768 MiB	1.5 GiB	Coordination-heavy pods benefit from low latency more than high memory.
Heavy worker	1 per pod	1 CPU	2 CPU	2 GiB	3 GiB	Best default for coding agents, repo-wide checks, and tool-heavy implementation.
Light worker	1 per pod	250m	750m	512 MiB	1 GiB	Suitable for review, content, triage, and low-build inspection tasks.
Browser / E2E worker	1 per pod	1 CPU	2-3 CPU	2 GiB	4 GiB	Browser automation and test fixtures need memory headroom beyond normal worker sizing.

Worker class	Recommended pod shape	Concurrency per pod (`MAX_CONCURRENT_TASKS`)	Scaling rule
Heavy coding	1 agent per pod	1 active task (default)	Scale by adding pods, not by packing multiple coding sessions into one container.
Light specialist	1 agent per pod	1 active task (default)	Scale horizontally when queue latency matters.
Thin relay / managed-provider worker	1 agent per pod	1-2 active tasks	Increase only if the provider runtime executes outside the worker and local tool use is light.

Docker Compose on a single host

On a single VPS or bare-metal host, we leave reserve capacity for the database, Docker, the kernel page cache, logs, and deploy-time overlap. The practical rule: allocate only 70-80% of host memory to steady-state containers and keep at least 1-2 vCPU uncommitted on busy boxes.

Service	CPU budget	RAM budget	Replicas	Notes
API server	1 vCPU	1 GiB	1	Keep close to the database. Increase CPU if API latency rises during task churn.
Lead agent	1 vCPU	1 GiB	1	Usually one lead is enough for a small to medium swarm.
Heavy worker	2 vCPU burst	2-3 GiB	By host capacity	Count each active coding worker as the main unit of capacity.
Light worker	0.5-1 vCPU	512 MiB-1 GiB	By host capacity	Good filler capacity after heavy workers are reserved.
Observability / proxy / support services	0.5-1 vCPU	512 MiB-2 GiB	1 each	Include these before calculating worker slots.

Example host class	Approx. host resources	Recommended swarm shape	Notes
Small VPS	4 vCPU / 8 GiB RAM	API + lead + 1 heavy worker + 1-2 light workers	Good for evaluation, demos, and low-concurrency self-hosting.
Medium VPS	8 vCPU / 16 GiB RAM	API + lead + 3 heavy workers + 2-4 light workers	Practical baseline for a small production team.
Large VPS / small dedicated host	16 vCPU / 32 GiB RAM	API + lead + 6-8 heavy workers + 4-8 light workers	Keep 6-8 GiB free for OS cache, logs, deploy overlap, and occasional spikes.
Dedicated build host	32 vCPU / 64 GiB RAM	API + lead + 12-16 heavy workers + light workers as needed	Useful when many workers run tests or builds locally. Split database/storage if API latency or disk I/O becomes noisy.

Concurrency is the real multiplier

The swarm scales most predictably when each local-runtime worker runs one active task at a time. The knob is MAX_CONCURRENT_TASKS; the generated Docker Compose default is MAX_CONCURRENT_TASKS=1, which means one active task per worker. Increasing per-worker concurrency can work for thin relay workers or low-tool tasks, but it multiplies memory peaks and makes local builds fight inside the same container.

Concurrency profile	Worker count	`MAX_CONCURRENT_TASKS` per worker	Total active tasks	Recommended host or cluster budget	When to use
Evaluation	1 lead + 1 heavy worker	1	1	2-4 vCPU, 4-8 GiB RAM	Trial deployments and occasional coding tasks.
Small team	1 lead + 2-3 heavy workers + 2 light workers	1	4-5	8 vCPU, 16 GiB RAM	Several independent tasks per day with room for reviews and content work.
Busy team	1 lead + 6-8 heavy workers + 4 light workers	1	10-12	16 vCPU, 32 GiB RAM	Regular parallel implementation, review, and QA loops.
High throughput	1-2 leads + 12-16 heavy workers + 8+ light workers	1	20+	32+ vCPU, 64+ GiB RAM or Kubernetes node pool	Sustained task queues where horizontal scale matters more than single-host simplicity.
Thin relay workers	Depends on provider	2	Varies	Add 512 MiB-1 GiB RAM per extra active task	Only for providers where execution happens outside the worker and local tooling is minimal.

Trap one: CPU that climbs because the query climbs

The CPU graph that triggered the investigation was suspicious because it was too clean. Real CPU usage jitters. This line rose with mechanical precision and reset on deploy.

The problem was the panel, not the container. It plotted container.cpu.utilization with average time aggregation, treating a cumulative-since-boot counter like an instantaneous usage gauge. Averaging a monotonic counter produces a fake climb by design.

# Wrong mental model:
avg(container.cpu.utilization)

# Better instantaneous CPU:
rate(container.cpu.usage.total)

We fixed the SigNoz Container CPU Percent panel by switching the time aggregation from average to rate. The straight-line climb disappeared. The actual instantaneous CPU was flat.

Trap two: memory that is high but reclaimable

The memory graph was more subtle. Our coder worker plateaued around 1,145 MB at idle, peaked at 1,979 MB during a heavy coding session, and reset to about 470 MB after the next redeploy. That shape can look like a leak if you only stare at container.memory.usage.total.

It was mostly two normal effects stacked together. First, cgroup page cache: coding sessions read and write a lot of files, and Linux keeps file-backed pages around until pressure forces eviction. Second, the high-water mark of a long-lived Bun/Node runner: heap can be reusable inside the runtime without RSS immediately returning to the OS.

# Better leak-triage panel:
working set = usage.total - inactive_file

Total memory still matters for OOM risk. It is not the same as a leak signal. For leak triage, graph working set and look for growth that survives restarts. If the plateau clears on every redeploy, start with page cache and runtime high-water marks before you go hunting through application code.

Harness choice is part of sizing

The harness provider changes the operational risk profile. claude is our most reliable general harness. codex is stable and good for deterministic implementation, review, and structured-output work. pi is fine for content and QA. opencode has produced intermittent session errors within roughly 10 seconds of task start, so we avoid it for determinism-critical workflow nodes.

That is not just reliability bookkeeping. It affects how much headroom you give a container and what work you route there. A Codex or Claude implementation worker gets heavy-worker resources. A content worker can stay light until the workflow adds builds, browser work, or large local processing.

Fix real leaks, then fix the dashboard

The same operational thread did uncover real cleanup work. PR #675, “Fix runner and MCP transport leaks,” bounded runner task-keyed bookkeeping and closed idle MCP owner/user transports that survived unclean disconnects. That was real.

The CPU climb after that was not. That distinction is the whole lesson: production agent infrastructure needs both code fixes and measurement discipline. If you mix them up, you either ignore a real leak or waste a morning tuning containers around a dashboard artifact.

The full reference guide, including role-by-role sizing and metric panel guidance, is in Performance & Resource Sizing.

/ keep reading

All posts

June 15, 2026 / 10 min read

Right-sizing your agent swarm without chasing ghosts.

The numbers we actually run

Kubernetes pod sizing

Docker Compose on a single host

Concurrency is the real multiplier

Trap one: CPU that climbs because the query climbs

Trap two: memory that is high but reclaimable

Harness choice is part of sizing

Fix real leaks, then fix the dashboard

A Frontier Model Is Rented. A Swarm Is Owned.

Script Workflows: Durable One-off Runs for Agent Work

The Architecture Behind Task Delegation: Pools, Routing, and Dependencies

Build your swarm tonight.

The numbers we actually run

Kubernetes pod sizing

Docker Compose on a single host

Concurrency is the real multiplier

Trap one: CPU that climbs because the query climbs

Trap two: memory that is high but reclaimable

Harness choice is part of sizing

Fix real leaks, then fix the dashboard

Related field notes

A Frontier Model Is Rented. A Swarm Is Owned.

Script Workflows: Durable One-off Runs for Agent Work

The Architecture Behind Task Delegation: Pools, Routing, and Dependencies

Build your swarm tonight.