Server memory grows unbounded on the admin VM (10 to 14 GB RSS), starving the machine #2

Open
opened 2026-06-12 00:51:20 +00:00 by mik-tf · 1 comment
Owner

On the sandbox admin VM (16 GB), hero_embedder_provider_server grew to 10.3 GB RSS over about two weeks of uptime and an earlier instance was OOM killed by the kernel at 14.1 GB (dmesg has the record). Tonight the resulting memory pressure cascaded into stalls of unrelated daemons on the same VM (supervisor RPCs wedged, the router stopped answering its sockets, one compute daemon lost its socket earlier in the evening). A plain service restart dropped the whole VM from 13.5 GB used back to 6 GB, so the growth is reclaimable state, a cache without a bound or a leak, not working-set need. Could the server cap its cache or recycle memory periodically? Until then we treat a periodic embedder restart as an operational workaround on the admin VM.

Signed-by: mik-tf mik-tf@noreply.invalid

On the sandbox admin VM (16 GB), hero_embedder_provider_server grew to 10.3 GB RSS over about two weeks of uptime and an earlier instance was OOM killed by the kernel at 14.1 GB (dmesg has the record). Tonight the resulting memory pressure cascaded into stalls of unrelated daemons on the same VM (supervisor RPCs wedged, the router stopped answering its sockets, one compute daemon lost its socket earlier in the evening). A plain service restart dropped the whole VM from 13.5 GB used back to 6 GB, so the growth is reclaimable state, a cache without a bound or a leak, not working-set need. Could the server cap its cache or recycle memory periodically? Until then we treat a periodic embedder restart as an operational workaround on the admin VM. Signed-by: mik-tf <mik-tf@noreply.invalid>
Author
Owner

Root cause identified and a fix is live on the sandbox admin VM. The ONNX sessions were created with onnxruntime's defaults: the CPU arena allocator (which grows to each new peak batch shape and never returns memory to the OS) plus memory-pattern caching (which accumulates per input shape). With dynamic batch sizes from tester ingests, RSS only ever goes up; that is exactly the observed 10 to 14 GB after weeks. Fix on the integration branch (commit 9d3c8fe): sessions are now built with the CPU execution provider's arena disabled and memory pattern off, for both the embedder and the reranker. Deployed on the admin VM tonight (local build, backup kept as hero_embedder_provider_server.bak-s231-arena): service healthy, models loaded, overlay auth gate intact, RSS baseline about 3.0 GB with all models. Whether RSS now stays flat needs a few days of tester load to confirm, so as a belt and braces measure the admin VM also has a watchdog cron (/etc/cron.d/hero-embedder-watchdog) that restarts the engine if RSS exceeds 6 GB and logs to /var/log/hero-embedder-watchdog.log. If RSS stays flat over the coming days this issue can close after the change is promoted and published through the normal release flow.

Signed-by: mik-tf mik-tf@noreply.invalid

Root cause identified and a fix is live on the sandbox admin VM. The ONNX sessions were created with onnxruntime's defaults: the CPU arena allocator (which grows to each new peak batch shape and never returns memory to the OS) plus memory-pattern caching (which accumulates per input shape). With dynamic batch sizes from tester ingests, RSS only ever goes up; that is exactly the observed 10 to 14 GB after weeks. Fix on the integration branch (commit 9d3c8fe): sessions are now built with the CPU execution provider's arena disabled and memory pattern off, for both the embedder and the reranker. Deployed on the admin VM tonight (local build, backup kept as hero_embedder_provider_server.bak-s231-arena): service healthy, models loaded, overlay auth gate intact, RSS baseline about 3.0 GB with all models. Whether RSS now stays flat needs a few days of tester load to confirm, so as a belt and braces measure the admin VM also has a watchdog cron (/etc/cron.d/hero-embedder-watchdog) that restarts the engine if RSS exceeds 6 GB and logs to /var/log/hero-embedder-watchdog.log. If RSS stays flat over the coming days this issue can close after the change is promoted and published through the normal release flow. Signed-by: mik-tf <mik-tf@noreply.invalid>
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_embedder_provider#2
No description provided.