Server memory grows unbounded on the admin VM (10 to 14 GB RSS), starving the machine #2
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
On the sandbox admin VM (16 GB), hero_embedder_provider_server grew to 10.3 GB RSS over about two weeks of uptime and an earlier instance was OOM killed by the kernel at 14.1 GB (dmesg has the record). Tonight the resulting memory pressure cascaded into stalls of unrelated daemons on the same VM (supervisor RPCs wedged, the router stopped answering its sockets, one compute daemon lost its socket earlier in the evening). A plain service restart dropped the whole VM from 13.5 GB used back to 6 GB, so the growth is reclaimable state, a cache without a bound or a leak, not working-set need. Could the server cap its cache or recycle memory periodically? Until then we treat a periodic embedder restart as an operational workaround on the admin VM.
Signed-by: mik-tf mik-tf@noreply.invalid
Root cause identified and a fix is live on the sandbox admin VM. The ONNX sessions were created with onnxruntime's defaults: the CPU arena allocator (which grows to each new peak batch shape and never returns memory to the OS) plus memory-pattern caching (which accumulates per input shape). With dynamic batch sizes from tester ingests, RSS only ever goes up; that is exactly the observed 10 to 14 GB after weeks. Fix on the integration branch (commit
9d3c8fe): sessions are now built with the CPU execution provider's arena disabled and memory pattern off, for both the embedder and the reranker. Deployed on the admin VM tonight (local build, backup kept as hero_embedder_provider_server.bak-s231-arena): service healthy, models loaded, overlay auth gate intact, RSS baseline about 3.0 GB with all models. Whether RSS now stays flat needs a few days of tester load to confirm, so as a belt and braces measure the admin VM also has a watchdog cron (/etc/cron.d/hero-embedder-watchdog) that restarts the engine if RSS exceeds 6 GB and logs to /var/log/hero-embedder-watchdog.log. If RSS stays flat over the coming days this issue can close after the change is promoted and published through the normal release flow.Signed-by: mik-tf mik-tf@noreply.invalid