lhumina_code/hero_proc

All service.* RPCs wedge when one supervised service stops answering its socket #150

New issue

Open

opened 2026-06-12 00:51:20 +00:00 by mik-tf · 0 comments

mik-tf commented

2026-06-12 00:51:20 +00:00

Owner

Reproduced tonight on the sandbox admin VM under memory pressure: hero_router kept its sockets open but stopped responding, and from that point every service.* RPC on hero_proc hung indefinitely, service.status for an unrelated healthy service, service.status_all, and service.restart alike, while logs.* kept answering for a while (then the process wedged fully and needed a kill and relaunch; it also could not reap the router zombie). The smell is a status probe to a child socket with no timeout while holding the registry lock, so one unresponsive child blocks the whole service surface. Expected: per-probe timeouts and no cross-service lock held during probes, so one bad child degrades to one bad row in status_all instead of taking down the control plane. Related but distinct from the restart issue in #149.

Signed-by: mik-tf mik-tf@noreply.invalid

Reproduced tonight on the sandbox admin VM under memory pressure: hero_router kept its sockets open but stopped responding, and from that point every service.* RPC on hero_proc hung indefinitely, service.status for an unrelated healthy service, service.status_all, and service.restart alike, while logs.* kept answering for a while (then the process wedged fully and needed a kill and relaunch; it also could not reap the router zombie). The smell is a status probe to a child socket with no timeout while holding the registry lock, so one unresponsive child blocks the whole service surface. Expected: per-probe timeouts and no cross-service lock held during probes, so one bad child degrades to one bad row in status_all instead of taking down the control plane. Related but distinct from the restart issue in https://forge.ourworld.tf/lhumina_code/hero_proc/issues/149. Signed-by: mik-tf <mik-tf@noreply.invalid>

Sign in to join this conversation.

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

lhumina_code/hero_proc#150

No description provided.