All service.* RPCs wedge when one supervised service stops answering its socket #150

Open
opened 2026-06-12 00:51:20 +00:00 by mik-tf · 0 comments
Owner

Reproduced tonight on the sandbox admin VM under memory pressure: hero_router kept its sockets open but stopped responding, and from that point every service.* RPC on hero_proc hung indefinitely, service.status for an unrelated healthy service, service.status_all, and service.restart alike, while logs.* kept answering for a while (then the process wedged fully and needed a kill and relaunch; it also could not reap the router zombie). The smell is a status probe to a child socket with no timeout while holding the registry lock, so one unresponsive child blocks the whole service surface. Expected: per-probe timeouts and no cross-service lock held during probes, so one bad child degrades to one bad row in status_all instead of taking down the control plane. Related but distinct from the restart issue in #149.

Signed-by: mik-tf mik-tf@noreply.invalid

Reproduced tonight on the sandbox admin VM under memory pressure: hero_router kept its sockets open but stopped responding, and from that point every service.* RPC on hero_proc hung indefinitely, service.status for an unrelated healthy service, service.status_all, and service.restart alike, while logs.* kept answering for a while (then the process wedged fully and needed a kill and relaunch; it also could not reap the router zombie). The smell is a status probe to a child socket with no timeout while holding the registry lock, so one unresponsive child blocks the whole service surface. Expected: per-probe timeouts and no cross-service lock held during probes, so one bad child degrades to one bad row in status_all instead of taking down the control plane. Related but distinct from the restart issue in https://forge.ourworld.tf/lhumina_code/hero_proc/issues/149. Signed-by: mik-tf <mik-tf@noreply.invalid>
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#150
No description provided.