All service.* RPCs wedge when one supervised service stops answering its socket #150
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#150
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Reproduced tonight on the sandbox admin VM under memory pressure: hero_router kept its sockets open but stopped responding, and from that point every service.* RPC on hero_proc hung indefinitely, service.status for an unrelated healthy service, service.status_all, and service.restart alike, while logs.* kept answering for a while (then the process wedged fully and needed a kill and relaunch; it also could not reap the router zombie). The smell is a status probe to a child socket with no timeout while holding the registry lock, so one unresponsive child blocks the whole service surface. Expected: per-probe timeouts and no cross-service lock held during probes, so one bad child degrades to one bad row in status_all instead of taking down the control plane. Related but distinct from the restart issue in #149.
Signed-by: mik-tf mik-tf@noreply.invalid