[P2] Redeploy fragility — multiple stray web procs; harden + document #46
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_shrimp#46
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
Restarting
hero_shrimp_webrepeatedly left several stray processes; the bound one wins nondeterministically and which binary servesweb.sockisn't obvious. Redeploy depends on a fragile manual procedure.Evidence
hero_shrimp_webPIDs after repeated relaunches this session.Proposed fix
A single supervised launcher (or systemd unit) that kills predecessors, verifies the socket owner, and refuses to run two instances.
Filed from a comparative audit of Hero Shrimp vs Qwen-Code / kimi-cli / picoclaw (2026-05-23). Severity in title: P0=correctness/trust, P1=reliability/UX, P2=cleanup.
Implementation Spec for Issue #46 — Redeploy fragility: stray web procs
Verified root cause
In
crates/hero_shrimp_web/src/lib.rs,run_ui()blindlyremove_file(web.sock)thenUnixListener::bind(...)with no liveness check, no pidfile, no killing of predecessors. Each relaunch unlinks the socket path and binds a fresh inode while the previous process keeps serving its now-orphaned inode — so repeated relaunches leak live processes (reproduced 3 simultaneous PIDs + a pre-existing stray, matching the issue's "5 concurrent PIDs").The server binary already solves this exact problem via
crates/hero_shrimp_server/src/rpc/pidfile.rs— aPidFileRAII guard that refuses to start if a live pid owns the pidfile, and withforce=truedoes SIGTERM→SIGKILL of the predecessor, clears the stale pidfile + adjacent socket, and takes over. The web binary has no equivalent.Objective
Make
hero_shrimp_webredeploy deterministic: on startup detect a live predecessor owning the web socket, take over deterministically (kill predecessor + clean socket), and never leave two instances racing. Reuse the server's provenPidFilerather than authoring a second copy.Approach (lowest churn)
Extract
PidFileinto the shared, dependency-lighthero_shrimp_typescrate (both server and web already depend on it; web must NOT depend on the heavyweight server crate). Parameterize the adjacent-socket-to-clear so it works for bothrpc.sockandweb.sock. Server re-exports from the shared location — single implementation, no duplication.Files to Modify/Create
crates/hero_shrimp_types/src/pidfile.rs— movePidFilehere; replace hard-codedrpc.sockcleanup with asocket_name: &strparameter; port the unit tests.crates/hero_shrimp_types/src/lib.rs— addpub mod pidfile;.crates/hero_shrimp_types/src/paths.rs— adddefault_web_pidfile()→<socket_dir>/web.pid(+ unit test). Server keeps its distinctdaemon.pid.crates/hero_shrimp_types/Cargo.toml— addanyhow+libc = "0.2"deps,tempfiledev-dep.crates/hero_shrimp_server/src/rpc/pidfile.rs— replace with a thin re-export shim of the shared impl (server call sites unchanged; pass"rpc.sock").crates/hero_shrimp_web/src/lib.rs— inrun_ui(), acquirePidFile::acquire_with_socket(&default_web_pidfile(), true, "web.sock")before binding, bind it to a function-scope_pidfileso Drop fires acrossaxum::serve; loud stderr banner on refusal; drop the blind pre-bind unlink (guard now does liveness-checked cleanup).docs/getting-started.md— listweb.pidin the paths table; add a "Redeploying the web UI" note (the binary self-evicts a live predecessor; plain hero_proc restart is now deterministic, no manualpkill).Implementation Plan
PidFileintohero_shrimp_types(parameterized socket name; ported tests; deps). — deps: nonedefault_web_pidfile()path helper + test. — deps: none"rpc.sock". — deps: 1run_ui(): acquire guard, hold across serve, remove blind unlink, loud refusal banner. — deps: 1, 2cargo build -p hero_shrimp_types -p hero_shrimp_server -p hero_shrimp_web;cargo test -p hero_shrimp_types. — deps: allAcceptance Criteria
PidFileinhero_shrimp_types::pidfile; server re-exports it (no duplicated logic).default_web_pidfile()returns<socket_dir>/web.pid, covered by a unit test.run_ui()acquires the guard before binding, holds it acrossaxum::serve, releases on clean exit and on SIGTERM.web.sock, then binds — one live PID, deterministic takeover.hero_shrimp_webPID (reproduce the issue scenario).daemon.pid,rpc.sock) unchanged; existing tests pass.cargo buildfor the three crates succeeds; ported pidfile + path tests pass.docs/getting-started.mddocuments the hardened redeploy and listsweb.pid.Notes
hero_shrimp_webtakes no lifecycle flags (perhero_proc_service_selfstart) and is only ever (re)started by the supervisor — exactly the redeploy path the issue describes — so defaulting to force-takeover matches the issue's "kills predecessors" requirement.kill_otherstays: the web action's supervisor-levelkill_other.socket=["web.sock"]is a heuristic that proved insufficient; the in-process pidfile guard is the deterministic backstop. Complementary, not redundant.O_CREAT|O_EXCLatomic create already provides the race guard. Minimal, proven fix.Test Results
Unit tests —
cargo test -p hero_shrimp_typespidfile::force_takeover_clears_named_web_socket— verifies aweb.pidtakeover clears the adjacentweb.sock(notrpc.sock)pidfile::*— the five guard behaviours (write/drop, stale reclaim, force takeover, loud refusal of a live owner) moved with the implementation and still passpaths::web_pidfile_sits_beside_web_socket—web.pidresolves next toweb.sockand is distinct fromdaemon.pidServer regression —
cargo test -p hero_shrimp_serverlibcdropped from the server crate (no longer used directly). No behavioural change todaemon.pid/rpc.sockhandling.End-to-end reproduction of the issue scenario
Launched
hero_shrimp_webthree times in a row against a test socket (the exact "repeated relaunch" path from the report):[pidfile] force: evicted incumbent pid ...logged per predecessorThe "5 concurrent PIDs" scenario now deterministically collapses to exactly one live instance per relaunch.
Implementation Summary
Root cause
hero_shrimp_web'srun_ui()blindly unlinkedweb.sockand bound a fresh inode with no liveness check — so every relaunch leaked the previous process (which kept serving its now-orphaned inode). The daemon already had aPidFileguard for this; the web binary had none.Fix
Extracted the daemon's proven
PidFileinto the shared, dependency-lighthero_shrimp_typescrate as a single implementation used by both binaries, then made the web binary acquire it before binding.Changes
crates/hero_shrimp_types/src/pidfile.rs(new) — sharedPidFileRAII guard, moved from the server. Generalised the adjacent-socket-to-clear into a parameter (acquire_with_socket(path, force, socket_name));acquire/acquire_withretained asrpc.sockdefaults so the daemon call site is unchanged.crates/hero_shrimp_types/src/lib.rs—pub mod pidfile;.crates/hero_shrimp_types/src/paths.rs— addeddefault_web_pidfile()(<socket_dir>/web.pid, distinct fromdaemon.pid) + unit test.crates/hero_shrimp_types/Cargo.toml— addedanyhow,libc, andtempfile(dev).crates/hero_shrimp_server/src/rpc/pidfile.rs— reduced to a thin re-export of the shared guard; dropped the now-unusedlibcdep from the server crate.crates/hero_shrimp_web/src/lib.rs—run_ui()now acquiresPidFile::acquire_with_socket(web.pid, force=true, "web.sock")before binding, holds it acrossaxum::serve(Drop releases on clean exit / SIGTERM), and refuses loudly if a predecessor can't be evicted. Addedweb_pidfile_path()so the guard sits beside the socket even underHERO_SHRIMP_WEB_SOCKEToverride.force=truebecause the binary takes no lifecycle flags and is only ever (re)started by the supervisor.docs/getting-started.md— documenteddaemon.pid/web.pidin the paths table and added a "Redeploying" note (no more manualpkillof stray PIDs).Verification
cargo test -p hero_shrimp_types: 24 + 6 doctests pass, including the new web-socket takeover and web-pidfile path tests.cargo buildfor all three crates succeeds; server pidfile behaviour unchanged.Notes / caveats
kill_other.socket=["web.sock"]inservice.rsis kept — it's a complementary heuristic; the in-process pidfile guard is the deterministic backstop.O_CREAT|O_EXCLatomic create is the race guard.Pull request opened: #115
This PR implements the changes discussed in this issue.