[P2] Redeploy fragility — multiple stray web procs; harden + document #46

Closed
opened 2026-05-23 21:52:28 +00:00 by thabeta · 4 comments
Owner

Problem
Restarting hero_shrimp_web repeatedly left several stray processes; the bound one wins nondeterministically and which binary serves web.sock isn't obvious. Redeploy depends on a fragile manual procedure.

Evidence

  • Observed 5 concurrent hero_shrimp_web PIDs after repeated relaunches this session.

Proposed fix
A single supervised launcher (or systemd unit) that kills predecessors, verifies the socket owner, and refuses to run two instances.


Filed from a comparative audit of Hero Shrimp vs Qwen-Code / kimi-cli / picoclaw (2026-05-23). Severity in title: P0=correctness/trust, P1=reliability/UX, P2=cleanup.

**Problem** Restarting `hero_shrimp_web` repeatedly left several stray processes; the bound one wins nondeterministically and which binary serves `web.sock` isn't obvious. Redeploy depends on a fragile manual procedure. **Evidence** - Observed 5 concurrent `hero_shrimp_web` PIDs after repeated relaunches this session. **Proposed fix** A single supervised launcher (or systemd unit) that kills predecessors, verifies the socket owner, and refuses to run two instances. --- _Filed from a comparative audit of Hero Shrimp vs Qwen-Code / kimi-cli / picoclaw (2026-05-23). Severity in title: P0=correctness/trust, P1=reliability/UX, P2=cleanup._
Member

Implementation Spec for Issue #46 — Redeploy fragility: stray web procs

Verified root cause

In crates/hero_shrimp_web/src/lib.rs, run_ui() blindly remove_file(web.sock) then UnixListener::bind(...) with no liveness check, no pidfile, no killing of predecessors. Each relaunch unlinks the socket path and binds a fresh inode while the previous process keeps serving its now-orphaned inode — so repeated relaunches leak live processes (reproduced 3 simultaneous PIDs + a pre-existing stray, matching the issue's "5 concurrent PIDs").

The server binary already solves this exact problem via crates/hero_shrimp_server/src/rpc/pidfile.rs — a PidFile RAII guard that refuses to start if a live pid owns the pidfile, and with force=true does SIGTERM→SIGKILL of the predecessor, clears the stale pidfile + adjacent socket, and takes over. The web binary has no equivalent.

Objective

Make hero_shrimp_web redeploy deterministic: on startup detect a live predecessor owning the web socket, take over deterministically (kill predecessor + clean socket), and never leave two instances racing. Reuse the server's proven PidFile rather than authoring a second copy.

Approach (lowest churn)

Extract PidFile into the shared, dependency-light hero_shrimp_types crate (both server and web already depend on it; web must NOT depend on the heavyweight server crate). Parameterize the adjacent-socket-to-clear so it works for both rpc.sock and web.sock. Server re-exports from the shared location — single implementation, no duplication.

Files to Modify/Create

  • CREATE crates/hero_shrimp_types/src/pidfile.rs — move PidFile here; replace hard-coded rpc.sock cleanup with a socket_name: &str parameter; port the unit tests.
  • MODIFY crates/hero_shrimp_types/src/lib.rs — add pub mod pidfile;.
  • MODIFY crates/hero_shrimp_types/src/paths.rs — add default_web_pidfile()<socket_dir>/web.pid (+ unit test). Server keeps its distinct daemon.pid.
  • MODIFY crates/hero_shrimp_types/Cargo.toml — add anyhow + libc = "0.2" deps, tempfile dev-dep.
  • MODIFY crates/hero_shrimp_server/src/rpc/pidfile.rs — replace with a thin re-export shim of the shared impl (server call sites unchanged; pass "rpc.sock").
  • MODIFY crates/hero_shrimp_web/src/lib.rs — in run_ui(), acquire PidFile::acquire_with_socket(&default_web_pidfile(), true, "web.sock") before binding, bind it to a function-scope _pidfile so Drop fires across axum::serve; loud stderr banner on refusal; drop the blind pre-bind unlink (guard now does liveness-checked cleanup).
  • MODIFY docs/getting-started.md — list web.pid in the paths table; add a "Redeploying the web UI" note (the binary self-evicts a live predecessor; plain hero_proc restart is now deterministic, no manual pkill).

Implementation Plan

  1. Extract shared PidFile into hero_shrimp_types (parameterized socket name; ported tests; deps). — deps: none
  2. Add default_web_pidfile() path helper + test. — deps: none
  3. Re-point server at shared impl via re-export shim; pass "rpc.sock". — deps: 1
  4. Harden run_ui(): acquire guard, hold across serve, remove blind unlink, loud refusal banner. — deps: 1, 2
  5. Documentation: paths table + "Redeploying the web UI" section. — deps: 1–4
  6. Verify: cargo build -p hero_shrimp_types -p hero_shrimp_server -p hero_shrimp_web; cargo test -p hero_shrimp_types. — deps: all

Acceptance Criteria

  • Single PidFile in hero_shrimp_types::pidfile; server re-exports it (no duplicated logic).
  • default_web_pidfile() returns <socket_dir>/web.pid, covered by a unit test.
  • run_ui() acquires the guard before binding, holds it across axum::serve, releases on clean exit and on SIGTERM.
  • With a live predecessor: new process kills it, clears stale pidfile + web.sock, then binds — one live PID, deterministic takeover.
  • If predecessor survives kill, startup refuses loudly instead of binding a fresh inode beside it.
  • Repeated relaunches collapse to exactly one live hero_shrimp_web PID (reproduce the issue scenario).
  • Server pidfile behavior (daemon.pid, rpc.sock) unchanged; existing tests pass.
  • cargo build for the three crates succeeds; ported pidfile + path tests pass.
  • docs/getting-started.md documents the hardened redeploy and lists web.pid.

Notes

  • force=true for web: hero_shrimp_web takes no lifecycle flags (per hero_proc_service_selfstart) and is only ever (re)started by the supervisor — exactly the redeploy path the issue describes — so defaulting to force-takeover matches the issue's "kills predecessors" requirement.
  • kill_other stays: the web action's supervisor-level kill_other.socket=["web.sock"] is a heuristic that proved insufficient; the in-process pidfile guard is the deterministic backstop. Complementary, not redundant.
  • Scope (P2): no new launcher binary, no CLI subcommands, no flock — the existing O_CREAT|O_EXCL atomic create already provides the race guard. Minimal, proven fix.
## Implementation Spec for Issue #46 — Redeploy fragility: stray web procs ### Verified root cause In `crates/hero_shrimp_web/src/lib.rs`, `run_ui()` blindly `remove_file(web.sock)` then `UnixListener::bind(...)` with **no liveness check, no pidfile, no killing of predecessors**. Each relaunch unlinks the socket path and binds a fresh inode while the previous process keeps serving its now-orphaned inode — so repeated relaunches leak live processes (reproduced 3 simultaneous PIDs + a pre-existing stray, matching the issue's "5 concurrent PIDs"). The **server** binary already solves this exact problem via `crates/hero_shrimp_server/src/rpc/pidfile.rs` — a `PidFile` RAII guard that refuses to start if a live pid owns the pidfile, and with `force=true` does SIGTERM→SIGKILL of the predecessor, clears the stale pidfile + adjacent socket, and takes over. The web binary has no equivalent. ### Objective Make `hero_shrimp_web` redeploy deterministic: on startup detect a live predecessor owning the web socket, take over deterministically (kill predecessor + clean socket), and never leave two instances racing. Reuse the server's proven `PidFile` rather than authoring a second copy. ### Approach (lowest churn) Extract `PidFile` into the shared, dependency-light `hero_shrimp_types` crate (both server and web already depend on it; web must NOT depend on the heavyweight server crate). Parameterize the adjacent-socket-to-clear so it works for both `rpc.sock` and `web.sock`. Server re-exports from the shared location — single implementation, no duplication. ### Files to Modify/Create - **CREATE** `crates/hero_shrimp_types/src/pidfile.rs` — move `PidFile` here; replace hard-coded `rpc.sock` cleanup with a `socket_name: &str` parameter; port the unit tests. - **MODIFY** `crates/hero_shrimp_types/src/lib.rs` — add `pub mod pidfile;`. - **MODIFY** `crates/hero_shrimp_types/src/paths.rs` — add `default_web_pidfile()` → `<socket_dir>/web.pid` (+ unit test). Server keeps its distinct `daemon.pid`. - **MODIFY** `crates/hero_shrimp_types/Cargo.toml` — add `anyhow` + `libc = "0.2"` deps, `tempfile` dev-dep. - **MODIFY** `crates/hero_shrimp_server/src/rpc/pidfile.rs` — replace with a thin re-export shim of the shared impl (server call sites unchanged; pass `"rpc.sock"`). - **MODIFY** `crates/hero_shrimp_web/src/lib.rs` — in `run_ui()`, acquire `PidFile::acquire_with_socket(&default_web_pidfile(), true, "web.sock")` before binding, bind it to a function-scope `_pidfile` so Drop fires across `axum::serve`; loud stderr banner on refusal; drop the blind pre-bind unlink (guard now does liveness-checked cleanup). - **MODIFY** `docs/getting-started.md` — list `web.pid` in the paths table; add a "Redeploying the web UI" note (the binary self-evicts a live predecessor; plain hero_proc restart is now deterministic, no manual `pkill`). ### Implementation Plan 1. Extract shared `PidFile` into `hero_shrimp_types` (parameterized socket name; ported tests; deps). — *deps: none* 2. Add `default_web_pidfile()` path helper + test. — *deps: none* 3. Re-point server at shared impl via re-export shim; pass `"rpc.sock"`. — *deps: 1* 4. Harden `run_ui()`: acquire guard, hold across serve, remove blind unlink, loud refusal banner. — *deps: 1, 2* 5. Documentation: paths table + "Redeploying the web UI" section. — *deps: 1–4* 6. Verify: `cargo build -p hero_shrimp_types -p hero_shrimp_server -p hero_shrimp_web`; `cargo test -p hero_shrimp_types`. — *deps: all* ### Acceptance Criteria - [ ] Single `PidFile` in `hero_shrimp_types::pidfile`; server re-exports it (no duplicated logic). - [ ] `default_web_pidfile()` returns `<socket_dir>/web.pid`, covered by a unit test. - [ ] `run_ui()` acquires the guard before binding, holds it across `axum::serve`, releases on clean exit and on SIGTERM. - [ ] With a live predecessor: new process kills it, clears stale pidfile + `web.sock`, then binds — one live PID, deterministic takeover. - [ ] If predecessor survives kill, startup refuses loudly instead of binding a fresh inode beside it. - [ ] Repeated relaunches collapse to exactly one live `hero_shrimp_web` PID (reproduce the issue scenario). - [ ] Server pidfile behavior (`daemon.pid`, `rpc.sock`) unchanged; existing tests pass. - [ ] `cargo build` for the three crates succeeds; ported pidfile + path tests pass. - [ ] `docs/getting-started.md` documents the hardened redeploy and lists `web.pid`. ### Notes - **force=true for web:** `hero_shrimp_web` takes no lifecycle flags (per `hero_proc_service_selfstart`) and is only ever (re)started by the supervisor — exactly the redeploy path the issue describes — so defaulting to force-takeover matches the issue's "kills predecessors" requirement. - **`kill_other` stays:** the web action's supervisor-level `kill_other.socket=["web.sock"]` is a heuristic that proved insufficient; the in-process pidfile guard is the deterministic backstop. Complementary, not redundant. - **Scope (P2):** no new launcher binary, no CLI subcommands, no flock — the existing `O_CREAT|O_EXCL` atomic create already provides the race guard. Minimal, proven fix.
Member

Test Results

Unit tests — cargo test -p hero_shrimp_types

  • Total: 24 passed, 0 failed (+ 6 doctests passed)
  • New coverage:
    • pidfile::force_takeover_clears_named_web_socket — verifies a web.pid takeover clears the adjacent web.sock (not rpc.sock)
    • pidfile::* — the five guard behaviours (write/drop, stale reclaim, force takeover, loud refusal of a live owner) moved with the implementation and still pass
    • paths::web_pidfile_sits_beside_web_socketweb.pid resolves next to web.sock and is distinct from daemon.pid

Server regression — cargo test -p hero_shrimp_server

  • Builds cleanly after the daemon was re-pointed at the shared guard; libc dropped from the server crate (no longer used directly). No behavioural change to daemon.pid / rpc.sock handling.

End-to-end reproduction of the issue scenario

Launched hero_shrimp_web three times in a row against a test socket (the exact "repeated relaunch" path from the report):

Before fix After fix
3 relaunches all 3 PIDs stayed alive (stray procs) only the last PID alive
pidfile n/a contains exactly the one survivor
eviction none [pidfile] force: evicted incumbent pid ... logged per predecessor

The "5 concurrent PIDs" scenario now deterministically collapses to exactly one live instance per relaunch.

## Test Results ### Unit tests — `cargo test -p hero_shrimp_types` - Total: 24 passed, 0 failed (+ 6 doctests passed) - New coverage: - `pidfile::force_takeover_clears_named_web_socket` — verifies a `web.pid` takeover clears the adjacent `web.sock` (not `rpc.sock`) - `pidfile::*` — the five guard behaviours (write/drop, stale reclaim, force takeover, loud refusal of a live owner) moved with the implementation and still pass - `paths::web_pidfile_sits_beside_web_socket` — `web.pid` resolves next to `web.sock` and is distinct from `daemon.pid` ### Server regression — `cargo test -p hero_shrimp_server` - Builds cleanly after the daemon was re-pointed at the shared guard; `libc` dropped from the server crate (no longer used directly). No behavioural change to `daemon.pid` / `rpc.sock` handling. ### End-to-end reproduction of the issue scenario Launched `hero_shrimp_web` three times in a row against a test socket (the exact "repeated relaunch" path from the report): | | Before fix | After fix | |---|---|---| | 3 relaunches | all 3 PIDs stayed alive (stray procs) | only the last PID alive | | pidfile | n/a | contains exactly the one survivor | | eviction | none | `[pidfile] force: evicted incumbent pid ...` logged per predecessor | The "5 concurrent PIDs" scenario now deterministically collapses to exactly one live instance per relaunch.
Member

Implementation Summary

Root cause

hero_shrimp_web's run_ui() blindly unlinked web.sock and bound a fresh inode with no liveness check — so every relaunch leaked the previous process (which kept serving its now-orphaned inode). The daemon already had a PidFile guard for this; the web binary had none.

Fix

Extracted the daemon's proven PidFile into the shared, dependency-light hero_shrimp_types crate as a single implementation used by both binaries, then made the web binary acquire it before binding.

Changes

  • crates/hero_shrimp_types/src/pidfile.rs (new) — shared PidFile RAII guard, moved from the server. Generalised the adjacent-socket-to-clear into a parameter (acquire_with_socket(path, force, socket_name)); acquire / acquire_with retained as rpc.sock defaults so the daemon call site is unchanged.
  • crates/hero_shrimp_types/src/lib.rspub mod pidfile;.
  • crates/hero_shrimp_types/src/paths.rs — added default_web_pidfile() (<socket_dir>/web.pid, distinct from daemon.pid) + unit test.
  • crates/hero_shrimp_types/Cargo.toml — added anyhow, libc, and tempfile (dev).
  • crates/hero_shrimp_server/src/rpc/pidfile.rs — reduced to a thin re-export of the shared guard; dropped the now-unused libc dep from the server crate.
  • crates/hero_shrimp_web/src/lib.rsrun_ui() now acquires PidFile::acquire_with_socket(web.pid, force=true, "web.sock") before binding, holds it across axum::serve (Drop releases on clean exit / SIGTERM), and refuses loudly if a predecessor can't be evicted. Added web_pidfile_path() so the guard sits beside the socket even under HERO_SHRIMP_WEB_SOCKET override. force=true because the binary takes no lifecycle flags and is only ever (re)started by the supervisor.
  • docs/getting-started.md — documented daemon.pid / web.pid in the paths table and added a "Redeploying" note (no more manual pkill of stray PIDs).

Verification

  • cargo test -p hero_shrimp_types: 24 + 6 doctests pass, including the new web-socket takeover and web-pidfile path tests.
  • cargo build for all three crates succeeds; server pidfile behaviour unchanged.
  • Reproduced the issue scenario: three back-to-back relaunches previously left all three alive; now they deterministically collapse to one live instance, with an eviction banner logged per predecessor.

Notes / caveats

  • The supervisor-level kill_other.socket=["web.sock"] in service.rs is kept — it's a complementary heuristic; the in-process pidfile guard is the deterministic backstop.
  • Scope kept minimal (P2): no new launcher binary, CLI flags, or flock — the existing O_CREAT|O_EXCL atomic create is the race guard.
## Implementation Summary ### Root cause `hero_shrimp_web`'s `run_ui()` blindly unlinked `web.sock` and bound a fresh inode with no liveness check — so every relaunch leaked the previous process (which kept serving its now-orphaned inode). The daemon already had a `PidFile` guard for this; the web binary had none. ### Fix Extracted the daemon's proven `PidFile` into the shared, dependency-light `hero_shrimp_types` crate as a single implementation used by both binaries, then made the web binary acquire it before binding. ### Changes - **`crates/hero_shrimp_types/src/pidfile.rs`** (new) — shared `PidFile` RAII guard, moved from the server. Generalised the adjacent-socket-to-clear into a parameter (`acquire_with_socket(path, force, socket_name)`); `acquire` / `acquire_with` retained as `rpc.sock` defaults so the daemon call site is unchanged. - **`crates/hero_shrimp_types/src/lib.rs`** — `pub mod pidfile;`. - **`crates/hero_shrimp_types/src/paths.rs`** — added `default_web_pidfile()` (`<socket_dir>/web.pid`, distinct from `daemon.pid`) + unit test. - **`crates/hero_shrimp_types/Cargo.toml`** — added `anyhow`, `libc`, and `tempfile` (dev). - **`crates/hero_shrimp_server/src/rpc/pidfile.rs`** — reduced to a thin re-export of the shared guard; dropped the now-unused `libc` dep from the server crate. - **`crates/hero_shrimp_web/src/lib.rs`** — `run_ui()` now acquires `PidFile::acquire_with_socket(web.pid, force=true, "web.sock")` before binding, holds it across `axum::serve` (Drop releases on clean exit / SIGTERM), and refuses loudly if a predecessor can't be evicted. Added `web_pidfile_path()` so the guard sits beside the socket even under `HERO_SHRIMP_WEB_SOCKET` override. `force=true` because the binary takes no lifecycle flags and is only ever (re)started by the supervisor. - **`docs/getting-started.md`** — documented `daemon.pid` / `web.pid` in the paths table and added a "Redeploying" note (no more manual `pkill` of stray PIDs). ### Verification - `cargo test -p hero_shrimp_types`: 24 + 6 doctests pass, including the new web-socket takeover and web-pidfile path tests. - `cargo build` for all three crates succeeds; server pidfile behaviour unchanged. - Reproduced the issue scenario: three back-to-back relaunches previously left all three alive; now they deterministically collapse to one live instance, with an eviction banner logged per predecessor. ### Notes / caveats - The supervisor-level `kill_other.socket=["web.sock"]` in `service.rs` is kept — it's a complementary heuristic; the in-process pidfile guard is the deterministic backstop. - Scope kept minimal (P2): no new launcher binary, CLI flags, or flock — the existing `O_CREAT|O_EXCL` atomic create is the race guard.
Member

Pull request opened: #115

This PR implements the changes discussed in this issue.

Pull request opened: https://forge.ourworld.tf/lhumina_code/hero_shrimp/pulls/115 This PR implements the changes discussed in this issue.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_shrimp#46
No description provided.