[P1] ~45 unsupervised tokio::spawn sites can panic/hang silently #38

Closed
opened 2026-05-23 21:52:23 +00:00 by thabeta · 3 comments
Owner

Problem
Long-lived background tasks (integration gateways, SSE chat, scheduler, hooks, MCP) are spawned without JoinError monitoring, restart, or even logging. Only proposer.rs has panic-isolation + backoff-restart.

Evidence

  • ARCHITECTURE_CLEANUP_PLAN.md (P0.2); crates/hero_shrimp_engine/src/agent_core/agent/proposer.rs is the template not applied broadly.

Proposed fix
Wrap supervised spawns in a helper that logs JoinError, and restarts (with backoff) the ones that should be durable.


Filed from a comparative audit of Hero Shrimp vs Qwen-Code / kimi-cli / picoclaw (2026-05-23). Severity in title: P0=correctness/trust, P1=reliability/UX, P2=cleanup.

**Problem** Long-lived background tasks (integration gateways, SSE chat, scheduler, hooks, MCP) are spawned without JoinError monitoring, restart, or even logging. Only `proposer.rs` has panic-isolation + backoff-restart. **Evidence** - `ARCHITECTURE_CLEANUP_PLAN.md` (P0.2); `crates/hero_shrimp_engine/src/agent_core/agent/proposer.rs` is the template not applied broadly. **Proposed fix** Wrap supervised spawns in a helper that logs JoinError, and restarts (with backoff) the ones that should be durable. --- _Filed from a comparative audit of Hero Shrimp vs Qwen-Code / kimi-cli / picoclaw (2026-05-23). Severity in title: P0=correctness/trust, P1=reliability/UX, P2=cleanup._
rawan self-assigned this 2026-06-15 07:48:06 +00:00
Member

Implementation Spec for Issue #38 — Supervise remaining unsupervised tokio::spawn sites

Important finding (issue premise is partially stale)

The supervision helper already exists: spawn_supervised(name, factory) in crates/hero_shrimp_lib/crates/runtime/src/shrimp/runtime/supervise.rs (panic isolation, tracing::error! on panic/return, exponential backoff 1s→60s, resets after a 300s healthy run). The issue's reference to proposer.rs is stale — that path no longer exists; the template now lives in supervise.rs and is already applied to ~10 runtime workers and the 3 listener.rs tasks (job_scheduler, agent_mailbox, accept_task).

So this task is finishing the rollout, not building the helper.

Objective

Bring the remaining always-on / durable background tasks under spawn_supervised (or a log-only variant) so a panic or unexpected return is logged loudly and restarted with backoff, matching the protection runtime workers already have.

Spawn-site classification

Group A — durable, SHOULD be supervised (still raw tokio::spawn):

  • adapters/.../telegram.rs:464 — Telegram bot dispatcher polling loop (always-on gateway).
  • adapters/.../admin.rs:114 — Admin HTTP/SSE server (axum::serve) — always-on gateway.
  • runtime/.../cron.rs:169 — Cron scheduler per-job durable loop.
  • engine/.../integrations/lsp/client.rs:91 — LSP reader loop (parse frames forever).
  • engine/.../client/engine_client/rpc_client.rs:73 — RPC client reader/demux loop.

Group B — per-request / per-connection / job-scoped — leave (or optional log-only): SSE loops (events.rs:37, chat.rs:52), per-request queue/crew/proof_run spawns, job watchdog/heartbeat, agent-loop helpers, spawn_with_context wrappers.

Group C — test/stub spawns — out of scope: all #[cfg(test)] and spawn_stub* sites.

Files to Modify

  • crates/hero_shrimp_lib/crates/adapters/src/shrimp/adapters/telegram.rs (~464)
  • crates/hero_shrimp_lib/crates/adapters/src/shrimp/adapters/admin.rs (~114)
  • crates/hero_shrimp_lib/crates/runtime/src/shrimp/runtime/cron.rs (~169)
  • crates/hero_shrimp_lib/crates/engine/src/shrimp/core/integrations/lsp/client.rs (~91)
  • crates/hero_shrimp_lib/crates/engine/src/shrimp/core/client/engine_client/rpc_client.rs (~73)
  • (optional) crates/hero_shrimp_lib/crates/runtime/src/shrimp/runtime/supervise.rs — add spawn_logged log-only one-shot variant.

Implementation Plan

Step 0 (optional): Add spawn_logged(name, fut) to supervise.rs — inner-spawn, await JoinHandle, tracing::error! on JoinError, no restart. Needed only for reader loops / one-shots where restart is infeasible. Dependencies: none.

Step 1: Supervise Telegram gateway dispatcher (telegram.rs:464). Rebuild dispatcher/listener from cloned bot+handler inside the Fn factory; preserve record_channel_failure("telegram", …). Dependencies: none.

Step 2: Supervise Admin HTTP/SSE server (admin.rs:114). Rebuild listener+router from config/state inside the factory; preserve record_channel_failure("admin", …); ensure socket rebind is idempotent. Dependencies: none.

Step 3: Supervise cron durable loop (cron.rs:169, NOT the one-shot at :215). Use a static label (e.g. "cron_scheduled_job") and pass job id as a tracing field. Dependencies: none.

Step 4: LSP reader loop (lsp/client.rs:91). stdout/child are non-Clone; a dead process can't be re-read. Prefer spawn_logged (log a mid-frame panic) keeping EOF-exit terminal. Dependencies: Step 0.

Step 5: RPC client reader loop (rpc_client.rs:73). Same shape as Step 4 — non-Clone reader; use spawn_logged. Dependencies: Step 0.

Step 6 (optional): Convert selected Group B one-shots (lifecycle.rs:36, cron.rs:215, mcp.rs:473, telegram.rs:160) to spawn_logged. Dependencies: Step 0.

Step 7: Verification sweep — re-run grep -rn "tokio::spawn" and confirm every remaining raw site is deliberate (Group B/C).

Acceptance Criteria

  • All Group A sites are under spawn_supervised (restart) or spawn_logged (log-only), with the choice documented inline where restart was infeasible.
  • record_channel_failure side effects for telegram/admin preserved.
  • No already-supervised site (listener.rs, runtime workers) altered or regressed.
  • Per-request / per-connection / job-scoped / test spawns left unchanged.
  • Converted factories are valid Fn closures (or log-only FnOnce), non-Clone state rebuilt inside.
  • Workspace builds with 0 warnings; runtime/server/engine/adapters tests green.
  • Final grep confirms remaining raw sites are intentional.

Notes / scope boundaries

  • Do NOT re-create spawn_supervised — it exists and is applied to runtime workers + listener tasks.
  • name must be &'static str; pass dynamic ids (cron job id, server name) via tracing fields.
  • Re-runnability is the main hazard: spawn_supervised rebuilds the body each restart. Reader loops owning non-Clone I/O can't be restarted into a closed pipe — prefer spawn_logged there.
  • Do NOT supervise SSE per-connection loops, per-request spawns, job-scoped watchdog/heartbeat, or spawn_with_context wrappers — they self-terminate with their owning scope.
  • All test/stub spawns out of scope.
## Implementation Spec for Issue #38 — Supervise remaining unsupervised `tokio::spawn` sites ### Important finding (issue premise is partially stale) The supervision helper **already exists**: `spawn_supervised(name, factory)` in `crates/hero_shrimp_lib/crates/runtime/src/shrimp/runtime/supervise.rs` (panic isolation, `tracing::error!` on panic/return, exponential backoff 1s→60s, resets after a 300s healthy run). The issue's reference to `proposer.rs` is stale — that path no longer exists; the template now lives in `supervise.rs` and is **already applied** to ~10 runtime workers and the 3 `listener.rs` tasks (job_scheduler, agent_mailbox, accept_task). So this task is **finishing the rollout**, not building the helper. ### Objective Bring the remaining always-on / durable background tasks under `spawn_supervised` (or a log-only variant) so a panic or unexpected return is logged loudly and restarted with backoff, matching the protection runtime workers already have. ### Spawn-site classification **Group A — durable, SHOULD be supervised (still raw `tokio::spawn`):** - `adapters/.../telegram.rs:464` — Telegram bot dispatcher polling loop (always-on gateway). - `adapters/.../admin.rs:114` — Admin HTTP/SSE server (`axum::serve`) — always-on gateway. - `runtime/.../cron.rs:169` — Cron scheduler per-job durable loop. - `engine/.../integrations/lsp/client.rs:91` — LSP reader loop (parse frames forever). - `engine/.../client/engine_client/rpc_client.rs:73` — RPC client reader/demux loop. **Group B — per-request / per-connection / job-scoped — leave (or optional log-only):** SSE loops (`events.rs:37`, `chat.rs:52`), per-request queue/crew/proof_run spawns, job watchdog/heartbeat, agent-loop helpers, `spawn_with_context` wrappers. **Group C — test/stub spawns — out of scope:** all `#[cfg(test)]` and `spawn_stub*` sites. ### Files to Modify - `crates/hero_shrimp_lib/crates/adapters/src/shrimp/adapters/telegram.rs` (~464) - `crates/hero_shrimp_lib/crates/adapters/src/shrimp/adapters/admin.rs` (~114) - `crates/hero_shrimp_lib/crates/runtime/src/shrimp/runtime/cron.rs` (~169) - `crates/hero_shrimp_lib/crates/engine/src/shrimp/core/integrations/lsp/client.rs` (~91) - `crates/hero_shrimp_lib/crates/engine/src/shrimp/core/client/engine_client/rpc_client.rs` (~73) - (optional) `crates/hero_shrimp_lib/crates/runtime/src/shrimp/runtime/supervise.rs` — add `spawn_logged` log-only one-shot variant. ### Implementation Plan **Step 0 (optional):** Add `spawn_logged(name, fut)` to `supervise.rs` — inner-spawn, await JoinHandle, `tracing::error!` on JoinError, no restart. Needed only for reader loops / one-shots where restart is infeasible. Dependencies: none. **Step 1:** Supervise Telegram gateway dispatcher (`telegram.rs:464`). Rebuild dispatcher/listener from cloned `bot`+handler inside the `Fn` factory; preserve `record_channel_failure("telegram", …)`. Dependencies: none. **Step 2:** Supervise Admin HTTP/SSE server (`admin.rs:114`). Rebuild listener+router from config/state inside the factory; preserve `record_channel_failure("admin", …)`; ensure socket rebind is idempotent. Dependencies: none. **Step 3:** Supervise cron durable loop (`cron.rs:169`, NOT the one-shot at :215). Use a static label (e.g. `"cron_scheduled_job"`) and pass job id as a tracing field. Dependencies: none. **Step 4:** LSP reader loop (`lsp/client.rs:91`). `stdout`/`child` are non-`Clone`; a dead process can't be re-read. Prefer `spawn_logged` (log a mid-frame panic) keeping EOF-exit terminal. Dependencies: Step 0. **Step 5:** RPC client reader loop (`rpc_client.rs:73`). Same shape as Step 4 — non-`Clone` reader; use `spawn_logged`. Dependencies: Step 0. **Step 6 (optional):** Convert selected Group B one-shots (`lifecycle.rs:36`, `cron.rs:215`, `mcp.rs:473`, `telegram.rs:160`) to `spawn_logged`. Dependencies: Step 0. **Step 7:** Verification sweep — re-run `grep -rn "tokio::spawn"` and confirm every remaining raw site is deliberate (Group B/C). ### Acceptance Criteria - [ ] All Group A sites are under `spawn_supervised` (restart) or `spawn_logged` (log-only), with the choice documented inline where restart was infeasible. - [ ] `record_channel_failure` side effects for telegram/admin preserved. - [ ] No already-supervised site (listener.rs, runtime workers) altered or regressed. - [ ] Per-request / per-connection / job-scoped / test spawns left unchanged. - [ ] Converted factories are valid `Fn` closures (or log-only `FnOnce`), non-`Clone` state rebuilt inside. - [ ] Workspace builds with 0 warnings; runtime/server/engine/adapters tests green. - [ ] Final grep confirms remaining raw sites are intentional. ### Notes / scope boundaries - Do NOT re-create `spawn_supervised` — it exists and is applied to runtime workers + listener tasks. - `name` must be `&'static str`; pass dynamic ids (cron job id, server name) via tracing fields. - Re-runnability is the main hazard: `spawn_supervised` rebuilds the body each restart. Reader loops owning non-`Clone` I/O can't be restarted into a closed pipe — prefer `spawn_logged` there. - Do NOT supervise SSE per-connection loops, per-request spawns, job-scoped watchdog/heartbeat, or `spawn_with_context` wrappers — they self-terminate with their owning scope. - All test/stub spawns out of scope.
Member

Test Results

Ran cargo test on the changed crates after the conversions.

  • hero_shrimp_engine: 1733 passed, 0 failed, 10 ignored
  • hero_shrimp_runtime: 580 passed, 0 failed, 0 ignored (includes supervise::tests::restarts_after_panic and the new supervise::tests::logged_runs_future_to_completion)
  • hero_shrimp_adapters: no unit tests (0)

Totals: 2313 passed, 0 failed. Full workspace builds cleanly (cargo build --workspace) with no new warnings.

## Test Results Ran `cargo test` on the changed crates after the conversions. - `hero_shrimp_engine`: 1733 passed, 0 failed, 10 ignored - `hero_shrimp_runtime`: 580 passed, 0 failed, 0 ignored (includes `supervise::tests::restarts_after_panic` and the new `supervise::tests::logged_runs_future_to_completion`) - `hero_shrimp_adapters`: no unit tests (0) Totals: 2313 passed, 0 failed. Full workspace builds cleanly (`cargo build --workspace`) with no new warnings.
Member

Implementation Summary

Finished the rollout of supervised task spawning to the remaining durable/long-lived tokio::spawn sites. The spawn_supervised helper already existed and was applied to ~10 runtime workers plus the three listener.rs tasks; this change completes coverage of the always-on gateways, the scheduler loop, and the long-lived reader loops, and adds panic-logging for selected fire-and-forget one-shots.

New helper

  • supervise.rs: added spawn_logged(name, fut) — a log-only one-shot variant that inner-spawns a future and logs a tracing::error! JoinError on panic, with no restart loop. Used for tasks that cannot be safely restarted (reader loops over non-Clone I/O, fire-and-forget one-shots). Unit test added.

Restart-supervised (spawn_supervised)

  • adapters/.../telegram.rs — Telegram gateway dispatcher (telegram_gateway). Dispatcher + polling listener rebuilt inside the Fn factory each restart; record_channel_failure("telegram", ...) preserved.
  • adapters/.../admin.rs — Admin HTTP/SSE server (admin_gateway). Listener re-bound (idempotent socket cleanup) and router rebuilt per restart; record_channel_failure("admin", ...) preserved.
  • runtime/.../cron.rs — durable recurring-job loop (cron_scheduled_job). Static task name; per-job id emitted as a tracing field.

Log-only (spawn_logged)

  • engine/.../integrations/lsp/client.rs — LSP reader loop (lsp_reader); EOF/process-death stays terminal, panics now logged.
  • engine/.../client/engine_client/rpc_client.rs — RPC client reader/demux loop (rpc_client_reader); same rationale.
  • engine/.../agency/lifecycle.rs — SubagentStart hooks one-shot (subagent_start_hooks).
  • runtime/.../cron.rs — one-shot scheduled task (cron_one_shot).
  • engine/.../integrations/mcp.rs — per-request MCP handler (mcp_request_handler).
  • adapters/.../telegram.rs — reminder-fire one-shot (telegram_update_handler).

Left unchanged (intentional)

Per-request / per-connection / job-scoped / infrastructure spawns: SSE loops (events.rs, chat.rs), per-request queue/crew/proof_run spawns, job watchdog/heartbeat, agent-loop helpers, spawn_with_context wrappers, and all test/stub spawns. A final grep sweep confirmed every remaining raw tokio::spawn outside test code is a deliberate non-durable spawn.

Tests

2313 passed, 0 failed across engine + runtime; adapters has no unit tests. Workspace builds with no new warnings.

Notes

  • The issue's reference to proposer.rs as the template is stale — the panic-isolation/backoff template now lives in supervise.rs.
  • Reader loops are log-only rather than restart-supervised because their non-Clone I/O handles cannot be rebuilt after the underlying process/connection closes; restarting would only re-fail against a closed pipe.
## Implementation Summary Finished the rollout of supervised task spawning to the remaining durable/long-lived `tokio::spawn` sites. The `spawn_supervised` helper already existed and was applied to ~10 runtime workers plus the three `listener.rs` tasks; this change completes coverage of the always-on gateways, the scheduler loop, and the long-lived reader loops, and adds panic-logging for selected fire-and-forget one-shots. ### New helper - `supervise.rs`: added `spawn_logged(name, fut)` — a log-only one-shot variant that inner-spawns a future and logs a `tracing::error!` JoinError on panic, with no restart loop. Used for tasks that cannot be safely restarted (reader loops over non-`Clone` I/O, fire-and-forget one-shots). Unit test added. ### Restart-supervised (`spawn_supervised`) - `adapters/.../telegram.rs` — Telegram gateway dispatcher (`telegram_gateway`). Dispatcher + polling listener rebuilt inside the `Fn` factory each restart; `record_channel_failure("telegram", ...)` preserved. - `adapters/.../admin.rs` — Admin HTTP/SSE server (`admin_gateway`). Listener re-bound (idempotent socket cleanup) and router rebuilt per restart; `record_channel_failure("admin", ...)` preserved. - `runtime/.../cron.rs` — durable recurring-job loop (`cron_scheduled_job`). Static task name; per-job id emitted as a `tracing` field. ### Log-only (`spawn_logged`) - `engine/.../integrations/lsp/client.rs` — LSP reader loop (`lsp_reader`); EOF/process-death stays terminal, panics now logged. - `engine/.../client/engine_client/rpc_client.rs` — RPC client reader/demux loop (`rpc_client_reader`); same rationale. - `engine/.../agency/lifecycle.rs` — SubagentStart hooks one-shot (`subagent_start_hooks`). - `runtime/.../cron.rs` — one-shot scheduled task (`cron_one_shot`). - `engine/.../integrations/mcp.rs` — per-request MCP handler (`mcp_request_handler`). - `adapters/.../telegram.rs` — reminder-fire one-shot (`telegram_update_handler`). ### Left unchanged (intentional) Per-request / per-connection / job-scoped / infrastructure spawns: SSE loops (`events.rs`, `chat.rs`), per-request queue/crew/proof_run spawns, job watchdog/heartbeat, agent-loop helpers, `spawn_with_context` wrappers, and all test/stub spawns. A final `grep` sweep confirmed every remaining raw `tokio::spawn` outside test code is a deliberate non-durable spawn. ### Tests 2313 passed, 0 failed across engine + runtime; adapters has no unit tests. Workspace builds with no new warnings. ### Notes - The issue's reference to `proposer.rs` as the template is stale — the panic-isolation/backoff template now lives in `supervise.rs`. - Reader loops are log-only rather than restart-supervised because their non-`Clone` I/O handles cannot be rebuilt after the underlying process/connection closes; restarting would only re-fail against a closed pipe.
rawan closed this issue 2026-06-15 09:17:03 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_shrimp#38
No description provided.