[P1] ~45 unsupervised tokio::spawn sites can panic/hang silently #38
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_shrimp#38
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
Long-lived background tasks (integration gateways, SSE chat, scheduler, hooks, MCP) are spawned without JoinError monitoring, restart, or even logging. Only
proposer.rshas panic-isolation + backoff-restart.Evidence
ARCHITECTURE_CLEANUP_PLAN.md(P0.2);crates/hero_shrimp_engine/src/agent_core/agent/proposer.rsis the template not applied broadly.Proposed fix
Wrap supervised spawns in a helper that logs JoinError, and restarts (with backoff) the ones that should be durable.
Filed from a comparative audit of Hero Shrimp vs Qwen-Code / kimi-cli / picoclaw (2026-05-23). Severity in title: P0=correctness/trust, P1=reliability/UX, P2=cleanup.
Implementation Spec for Issue #38 — Supervise remaining unsupervised
tokio::spawnsitesImportant finding (issue premise is partially stale)
The supervision helper already exists:
spawn_supervised(name, factory)incrates/hero_shrimp_lib/crates/runtime/src/shrimp/runtime/supervise.rs(panic isolation,tracing::error!on panic/return, exponential backoff 1s→60s, resets after a 300s healthy run). The issue's reference toproposer.rsis stale — that path no longer exists; the template now lives insupervise.rsand is already applied to ~10 runtime workers and the 3listener.rstasks (job_scheduler, agent_mailbox, accept_task).So this task is finishing the rollout, not building the helper.
Objective
Bring the remaining always-on / durable background tasks under
spawn_supervised(or a log-only variant) so a panic or unexpected return is logged loudly and restarted with backoff, matching the protection runtime workers already have.Spawn-site classification
Group A — durable, SHOULD be supervised (still raw
tokio::spawn):adapters/.../telegram.rs:464— Telegram bot dispatcher polling loop (always-on gateway).adapters/.../admin.rs:114— Admin HTTP/SSE server (axum::serve) — always-on gateway.runtime/.../cron.rs:169— Cron scheduler per-job durable loop.engine/.../integrations/lsp/client.rs:91— LSP reader loop (parse frames forever).engine/.../client/engine_client/rpc_client.rs:73— RPC client reader/demux loop.Group B — per-request / per-connection / job-scoped — leave (or optional log-only): SSE loops (
events.rs:37,chat.rs:52), per-request queue/crew/proof_run spawns, job watchdog/heartbeat, agent-loop helpers,spawn_with_contextwrappers.Group C — test/stub spawns — out of scope: all
#[cfg(test)]andspawn_stub*sites.Files to Modify
crates/hero_shrimp_lib/crates/adapters/src/shrimp/adapters/telegram.rs(~464)crates/hero_shrimp_lib/crates/adapters/src/shrimp/adapters/admin.rs(~114)crates/hero_shrimp_lib/crates/runtime/src/shrimp/runtime/cron.rs(~169)crates/hero_shrimp_lib/crates/engine/src/shrimp/core/integrations/lsp/client.rs(~91)crates/hero_shrimp_lib/crates/engine/src/shrimp/core/client/engine_client/rpc_client.rs(~73)crates/hero_shrimp_lib/crates/runtime/src/shrimp/runtime/supervise.rs— addspawn_loggedlog-only one-shot variant.Implementation Plan
Step 0 (optional): Add
spawn_logged(name, fut)tosupervise.rs— inner-spawn, await JoinHandle,tracing::error!on JoinError, no restart. Needed only for reader loops / one-shots where restart is infeasible. Dependencies: none.Step 1: Supervise Telegram gateway dispatcher (
telegram.rs:464). Rebuild dispatcher/listener from clonedbot+handler inside theFnfactory; preserverecord_channel_failure("telegram", …). Dependencies: none.Step 2: Supervise Admin HTTP/SSE server (
admin.rs:114). Rebuild listener+router from config/state inside the factory; preserverecord_channel_failure("admin", …); ensure socket rebind is idempotent. Dependencies: none.Step 3: Supervise cron durable loop (
cron.rs:169, NOT the one-shot at :215). Use a static label (e.g."cron_scheduled_job") and pass job id as a tracing field. Dependencies: none.Step 4: LSP reader loop (
lsp/client.rs:91).stdout/childare non-Clone; a dead process can't be re-read. Preferspawn_logged(log a mid-frame panic) keeping EOF-exit terminal. Dependencies: Step 0.Step 5: RPC client reader loop (
rpc_client.rs:73). Same shape as Step 4 — non-Clonereader; usespawn_logged. Dependencies: Step 0.Step 6 (optional): Convert selected Group B one-shots (
lifecycle.rs:36,cron.rs:215,mcp.rs:473,telegram.rs:160) tospawn_logged. Dependencies: Step 0.Step 7: Verification sweep — re-run
grep -rn "tokio::spawn"and confirm every remaining raw site is deliberate (Group B/C).Acceptance Criteria
spawn_supervised(restart) orspawn_logged(log-only), with the choice documented inline where restart was infeasible.record_channel_failureside effects for telegram/admin preserved.Fnclosures (or log-onlyFnOnce), non-Clonestate rebuilt inside.Notes / scope boundaries
spawn_supervised— it exists and is applied to runtime workers + listener tasks.namemust be&'static str; pass dynamic ids (cron job id, server name) via tracing fields.spawn_supervisedrebuilds the body each restart. Reader loops owning non-CloneI/O can't be restarted into a closed pipe — preferspawn_loggedthere.spawn_with_contextwrappers — they self-terminate with their owning scope.Test Results
Ran
cargo teston the changed crates after the conversions.hero_shrimp_engine: 1733 passed, 0 failed, 10 ignoredhero_shrimp_runtime: 580 passed, 0 failed, 0 ignored (includessupervise::tests::restarts_after_panicand the newsupervise::tests::logged_runs_future_to_completion)hero_shrimp_adapters: no unit tests (0)Totals: 2313 passed, 0 failed. Full workspace builds cleanly (
cargo build --workspace) with no new warnings.Implementation Summary
Finished the rollout of supervised task spawning to the remaining durable/long-lived
tokio::spawnsites. Thespawn_supervisedhelper already existed and was applied to ~10 runtime workers plus the threelistener.rstasks; this change completes coverage of the always-on gateways, the scheduler loop, and the long-lived reader loops, and adds panic-logging for selected fire-and-forget one-shots.New helper
supervise.rs: addedspawn_logged(name, fut)— a log-only one-shot variant that inner-spawns a future and logs atracing::error!JoinError on panic, with no restart loop. Used for tasks that cannot be safely restarted (reader loops over non-CloneI/O, fire-and-forget one-shots). Unit test added.Restart-supervised (
spawn_supervised)adapters/.../telegram.rs— Telegram gateway dispatcher (telegram_gateway). Dispatcher + polling listener rebuilt inside theFnfactory each restart;record_channel_failure("telegram", ...)preserved.adapters/.../admin.rs— Admin HTTP/SSE server (admin_gateway). Listener re-bound (idempotent socket cleanup) and router rebuilt per restart;record_channel_failure("admin", ...)preserved.runtime/.../cron.rs— durable recurring-job loop (cron_scheduled_job). Static task name; per-job id emitted as atracingfield.Log-only (
spawn_logged)engine/.../integrations/lsp/client.rs— LSP reader loop (lsp_reader); EOF/process-death stays terminal, panics now logged.engine/.../client/engine_client/rpc_client.rs— RPC client reader/demux loop (rpc_client_reader); same rationale.engine/.../agency/lifecycle.rs— SubagentStart hooks one-shot (subagent_start_hooks).runtime/.../cron.rs— one-shot scheduled task (cron_one_shot).engine/.../integrations/mcp.rs— per-request MCP handler (mcp_request_handler).adapters/.../telegram.rs— reminder-fire one-shot (telegram_update_handler).Left unchanged (intentional)
Per-request / per-connection / job-scoped / infrastructure spawns: SSE loops (
events.rs,chat.rs), per-request queue/crew/proof_run spawns, job watchdog/heartbeat, agent-loop helpers,spawn_with_contextwrappers, and all test/stub spawns. A finalgrepsweep confirmed every remaining rawtokio::spawnoutside test code is a deliberate non-durable spawn.Tests
2313 passed, 0 failed across engine + runtime; adapters has no unit tests. Workspace builds with no new warnings.
Notes
proposer.rsas the template is stale — the panic-isolation/backoff template now lives insupervise.rs.CloneI/O handles cannot be rebuilt after the underlying process/connection closes; restarting would only re-fail against a closed pipe.