[P2] CommandSucceeds verify timeout (300s) is hardcoded, no circuit-breaker #44
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_shrimp#44
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
A hanging verify command blocks the whole job for 300s with no config override or early-out.
Evidence
crates/hero_shrimp_engine/src/verification/runner.rsCOMMAND_TIMEOUT_SECS = 300.Proposed fix
Make it configurable and add an early circuit-break / escalation path.
Filed from a comparative audit of Hero Shrimp vs Qwen-Code / kimi-cli / picoclaw (2026-05-23). Severity in title: P0=correctness/trust, P1=reliability/UX, P2=cleanup.
Implementation Spec for Issue #44
Objective
Make the
CommandSucceedsverify timeout configurable and ensure commands are always bounded — even when thetimeoutcoreutil is absent — fixing the hang risk described in this issue.Target file
crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/runner.rs(The cited path
crates/hero_shrimp_engine/src/verification/runner.rsis stale; the module moved to the path above. The hardcoded constant currently lives there asCOMMAND_TIMEOUT_SECS = 300.)Requirements
HERO_SHRIMP_VERIFY_TIMEOUT_SECSenv var overrides the 300s default.timeoututility is on PATH (in-process watchdog fallback that kills the process group, using the already-presentlibcdependency).Implementation Plan
Step 1: Configurable timeout resolution
Files:
crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/runner.rs300as a namedDEFAULT_COMMAND_TIMEOUT_SECSconstant.fn resolve_timeout_secs() -> u64that readsHERO_SHRIMP_VERIFY_TIMEOUT_SECS, parses it, and returns the parsed value only if> 0, otherwise the default.command_succeedsinstead of the hardcoded constant.Dependencies: none
Step 2: In-process timeout fallback
Files:
crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/runner.rstimeoutis not on PATH, replace the unboundedCommand::new("sh").output()with a bounded path: spawn the child in its own process group (CommandExt::process_group(0)), drain stdout/stderr on threads, polltry_wait()until the deadline, and on expiry sendSIGKILLto the process group vialibc::kill(-pid, SIGKILL), recording a timeout result.timeout-coreutil branch is unchanged.Dependencies: Step 1
Step 3: Unit tests
Files:
crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/runner.rsresolve_timeout_secs: env unset -> 300; valid"5"-> 5;"0"-> 300; garbage"abc"-> 300 (env mutation serialized in one test to avoid races).sleep 30command bounded to ~1s via the in-process path returns promptly with a failed outcome (gated onsleepavailability, matching existing toolchain-gated tests).Dependencies: Step 1, Step 2
Acceptance Criteria
HERO_SHRIMP_VERIFY_TIMEOUT_SECSoverrides the defaulttimeoututilityNotes
libcdependency.shchild, so a hung build subprocess cannot outlive the timeout.Test Results
cargo test -p hero_shrimp_engine verification::runnerIncludes the two new tests:
resolve_timeout_secs_reads_env_and_falls_back_on_bad_values— verifies env override and that unset/zero/garbage values revert to the 300s default.in_process_fallback_bounds_a_hung_command— calls the in-process watchdog directly with asleep 30bounded to 1s; the command is killed in ~1s with a non-zero status and a timeout note in stderr.Implementation Summary
The hardcoded 300s verify timeout is now configurable and always enforced.
Changes (single file)
crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/runner.rsRenamed the constant to
DEFAULT_COMMAND_TIMEOUT_SECS(still 300) and addedresolve_timeout_secs(), which readsHERO_SHRIMP_VERIFY_TIMEOUT_SECS. A valid positive integer overrides the default; unset, zero, and non-numeric values keep the 300s default so a misconfiguration can never remove the bound.command_succeedsnow calls the resolver instead of the constant.Added
run_shell_bounded()— an in-process watchdog used when thetimeoutcoreutil is absent (previously that path ran the shell unbounded). It spawns the shell in its own process group, drains stdout/stderr on threads, polls for completion against a deadline, and on expiry sendsSIGKILLto the whole process group vialibc::kill(-pgid, ...)so a hung build subprocess ofshis killed too, not justsh. Thetimeout-coreutil branch is unchanged.Added two unit tests covering env-var parsing/fallback and the in-process bound.
Acceptance criteria
HERO_SHRIMP_VERIFY_TIMEOUT_SECSoverrides the defaulttimeoututility existsNotes
libcdependency.SIGKILL), consistent with the rest of the runtime.