[P2] CommandSucceeds verify timeout (300s) is hardcoded, no circuit-breaker #44

Closed
opened 2026-05-23 21:52:27 +00:00 by thabeta · 3 comments
Owner

Problem
A hanging verify command blocks the whole job for 300s with no config override or early-out.

Evidence

  • crates/hero_shrimp_engine/src/verification/runner.rs COMMAND_TIMEOUT_SECS = 300.

Proposed fix
Make it configurable and add an early circuit-break / escalation path.


Filed from a comparative audit of Hero Shrimp vs Qwen-Code / kimi-cli / picoclaw (2026-05-23). Severity in title: P0=correctness/trust, P1=reliability/UX, P2=cleanup.

**Problem** A hanging verify command blocks the whole job for 300s with no config override or early-out. **Evidence** - `crates/hero_shrimp_engine/src/verification/runner.rs` `COMMAND_TIMEOUT_SECS = 300`. **Proposed fix** Make it configurable and add an early circuit-break / escalation path. --- _Filed from a comparative audit of Hero Shrimp vs Qwen-Code / kimi-cli / picoclaw (2026-05-23). Severity in title: P0=correctness/trust, P1=reliability/UX, P2=cleanup._
rawan self-assigned this 2026-06-14 12:58:41 +00:00
Member

Implementation Spec for Issue #44

Objective

Make the CommandSucceeds verify timeout configurable and ensure commands are always bounded — even when the timeout coreutil is absent — fixing the hang risk described in this issue.

Target file

crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/runner.rs

(The cited path crates/hero_shrimp_engine/src/verification/runner.rs is stale; the module moved to the path above. The hardcoded constant currently lives there as COMMAND_TIMEOUT_SECS = 300.)

Requirements

  • HERO_SHRIMP_VERIFY_TIMEOUT_SECS env var overrides the 300s default.
  • Invalid / zero / unparseable values fall back to 300s.
  • The command is bounded whether or not the timeout utility is on PATH (in-process watchdog fallback that kills the process group, using the already-present libc dependency).
  • Unit tests cover the env parsing and the fallback enforcement.

Implementation Plan

Step 1: Configurable timeout resolution

Files: crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/runner.rs

  • Keep 300 as a named DEFAULT_COMMAND_TIMEOUT_SECS constant.
  • Add fn resolve_timeout_secs() -> u64 that reads HERO_SHRIMP_VERIFY_TIMEOUT_SECS, parses it, and returns the parsed value only if > 0, otherwise the default.
  • Call it in command_succeeds instead of the hardcoded constant.
    Dependencies: none

Step 2: In-process timeout fallback

Files: crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/runner.rs

  • When timeout is not on PATH, replace the unbounded Command::new("sh").output() with a bounded path: spawn the child in its own process group (CommandExt::process_group(0)), drain stdout/stderr on threads, poll try_wait() until the deadline, and on expiry send SIGKILL to the process group via libc::kill(-pid, SIGKILL), recording a timeout result.
  • The timeout-coreutil branch is unchanged.
    Dependencies: Step 1

Step 3: Unit tests

Files: crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/runner.rs

  • resolve_timeout_secs: env unset -> 300; valid "5" -> 5; "0" -> 300; garbage "abc" -> 300 (env mutation serialized in one test to avoid races).
  • Fallback enforcement: a sleep 30 command bounded to ~1s via the in-process path returns promptly with a failed outcome (gated on sleep availability, matching existing toolchain-gated tests).
    Dependencies: Step 1, Step 2

Acceptance Criteria

  • HERO_SHRIMP_VERIFY_TIMEOUT_SECS overrides the default
  • Invalid/zero values revert to 300s
  • Commands bounded with or without the timeout utility
  • All tests pass

Notes

  • No new dependencies — uses the standard library plus the existing libc dependency.
  • The in-process fallback kills the whole process group, not just the direct sh child, so a hung build subprocess cannot outlive the timeout.
## Implementation Spec for Issue #44 ### Objective Make the `CommandSucceeds` verify timeout configurable and ensure commands are always bounded — even when the `timeout` coreutil is absent — fixing the hang risk described in this issue. ### Target file `crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/runner.rs` (The cited path `crates/hero_shrimp_engine/src/verification/runner.rs` is stale; the module moved to the path above. The hardcoded constant currently lives there as `COMMAND_TIMEOUT_SECS = 300`.) ### Requirements - `HERO_SHRIMP_VERIFY_TIMEOUT_SECS` env var overrides the 300s default. - Invalid / zero / unparseable values fall back to 300s. - The command is bounded whether or not the `timeout` utility is on PATH (in-process watchdog fallback that kills the process group, using the already-present `libc` dependency). - Unit tests cover the env parsing and the fallback enforcement. ### Implementation Plan #### Step 1: Configurable timeout resolution Files: `crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/runner.rs` - Keep `300` as a named `DEFAULT_COMMAND_TIMEOUT_SECS` constant. - Add `fn resolve_timeout_secs() -> u64` that reads `HERO_SHRIMP_VERIFY_TIMEOUT_SECS`, parses it, and returns the parsed value only if `> 0`, otherwise the default. - Call it in `command_succeeds` instead of the hardcoded constant. Dependencies: none #### Step 2: In-process timeout fallback Files: `crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/runner.rs` - When `timeout` is not on PATH, replace the unbounded `Command::new("sh").output()` with a bounded path: spawn the child in its own process group (`CommandExt::process_group(0)`), drain stdout/stderr on threads, poll `try_wait()` until the deadline, and on expiry send `SIGKILL` to the process group via `libc::kill(-pid, SIGKILL)`, recording a timeout result. - The `timeout`-coreutil branch is unchanged. Dependencies: Step 1 #### Step 3: Unit tests Files: `crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/runner.rs` - `resolve_timeout_secs`: env unset -> 300; valid `"5"` -> 5; `"0"` -> 300; garbage `"abc"` -> 300 (env mutation serialized in one test to avoid races). - Fallback enforcement: a `sleep 30` command bounded to ~1s via the in-process path returns promptly with a failed outcome (gated on `sleep` availability, matching existing toolchain-gated tests). Dependencies: Step 1, Step 2 ### Acceptance Criteria - [ ] `HERO_SHRIMP_VERIFY_TIMEOUT_SECS` overrides the default - [ ] Invalid/zero values revert to 300s - [ ] Commands bounded with or without the `timeout` utility - [ ] All tests pass ### Notes - No new dependencies — uses the standard library plus the existing `libc` dependency. - The in-process fallback kills the whole process group, not just the direct `sh` child, so a hung build subprocess cannot outlive the timeout.
Member

Test Results

cargo test -p hero_shrimp_engine verification::runner

  • Total: 12
  • Passed: 12
  • Failed: 0

Includes the two new tests:

  • resolve_timeout_secs_reads_env_and_falls_back_on_bad_values — verifies env override and that unset/zero/garbage values revert to the 300s default.
  • in_process_fallback_bounds_a_hung_command — calls the in-process watchdog directly with a sleep 30 bounded to 1s; the command is killed in ~1s with a non-zero status and a timeout note in stderr.
running 12 tests
test ...::resolve_timeout_secs_reads_env_and_falls_back_on_bad_values ... ok
test ...::in_process_fallback_bounds_a_hung_command ... ok
... (10 pre-existing runner tests) ...
test result: ok. 12 passed; 0 failed; 0 ignored; finished in 1.01s
## Test Results `cargo test -p hero_shrimp_engine verification::runner` - Total: 12 - Passed: 12 - Failed: 0 Includes the two new tests: - `resolve_timeout_secs_reads_env_and_falls_back_on_bad_values` — verifies env override and that unset/zero/garbage values revert to the 300s default. - `in_process_fallback_bounds_a_hung_command` — calls the in-process watchdog directly with a `sleep 30` bounded to 1s; the command is killed in ~1s with a non-zero status and a timeout note in stderr. ``` running 12 tests test ...::resolve_timeout_secs_reads_env_and_falls_back_on_bad_values ... ok test ...::in_process_fallback_bounds_a_hung_command ... ok ... (10 pre-existing runner tests) ... test result: ok. 12 passed; 0 failed; 0 ignored; finished in 1.01s ```
Member

Implementation Summary

The hardcoded 300s verify timeout is now configurable and always enforced.

Changes (single file)

crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/runner.rs

  1. Renamed the constant to DEFAULT_COMMAND_TIMEOUT_SECS (still 300) and added resolve_timeout_secs(), which reads HERO_SHRIMP_VERIFY_TIMEOUT_SECS. A valid positive integer overrides the default; unset, zero, and non-numeric values keep the 300s default so a misconfiguration can never remove the bound. command_succeeds now calls the resolver instead of the constant.

  2. Added run_shell_bounded() — an in-process watchdog used when the timeout coreutil is absent (previously that path ran the shell unbounded). It spawns the shell in its own process group, drains stdout/stderr on threads, polls for completion against a deadline, and on expiry sends SIGKILL to the whole process group via libc::kill(-pgid, ...) so a hung build subprocess of sh is killed too, not just sh. The timeout-coreutil branch is unchanged.

  3. Added two unit tests covering env-var parsing/fallback and the in-process bound.

Acceptance criteria

  • HERO_SHRIMP_VERIFY_TIMEOUT_SECS overrides the default
  • Invalid/zero values revert to 300s
  • Commands bounded whether or not the timeout utility exists
  • All tests pass (12/12)

Notes

  • No new dependencies — standard library plus the existing libc dependency.
  • Unix-targeted (process groups / SIGKILL), consistent with the rest of the runtime.
  • Changes are local only; not committed or pushed.
## Implementation Summary The hardcoded 300s verify timeout is now configurable and always enforced. ### Changes (single file) `crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/runner.rs` 1. Renamed the constant to `DEFAULT_COMMAND_TIMEOUT_SECS` (still 300) and added `resolve_timeout_secs()`, which reads `HERO_SHRIMP_VERIFY_TIMEOUT_SECS`. A valid positive integer overrides the default; unset, zero, and non-numeric values keep the 300s default so a misconfiguration can never remove the bound. `command_succeeds` now calls the resolver instead of the constant. 2. Added `run_shell_bounded()` — an in-process watchdog used when the `timeout` coreutil is absent (previously that path ran the shell unbounded). It spawns the shell in its own process group, drains stdout/stderr on threads, polls for completion against a deadline, and on expiry sends `SIGKILL` to the whole process group via `libc::kill(-pgid, ...)` so a hung build subprocess of `sh` is killed too, not just `sh`. The `timeout`-coreutil branch is unchanged. 3. Added two unit tests covering env-var parsing/fallback and the in-process bound. ### Acceptance criteria - [x] `HERO_SHRIMP_VERIFY_TIMEOUT_SECS` overrides the default - [x] Invalid/zero values revert to 300s - [x] Commands bounded whether or not the `timeout` utility exists - [x] All tests pass (12/12) ### Notes - No new dependencies — standard library plus the existing `libc` dependency. - Unix-targeted (process groups / `SIGKILL`), consistent with the rest of the runtime. - Changes are local only; not committed or pushed.
rawan closed this issue 2026-06-14 15:05:01 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_shrimp#44
No description provided.