[P2] Verification requires a root-runnable command — nested-manifest builds false-negative #43

Closed
opened 2026-05-23 21:52:26 +00:00 by thabeta · 4 comments
Owner

Problem
The zero-hardcoding verdict runs the LLM's verify command from the run root. When the model nests the project in a subdir (e.g. crate in stackcrate/), cargo test at root exits non-zero → false negative. Accepted trade-off of "no hardcoding", but it still bites real runs (job-233/244).

Evidence

  • crates/hero_shrimp_engine/src/verification/runner.rs (command_succeeds runs from workspace root).

Proposed fix
Either nudge the executor to emit a root-runnable command, or let the contract carry a working-dir for the command without reintroducing language/manifest detection.


Filed from a comparative audit of Hero Shrimp vs Qwen-Code / kimi-cli / picoclaw (2026-05-23). Severity in title: P0=correctness/trust, P1=reliability/UX, P2=cleanup.

**Problem** The zero-hardcoding verdict runs the LLM's verify command from the run root. When the model nests the project in a subdir (e.g. crate in `stackcrate/`), `cargo test` at root exits non-zero → false negative. Accepted trade-off of "no hardcoding", but it still bites real runs (job-233/244). **Evidence** - `crates/hero_shrimp_engine/src/verification/runner.rs` (`command_succeeds` runs from workspace root). **Proposed fix** Either nudge the executor to emit a root-runnable command, or let the contract carry a working-dir for the command without reintroducing language/manifest detection. --- _Filed from a comparative audit of Hero Shrimp vs Qwen-Code / kimi-cli / picoclaw (2026-05-23). Severity in title: P0=correctness/trust, P1=reliability/UX, P2=cleanup._
Member

Implementation Spec for Issue #43

Objective

Eliminate the nested-manifest false-negative in deterministic verification. Today Check::CommandSucceeds always runs the LLM-authored verify command from the workspace root via run_shell(... .current_dir(workspace)). When the model nests its project in a subdirectory (e.g. a crate in stackcrate/), a bare root-relative command such as cargo test --tests exits non-zero at root, producing a false Verdict::Failed (jobs 233/244). Add an optional, contract-carried working_dir to Check::CommandSucceeds, thread it into the runner so the command runs from workspace.join(working_dir) (validated through the existing safe_join), and populate it deterministically from the deliverables' common subdirectory — never from language/manifest detection.

Requirements

  • Preserve the zero-hardcoding invariant: NO runner list, NO manifest map, NO language/output parsing is introduced. The verify command stays the author/LLM's verbatim string.
  • The new working_dir must be derived only from signals already present at freeze time (the declared/derived deliverable paths), not from probing the workspace (which is empty at freeze time anyway).
  • Backward compatible serde: existing frozen acceptance_contract.json files and existing JSONL ledgers (which contain kind: command_succeeds with only command) must still deserialize. New field is optional with a default.
  • Path-escape safety: working_dir must pass the same workspace-relative safety constraint as deliverable paths (safe_join / path_is_workspace_safe), validated at plan time.
  • When working_dir is absent or empty, behavior is byte-for-byte identical to today (run from workspace).
  • The fix must not regress the existing cd sub && ... author-authored escape hatch, which keeps working.

Files to Modify/Create

  • crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/check.rs — add optional working_dir field to Check::CommandSucceeds; update label(), validate_contract(), and all in-file test constructors.
  • crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/runner.rs — thread working_dir through run() -> command_succeeds() -> run_shell(); resolve the effective cwd via safe_join; add tests.
  • crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/derive.rs — populate working_dir from the deliverables' common leading subdirectory; add tests; update existing pattern-match tests that destructure CommandSucceeds.
  • crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/run.rs — update test constructors of CommandSucceeds (struct now has a new field).
  • crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/verdict.rs — update the test constructor at line ~159.

Design decision (which option, and why)

Recommended: Option (a) — add an optional working_dir to Check::CommandSucceeds, populated from the common deliverable subdirectory.

Justification:

  • There IS a clean, manifest-free source. At freeze time the only workspace signal is the set of deliverable paths (derive.rs::collect_deliverables). When the model declares its files under a single common leading directory (e.g. stackcrate/src/lib.rs, stackcrate/Cargo.toml, stackcrate/tests/...), that shared prefix (stackcrate) is exactly where a root-relative verify command should run — and computing it is pure string work over paths the system already has. It uses ZERO language knowledge: it never looks at file extensions, manifests, or runners. This is the direct structural analogue of the already-shipped find_file_by_suffix fix for FileExists (job-224), which tolerates the identical "model nested the project" situation. Mirroring that proven pattern keeps the two check families consistent.
  • It is deterministic and replayable (the verdict's core contract): the working_dir is frozen into acceptance_contract.json before the worker runs, so re-evaluation yields the identical verdict.
  • Option (b) (prompt-nudge the executor to emit a root-runnable command) is weaker on its own: the verify command is NOT emitted by an executor prompt in this codebase — it is extracted verbatim from the goal text by extract_verification_command (derive.rs ~140). The system explicitly "never invents a command." Nudging would require changing goal-authoring upstream of this repo and would still be a soft, probabilistic mitigation that false-negatives whenever the author writes a bare command — which is the common case. We keep the cd sub && ... escape hatch (the author's manual equivalent of option b) working, so authors who already write root-runnable commands are unaffected.

Honesty caveat encoded in code comments: the common-subdirectory heuristic only fires when there is exactly one non-trivial common leading directory across the deliverables. If deliverables live at the root, or split across multiple top-level dirs, working_dir is left None and behavior is unchanged (run from root) — the author's cd escape hatch remains the fallback. This is intentionally conservative: a wrong guess could mask a real failure, so we only set it when the signal is unambiguous.

Implementation Plan

Step 1: Add the optional working_dir field to the check vocabulary

Files: check.rs

  • Change the variant to: CommandSucceeds { command: String, #[serde(default, skip_serializing_if = "Option::is_none")] working_dir: Option<String> }.
    • #[serde(default)] makes old JSON/JSONL (no working_dir key) deserialize to None — backward compatible.
    • skip_serializing_if = "Option::is_none" keeps newly-frozen contracts byte-identical to old ones when no subdir is detected.
  • Update label(): when working_dir is Some(d), render `{command}` exits 0 (in `{d}`); else keep `{command}` exits 0.
  • Update validate_contract(): destructure { command, working_dir }; keep empty/destructive checks; if working_dir is Some(d) non-empty, require path_is_workspace_safe(d), else push an error. Empty/whitespace working_dir is treated as None.
  • Update all in-file test constructors to add working_dir: None. Add tests for working_dir validation (safe passes, ../escape rejected) and serde roundtrip (Some("stackcrate") <-> "working_dir":"stackcrate"; missing key -> None).

Dependencies: none.

Step 2: Thread working_dir into the runner and resolve the effective cwd

Files: runner.rs

  • run(): Check::CommandSucceeds { command, working_dir } => self.command_succeeds(command, working_dir.as_deref(), workspace).
  • command_succeeds(): add working_dir: Option<&str>. After the destructive-command gate, resolve the cwd: if Some(d) non-empty, safe_join(workspace, d) (Unrunnable on escape); if the resolved dir does not exist, fall back to workspace (never a false Fail from a stale guess); else use workspace. Pass the resolved path to run_shell.
  • run_shell() already pins .current_dir(workspace); just pass the resolved cwd as that argument. Reuse the existing safe_join helper.
  • Tests: command runs in working_dir when set (and fails at root with None); traversal refused (Unrunnable); missing working_dir falls back to root. Update existing test constructors to add working_dir: None.

Dependencies: Step 1.

Step 3: Populate working_dir deterministically from the deliverables' common subdirectory

Files: derive.rs

  • Add fn common_deliverable_subdir(deliverables: &[String]) -> Option<String>: longest common leading path-segment prefix across deliverables (drop final filename segment); return it only when there is a single unambiguous non-empty common subdir, else None. Pure string work; no extensions, no manifest names. Validate result with path_is_workspace_safe.
  • In derive_contract(), set working_dir: common_deliverable_subdir(&deliverables) when building Check::CommandSucceeds; keep command: cmd verbatim.
  • Update existing destructuring matchers to Check::CommandSucceeds { command, .. }.
  • Tests: nested deliverables -> Some("stackcrate"); root deliverables -> None; mixed top-level dirs -> None; cd backend && make check author case still working_dir: None.

Dependencies: Step 1.

Step 4: Fix remaining compile-breaks in sibling test modules

Files: run.rs, verdict.rs

  • Add working_dir: None to each Check::CommandSucceeds { command: ... } test literal. No behavioral change.

Dependencies: Step 1.

Step 5: Verify backward compatibility and full data flow

  • Confirm freeze_contract/load_contract roundtrip a contract with Some(working_dir) and with None.
  • Confirm the JSONL ledger is unaffected (it stores only the command string + exit code, not the Check).
  • Confirm proof_run.rs still compiles (never destructures CommandSucceeds).
  • Run cargo test for the engine verification module and the server proof/contract tests.

Dependencies: Steps 1-4.

Acceptance Criteria

  • Check::CommandSucceeds carries an optional working_dir: Option<String> with #[serde(default, skip_serializing_if = "Option::is_none")].
  • A frozen acceptance_contract.json produced before this change (no working_dir key) still deserializes (working_dir == None); a contract with no detected subdir serializes WITHOUT a working_dir key.
  • When working_dir is set, command_succeeds runs the verbatim command from workspace.join(working_dir) via safe_join; a ../absolute working_dir yields Unrunnable and is never executed.
  • A working_dir pointing at a non-existent subdir falls back to running at the workspace root (never a false Fail).
  • derive_contract sets working_dir to the deliverables' single common leading subdirectory when one unambiguously exists, and None otherwise.
  • The verify command remains verbatim — no command is invented or rewritten; the cd backend && make check author case still yields working_dir: None.
  • No language/runner/manifest list, no file-extension switch, and no output parsing is added anywhere in the change.
  • A regression test reproduces the issue: deliverables under stackcrate/ + a bare root-relative command FAIL at root but PASS with the derived working_dir.
  • All existing verification-module tests compile and pass after adding working_dir to their constructors.

Notes

  • The working file runner.rs in the current checkout contains in-progress timeout work (command_timeout_secs, run_shell_with_deadline). The implementing branch off integration will NOT have this; the spec is independent of it — the only runner change is WHICH path is passed as the cwd argument to run_shell/.current_dir(...).
  • Auto-retry from a discovered subdir in the runner was rejected as the primary mechanism: it could run the command multiple times and mask a genuine root-level failure, breaking the "verdict is a pure replayable function of contract + workspace" invariant. Freezing working_dir at derive time keeps the verdict deterministic.
  • The deliverable-prefix heuristic is deliberately conservative (single unambiguous common subdir only) — trading some recall for zero risk of masking a real failure, consistent with the module's "never a false pass" stance.
## Implementation Spec for Issue #43 ### Objective Eliminate the nested-manifest false-negative in deterministic verification. Today `Check::CommandSucceeds` always runs the LLM-authored verify command from the workspace root via `run_shell(... .current_dir(workspace))`. When the model nests its project in a subdirectory (e.g. a crate in `stackcrate/`), a bare root-relative command such as `cargo test --tests` exits non-zero at root, producing a false `Verdict::Failed` (jobs 233/244). Add an optional, contract-carried `working_dir` to `Check::CommandSucceeds`, thread it into the runner so the command runs from `workspace.join(working_dir)` (validated through the existing `safe_join`), and populate it deterministically from the deliverables' common subdirectory — never from language/manifest detection. ### Requirements - Preserve the zero-hardcoding invariant: NO runner list, NO manifest map, NO language/output parsing is introduced. The verify command stays the author/LLM's verbatim string. - The new `working_dir` must be derived only from signals already present at freeze time (the declared/derived deliverable paths), not from probing the workspace (which is empty at freeze time anyway). - Backward compatible serde: existing frozen `acceptance_contract.json` files and existing JSONL ledgers (which contain `kind: command_succeeds` with only `command`) must still deserialize. New field is optional with a default. - Path-escape safety: `working_dir` must pass the same workspace-relative safety constraint as deliverable paths (`safe_join` / `path_is_workspace_safe`), validated at plan time. - When `working_dir` is absent or empty, behavior is byte-for-byte identical to today (run from `workspace`). - The fix must not regress the existing `cd sub && ...` author-authored escape hatch, which keeps working. ### Files to Modify/Create - `crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/check.rs` — add optional `working_dir` field to `Check::CommandSucceeds`; update `label()`, `validate_contract()`, and all in-file test constructors. - `crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/runner.rs` — thread `working_dir` through `run()` -> `command_succeeds()` -> `run_shell()`; resolve the effective cwd via `safe_join`; add tests. - `crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/derive.rs` — populate `working_dir` from the deliverables' common leading subdirectory; add tests; update existing pattern-match tests that destructure `CommandSucceeds`. - `crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/run.rs` — update test constructors of `CommandSucceeds` (struct now has a new field). - `crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/verdict.rs` — update the test constructor at line ~159. ### Design decision (which option, and why) Recommended: **Option (a) — add an optional `working_dir` to `Check::CommandSucceeds`, populated from the common deliverable subdirectory.** Justification: - There IS a clean, manifest-free source. At freeze time the only workspace signal is the set of deliverable paths (`derive.rs::collect_deliverables`). When the model declares its files under a single common leading directory (e.g. `stackcrate/src/lib.rs`, `stackcrate/Cargo.toml`, `stackcrate/tests/...`), that shared prefix (`stackcrate`) is exactly where a root-relative verify command should run — and computing it is pure string work over paths the system already has. It uses ZERO language knowledge: it never looks at file extensions, manifests, or runners. This is the direct structural analogue of the already-shipped `find_file_by_suffix` fix for `FileExists` (job-224), which tolerates the identical "model nested the project" situation. Mirroring that proven pattern keeps the two check families consistent. - It is deterministic and replayable (the verdict's core contract): the working_dir is frozen into `acceptance_contract.json` before the worker runs, so re-evaluation yields the identical verdict. - Option (b) (prompt-nudge the executor to emit a root-runnable command) is weaker on its own: the verify command is NOT emitted by an executor prompt in this codebase — it is extracted verbatim from the goal text by `extract_verification_command` (derive.rs ~140). The system explicitly "never invents a command." Nudging would require changing goal-authoring upstream of this repo and would still be a soft, probabilistic mitigation that false-negatives whenever the author writes a bare command — which is the common case. We keep the `cd sub && ...` escape hatch (the author's manual equivalent of option b) working, so authors who already write root-runnable commands are unaffected. Honesty caveat encoded in code comments: the common-subdirectory heuristic only fires when there is exactly one non-trivial common leading directory across the deliverables. If deliverables live at the root, or split across multiple top-level dirs, `working_dir` is left `None` and behavior is unchanged (run from root) — the author's `cd` escape hatch remains the fallback. This is intentionally conservative: a wrong guess could mask a real failure, so we only set it when the signal is unambiguous. ### Implementation Plan #### Step 1: Add the optional `working_dir` field to the check vocabulary Files: `check.rs` - Change the variant to: `CommandSucceeds { command: String, #[serde(default, skip_serializing_if = "Option::is_none")] working_dir: Option<String> }`. - `#[serde(default)]` makes old JSON/JSONL (no `working_dir` key) deserialize to `None` — backward compatible. - `skip_serializing_if = "Option::is_none"` keeps newly-frozen contracts byte-identical to old ones when no subdir is detected. - Update `label()`: when `working_dir` is `Some(d)`, render `` `{command}` exits 0 (in `{d}`) ``; else keep `` `{command}` exits 0 ``. - Update `validate_contract()`: destructure `{ command, working_dir }`; keep empty/destructive checks; if `working_dir` is `Some(d)` non-empty, require `path_is_workspace_safe(d)`, else push an error. Empty/whitespace `working_dir` is treated as `None`. - Update all in-file test constructors to add `working_dir: None`. Add tests for working_dir validation (safe passes, `../escape` rejected) and serde roundtrip (`Some("stackcrate")` <-> `"working_dir":"stackcrate"`; missing key -> `None`). Dependencies: none. #### Step 2: Thread `working_dir` into the runner and resolve the effective cwd Files: `runner.rs` - `run()`: `Check::CommandSucceeds { command, working_dir } => self.command_succeeds(command, working_dir.as_deref(), workspace)`. - `command_succeeds()`: add `working_dir: Option<&str>`. After the destructive-command gate, resolve the cwd: if `Some(d)` non-empty, `safe_join(workspace, d)` (Unrunnable on escape); if the resolved dir does not exist, fall back to `workspace` (never a false Fail from a stale guess); else use `workspace`. Pass the resolved path to `run_shell`. - `run_shell()` already pins `.current_dir(workspace)`; just pass the resolved cwd as that argument. Reuse the existing `safe_join` helper. - Tests: command runs in working_dir when set (and fails at root with `None`); traversal refused (Unrunnable); missing working_dir falls back to root. Update existing test constructors to add `working_dir: None`. Dependencies: Step 1. #### Step 3: Populate `working_dir` deterministically from the deliverables' common subdirectory Files: `derive.rs` - Add `fn common_deliverable_subdir(deliverables: &[String]) -> Option<String>`: longest common leading path-segment prefix across deliverables (drop final filename segment); return it only when there is a single unambiguous non-empty common subdir, else `None`. Pure string work; no extensions, no manifest names. Validate result with `path_is_workspace_safe`. - In `derive_contract()`, set `working_dir: common_deliverable_subdir(&deliverables)` when building `Check::CommandSucceeds`; keep `command: cmd` verbatim. - Update existing destructuring matchers to `Check::CommandSucceeds { command, .. }`. - Tests: nested deliverables -> `Some("stackcrate")`; root deliverables -> `None`; mixed top-level dirs -> `None`; `cd backend && make check` author case still `working_dir: None`. Dependencies: Step 1. #### Step 4: Fix remaining compile-breaks in sibling test modules Files: `run.rs`, `verdict.rs` - Add `working_dir: None` to each `Check::CommandSucceeds { command: ... }` test literal. No behavioral change. Dependencies: Step 1. #### Step 5: Verify backward compatibility and full data flow - Confirm `freeze_contract`/`load_contract` roundtrip a contract with `Some(working_dir)` and with `None`. - Confirm the JSONL ledger is unaffected (it stores only the command string + exit code, not the `Check`). - Confirm `proof_run.rs` still compiles (never destructures `CommandSucceeds`). - Run `cargo test` for the engine verification module and the server proof/contract tests. Dependencies: Steps 1-4. ### Acceptance Criteria - [ ] `Check::CommandSucceeds` carries an optional `working_dir: Option<String>` with `#[serde(default, skip_serializing_if = "Option::is_none")]`. - [ ] A frozen `acceptance_contract.json` produced before this change (no `working_dir` key) still deserializes (`working_dir == None`); a contract with no detected subdir serializes WITHOUT a `working_dir` key. - [ ] When `working_dir` is set, `command_succeeds` runs the verbatim command from `workspace.join(working_dir)` via `safe_join`; a `..`/absolute `working_dir` yields `Unrunnable` and is never executed. - [ ] A `working_dir` pointing at a non-existent subdir falls back to running at the workspace root (never a false `Fail`). - [ ] `derive_contract` sets `working_dir` to the deliverables' single common leading subdirectory when one unambiguously exists, and `None` otherwise. - [ ] The verify command remains verbatim — no command is invented or rewritten; the `cd backend && make check` author case still yields `working_dir: None`. - [ ] No language/runner/manifest list, no file-extension switch, and no output parsing is added anywhere in the change. - [ ] A regression test reproduces the issue: deliverables under `stackcrate/` + a bare root-relative command FAIL at root but PASS with the derived `working_dir`. - [ ] All existing verification-module tests compile and pass after adding `working_dir` to their constructors. ### Notes - The working file `runner.rs` in the current checkout contains in-progress timeout work (`command_timeout_secs`, `run_shell_with_deadline`). The implementing branch off `integration` will NOT have this; the spec is independent of it — the only runner change is WHICH path is passed as the cwd argument to `run_shell`/`.current_dir(...)`. - Auto-retry from a discovered subdir in the runner was rejected as the primary mechanism: it could run the command multiple times and mask a genuine root-level failure, breaking the "verdict is a pure replayable function of contract + workspace" invariant. Freezing `working_dir` at derive time keeps the verdict deterministic. - The deliverable-prefix heuristic is deliberately conservative (single unambiguous common subdir only) — trading some recall for zero risk of masking a real failure, consistent with the module's "never a false pass" stance.
Member

Test Results

Full engine crate test suite (cargo test -p hero_shrimp_engine):

  • Total: 1742
  • Passed: 1742
  • Failed: 0
  • Ignored: 10

Verification module (cargo test -p hero_shrimp_engine verification): 80 passed, 0 failed.

New tests added by this change:

  • check::command_succeeds_working_dir_safe_path_passes_validation
  • check::command_succeeds_working_dir_traversal_fails_validation
  • check::command_succeeds_working_dir_serde_roundtrip (covers backward-compat: legacy JSON with no working_dir key deserializes to None)
  • runner::command_runs_in_working_dir_when_set
  • runner::command_working_dir_traversal_is_unrunnable
  • runner::command_missing_working_dir_falls_back_to_root
  • derive::nested_deliverables_set_command_working_dir
  • derive::root_deliverables_leave_working_dir_none
  • derive::mixed_top_level_dirs_leave_working_dir_none
  • derive::author_already_cds_yields_no_working_dir
  • derive::common_deliverable_subdir_cases

The regression case is covered: a command that passes only inside a nested stackcrate/ subdir fails at the workspace root with working_dir: None and passes once working_dir is derived/set.

## Test Results Full engine crate test suite (`cargo test -p hero_shrimp_engine`): - Total: 1742 - Passed: 1742 - Failed: 0 - Ignored: 10 Verification module (`cargo test -p hero_shrimp_engine verification`): 80 passed, 0 failed. New tests added by this change: - `check::command_succeeds_working_dir_safe_path_passes_validation` - `check::command_succeeds_working_dir_traversal_fails_validation` - `check::command_succeeds_working_dir_serde_roundtrip` (covers backward-compat: legacy JSON with no `working_dir` key deserializes to `None`) - `runner::command_runs_in_working_dir_when_set` - `runner::command_working_dir_traversal_is_unrunnable` - `runner::command_missing_working_dir_falls_back_to_root` - `derive::nested_deliverables_set_command_working_dir` - `derive::root_deliverables_leave_working_dir_none` - `derive::mixed_top_level_dirs_leave_working_dir_none` - `derive::author_already_cds_yields_no_working_dir` - `derive::common_deliverable_subdir_cases` The regression case is covered: a command that passes only inside a nested `stackcrate/` subdir fails at the workspace root with `working_dir: None` and passes once `working_dir` is derived/set.
Member

Pull request opened: #117

This PR implements the changes discussed in this issue (targets the integration branch).

Pull request opened: https://forge.ourworld.tf/lhumina_code/hero_shrimp/pulls/117 This PR implements the changes discussed in this issue (targets the `integration` branch).
Member

Implementation Summary

Approach

Option (a) from the issue: the contract now carries an optional working directory for the verify command, derived deterministically from the deliverables' layout. The alternative (prompt-nudging the executor) was rejected because the verify command is extracted verbatim from the goal text and the system never invents a command — a nudge would be a soft mitigation that still false-negatives on the common case. No language, runner, or manifest detection is introduced.

Files changed

  • crates/.../verification/check.rs — added working_dir: Option<String> to Check::CommandSucceeds with #[serde(default, skip_serializing_if = "Option::is_none")]; updated label() and validate_contract() (rejects unsafe working_dir paths).
  • crates/.../verification/runner.rscommand_succeeds resolves the effective cwd via the existing safe_join; path escape → Unrunnable; missing subdir → root fallback (never a false Fail).
  • crates/.../verification/derive.rscommon_deliverable_subdir populates working_dir only when the deliverables share one unambiguous common subdir.
  • crates/.../verification/run.rs, verdict.rs — test constructors updated.

Guarantees

  • Backward compatible: contracts/ledgers without a working_dir key still deserialize (None), and contracts with no detected subdir serialize identically to before.
  • Deterministic and replayable: working_dir is frozen into the contract at derive time.
  • The verify command remains verbatim; the cd sub && ... author escape hatch still works and yields working_dir: None.
  • Conservative: only fires on a single unambiguous common subdir, so it can never mask a real root-level failure.

Tests

cargo test -p hero_shrimp_engine: 1742 passed, 0 failed, 10 ignored. 11 new tests, including a regression reproducing the nested-stackcrate/ false-negative, path-traversal rejection, missing-subdir root fallback, and a legacy-format serde compatibility check.

Branch: integration_verify_working_dir → PR #117 (base integration).

## Implementation Summary ### Approach Option (a) from the issue: the contract now carries an optional working directory for the verify command, derived deterministically from the deliverables' layout. The alternative (prompt-nudging the executor) was rejected because the verify command is extracted verbatim from the goal text and the system never invents a command — a nudge would be a soft mitigation that still false-negatives on the common case. No language, runner, or manifest detection is introduced. ### Files changed - `crates/.../verification/check.rs` — added `working_dir: Option<String>` to `Check::CommandSucceeds` with `#[serde(default, skip_serializing_if = "Option::is_none")]`; updated `label()` and `validate_contract()` (rejects unsafe working_dir paths). - `crates/.../verification/runner.rs` — `command_succeeds` resolves the effective cwd via the existing `safe_join`; path escape → Unrunnable; missing subdir → root fallback (never a false Fail). - `crates/.../verification/derive.rs` — `common_deliverable_subdir` populates `working_dir` only when the deliverables share one unambiguous common subdir. - `crates/.../verification/run.rs`, `verdict.rs` — test constructors updated. ### Guarantees - Backward compatible: contracts/ledgers without a `working_dir` key still deserialize (`None`), and contracts with no detected subdir serialize identically to before. - Deterministic and replayable: `working_dir` is frozen into the contract at derive time. - The verify command remains verbatim; the `cd sub && ...` author escape hatch still works and yields `working_dir: None`. - Conservative: only fires on a single unambiguous common subdir, so it can never mask a real root-level failure. ### Tests `cargo test -p hero_shrimp_engine`: 1742 passed, 0 failed, 10 ignored. 11 new tests, including a regression reproducing the nested-`stackcrate/` false-negative, path-traversal rejection, missing-subdir root fallback, and a legacy-format serde compatibility check. Branch: `integration_verify_working_dir` → PR #117 (base `integration`).
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_shrimp#43
No description provided.