fix(verification): run verify command in deliverables' subdir to stop nested-manifest false-negatives #117

Merged
salmaelsoly merged 1 commit from integration_verify_working_dir into integration 2026-06-15 08:06:14 +00:00
Member

Summary

The zero-hardcoding verdict ran the LLM-authored verify command from the workspace root. When the model nested its project in a subdirectory (e.g. a crate in stackcrate/), a bare root-relative command such as cargo test --tests exited non-zero at root and was recorded as a genuine failure — a false negative (jobs 233/244).

This adds an optional, contract-carried working_dir to Check::CommandSucceeds, derived deterministically at freeze time from the common leading subdirectory of the declared deliverables. The command then runs from that subdir. No language, runner, or manifest detection is introduced — the working dir is pure path-prefix inference over paths the author already declared, mirroring the existing find_file_by_suffix nesting tolerance for FileExists.

Closes #43

Changes

  • check.rs: optional working_dir: Option<String> on Check::CommandSucceeds with #[serde(default, skip_serializing_if = "Option::is_none")] (backward-compatible with existing frozen contracts and JSONL ledgers); label() and validate_contract() updated, the latter rejecting unsafe working_dir paths.
  • runner.rs: command_succeeds resolves the effective cwd through the existing safe_join helper. A path escape returns Unrunnable; a not-yet-created subdir falls back to the workspace root so a stale guess never produces a false Fail.
  • derive.rs: new common_deliverable_subdir helper populates working_dir only when the deliverables share a single unambiguous common subdir; otherwise None (root files or split top-level dirs leave behavior unchanged). The verify command itself stays verbatim.
  • run.rs, verdict.rs: test constructors updated for the new field.

The verdict stays a pure, replayable function of the frozen contract plus the workspace. The author's cd sub && ... escape hatch is preserved and still yields working_dir: None.

Test Results

cargo test -p hero_shrimp_engine: 1742 passed, 0 failed, 10 ignored. 11 new tests, including a regression that fails at root and passes once working_dir is set, a path-traversal rejection, a missing-subdir root fallback, and a serde backward-compat check for the legacy on-disk format.

## Summary The zero-hardcoding verdict ran the LLM-authored verify command from the workspace root. When the model nested its project in a subdirectory (e.g. a crate in `stackcrate/`), a bare root-relative command such as `cargo test --tests` exited non-zero at root and was recorded as a genuine failure — a false negative (jobs 233/244). This adds an optional, contract-carried `working_dir` to `Check::CommandSucceeds`, derived deterministically at freeze time from the common leading subdirectory of the declared deliverables. The command then runs from that subdir. No language, runner, or manifest detection is introduced — the working dir is pure path-prefix inference over paths the author already declared, mirroring the existing `find_file_by_suffix` nesting tolerance for `FileExists`. ## Related Issue Closes https://forge.ourworld.tf/lhumina_code/hero_shrimp/issues/43 ## Changes - `check.rs`: optional `working_dir: Option<String>` on `Check::CommandSucceeds` with `#[serde(default, skip_serializing_if = "Option::is_none")]` (backward-compatible with existing frozen contracts and JSONL ledgers); `label()` and `validate_contract()` updated, the latter rejecting unsafe `working_dir` paths. - `runner.rs`: `command_succeeds` resolves the effective cwd through the existing `safe_join` helper. A path escape returns `Unrunnable`; a not-yet-created subdir falls back to the workspace root so a stale guess never produces a false Fail. - `derive.rs`: new `common_deliverable_subdir` helper populates `working_dir` only when the deliverables share a single unambiguous common subdir; otherwise `None` (root files or split top-level dirs leave behavior unchanged). The verify command itself stays verbatim. - `run.rs`, `verdict.rs`: test constructors updated for the new field. The verdict stays a pure, replayable function of the frozen contract plus the workspace. The author's `cd sub && ...` escape hatch is preserved and still yields `working_dir: None`. ## Test Results `cargo test -p hero_shrimp_engine`: 1742 passed, 0 failed, 10 ignored. 11 new tests, including a regression that fails at root and passes once `working_dir` is set, a path-traversal rejection, a missing-subdir root fallback, and a serde backward-compat check for the legacy on-disk format.
salmaelsoly force-pushed integration_verify_working_dir from 4e2f06f976 to 77865aa356 2026-06-15 08:04:08 +00:00 Compare
salmaelsoly merged commit c8e9c04e9a into integration 2026-06-15 08:06:14 +00:00
salmaelsoly deleted branch integration_verify_working_dir 2026-06-15 08:06:21 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_shrimp!117
No description provided.