[P2] Verification requires a root-runnable command — nested-manifest builds false-negative #43
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_shrimp#43
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
The zero-hardcoding verdict runs the LLM's verify command from the run root. When the model nests the project in a subdir (e.g. crate in
stackcrate/),cargo testat root exits non-zero → false negative. Accepted trade-off of "no hardcoding", but it still bites real runs (job-233/244).Evidence
crates/hero_shrimp_engine/src/verification/runner.rs(command_succeedsruns from workspace root).Proposed fix
Either nudge the executor to emit a root-runnable command, or let the contract carry a working-dir for the command without reintroducing language/manifest detection.
Filed from a comparative audit of Hero Shrimp vs Qwen-Code / kimi-cli / picoclaw (2026-05-23). Severity in title: P0=correctness/trust, P1=reliability/UX, P2=cleanup.
Implementation Spec for Issue #43
Objective
Eliminate the nested-manifest false-negative in deterministic verification. Today
Check::CommandSucceedsalways runs the LLM-authored verify command from the workspace root viarun_shell(... .current_dir(workspace)). When the model nests its project in a subdirectory (e.g. a crate instackcrate/), a bare root-relative command such ascargo test --testsexits non-zero at root, producing a falseVerdict::Failed(jobs 233/244). Add an optional, contract-carriedworking_dirtoCheck::CommandSucceeds, thread it into the runner so the command runs fromworkspace.join(working_dir)(validated through the existingsafe_join), and populate it deterministically from the deliverables' common subdirectory — never from language/manifest detection.Requirements
working_dirmust be derived only from signals already present at freeze time (the declared/derived deliverable paths), not from probing the workspace (which is empty at freeze time anyway).acceptance_contract.jsonfiles and existing JSONL ledgers (which containkind: command_succeedswith onlycommand) must still deserialize. New field is optional with a default.working_dirmust pass the same workspace-relative safety constraint as deliverable paths (safe_join/path_is_workspace_safe), validated at plan time.working_diris absent or empty, behavior is byte-for-byte identical to today (run fromworkspace).cd sub && ...author-authored escape hatch, which keeps working.Files to Modify/Create
crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/check.rs— add optionalworking_dirfield toCheck::CommandSucceeds; updatelabel(),validate_contract(), and all in-file test constructors.crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/runner.rs— threadworking_dirthroughrun()->command_succeeds()->run_shell(); resolve the effective cwd viasafe_join; add tests.crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/derive.rs— populateworking_dirfrom the deliverables' common leading subdirectory; add tests; update existing pattern-match tests that destructureCommandSucceeds.crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/run.rs— update test constructors ofCommandSucceeds(struct now has a new field).crates/hero_shrimp_lib/crates/engine/src/shrimp/core/verification/verdict.rs— update the test constructor at line ~159.Design decision (which option, and why)
Recommended: Option (a) — add an optional
working_dirtoCheck::CommandSucceeds, populated from the common deliverable subdirectory.Justification:
derive.rs::collect_deliverables). When the model declares its files under a single common leading directory (e.g.stackcrate/src/lib.rs,stackcrate/Cargo.toml,stackcrate/tests/...), that shared prefix (stackcrate) is exactly where a root-relative verify command should run — and computing it is pure string work over paths the system already has. It uses ZERO language knowledge: it never looks at file extensions, manifests, or runners. This is the direct structural analogue of the already-shippedfind_file_by_suffixfix forFileExists(job-224), which tolerates the identical "model nested the project" situation. Mirroring that proven pattern keeps the two check families consistent.acceptance_contract.jsonbefore the worker runs, so re-evaluation yields the identical verdict.extract_verification_command(derive.rs ~140). The system explicitly "never invents a command." Nudging would require changing goal-authoring upstream of this repo and would still be a soft, probabilistic mitigation that false-negatives whenever the author writes a bare command — which is the common case. We keep thecd sub && ...escape hatch (the author's manual equivalent of option b) working, so authors who already write root-runnable commands are unaffected.Honesty caveat encoded in code comments: the common-subdirectory heuristic only fires when there is exactly one non-trivial common leading directory across the deliverables. If deliverables live at the root, or split across multiple top-level dirs,
working_diris leftNoneand behavior is unchanged (run from root) — the author'scdescape hatch remains the fallback. This is intentionally conservative: a wrong guess could mask a real failure, so we only set it when the signal is unambiguous.Implementation Plan
Step 1: Add the optional
working_dirfield to the check vocabularyFiles:
check.rsCommandSucceeds { command: String, #[serde(default, skip_serializing_if = "Option::is_none")] working_dir: Option<String> }.#[serde(default)]makes old JSON/JSONL (noworking_dirkey) deserialize toNone— backward compatible.skip_serializing_if = "Option::is_none"keeps newly-frozen contracts byte-identical to old ones when no subdir is detected.label(): whenworking_dirisSome(d), render`{command}` exits 0 (in `{d}`); else keep`{command}` exits 0.validate_contract(): destructure{ command, working_dir }; keep empty/destructive checks; ifworking_dirisSome(d)non-empty, requirepath_is_workspace_safe(d), else push an error. Empty/whitespaceworking_diris treated asNone.working_dir: None. Add tests for working_dir validation (safe passes,../escaperejected) and serde roundtrip (Some("stackcrate")<->"working_dir":"stackcrate"; missing key ->None).Dependencies: none.
Step 2: Thread
working_dirinto the runner and resolve the effective cwdFiles:
runner.rsrun():Check::CommandSucceeds { command, working_dir } => self.command_succeeds(command, working_dir.as_deref(), workspace).command_succeeds(): addworking_dir: Option<&str>. After the destructive-command gate, resolve the cwd: ifSome(d)non-empty,safe_join(workspace, d)(Unrunnable on escape); if the resolved dir does not exist, fall back toworkspace(never a false Fail from a stale guess); else useworkspace. Pass the resolved path torun_shell.run_shell()already pins.current_dir(workspace); just pass the resolved cwd as that argument. Reuse the existingsafe_joinhelper.None); traversal refused (Unrunnable); missing working_dir falls back to root. Update existing test constructors to addworking_dir: None.Dependencies: Step 1.
Step 3: Populate
working_dirdeterministically from the deliverables' common subdirectoryFiles:
derive.rsfn common_deliverable_subdir(deliverables: &[String]) -> Option<String>: longest common leading path-segment prefix across deliverables (drop final filename segment); return it only when there is a single unambiguous non-empty common subdir, elseNone. Pure string work; no extensions, no manifest names. Validate result withpath_is_workspace_safe.derive_contract(), setworking_dir: common_deliverable_subdir(&deliverables)when buildingCheck::CommandSucceeds; keepcommand: cmdverbatim.Check::CommandSucceeds { command, .. }.Some("stackcrate"); root deliverables ->None; mixed top-level dirs ->None;cd backend && make checkauthor case stillworking_dir: None.Dependencies: Step 1.
Step 4: Fix remaining compile-breaks in sibling test modules
Files:
run.rs,verdict.rsworking_dir: Noneto eachCheck::CommandSucceeds { command: ... }test literal. No behavioral change.Dependencies: Step 1.
Step 5: Verify backward compatibility and full data flow
freeze_contract/load_contractroundtrip a contract withSome(working_dir)and withNone.Check).proof_run.rsstill compiles (never destructuresCommandSucceeds).cargo testfor the engine verification module and the server proof/contract tests.Dependencies: Steps 1-4.
Acceptance Criteria
Check::CommandSucceedscarries an optionalworking_dir: Option<String>with#[serde(default, skip_serializing_if = "Option::is_none")].acceptance_contract.jsonproduced before this change (noworking_dirkey) still deserializes (working_dir == None); a contract with no detected subdir serializes WITHOUT aworking_dirkey.working_diris set,command_succeedsruns the verbatim command fromworkspace.join(working_dir)viasafe_join; a../absoluteworking_diryieldsUnrunnableand is never executed.working_dirpointing at a non-existent subdir falls back to running at the workspace root (never a falseFail).derive_contractsetsworking_dirto the deliverables' single common leading subdirectory when one unambiguously exists, andNoneotherwise.cd backend && make checkauthor case still yieldsworking_dir: None.stackcrate/+ a bare root-relative command FAIL at root but PASS with the derivedworking_dir.working_dirto their constructors.Notes
runner.rsin the current checkout contains in-progress timeout work (command_timeout_secs,run_shell_with_deadline). The implementing branch offintegrationwill NOT have this; the spec is independent of it — the only runner change is WHICH path is passed as the cwd argument torun_shell/.current_dir(...).working_dirat derive time keeps the verdict deterministic.Test Results
Full engine crate test suite (
cargo test -p hero_shrimp_engine):Verification module (
cargo test -p hero_shrimp_engine verification): 80 passed, 0 failed.New tests added by this change:
check::command_succeeds_working_dir_safe_path_passes_validationcheck::command_succeeds_working_dir_traversal_fails_validationcheck::command_succeeds_working_dir_serde_roundtrip(covers backward-compat: legacy JSON with noworking_dirkey deserializes toNone)runner::command_runs_in_working_dir_when_setrunner::command_working_dir_traversal_is_unrunnablerunner::command_missing_working_dir_falls_back_to_rootderive::nested_deliverables_set_command_working_dirderive::root_deliverables_leave_working_dir_nonederive::mixed_top_level_dirs_leave_working_dir_nonederive::author_already_cds_yields_no_working_dirderive::common_deliverable_subdir_casesThe regression case is covered: a command that passes only inside a nested
stackcrate/subdir fails at the workspace root withworking_dir: Noneand passes onceworking_diris derived/set.Pull request opened: #117
This PR implements the changes discussed in this issue (targets the
integrationbranch).Implementation Summary
Approach
Option (a) from the issue: the contract now carries an optional working directory for the verify command, derived deterministically from the deliverables' layout. The alternative (prompt-nudging the executor) was rejected because the verify command is extracted verbatim from the goal text and the system never invents a command — a nudge would be a soft mitigation that still false-negatives on the common case. No language, runner, or manifest detection is introduced.
Files changed
crates/.../verification/check.rs— addedworking_dir: Option<String>toCheck::CommandSucceedswith#[serde(default, skip_serializing_if = "Option::is_none")]; updatedlabel()andvalidate_contract()(rejects unsafe working_dir paths).crates/.../verification/runner.rs—command_succeedsresolves the effective cwd via the existingsafe_join; path escape → Unrunnable; missing subdir → root fallback (never a false Fail).crates/.../verification/derive.rs—common_deliverable_subdirpopulatesworking_dironly when the deliverables share one unambiguous common subdir.crates/.../verification/run.rs,verdict.rs— test constructors updated.Guarantees
working_dirkey still deserialize (None), and contracts with no detected subdir serialize identically to before.working_diris frozen into the contract at derive time.cd sub && ...author escape hatch still works and yieldsworking_dir: None.Tests
cargo test -p hero_shrimp_engine: 1742 passed, 0 failed, 10 ignored. 11 new tests, including a regression reproducing the nested-stackcrate/false-negative, path-traversal rejection, missing-subdir root fallback, and a legacy-format serde compatibility check.Branch:
integration_verify_working_dir→ PR #117 (baseintegration).