[P1] 150s reconciler stale-kill regression has no test gate #39

Closed
opened 2026-05-23 21:52:24 +00:00 by thabeta · 1 comment
Owner

Problem
The reconciler once killed live-but-slow jobs at the 150s threshold; fixed by heartbeating the running job row, but no regression test fences the threshold/heartbeat cadence — a future "optimization" could silently reintroduce it.

Evidence

  • crates/hero_shrimp_server/src/rpc/methods/job/proof_run.rs (heartbeat); ARCHITECTURE_CLEANUP_PLAN.md.

Proposed fix
Add a test that simulates a slow job emitting heartbeats and asserts the reconciler does not mark it stale before the deadline.


Filed from a comparative audit of Hero Shrimp vs Qwen-Code / kimi-cli / picoclaw (2026-05-23). Severity in title: P0=correctness/trust, P1=reliability/UX, P2=cleanup.

**Problem** The reconciler once killed live-but-slow jobs at the 150s threshold; fixed by heartbeating the running job row, but no regression test fences the threshold/heartbeat cadence — a future "optimization" could silently reintroduce it. **Evidence** - `crates/hero_shrimp_server/src/rpc/methods/job/proof_run.rs` (heartbeat); `ARCHITECTURE_CLEANUP_PLAN.md`. **Proposed fix** Add a test that simulates a slow job emitting heartbeats and asserts the reconciler does not mark it stale before the deadline. --- _Filed from a comparative audit of Hero Shrimp vs Qwen-Code / kimi-cli / picoclaw (2026-05-23). Severity in title: P0=correctness/trust, P1=reliability/UX, P2=cleanup._
Member

Verified — closing as already-resolved

The fix and its regression test gate both already exist in the tree, so the P1 concern (heartbeat fix with no test guarding it) no longer applies.

Fix: 30s liveness heartbeat task in proof_run.rs (start_proof_run) bumps autonomy_jobs.updated_at via touch_autonomy_job_updated_at, keeping a slow-but-live run from tripping the reconciler.

Reconciler threshold: reconcile.rsPLANNING_STALL_SECONDS = 150, measured against updated_at (the heartbeat), not started_at.

Test gate (4 tests, 2 layers):

Reconciler decision — crates/hero_shrimp_server/src/rpc/jsonrpc/tests/mod.rs:

  • reconciler_does_not_reconcile_freshly_heartbeated_live_run — a fresh-heartbeat run with started_at years old stays running. This is exactly the test this issue proposed ("simulate a slow job emitting heartbeats and assert the reconciler does not mark it stale before the deadline").
  • reconciler_still_reconciles_live_run_when_heartbeat_stopped — guards against over-correction: a genuinely-dead run (heartbeat stopped, updated_at past 150s) still goes stale.

Heartbeat primitive — crates/hero_shrimp_lib/crates/runtime/src/shrimp/runtime/db/autonomy_jobs.rs:

  • touch_advances_updated_at_and_matches_exactly_one_row — proves the heartbeat advances updated_at and matches exactly one row (guards the silent no-op failure mode).
  • touch_is_a_no_op_for_unknown_artifact_job_id — proves the row keying is correct.

Minor residual gap (not blocking): no test drives the spawned 30s task itself (the tokio::spawn loop / 30s cadence). The DB primitive it calls and the reconciler's reaction to fresh-vs-stale updated_at are both covered, so the regression described here is gated. A literal end-to-end test of the emitter task could be a small follow-up if desired.

Closing as resolved.

## Verified — closing as already-resolved The fix **and** its regression test gate both already exist in the tree, so the P1 concern (heartbeat fix with no test guarding it) no longer applies. **Fix:** 30s liveness heartbeat task in `proof_run.rs` (`start_proof_run`) bumps `autonomy_jobs.updated_at` via `touch_autonomy_job_updated_at`, keeping a slow-but-live run from tripping the reconciler. **Reconciler threshold:** `reconcile.rs` — `PLANNING_STALL_SECONDS = 150`, measured against `updated_at` (the heartbeat), not `started_at`. **Test gate (4 tests, 2 layers):** Reconciler decision — `crates/hero_shrimp_server/src/rpc/jsonrpc/tests/mod.rs`: - `reconciler_does_not_reconcile_freshly_heartbeated_live_run` — a fresh-heartbeat run with `started_at` years old stays `running`. This is exactly the test this issue proposed ("simulate a slow job emitting heartbeats and assert the reconciler does not mark it stale before the deadline"). - `reconciler_still_reconciles_live_run_when_heartbeat_stopped` — guards against over-correction: a genuinely-dead run (heartbeat stopped, `updated_at` past 150s) still goes `stale`. Heartbeat primitive — `crates/hero_shrimp_lib/crates/runtime/src/shrimp/runtime/db/autonomy_jobs.rs`: - `touch_advances_updated_at_and_matches_exactly_one_row` — proves the heartbeat advances `updated_at` and matches exactly one row (guards the silent no-op failure mode). - `touch_is_a_no_op_for_unknown_artifact_job_id` — proves the row keying is correct. **Minor residual gap (not blocking):** no test drives the spawned 30s task itself (the `tokio::spawn` loop / 30s cadence). The DB primitive it calls and the reconciler's reaction to fresh-vs-stale `updated_at` are both covered, so the regression described here is gated. A literal end-to-end test of the emitter task could be a small follow-up if desired. Closing as resolved.
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_shrimp#39
No description provided.