[P1] 150s reconciler stale-kill regression has no test gate #39
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_shrimp#39
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
The reconciler once killed live-but-slow jobs at the 150s threshold; fixed by heartbeating the running job row, but no regression test fences the threshold/heartbeat cadence — a future "optimization" could silently reintroduce it.
Evidence
crates/hero_shrimp_server/src/rpc/methods/job/proof_run.rs(heartbeat);ARCHITECTURE_CLEANUP_PLAN.md.Proposed fix
Add a test that simulates a slow job emitting heartbeats and asserts the reconciler does not mark it stale before the deadline.
Filed from a comparative audit of Hero Shrimp vs Qwen-Code / kimi-cli / picoclaw (2026-05-23). Severity in title: P0=correctness/trust, P1=reliability/UX, P2=cleanup.
Verified — closing as already-resolved
The fix and its regression test gate both already exist in the tree, so the P1 concern (heartbeat fix with no test guarding it) no longer applies.
Fix: 30s liveness heartbeat task in
proof_run.rs(start_proof_run) bumpsautonomy_jobs.updated_atviatouch_autonomy_job_updated_at, keeping a slow-but-live run from tripping the reconciler.Reconciler threshold:
reconcile.rs—PLANNING_STALL_SECONDS = 150, measured againstupdated_at(the heartbeat), notstarted_at.Test gate (4 tests, 2 layers):
Reconciler decision —
crates/hero_shrimp_server/src/rpc/jsonrpc/tests/mod.rs:reconciler_does_not_reconcile_freshly_heartbeated_live_run— a fresh-heartbeat run withstarted_atyears old staysrunning. This is exactly the test this issue proposed ("simulate a slow job emitting heartbeats and assert the reconciler does not mark it stale before the deadline").reconciler_still_reconciles_live_run_when_heartbeat_stopped— guards against over-correction: a genuinely-dead run (heartbeat stopped,updated_atpast 150s) still goesstale.Heartbeat primitive —
crates/hero_shrimp_lib/crates/runtime/src/shrimp/runtime/db/autonomy_jobs.rs:touch_advances_updated_at_and_matches_exactly_one_row— proves the heartbeat advancesupdated_atand matches exactly one row (guards the silent no-op failure mode).touch_is_a_no_op_for_unknown_artifact_job_id— proves the row keying is correct.Minor residual gap (not blocking): no test drives the spawned 30s task itself (the
tokio::spawnloop / 30s cadence). The DB primitive it calls and the reconciler's reaction to fresh-vs-staleupdated_atare both covered, so the regression described here is gated. A literal end-to-end test of the emitter task could be a small follow-up if desired.Closing as resolved.