fix(proc): serve new RPC over multi-domain single-socket contract (SSE path + cleanup tests) #153

Merged
mahmoud merged 5 commits from development_proc_multidomain_verify into development 2026-06-15 09:00:23 +00:00
Owner

proc: serve new RPC over the multi-domain single-socket contract (server only)

Closes the verification + fixes tracked in #152. Scope is server RPC + multi-domain correctness only — admin crates (hero_proc_admin*), hero_router, and CI/release workflows are untouched. Ground truth was derived from a live build + infocheck + on-socket contract probe + full integration run on the latest development of hero_proc and hero_skills, against hero_lib development (rebuilt green after the lock float to 60867649).

What this fixes

1. SSE served at the wrong path (the one real contract gap).
web.rs mounted the job-log SSE handler at the top-level /events route only. The canonical multi-domain contract — and the generated logs spec's x-sse extension (endpoint=/events resolving per domain) — is GET /api/logs/events?job_sid=<sid>, which hero_router (PR #120/#121) now forwards verbatim and multiplexes from x-sse. The handler is now mounted at /api/logs/events, with /events kept as a back-compat alias. The 3 SSE integration tests + the 2 UC SSE tests go from fail→pass, and /api/logs/events?job_sid=… dispatches end-to-end through the router with no router change.

2. Cleanup tests were self-defeating / racy (test bug, not server).
clean_test_data_removes_everything deleted the scheduled actions via schedule_delete before clean_by_tag. schedule_delete removes the action without its logs, so the subsequent tag sweep never saw those actions and their log entries stayed queryable forever — surfacing as log entries still queryable after cleanup: [sched-… (logs.count=N)]. The server-side delete is correct (verified in isolation: a direct logs.delete(src) drops the count to 0 and it stays 0). The test now uses schedule_disable to stop the engine firing while leaving the actions in place, drains in-flight scheduled jobs to a terminal phase, and lets the async log flush settle, so clean_by_tag removes each action together with its logs. clean_test_data_is_idempotent is now self-contained (drains first, then asserts a second clean deletes 0) with protected-service checks conditional on pre-cleanup presence. No assertions were weakened. (Related background: #126, #141.)

3. Hygiene.

  • disabled = true on hero_proc_test/service.toml (it's a test runner, not a deployable service).
  • Removed the stale top-level /schema/ (*.oschema superseded by crates/hero_proc_server/oschema/, which the openrpc_server!/openrpc_client! macros read via spec = "oschema"; the top-level copies were unreferenced).
  • Rebased onto development after 953e752 (remove tracked Cargo.lock + gitignore it); the earlier lock-bump commit was dropped. With Cargo.lock untracked, hero_lib now resolves fresh from branch = "development" (currently 60867649).

Live multi-domain contract probe on hero_proc/rpc.sock

Path Result
GET /api/domains.json 200 — jobs, logs, secrets, system
GET /api/{domain}/openrpc.json ×4 200
POST /api/{domain}/rpc 200, real dispatch
GET /api/logs/events?job_sid=… now served (was 404)

Verification

  • cargo check --workspace · cargo clippy --workspace --all-targets -- -D warnings (0 warnings)
  • lab build --install (server + admin + cli; admin_dx* skipped, disabled)
  • Full integration suite --basic --functional --extended: 282 passed / 0 failed, run twice to confirm no flakiness (the cleanup test was previously timing-flaky).
  • Graceful SIGTERM shutdown (jobs drained, PID file + socket removed).

Downstream (listed, not modified)

hero_router development is already aligned (PR #120 8dcffe5 forward verbatim; PR #121 d1dfe88 derive SSE from x-sse). Once hero_proc serves /api/logs/events, SSE works end-to-end through the router with no router change. No other in-scope raw-RPC consumers are broken by the current wire.

## proc: serve new RPC over the multi-domain single-socket contract (server only) Closes the verification + fixes tracked in #152. Scope is **server RPC + multi-domain correctness only** — admin crates (`hero_proc_admin*`), `hero_router`, and CI/release workflows are untouched. Ground truth was derived from a **live build + infocheck + on-socket contract probe + full integration run** on the latest `development` of `hero_proc` and `hero_skills`, against hero_lib **`development`** (rebuilt green after the lock float to `60867649`). ### What this fixes **1. SSE served at the wrong path (the one real contract gap).** `web.rs` mounted the job-log SSE handler at the top-level `/events` route only. The canonical multi-domain contract — and the generated logs spec's `x-sse` extension (`endpoint=/events` resolving per domain) — is `GET /api/logs/events?job_sid=<sid>`, which `hero_router` (PR #120/#121) now forwards verbatim and multiplexes from `x-sse`. The handler is now mounted at `/api/logs/events`, with `/events` kept as a back-compat alias. The 3 SSE integration tests + the 2 UC SSE tests go from fail→pass, and `/api/logs/events?job_sid=…` dispatches end-to-end through the router with no router change. **2. Cleanup tests were self-defeating / racy (test bug, not server).** `clean_test_data_removes_everything` deleted the scheduled actions via `schedule_delete` *before* `clean_by_tag`. `schedule_delete` removes the **action** without its logs, so the subsequent tag sweep never saw those actions and their log entries stayed queryable forever — surfacing as `log entries still queryable after cleanup: [sched-… (logs.count=N)]`. The server-side delete is correct (verified in isolation: a direct `logs.delete(src)` drops the count to 0 and it stays 0). The test now uses `schedule_disable` to stop the engine firing while leaving the actions in place, drains in-flight scheduled jobs to a terminal phase, and lets the async log flush settle, so `clean_by_tag` removes each action together with its logs. `clean_test_data_is_idempotent` is now self-contained (drains first, then asserts a second clean deletes 0) with protected-service checks conditional on pre-cleanup presence. **No assertions were weakened.** (Related background: #126, #141.) **3. Hygiene.** - `disabled = true` on `hero_proc_test/service.toml` (it's a test runner, not a deployable service). - Removed the stale top-level `/schema/` (`*.oschema` superseded by `crates/hero_proc_server/oschema/`, which the `openrpc_server!`/`openrpc_client!` macros read via `spec = "oschema"`; the top-level copies were unreferenced). - Rebased onto `development` after `953e752` (remove tracked `Cargo.lock` + gitignore it); the earlier lock-bump commit was dropped. With `Cargo.lock` untracked, hero_lib now resolves fresh from `branch = "development"` (currently `60867649`). ### Live multi-domain contract probe on `hero_proc/rpc.sock` | Path | Result | |---|---| | `GET /api/domains.json` | ✅ 200 — jobs, logs, secrets, system | | `GET /api/{domain}/openrpc.json` ×4 | ✅ 200 | | `POST /api/{domain}/rpc` | ✅ 200, real dispatch | | `GET /api/logs/events?job_sid=…` | ✅ now served (was 404) | ### Verification - `cargo check --workspace` ✅ · `cargo clippy --workspace --all-targets -- -D warnings` ✅ (0 warnings) - `lab build --install` ✅ (server + admin + cli; admin_dx* skipped, disabled) - Full integration suite `--basic --functional --extended`: **282 passed / 0 failed**, run **twice** to confirm no flakiness (the cleanup test was previously timing-flaky). - Graceful SIGTERM shutdown ✅ (jobs drained, PID file + socket removed). ### Downstream (listed, not modified) `hero_router` `development` is already aligned (PR #120 `8dcffe5` forward verbatim; PR #121 `d1dfe88` derive SSE from `x-sse`). Once hero_proc serves `/api/logs/events`, SSE works end-to-end through the router with no router change. No other in-scope raw-RPC consumers are broken by the current wire.
Refresh the hero_lib git dependency (all 7 herolib_* / hero_lifecycle
crates) from the stale locked rev a9b14b60 to the current development
HEAD 925b3df9 so the build reflects latest development. No source
changes; cargo check/clippy/build remain green.

Refs #152
The job-log SSE handler was mounted at the top-level /events route only.
The multi-domain contract (and the generated logs spec's x-sse extension,
endpoint=/events resolving per-domain) is GET /api/logs/events?job_sid=<sid>,
which hero_router forwards verbatim. Mount the handler at the canonical
/api/logs/events path and keep /events as a back-compat alias.
clean_test_data_is_idempotent now drains test-tagged actions itself before
asserting a second clean_by_tag deletes 0, instead of depending on a sibling
subtest having cleaned first (which failed in isolation / on a persistent DB).
Protected-service assertions are conditional on pre-cleanup presence so the
test passes whether or not admin/router are running.

clean_test_data_removes_everything no longer deletes the scheduled actions
before cleaning. schedule_delete removes the action without its logs, which
orphaned every scheduled action's log subtree (the action was gone, so the
tag sweep never saw it and its log entries stayed queryable forever). It now
uses schedule_disable to stop the engine firing while leaving the actions in
place, then drains in-flight scheduled jobs to a terminal phase and lets the
async log flush settle, so clean_by_tag deletes each action together with its
logs and the store assertion is deterministic. Assertions are unchanged.
hero_proc_test is an integration-test runner, not a deployable service. Mark
it disabled so lab build/start skips it (matching the admin_dx convention).
These .oschema files were superseded by crates/hero_proc_server/oschema/, which
the openrpc_server!/openrpc_client! macros read (spec = "oschema"). The
top-level copies were unreferenced.
mahmoud force-pushed development_proc_multidomain_verify from 84465cf5b7 to a6a55df71a 2026-06-15 07:59:35 +00:00 Compare
The openrpc/*.json specs are generated by openrpc_server! (save_openrpc_dir).
Rebuilding against current hero_lib development regenerates them with each
method's description correctly aligned (an earlier codegen attached each
description to the following method). Pure description/summary churn — no
parameter, result, or x-sse changes.
mahmoud changed title from fix(proc): serve new RPC over multi-domain single-socket contract (SSE path + cleanup tests) to WIP fix(proc): serve new RPC over multi-domain single-socket contract (SSE path + cleanup tests) 2026-06-15 08:08:38 +00:00
mahmoud changed title from WIP fix(proc): serve new RPC over multi-domain single-socket contract (SSE path + cleanup tests) to WIP: fix(proc): serve new RPC over multi-domain single-socket contract (SSE path + cleanup tests) 2026-06-15 08:08:45 +00:00
mahmoud force-pushed development_proc_multidomain_verify from 7003a898ca to a6a55df71a 2026-06-15 08:12:47 +00:00 Compare
openrpc/*.json are emitted by openrpc_server! (save_openrpc_dir) on every
build; nothing reads them from disk (the server embeds the spec at compile
time). Tracking them only produced churn and let them drift out of date.
gitignore them and generate on build, same as Cargo.lock.
mahmoud changed title from WIP: fix(proc): serve new RPC over multi-domain single-socket contract (SSE path + cleanup tests) to fix(proc): serve new RPC over multi-domain single-socket contract (SSE path + cleanup tests) 2026-06-15 09:00:19 +00:00
mahmoud merged commit 8e7b38827c into development 2026-06-15 09:00:23 +00:00
mahmoud deleted branch development_proc_multidomain_verify 2026-06-15 09:00:30 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc!153
No description provided.