Migrate hero_proc to the one-socket multi-domain model (domains as /api/{domain}/ paths) #148

Closed
opened 2026-06-10 11:13:16 +00:00 by timur · 3 comments
Owner

Summary

hero_lib has moved multi-domain services to a one-socket-per-service model (domains as URL path segments), but hero_proc is only half-migrated: its service.toml, its Askama admin, and lab's hero_proc SDK still expect the old per-domain sockets (hero_proc/rpc_jobs.sock, rpc_logs.sock, …). The result: on latest hero_lib, hero_proc_server binds only hero_proc/rpc.sock, so lab service hero_proc fails (socket not found … rpc_jobs.sock) and discovery/admin break.

This was misdiagnosed as a regression — it is an intentional architecture change that needs finishing across consumers.

The new model (authoritative)

hero_lifecycle::ServiceManifest::serve_rpc_domains_with_extra (hero_lib/crates/hero_lifecycle/src/manifest.rs:768) is explicit:

"ONE socket per service. Domains are path segments… NO control-plane socket and NO rpc_<domain>.sock fan-out."

Everything is served on <service>/rpc.sock:

  • GET /api/domains.json → the domain list
  • POST /api/{domain}/rpc → JSON-RPC per domain
  • GET /api/{domain}/openrpc.json → spec per domain
  • GET /heroservice.json, GET /health.json

The macros that generate this: herolib_macros::openrpc_from_oschema! (multi-domain) and openrpc_server! (single-domain). The model landed via hero_lib dev commits 537e2b9b (multi-domain proxy routing), edb04a99 (100%-compliant web routes), ee95c88d (dispatch+manifest), 277299dc (simplify openrpc_proxy+manifest).

What is stale and must be migrated

  1. hero_proc/crates/hero_proc_server/service.toml — still declares rpc_jobs.sock, rpc_logs.sock, rpc_secrets.sock, rpc_system.sock. Should declare a single hero_proc/rpc.sock (domains live under /api/{domain}/… on it). Verify the banner/--info matches what serve_rpc_domains_with_extra actually binds.
  2. lab's hero_proc SDK (hero_skills/crates/lab, via hero_proc_sdk) — the multi-domain client still resolves $HERO_SOCKET_DIR/hero_proc/rpc_<domain>.sock. It must call <service>/rpc.sock with /api/{domain}/rpc. This is why lab service hero_proc / status fails.
  3. hero_proc's own SDK (hero_proc/crates/hero_proc_sdk) — same per-domain-socket assumption.
  4. hero_router discovery — confirm it discovers a multi-domain service via /heroservice.json + /api/domains.json and caches each /api/{domain}/openrpc.json (today router.services reports methods=0, spec=null for hero_proc's domains — see the related rpc.discover fix hero_lib@599e2c37, but the durable path is domains.json).
  5. Existing Askama hero_proc_adminkeep it working (Timur wants it as the 1:1 reference for the Dioxus admin). Update its socket/path expectations to the new model rather than removing it.

Plan

  • Migrate hero_proc_server/service.toml to one rpc.sock; confirm serve_domains_with binds exactly that + serves /api/domains.json + per-domain routes.
  • Update hero_proc_sdk + lab's hero_proc client to the <svc>/rpc.sock + /api/{domain}/rpc contract.
  • Keep the Askama hero_proc_admin running against the new model (reference UI).
  • Verify hero_router discovers + caches per-domain specs via domains.json.
  • lab service hero_proc --build --start clean; router.services shows hero_proc's domain methods > 0.

Validation / repro

PATH_CODE=<lhumina_code> lab service hero_proc --build --start   # was failing on rpc_jobs.sock
ls $HERO_SOCKET_DIR/hero_proc/                                   # latest build: only rpc.sock (correct, new model)
curl --unix-socket .../hero_proc/rpc.sock http://localhost/api/domains.json
curl --unix-socket .../hero_proc/rpc.sock http://localhost/api/jobs/openrpc.json

Note: the older hero_lib@0b06c634 (pre-refactor) + current hero_proc both bind the flat rpc_*.sock and work — that is the old model and is the thing being migrated away from.

  • Dioxus admin issue (consumer of /api/domains.json): see companion issue in hero_website_framework.
  • rpc.discover macro fix (single-domain discovery): hero_lib@599e2c37.

Filed by Claude (owner-mode work for Timur).

## Summary `hero_lib` has moved multi-domain services to a **one-socket-per-service** model (domains as URL path segments), but `hero_proc` is only **half-migrated**: its `service.toml`, its Askama admin, and `lab`'s hero_proc SDK still expect the **old per-domain sockets** (`hero_proc/rpc_jobs.sock`, `rpc_logs.sock`, …). The result: on latest `hero_lib`, `hero_proc_server` binds only `hero_proc/rpc.sock`, so `lab service hero_proc` fails (`socket not found … rpc_jobs.sock`) and discovery/admin break. This was misdiagnosed as a regression — it is an intentional architecture change that needs finishing across consumers. ## The new model (authoritative) `hero_lifecycle::ServiceManifest::serve_rpc_domains_with_extra` (`hero_lib/crates/hero_lifecycle/src/manifest.rs:768`) is explicit: > *"ONE socket per service. Domains are path segments… NO control-plane socket and NO `rpc_<domain>.sock` fan-out."* Everything is served on `<service>/rpc.sock`: - `GET /api/domains.json` → the domain list - `POST /api/{domain}/rpc` → JSON-RPC per domain - `GET /api/{domain}/openrpc.json` → spec per domain - `GET /heroservice.json`, `GET /health.json` The macros that generate this: `herolib_macros::openrpc_from_oschema!` (multi-domain) and `openrpc_server!` (single-domain). The model landed via hero_lib dev commits `537e2b9b` (multi-domain proxy routing), `edb04a99` (100%-compliant web routes), `ee95c88d` (dispatch+manifest), `277299dc` (simplify openrpc_proxy+manifest). ## What is stale and must be migrated 1. **`hero_proc/crates/hero_proc_server/service.toml`** — still declares `rpc_jobs.sock`, `rpc_logs.sock`, `rpc_secrets.sock`, `rpc_system.sock`. Should declare a single `hero_proc/rpc.sock` (domains live under `/api/{domain}/…` on it). Verify the banner/`--info` matches what `serve_rpc_domains_with_extra` actually binds. 2. **`lab`'s hero_proc SDK** (`hero_skills/crates/lab`, via `hero_proc_sdk`) — the multi-domain client still resolves `$HERO_SOCKET_DIR/hero_proc/rpc_<domain>.sock`. It must call `<service>/rpc.sock` with `/api/{domain}/rpc`. This is why `lab service hero_proc` / status fails. 3. **`hero_proc`'s own SDK** (`hero_proc/crates/hero_proc_sdk`) — same per-domain-socket assumption. 4. **`hero_router` discovery** — confirm it discovers a multi-domain service via `/heroservice.json` + `/api/domains.json` and caches each `/api/{domain}/openrpc.json` (today `router.services` reports `methods=0, spec=null` for hero_proc's domains — see the related `rpc.discover` fix `hero_lib@599e2c37`, but the durable path is `domains.json`). 5. **Existing Askama `hero_proc_admin`** — **keep it working** (Timur wants it as the 1:1 reference for the Dioxus admin). Update its socket/path expectations to the new model rather than removing it. ## Plan - [ ] Migrate `hero_proc_server/service.toml` to one `rpc.sock`; confirm `serve_domains_with` binds exactly that + serves `/api/domains.json` + per-domain routes. - [ ] Update `hero_proc_sdk` + `lab`'s hero_proc client to the `<svc>/rpc.sock` + `/api/{domain}/rpc` contract. - [ ] Keep the Askama `hero_proc_admin` running against the new model (reference UI). - [ ] Verify `hero_router` discovers + caches per-domain specs via `domains.json`. - [ ] `lab service hero_proc --build --start` clean; `router.services` shows hero_proc's domain methods > 0. ## Validation / repro ``` PATH_CODE=<lhumina_code> lab service hero_proc --build --start # was failing on rpc_jobs.sock ls $HERO_SOCKET_DIR/hero_proc/ # latest build: only rpc.sock (correct, new model) curl --unix-socket .../hero_proc/rpc.sock http://localhost/api/domains.json curl --unix-socket .../hero_proc/rpc.sock http://localhost/api/jobs/openrpc.json ``` Note: the older `hero_lib@0b06c634` (pre-refactor) + current hero_proc both bind the flat `rpc_*.sock` and work — that is the *old* model and is the thing being migrated away from. ## Related - Dioxus admin issue (consumer of `/api/domains.json`): see companion issue in `hero_website_framework`. - `rpc.discover` macro fix (single-domain discovery): `hero_lib@599e2c37`. *Filed by Claude (owner-mode work for Timur).*
Author
Owner

The new one-socket model works end-to-end on the serve + admin side — the gap is purely stale consumers:

  • Built + ran new-model hero_proc (single hero_proc/rpc.sock). It correctly serves /api/domains.json, per-domain /api/<domain>/openrpc.json, and /api/<domain>/rpc.
  • The new Dioxus admin (hero_website_framework@5e9724f) consumes it live (domains jobs/logs/secrets/system; 88 jobs methods).
  • Confirmed the rpc.discover macro fix (hero_lib@599e2c37): each per-domain /api/<domain>/rpc answers rpc.discover with its spec, so hero_router can introspect over RPC.

Still stale / to migrate (this issue):

  • hero_proc_server/service.toml still declares rpc_jobs.sock / rpc_logs.sock / rpc_secrets.sock / rpc_system.sock.
  • The Askama hero_proc_admin + lab's hero_proc SDK still resolve rpc_<domain>.sock. lab service hero_proc fails its readiness check on rpc_jobs.sock — that is the migration target (keep the Askama admin working as the Dioxus reference).
The new one-socket model works end-to-end on the **serve + admin** side — the gap is purely stale consumers: - Built + ran new-model `hero_proc` (single `hero_proc/rpc.sock`). It correctly serves `/api/domains.json`, per-domain `/api/<domain>/openrpc.json`, and `/api/<domain>/rpc`. - The new Dioxus admin (`hero_website_framework@5e9724f`) consumes it live (domains jobs/logs/secrets/system; 88 jobs methods). - Confirmed the `rpc.discover` macro fix (`hero_lib@599e2c37`): each per-domain `/api/<domain>/rpc` answers `rpc.discover` with its spec, so `hero_router` can introspect over RPC. Still **stale / to migrate** (this issue): - `hero_proc_server/service.toml` still declares `rpc_jobs.sock` / `rpc_logs.sock` / `rpc_secrets.sock` / `rpc_system.sock`. - The Askama `hero_proc_admin` + `lab`'s hero_proc SDK still resolve `rpc_<domain>.sock`. `lab service hero_proc` fails its readiness check on `rpc_jobs.sock` — that is the migration target (keep the Askama admin working as the Dioxus reference).
Author
Owner

Done — hero_proc_sdk migrated to the one-socket model

Root cause

The server (hero_lifecycle::serve_rpc_domains_with_extra) had already moved every multi-domain service to one socket per service: a single <service>/rpc.sock serving each domain as a path segment (POST /api/{domain}/rpc, GET /api/{domain}/openrpc.json, GET /api/{domain}/events; discovery via GET /api/domains.json). But the typed client — generated by the openrpc_client! bundle macro — still dialed the old rpc_<domain>.sock fan-out, so every SDK consumer failed with socket not found at '.../rpc_jobs.sock'.

The per-domain socket resolution lives in the macro, not in hand-written SDK code, so the fix spans three repos:

Changes

hero_lib (a9b14b60) — the actual migration:

  • herolib_macros generate_client_bundle: connect() resolves the single <service>/rpc.sock and points each domain client at its /api/{domain}/rpc segment via a new per-domain connect_socket_at(path, rpc_path). connect_http/connect_http_router both reach <base>/api/{domain}/rpc via connect_http_at(rpc_url).
  • herolib_openrpc OpenRpcTransport::sse_subscribe: derives the SSE route from the transport's RPC endpoint (api-base + suffix) so a domain on the shared socket streams from /api/{domain}/events; single-domain /api/rpc → /api/events is unchanged. Bundle tests updated to assert the one-socket layout.

hero_proc (ce95f36, closes this issue):

  • service.toml: declares only the single rpc.sock (dropped the stale control-plane + rpc_<domain>.sock entries).
  • hero_proc_sdk docs (factory.rs, openrpc_client/mod.rs, README.md) + hero_proc_admin comments: aligned to the one-socket / path-segment model. Lockfile bumped to the hero_lib rev above.

hero_skills (c0c8ef22):

  • lab's raw hero_proc liveness probe (ping_hero_proc) POSTed to bare /rpc (now 404). Repointed to POST /api/system/rpc ping. (lab's SDK-based registration path — hero_proc_factory() — is fixed transitively by the SDK update.)

Validation (live stack: hero_proc_server + hero_router :9988)

  • cargo build of hero_proc_sdk + Askama hero_proc_admin: clean.
  • POST /hero_proc/rpc/api/jobs/rpc (via router) and per-domain calls direct on rpc.sock: OK.
  • SDK 01_connect example (system + jobs domains): connects, pings, lists services.
  • Askama admin services / jobs / logs panes load with no rpc_jobs.sock error (secrets pane surfaces a pre-existing corrupt secrets DB server error — database disk image is malformed — unrelated to routing; reproduces directly on the socket).
  • lab service hero_proc --statusping: ok, server + admin running.
  • lab service hero_proc --starthero_proc_server + hero_proc_admin register cleanly, no rpc_jobs.sock probe failure.

Out-of-scope follow-ups (noted, not fixed here)

  • lab's generic smoke_test_sockets still uses the pre-one-socket endpoint layout (/health, /openrpc.json, POST /rpc system.ping, /.well-known/heroservice.json). Failures are non-fatal warnings. Surfaced as a smoke-test fail on hero_proc_admin_dx's admindx.sock (the Dioxus SPA serves HTML at /.well-known/heroservice.json instead of a JSON manifest). Both belong to a separate hero_sockets/lab alignment task — and the Dioxus admin is explicitly off-limits here.

Closing — Askama admin + lab both work against the one-socket server.

## Done — hero_proc_sdk migrated to the one-socket model ### Root cause The server (`hero_lifecycle::serve_rpc_domains_with_extra`) had already moved every multi-domain service to **one socket per service**: a single `<service>/rpc.sock` serving each domain as a path segment (`POST /api/{domain}/rpc`, `GET /api/{domain}/openrpc.json`, `GET /api/{domain}/events`; discovery via `GET /api/domains.json`). But the typed client — generated by the `openrpc_client!` **bundle** macro — still dialed the old `rpc_<domain>.sock` fan-out, so every SDK consumer failed with `socket not found at '.../rpc_jobs.sock'`. The per-domain socket resolution lives in the macro, not in hand-written SDK code, so the fix spans three repos: ### Changes **hero_lib** (`a9b14b60`) — the actual migration: - `herolib_macros` `generate_client_bundle`: `connect()` resolves the single `<service>/rpc.sock` and points each domain client at its `/api/{domain}/rpc` segment via a new per-domain `connect_socket_at(path, rpc_path)`. `connect_http`/`connect_http_router` both reach `<base>/api/{domain}/rpc` via `connect_http_at(rpc_url)`. - `herolib_openrpc` `OpenRpcTransport::sse_subscribe`: derives the SSE route from the transport's RPC endpoint (api-base + suffix) so a domain on the shared socket streams from `/api/{domain}/events`; single-domain `/api/rpc → /api/events` is unchanged. Bundle tests updated to assert the one-socket layout. **hero_proc** (`ce95f36`, closes this issue): - `service.toml`: declares only the single `rpc.sock` (dropped the stale control-plane + `rpc_<domain>.sock` entries). - `hero_proc_sdk` docs (`factory.rs`, `openrpc_client/mod.rs`, `README.md`) + `hero_proc_admin` comments: aligned to the one-socket / path-segment model. Lockfile bumped to the hero_lib rev above. **hero_skills** (`c0c8ef22`): - `lab`'s raw hero_proc liveness probe (`ping_hero_proc`) POSTed to bare `/rpc` (now 404). Repointed to `POST /api/system/rpc` `ping`. (lab's SDK-based registration path — `hero_proc_factory()` — is fixed transitively by the SDK update.) ### Validation (live stack: hero_proc_server + hero_router :9988) - `cargo build` of hero_proc_sdk + Askama hero_proc_admin: clean. - `POST /hero_proc/rpc/api/jobs/rpc` (via router) and per-domain calls direct on `rpc.sock`: OK. - SDK `01_connect` example (system + jobs domains): connects, pings, lists services. - Askama admin **services / jobs / logs panes load** with no `rpc_jobs.sock` error (secrets pane surfaces a pre-existing *corrupt secrets DB* server error — `database disk image is malformed` — unrelated to routing; reproduces directly on the socket). - `lab service hero_proc --status` → `ping: ok`, server + admin `running`. - `lab service hero_proc --start` → **hero_proc_server + hero_proc_admin register cleanly, no rpc_jobs.sock probe failure.** ### Out-of-scope follow-ups (noted, not fixed here) - lab's **generic** `smoke_test_sockets` still uses the pre-one-socket endpoint layout (`/health`, `/openrpc.json`, `POST /rpc system.ping`, `/.well-known/heroservice.json`). Failures are non-fatal warnings. Surfaced as a smoke-test fail on `hero_proc_admin_dx`'s `admindx.sock` (the Dioxus SPA serves HTML at `/.well-known/heroservice.json` instead of a JSON manifest). Both belong to a separate hero_sockets/lab alignment task — and the Dioxus admin is explicitly off-limits here. Closing — Askama admin + lab both work against the one-socket server.
timur closed this issue 2026-06-11 12:04:12 +00:00
Author
Owner

Follow-ups: secrets-store repair + smoke-test layout

The two items flagged when this issue closed are now addressed.

1. Secrets pane (database disk image is malformed)

Root cause was index corruption, not data loss — idx_secrets_context / sqlite_autoindex_secrets_1 had wrong entry counts while all 76 secret rows were intact. So the fix is a non-destructive REINDEX, not a wipe.

hero_proc (5ab563d):

  • SecretsApi::repair() runs REINDEX secrets + PRAGMA integrity_check (returns "ok"), rebuilding the indexes from the row data — secrets preserved.
  • New SecretService::repair() -> str RPC (regenerated openrpc spec) wired to it.
  • Askama admin: the secrets pane now renders list/connection failures inline (toolbar stays reachable) and adds a "Repair store" button (POST /secrets/repair) — available even when the store can't be listed, so a malformed store is self-serviceable from the UI.

Verified on the live store: secret_list_full → malformed → repair"ok"secret_list_full76 secrets (no data loss); the pane loads and the button round-trips (flash: "Secrets store repaired — integrity: ok"). The live store is repaired.

2. lab smoke-test endpoint layout

hero_skills (dc72a1bc): smoke_test_sockets asserted the pre-one-socket layout. Each check now tries an ordered list of candidates — one-socket path first, legacy path as fallback — passing on the first 2xx that matches (probe_check_blocking):

  • openrpc: /health.json/health; /api/domains.json/openrpc.json; /heroservice.json/.well-known/heroservice.json; ping /api/ping→system-domain rpc→/api/rpc/rpc.
  • http: /health(.json); manifest /.well-known/heroservice.json/heroservice.json.

Verified live: hero_proc (one-socket) and hero_router (legacy) now pass all rpc + admin smoke checks (6 passed). The only remaining failure is *_admin_dx's manifest check — a real gap in the Dioxus admin (its SPA serves HTML, not a JSON discovery manifest), which is off-limits per the migration scope; left as a legitimate flag rather than silenced.

## Follow-ups: secrets-store repair + smoke-test layout The two items flagged when this issue closed are now addressed. ### 1. Secrets pane (`database disk image is malformed`) Root cause was **index corruption**, not data loss — `idx_secrets_context` / `sqlite_autoindex_secrets_1` had wrong entry counts while all 76 secret rows were intact. So the fix is a non-destructive `REINDEX`, not a wipe. **hero_proc** (`5ab563d`): - `SecretsApi::repair()` runs `REINDEX secrets` + `PRAGMA integrity_check` (returns `"ok"`), rebuilding the indexes from the row data — secrets preserved. - New `SecretService::repair() -> str` RPC (regenerated openrpc spec) wired to it. - Askama admin: the secrets pane now renders list/connection failures **inline** (toolbar stays reachable) and adds a **"Repair store"** button (`POST /secrets/repair`) — available even when the store can't be listed, so a malformed store is self-serviceable from the UI. Verified on the live store: `secret_list_full` → malformed → `repair` → `"ok"` → `secret_list_full` → **76 secrets** (no data loss); the pane loads and the button round-trips (flash: "Secrets store repaired — integrity: ok"). The live store is repaired. ### 2. lab smoke-test endpoint layout **hero_skills** (`dc72a1bc`): `smoke_test_sockets` asserted the pre-one-socket layout. Each check now tries an ordered list of **candidates — one-socket path first, legacy path as fallback** — passing on the first 2xx that matches (`probe_check_blocking`): - openrpc: `/health.json`→`/health`; `/api/domains.json`→`/openrpc.json`; `/heroservice.json`→`/.well-known/heroservice.json`; ping `/api/ping`→system-domain rpc→`/api/rpc`→`/rpc`. - http: `/health`(`.json`); manifest `/.well-known/heroservice.json`→`/heroservice.json`. Verified live: **hero_proc (one-socket)** and **hero_router (legacy)** now pass all rpc + admin smoke checks (`6 passed`). The only remaining failure is `*_admin_dx`'s manifest check — a real gap in the **Dioxus admin** (its SPA serves HTML, not a JSON discovery manifest), which is off-limits per the migration scope; left as a legitimate flag rather than silenced.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_proc#148
No description provided.