Migrate hero_proc to the one-socket multi-domain model (domains as /api/{domain}/ paths) #148
Labels
No labels
prio_critical
prio_low
type_bug
type_contact
type_issue
type_lead
type_question
type_story
type_task
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/hero_proc#148
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
hero_libhas moved multi-domain services to a one-socket-per-service model (domains as URL path segments), buthero_procis only half-migrated: itsservice.toml, its Askama admin, andlab's hero_proc SDK still expect the old per-domain sockets (hero_proc/rpc_jobs.sock,rpc_logs.sock, …). The result: on latesthero_lib,hero_proc_serverbinds onlyhero_proc/rpc.sock, solab service hero_procfails (socket not found … rpc_jobs.sock) and discovery/admin break.This was misdiagnosed as a regression — it is an intentional architecture change that needs finishing across consumers.
The new model (authoritative)
hero_lifecycle::ServiceManifest::serve_rpc_domains_with_extra(hero_lib/crates/hero_lifecycle/src/manifest.rs:768) is explicit:Everything is served on
<service>/rpc.sock:GET /api/domains.json→ the domain listPOST /api/{domain}/rpc→ JSON-RPC per domainGET /api/{domain}/openrpc.json→ spec per domainGET /heroservice.json,GET /health.jsonThe macros that generate this:
herolib_macros::openrpc_from_oschema!(multi-domain) andopenrpc_server!(single-domain). The model landed via hero_lib dev commits537e2b9b(multi-domain proxy routing),edb04a99(100%-compliant web routes),ee95c88d(dispatch+manifest),277299dc(simplify openrpc_proxy+manifest).What is stale and must be migrated
hero_proc/crates/hero_proc_server/service.toml— still declaresrpc_jobs.sock,rpc_logs.sock,rpc_secrets.sock,rpc_system.sock. Should declare a singlehero_proc/rpc.sock(domains live under/api/{domain}/…on it). Verify the banner/--infomatches whatserve_rpc_domains_with_extraactually binds.lab's hero_proc SDK (hero_skills/crates/lab, viahero_proc_sdk) — the multi-domain client still resolves$HERO_SOCKET_DIR/hero_proc/rpc_<domain>.sock. It must call<service>/rpc.sockwith/api/{domain}/rpc. This is whylab service hero_proc/ status fails.hero_proc's own SDK (hero_proc/crates/hero_proc_sdk) — same per-domain-socket assumption.hero_routerdiscovery — confirm it discovers a multi-domain service via/heroservice.json+/api/domains.jsonand caches each/api/{domain}/openrpc.json(todayrouter.servicesreportsmethods=0, spec=nullfor hero_proc's domains — see the relatedrpc.discoverfixhero_lib@599e2c37, but the durable path isdomains.json).hero_proc_admin— keep it working (Timur wants it as the 1:1 reference for the Dioxus admin). Update its socket/path expectations to the new model rather than removing it.Plan
hero_proc_server/service.tomlto onerpc.sock; confirmserve_domains_withbinds exactly that + serves/api/domains.json+ per-domain routes.hero_proc_sdk+lab's hero_proc client to the<svc>/rpc.sock+/api/{domain}/rpccontract.hero_proc_adminrunning against the new model (reference UI).hero_routerdiscovers + caches per-domain specs viadomains.json.lab service hero_proc --build --startclean;router.servicesshows hero_proc's domain methods > 0.Validation / repro
Note: the older
hero_lib@0b06c634(pre-refactor) + current hero_proc both bind the flatrpc_*.sockand work — that is the old model and is the thing being migrated away from.Related
/api/domains.json): see companion issue inhero_website_framework.rpc.discovermacro fix (single-domain discovery):hero_lib@599e2c37.Filed by Claude (owner-mode work for Timur).
The new one-socket model works end-to-end on the serve + admin side — the gap is purely stale consumers:
hero_proc(singlehero_proc/rpc.sock). It correctly serves/api/domains.json, per-domain/api/<domain>/openrpc.json, and/api/<domain>/rpc.hero_website_framework@5e9724f) consumes it live (domains jobs/logs/secrets/system; 88 jobs methods).rpc.discovermacro fix (hero_lib@599e2c37): each per-domain/api/<domain>/rpcanswersrpc.discoverwith its spec, sohero_routercan introspect over RPC.Still stale / to migrate (this issue):
hero_proc_server/service.tomlstill declaresrpc_jobs.sock/rpc_logs.sock/rpc_secrets.sock/rpc_system.sock.hero_proc_admin+lab's hero_proc SDK still resolverpc_<domain>.sock.lab service hero_procfails its readiness check onrpc_jobs.sock— that is the migration target (keep the Askama admin working as the Dioxus reference).Done — hero_proc_sdk migrated to the one-socket model
Root cause
The server (
hero_lifecycle::serve_rpc_domains_with_extra) had already moved every multi-domain service to one socket per service: a single<service>/rpc.sockserving each domain as a path segment (POST /api/{domain}/rpc,GET /api/{domain}/openrpc.json,GET /api/{domain}/events; discovery viaGET /api/domains.json). But the typed client — generated by theopenrpc_client!bundle macro — still dialed the oldrpc_<domain>.sockfan-out, so every SDK consumer failed withsocket not found at '.../rpc_jobs.sock'.The per-domain socket resolution lives in the macro, not in hand-written SDK code, so the fix spans three repos:
Changes
hero_lib (
a9b14b60) — the actual migration:herolib_macrosgenerate_client_bundle:connect()resolves the single<service>/rpc.sockand points each domain client at its/api/{domain}/rpcsegment via a new per-domainconnect_socket_at(path, rpc_path).connect_http/connect_http_routerboth reach<base>/api/{domain}/rpcviaconnect_http_at(rpc_url).herolib_openrpcOpenRpcTransport::sse_subscribe: derives the SSE route from the transport's RPC endpoint (api-base + suffix) so a domain on the shared socket streams from/api/{domain}/events; single-domain/api/rpc → /api/eventsis unchanged. Bundle tests updated to assert the one-socket layout.hero_proc (
ce95f36, closes this issue):service.toml: declares only the singlerpc.sock(dropped the stale control-plane +rpc_<domain>.sockentries).hero_proc_sdkdocs (factory.rs,openrpc_client/mod.rs,README.md) +hero_proc_admincomments: aligned to the one-socket / path-segment model. Lockfile bumped to the hero_lib rev above.hero_skills (
c0c8ef22):lab's raw hero_proc liveness probe (ping_hero_proc) POSTed to bare/rpc(now 404). Repointed toPOST /api/system/rpcping. (lab's SDK-based registration path —hero_proc_factory()— is fixed transitively by the SDK update.)Validation (live stack: hero_proc_server + hero_router :9988)
cargo buildof hero_proc_sdk + Askama hero_proc_admin: clean.POST /hero_proc/rpc/api/jobs/rpc(via router) and per-domain calls direct onrpc.sock: OK.01_connectexample (system + jobs domains): connects, pings, lists services.rpc_jobs.sockerror (secrets pane surfaces a pre-existing corrupt secrets DB server error —database disk image is malformed— unrelated to routing; reproduces directly on the socket).lab service hero_proc --status→ping: ok, server + adminrunning.lab service hero_proc --start→ hero_proc_server + hero_proc_admin register cleanly, no rpc_jobs.sock probe failure.Out-of-scope follow-ups (noted, not fixed here)
smoke_test_socketsstill uses the pre-one-socket endpoint layout (/health,/openrpc.json,POST /rpc system.ping,/.well-known/heroservice.json). Failures are non-fatal warnings. Surfaced as a smoke-test fail onhero_proc_admin_dx'sadmindx.sock(the Dioxus SPA serves HTML at/.well-known/heroservice.jsoninstead of a JSON manifest). Both belong to a separate hero_sockets/lab alignment task — and the Dioxus admin is explicitly off-limits here.Closing — Askama admin + lab both work against the one-socket server.
Follow-ups: secrets-store repair + smoke-test layout
The two items flagged when this issue closed are now addressed.
1. Secrets pane (
database disk image is malformed)Root cause was index corruption, not data loss —
idx_secrets_context/sqlite_autoindex_secrets_1had wrong entry counts while all 76 secret rows were intact. So the fix is a non-destructiveREINDEX, not a wipe.hero_proc (
5ab563d):SecretsApi::repair()runsREINDEX secrets+PRAGMA integrity_check(returns"ok"), rebuilding the indexes from the row data — secrets preserved.SecretService::repair() -> strRPC (regenerated openrpc spec) wired to it.POST /secrets/repair) — available even when the store can't be listed, so a malformed store is self-serviceable from the UI.Verified on the live store:
secret_list_full→ malformed →repair→"ok"→secret_list_full→ 76 secrets (no data loss); the pane loads and the button round-trips (flash: "Secrets store repaired — integrity: ok"). The live store is repaired.2. lab smoke-test endpoint layout
hero_skills (
dc72a1bc):smoke_test_socketsasserted the pre-one-socket layout. Each check now tries an ordered list of candidates — one-socket path first, legacy path as fallback — passing on the first 2xx that matches (probe_check_blocking):/health.json→/health;/api/domains.json→/openrpc.json;/heroservice.json→/.well-known/heroservice.json; ping/api/ping→system-domain rpc→/api/rpc→/rpc./health(.json); manifest/.well-known/heroservice.json→/heroservice.json.Verified live: hero_proc (one-socket) and hero_router (legacy) now pass all rpc + admin smoke checks (
6 passed). The only remaining failure is*_admin_dx's manifest check — a real gap in the Dioxus admin (its SPA serves HTML, not a JSON discovery manifest), which is off-limits per the migration scope; left as a legitimate flag rather than silenced.