[feat] Deploy on any node generation: support zos-light (zmachine-light/network-light) workloads #135

Open
opened 2026-06-12 22:34:56 +00:00 by mik-tf · 1 comment
Owner

Some TFGrid nodes run the newer "zos light" generation and advertise only the light workload feature set (zmachine-light, network-light), visible on the Grid Proxy node features list. Our compute daemon always builds classic zmachine + network workloads, so on a light-only node the ZOS daemon accepts the contract and then rejects the deployment, leaving an orphan network contract to clean up by hand. The standardized new sandbox nodes are zos light (for example mainnet node 8072, features: zmachine-light, network-light, zdb, zmount, volume, qsfs, zlogs, mycelium, wireguard), so the daemon needs to deploy on any node generation rather than asking node operators for classic installs.

The upstream tfgrid-sdk-rust crate already exposes a first-class light path (client.deploy_vm_light taking a VmLightDeployment, with NetworkLightSpec and VmLightSpec), and it returns the same DeploymentOutcome as the classic deploy_vm, so the downstream mycelium-ip poll and SSH reachability probe stay unchanged. Plan:

  1. Capture the node's advertised features in the daemon's Grid Proxy node query (the node-info struct does not read the features array today) and derive a node generation: light when the features contain zmachine-light, classic otherwise.

  2. Branch the deploy path on generation. Classic keeps the existing deploy_vm. Light builds VmLightDeployment with NetworkLightSpec plus a VmLightSpec that mirrors the classic VmSpec field for field: the same cpu, memory, rootfs size, the /data volume, and the SSH_KEY env. One must-set detail: pass the real image flist explicitly on the light spec, because the SDK light builder defaults the flist to a generic base image rather than the Ubuntu image we deploy. Mycelium is auto-seeded on the light path, so no extra mycelium wiring is needed.

  3. Once the daemon deploys on both generations, the deployer placement no longer needs to block light nodes. Surface the node generation on the deployer Nodes page so an operator can see which generation a node is. This folds in the deployer-side placement issue at lhumina_code/hero_os_tfgrid_deployer#25 .

Live validation gate: re-reserve node 8072 once it is re-flagged dedicated, then confirm the Ubuntu image boots under zmachine-light and is reachable over mycelium SSH end to end, since that is the one thing the SDK types cannot prove on their own.

Some TFGrid nodes run the newer "zos light" generation and advertise only the light workload feature set (zmachine-light, network-light), visible on the Grid Proxy node features list. Our compute daemon always builds classic zmachine + network workloads, so on a light-only node the ZOS daemon accepts the contract and then rejects the deployment, leaving an orphan network contract to clean up by hand. The standardized new sandbox nodes are zos light (for example mainnet node 8072, features: zmachine-light, network-light, zdb, zmount, volume, qsfs, zlogs, mycelium, wireguard), so the daemon needs to deploy on any node generation rather than asking node operators for classic installs. The upstream tfgrid-sdk-rust crate already exposes a first-class light path (client.deploy_vm_light taking a VmLightDeployment, with NetworkLightSpec and VmLightSpec), and it returns the same DeploymentOutcome as the classic deploy_vm, so the downstream mycelium-ip poll and SSH reachability probe stay unchanged. Plan: 1. Capture the node's advertised features in the daemon's Grid Proxy node query (the node-info struct does not read the features array today) and derive a node generation: light when the features contain zmachine-light, classic otherwise. 2. Branch the deploy path on generation. Classic keeps the existing deploy_vm. Light builds VmLightDeployment with NetworkLightSpec plus a VmLightSpec that mirrors the classic VmSpec field for field: the same cpu, memory, rootfs size, the /data volume, and the SSH_KEY env. One must-set detail: pass the real image flist explicitly on the light spec, because the SDK light builder defaults the flist to a generic base image rather than the Ubuntu image we deploy. Mycelium is auto-seeded on the light path, so no extra mycelium wiring is needed. 3. Once the daemon deploys on both generations, the deployer placement no longer needs to block light nodes. Surface the node generation on the deployer Nodes page so an operator can see which generation a node is. This folds in the deployer-side placement issue at https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/25 . Live validation gate: re-reserve node 8072 once it is re-flagged dedicated, then confirm the Ubuntu image boots under zmachine-light and is reachable over mycelium SSH end to end, since that is the one thing the SDK types cannot prove on their own.
Author
Owner

Update from a live test. The generation-aware selection layer is on integration at hero_compute@e7c5723: the daemon now reads the node features, picks the light deploy path on a zmachine-light node, and leaves the classic path unchanged (unit tested, gated). The light deploy itself is not yet proven. A live attempt on mainnet node 8072 (the only alive light node on the grid right now) reached a wait-network-workloads timeout: the daemon correctly selected the light path and submitted a single light deployment, but the network-light workload did not become ready in time. Same symptom the earlier abandoned attempt hit. Two things are tangled and need separating before we can call it proven: whether our light workload construction matches what the dashboard submits, and the fact that 8072 is non-dedicated (shared), so a deployment we did not rent is not prioritized like a dedicated rental. Every alive light node on the grid is non-dedicated, and every dedicated light node is currently offline, so proving this needs either non-dedicated deploy support in our stack or a dedicated light node actually brought online by the farm.

Update from a live test. The generation-aware selection layer is on integration at hero_compute@e7c5723: the daemon now reads the node features, picks the light deploy path on a zmachine-light node, and leaves the classic path unchanged (unit tested, gated). The light deploy itself is not yet proven. A live attempt on mainnet node 8072 (the only alive light node on the grid right now) reached a wait-network-workloads timeout: the daemon correctly selected the light path and submitted a single light deployment, but the network-light workload did not become ready in time. Same symptom the earlier abandoned attempt hit. Two things are tangled and need separating before we can call it proven: whether our light workload construction matches what the dashboard submits, and the fact that 8072 is non-dedicated (shared), so a deployment we did not rent is not prioritized like a dedicated rental. Every alive light node on the grid is non-dedicated, and every dedicated light node is currently offline, so proving this needs either non-dedicated deploy support in our stack or a dedicated light node actually brought online by the farm.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_compute#135
No description provided.