[placement] Node feature set (zos light vs classic) is invisible to placement; light-only node accepted then rejects every deploy #25

Closed
opened 2026-06-12 20:24:13 +00:00 by mik-tf · 2 comments
Owner

Provisioning a tester VM onto mainnet node 8072 fails after contract submission: the node only advertises the light workload feature set (zmachine-light, network-light, per Grid Proxy /nodes/8072), while the compute daemon deploys classic network and zmachine workloads, so the ZOS daemon rejects the deployment and the network contract is left for manual cancellation (contract 2097718, cancelled). The node registry and capacity board show such a node as healthy and fitting, so placement happily selects a node that can never run our workloads. Suggest the placement/registration guard also compare the node's advertised features against the workload types the daemon deploys, refuse registration or placement with a clear message when they do not match, and surface the node generation on the Nodes page. Until then the rented mainnet node 8072 cannot serve tester VMs; either a node with the classic feature set is needed on mainnet or light workload support in the compute daemon.

Provisioning a tester VM onto mainnet node 8072 fails after contract submission: the node only advertises the light workload feature set (zmachine-light, network-light, per Grid Proxy /nodes/8072), while the compute daemon deploys classic network and zmachine workloads, so the ZOS daemon rejects the deployment and the network contract is left for manual cancellation (contract 2097718, cancelled). The node registry and capacity board show such a node as healthy and fitting, so placement happily selects a node that can never run our workloads. Suggest the placement/registration guard also compare the node's advertised features against the workload types the daemon deploys, refuse registration or placement with a clear message when they do not match, and surface the node generation on the Nodes page. Until then the rented mainnet node 8072 cannot serve tester VMs; either a node with the classic feature set is needed on mainnet or light workload support in the compute daemon.
Author
Owner

Important reframe after talking to ops (projectmycelium/circle_ops#837): light nodes are not an anomaly to avoid, they are the direction. New ops-provisioned nodes for the sandbox are v3light on purpose, and a dashboard deployment on node 8072 succeeds because the dashboard submits the light workload types. So the resolution here is not "prefer classic nodes": the compute daemon needs to submit zmachine-light and network-light on nodes that advertise the light feature set. The feature-aware placement check described above is still wanted, but as a compatibility router (pick the workload family per node) rather than only a guard that refuses light nodes.

Important reframe after talking to ops (https://forge.ourworld.tf/projectmycelium/circle_ops/issues/837): light nodes are not an anomaly to avoid, they are the direction. New ops-provisioned nodes for the sandbox are v3light on purpose, and a dashboard deployment on node 8072 succeeds because the dashboard submits the light workload types. So the resolution here is not "prefer classic nodes": the compute daemon needs to submit zmachine-light and network-light on nodes that advertise the light feature set. The feature-aware placement check described above is still wanted, but as a compatibility router (pick the workload family per node) rather than only a guard that refuses light nodes.
Author
Owner

Folded into the compute daemon issue at lhumina_code/hero_compute#135 . The right fix is for the daemon to deploy on any node generation (detect the node's advertised features and use the light workload path on light-only nodes), after which the deployer placement no longer needs to block these nodes and just surfaces the node generation on the Nodes page. Closing here in favor of that issue.

Folded into the compute daemon issue at https://forge.ourworld.tf/lhumina_code/hero_compute/issues/135 . The right fix is for the daemon to deploy on any node generation (detect the node's advertised features and use the light workload path on light-only nodes), after which the deployer placement no longer needs to block these nodes and just surfaces the node generation on the Nodes page. Closing here in favor of that issue.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_os_tfgrid_deployer#25
No description provided.