deploy_vm hangs on dedicated zos-light nodes: it forces light workloads, but classic deploys fine #136

Open
opened 2026-06-15 02:11:05 +00:00 by mik-tf · 0 comments
Owner

On a dedicated zos-light node we rent on mainnet (node 8072), deploy_vm hangs in provisioning indefinitely and the VM never gets a mycelium IP. It creates the network-light and vm-light contracts, stops at "Submitting deployment to TFChain", and the workload never becomes ready. The cause is the automatic generation selection in node_capacity.rs: is_light_generation returns true for any node advertising zmachine-light without a classic zmachine, so deploy_on_tfgrid always builds light workloads for these nodes. The code comment there assumes a light node accepts a classic contract and then rejects the workloads, but that is not what happens on node 8072: deploying a classic full VM on the same node through the official ThreeFold dashboard comes up in about one minute with working mycelium SSH. So classic workloads do work on this node, and forcing the light path is what hangs. Proposed fix: make the light vs classic choice flexible (an override, or a per-node check) instead of a hard rule from the feature flags, prefer classic where it works, and confirm by deploying a classic VM on node 8072 through our own daemon. This is likely the root cause of the slow first-boot reachability we have seen on fresh VMs.

Signed-by: mik-tf mik-tf@noreply.invalid

On a dedicated zos-light node we rent on mainnet (node 8072), deploy_vm hangs in provisioning indefinitely and the VM never gets a mycelium IP. It creates the network-light and vm-light contracts, stops at "Submitting deployment to TFChain", and the workload never becomes ready. The cause is the automatic generation selection in node_capacity.rs: is_light_generation returns true for any node advertising zmachine-light without a classic zmachine, so deploy_on_tfgrid always builds light workloads for these nodes. The code comment there assumes a light node accepts a classic contract and then rejects the workloads, but that is not what happens on node 8072: deploying a classic full VM on the same node through the official ThreeFold dashboard comes up in about one minute with working mycelium SSH. So classic workloads do work on this node, and forcing the light path is what hangs. Proposed fix: make the light vs classic choice flexible (an override, or a per-node check) instead of a hard rule from the feature flags, prefer classic where it works, and confirm by deploying a classic VM on node 8072 through our own daemon. This is likely the root cause of the slow first-boot reachability we have seen on fresh VMs. Signed-by: mik-tf <mik-tf@noreply.invalid>
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_compute#136
No description provided.