[deployer admin] Show the install log in the admin UI for troubleshooting #267

Open
opened 2026-06-07 16:24:01 +00:00 by mik-tf · 1 comment
Owner

When a tester install fails, the admin dashboard only shows a generic "Install failed on the VM" message, so the operator cannot see why without opening a shell on the admin machine to read the deployer logs (which is exactly what was needed to diagnose the recent gate failure). We should add a button in the admin dashboard that displays the install output for a given tester, both the steps and any error, so failures can be understood in the browser. A clean implementation captures the install run output, stores it per tester in a small database column, exposes it through a read only RPC, and surfaces it behind a Show install log button on the user detail page. Keep it simple first with the deployer side install output, which is what is needed to see which step failed and why; a richer live view of the steps as they happen can come later. This was scoped during the install reliability work but deferred so the reliability fix could land cleanly on its own.

Signed-by: mik-tf mik-tf@noreply.invalid

When a tester install fails, the admin dashboard only shows a generic "Install failed on the VM" message, so the operator cannot see why without opening a shell on the admin machine to read the deployer logs (which is exactly what was needed to diagnose the recent gate failure). We should add a button in the admin dashboard that displays the install output for a given tester, both the steps and any error, so failures can be understood in the browser. A clean implementation captures the install run output, stores it per tester in a small database column, exposes it through a read only RPC, and surfaces it behind a Show install log button on the user detail page. Keep it simple first with the deployer side install output, which is what is needed to see which step failed and why; a richer live view of the steps as they happen can come later. This was scoped during the install reliability work but deferred so the reliability fix could land cleanly on its own. Signed-by: mik-tf <mik-tf@noreply.invalid>
Author
Owner

Two concrete gaps confirmed live while watching a tester install (weynandsandbox).

  1. The install really is a black box in the UI: the user detail page shows a spinning "installing" badge and "(awaiting install)" with no way to see what step it is on or why it is stuck. This is exactly the Show install log button this issue asks for; the live-step view would be even better here.

  2. There is no way to re-run or retry the install from the UI. When an install is interrupted (in this case the deployer service was restarted mid-install), the VM is left running with the install state stuck on installing, and the only action offered is Destroy. The operator cannot retry the install from the browser; today the options are to wait out the 30 minute stale-lock or trigger the install RPC by hand. We should add a "Retry install" (and "Reinstall") action on the user detail page that calls install_hero_stack for that VM, and ideally a way to clear a stuck installing state so a retry does not have to wait for the timeout.

Signed-by: mik-tf mik-tf@noreply.invalid

Two concrete gaps confirmed live while watching a tester install (weynandsandbox). 1. The install really is a black box in the UI: the user detail page shows a spinning "installing" badge and "(awaiting install)" with no way to see what step it is on or why it is stuck. This is exactly the Show install log button this issue asks for; the live-step view would be even better here. 2. There is no way to re-run or retry the install from the UI. When an install is interrupted (in this case the deployer service was restarted mid-install), the VM is left running with the install state stuck on installing, and the only action offered is Destroy. The operator cannot retry the install from the browser; today the options are to wait out the 30 minute stale-lock or trigger the install RPC by hand. We should add a "Retry install" (and "Reinstall") action on the user detail page that calls install_hero_stack for that VM, and ideally a way to clear a stuck installing state so a retry does not have to wait for the timeout. Signed-by: mik-tf <mik-tf@noreply.invalid>
mik-tf self-assigned this 2026-06-14 04:32:06 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/home#267
No description provided.