[deployer admin] Show the install log in the admin UI for troubleshooting #267
Labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
lhumina_code/home#267
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
When a tester install fails, the admin dashboard only shows a generic "Install failed on the VM" message, so the operator cannot see why without opening a shell on the admin machine to read the deployer logs (which is exactly what was needed to diagnose the recent gate failure). We should add a button in the admin dashboard that displays the install output for a given tester, both the steps and any error, so failures can be understood in the browser. A clean implementation captures the install run output, stores it per tester in a small database column, exposes it through a read only RPC, and surfaces it behind a Show install log button on the user detail page. Keep it simple first with the deployer side install output, which is what is needed to see which step failed and why; a richer live view of the steps as they happen can come later. This was scoped during the install reliability work but deferred so the reliability fix could land cleanly on its own.
Signed-by: mik-tf mik-tf@noreply.invalid
Two concrete gaps confirmed live while watching a tester install (weynandsandbox).
The install really is a black box in the UI: the user detail page shows a spinning "installing" badge and "(awaiting install)" with no way to see what step it is on or why it is stuck. This is exactly the Show install log button this issue asks for; the live-step view would be even better here.
There is no way to re-run or retry the install from the UI. When an install is interrupted (in this case the deployer service was restarted mid-install), the VM is left running with the install state stuck on installing, and the only action offered is Destroy. The operator cannot retry the install from the browser; today the options are to wait out the 30 minute stale-lock or trigger the install RPC by hand. We should add a "Retry install" (and "Reinstall") action on the user detail page that calls install_hero_stack for that VM, and ideally a way to clear a stuck installing state so a retry does not have to wait for the timeout.
Signed-by: mik-tf mik-tf@noreply.invalid