[lab] release publish skips re-uploading changed binaries, freezing rolling latest-* assets #323

New issue

Open

opened 2026-06-14 14:50:26 +00:00 by mik-tf · 1 comment

mik-tf commented

2026-06-14 14:50:26 +00:00

Owner

The CI release pipeline freezes a rolling latest-* release at its first-uploaded binaries: every subsequent push refreshes the release record and rolling tag but the binary assets keep their original upload date, so a downloaded binary can be weeks stale. On the deployer this shipped a pre-M13 binary that paniced on the migrated DB (lhumina_code/hero_os_tfgrid_deployer#29); on the cockpit it shipped an old UI.

lab's current source (crates/lab/src/repo/releases.rs upload_release_assets) deletes-then-uploads each asset, which would refresh correctly, so the lab-builder CI image appears to run an older lab whose publish path skips re-uploading existing assets (the area touched by 0a2e235 'compare md5 before skipping release uploads'). The durable fix is to ensure the CI lab-builder image carries a lab that always replaces assets, then the per-repo workaround below is unnecessary.

Interim workaround already applied + verified on hero_os_tfgrid_deployer and hero_cockpit: the lab-release workflow deletes the tag's existing assets via the Forge API before lab build --upload, which forces a fresh upload (the deployer release now carries today-dated assets). The same one-block change can land in the canonical workflow template, or be dropped once the image-side fix is in.

Signed-by: mik-tf mik-tf@noreply.invalid

The CI release pipeline freezes a rolling `latest-*` release at its first-uploaded binaries: every subsequent push refreshes the release record and rolling tag but the binary assets keep their original upload date, so a downloaded binary can be weeks stale. On the deployer this shipped a pre-M13 binary that paniced on the migrated DB (https://forge.ourworld.tf/lhumina_code/hero_os_tfgrid_deployer/issues/29); on the cockpit it shipped an old UI. `lab`'s current source (`crates/lab/src/repo/releases.rs` upload_release_assets) deletes-then-uploads each asset, which would refresh correctly, so the `lab-builder` CI image appears to run an older `lab` whose publish path skips re-uploading existing assets (the area touched by 0a2e235 'compare md5 before skipping release uploads'). The durable fix is to ensure the CI `lab-builder` image carries a `lab` that always replaces assets, then the per-repo workaround below is unnecessary. Interim workaround already applied + verified on hero_os_tfgrid_deployer and hero_cockpit: the lab-release workflow deletes the tag's existing assets via the Forge API before `lab build --upload`, which forces a fresh upload (the deployer release now carries today-dated assets). The same one-block change can land in the canonical workflow template, or be dropped once the image-side fix is in. Signed-by: mik-tf <mik-tf@noreply.invalid>

mik-tf referenced this issue from lhumina_code/hero_os_tfgrid_deployer

2026-06-14 14:52:17 +00:00

[deployer] latest-integration release binary is stale (pre-M13), self-upgrade panics on the live DB #29

mik-tf commented

2026-06-14 17:06:18 +00:00

Author

Owner

Diagnosed. The lab CODE is already fixed: commit 0a2e235 (Jun 11, on both integration and main) replaced the upload step's name-presence skip -- which froze every rolling release after its first publish, since rolling tags reuse asset names -- with an md5-match skip, so changed binaries re-upload.

The reason releases still freeze is that the fix never reached the CI image: the build-lab-builder build-push job (which builds + pushes the lab-builder image) has been failing consistently -- on 0a2e235 itself (integration), on 06baa09 (integration), and on a48b0cd (main). The release jobs succeed but run the stale pre-fix lab baked into the image; the last successful image push predates 0a2e235.

So the remaining fix is operational: make build-push succeed so the image carries the current lab. The job does a heavy docker build (apt, rustup 1.96, a musl.cc aarch64 cross-toolchain download, lab compile) then a 3-retry docker push; likely flake points are the musl.cc download (external) or the registry push (the merge-base commit is literally 'ci: retry docker push of the lab-builder image'). The build-push job log pinpoints it. Once the image rebuilds, the per-repo delete-before-upload workaround now on deployer+cockpit can be dropped.

Signed-by: mik-tf mik-tf@noreply.invalid

Diagnosed. The lab CODE is already fixed: commit 0a2e235 (Jun 11, on both integration and main) replaced the upload step's name-presence skip -- which froze every rolling release after its first publish, since rolling tags reuse asset names -- with an md5-match skip, so changed binaries re-upload. The reason releases still freeze is that the fix never reached the CI image: the build-lab-builder `build-push` job (which builds + pushes the lab-builder image) has been failing consistently -- on 0a2e235 itself (integration), on 06baa09 (integration), and on a48b0cd (main). The `release` jobs succeed but run the stale pre-fix lab baked into the image; the last successful image push predates 0a2e235. So the remaining fix is operational: make build-push succeed so the image carries the current lab. The job does a heavy docker build (apt, rustup 1.96, a musl.cc aarch64 cross-toolchain download, lab compile) then a 3-retry docker push; likely flake points are the musl.cc download (external) or the registry push (the merge-base commit is literally 'ci: retry docker push of the lab-builder image'). The build-push job log pinpoints it. Once the image rebuilds, the per-repo delete-before-upload workaround now on deployer+cockpit can be dropped. Signed-by: mik-tf <mik-tf@noreply.invalid>