geomind_code/my_hypervisor

Fork 0

Implement prepared shared rootfs reuse as the primary deduped runtime path #48

New issue

Open

opened 2026-03-19 19:45:43 +00:00 by thabeta · 3 comments

thabeta commented

2026-03-19 19:45:43 +00:00

Owner

Problem

This builds on the broader storage-deduplication work tracked in #35.

The current branch already caches pulled image contents, but it still materializes too much state per VM. The result is avoidable disk duplication and extra startup work when multiple VMs use the same image.

Current behavior

1. The cache stops at extracted rootfs

The image cache stores layers and can extract one shared rootfs per image digest, but there is no cached prepared runtime base for the shared-root path.

That means the cache helps with pulling and extracting, but not enough with the work required to actually run many VMs efficiently from the same image.

2. Backend selection is not capability-driven

Storage backend selection currently resolves directly to block storage unless virtiofs is explicitly chosen.

So even when the runtime environment could support a shared-root design, the normal path still pushes users toward a heavier per-VM block-image workflow.

3. Storage preparation consumes the runtime rootfs source directly

The storage preparation path uses the stored rootfs source directly instead of resolving a separate immutable prepared base for image-backed VMs.

That keeps the design centered around "prepare per VM" rather than "prepare once per image, then reuse many times".

4. Shared-root plus extra filesystem mounts needs valid hypervisor argument assembly

The hypervisor argument builder emits one --fs entry for the root shared filesystem and then emits another --fs entry for each extra shared mount.

That is fragile for hypervisor argument parsing. A VM with a shared-root base plus one or more additional shared mounts needs one valid combined filesystem-share argument structure, not repeated top-level flags assembled in a way that can break launch.

5. First-time prepared-base creation needs serialization

Once a prepared shared rootfs cache exists, cold starts from the same image/variant must not race each other. Without a per-image/per-variant lock, two concurrent first boots can both try to create the same prepared base.

Why this matters

Disk usage grows much faster than necessary when the same image is launched many times.
Startup latency includes repeated per-VM preparation work that should have been amortized.
Shared-root should be the efficient first-class path on supported kernels, not an afterthought.
Concurrency bugs in first-time base preparation will only show up under realistic parallel starts, which makes them easy to miss and painful to debug later.

Suggested implementation

Prepared image base

Add a cached prepared rootfs per image digest and runtime variant.
Build it once and treat it as immutable shared input.
Include whatever static guest/runtime assets are required for the supported shared-root path.

Backend policy

Introduce an automatic storage mode that prefers the shared-root backend when the selected kernel supports it.
Keep block storage as an explicit or fallback compatibility path.

Per-VM state model

Limit per-VM writable state to overlay upper/work data, sockets, logs, and metadata.
Do not rebuild or recopy the full runtime filesystem for every VM from the same image.

Hypervisor argument assembly

Build filesystem-share arguments in a form that supports the root shared filesystem plus additional shared mounts in a single valid launch configuration.

Concurrency

Serialize first-time prepared-base creation with a per-image/per-variant lock.
Make later launches reuse the completed prepared base rather than rebuilding it.

Acceptance criteria

Two VMs from the same image can start concurrently without racing during first-time prepared-base creation.
Repeated VMs from the same image reuse one prepared shared base on disk.
Per-VM writable state is limited to overlay/runtime data rather than a full duplicated rootfs.
A VM using a shared-root base can also mount additional shared filesystem volumes successfully.

## Problem This builds on the broader storage-deduplication work tracked in #35. The current branch already caches pulled image contents, but it still materializes too much state per VM. The result is avoidable disk duplication and extra startup work when multiple VMs use the same image. ## Current behavior ### 1. The cache stops at extracted rootfs The image cache stores layers and can extract one shared rootfs per image digest, but there is no cached prepared runtime base for the shared-root path. That means the cache helps with pulling and extracting, but not enough with the work required to actually run many VMs efficiently from the same image. ### 2. Backend selection is not capability-driven Storage backend selection currently resolves directly to block storage unless `virtiofs` is explicitly chosen. So even when the runtime environment could support a shared-root design, the normal path still pushes users toward a heavier per-VM block-image workflow. ### 3. Storage preparation consumes the runtime rootfs source directly The storage preparation path uses the stored rootfs source directly instead of resolving a separate immutable prepared base for image-backed VMs. That keeps the design centered around "prepare per VM" rather than "prepare once per image, then reuse many times". ### 4. Shared-root plus extra filesystem mounts needs valid hypervisor argument assembly The hypervisor argument builder emits one `--fs` entry for the root shared filesystem and then emits another `--fs` entry for each extra shared mount. That is fragile for hypervisor argument parsing. A VM with a shared-root base plus one or more additional shared mounts needs one valid combined filesystem-share argument structure, not repeated top-level flags assembled in a way that can break launch. ### 5. First-time prepared-base creation needs serialization Once a prepared shared rootfs cache exists, cold starts from the same image/variant must not race each other. Without a per-image/per-variant lock, two concurrent first boots can both try to create the same prepared base. ## Why this matters - Disk usage grows much faster than necessary when the same image is launched many times. - Startup latency includes repeated per-VM preparation work that should have been amortized. - Shared-root should be the efficient first-class path on supported kernels, not an afterthought. - Concurrency bugs in first-time base preparation will only show up under realistic parallel starts, which makes them easy to miss and painful to debug later. ## Suggested implementation ### Prepared image base 1. Add a cached prepared rootfs per image digest and runtime variant. 2. Build it once and treat it as immutable shared input. 3. Include whatever static guest/runtime assets are required for the supported shared-root path. ### Backend policy 1. Introduce an automatic storage mode that prefers the shared-root backend when the selected kernel supports it. 2. Keep block storage as an explicit or fallback compatibility path. ### Per-VM state model 1. Limit per-VM writable state to overlay upper/work data, sockets, logs, and metadata. 2. Do not rebuild or recopy the full runtime filesystem for every VM from the same image. ### Hypervisor argument assembly 1. Build filesystem-share arguments in a form that supports the root shared filesystem plus additional shared mounts in a single valid launch configuration. ### Concurrency 1. Serialize first-time prepared-base creation with a per-image/per-variant lock. 2. Make later launches reuse the completed prepared base rather than rebuilding it. ## Acceptance criteria - Two VMs from the same image can start concurrently without racing during first-time prepared-base creation. - Repeated VMs from the same image reuse one prepared shared base on disk. - Per-VM writable state is limited to overlay/runtime data rather than a full duplicated rootfs. - A VM using a shared-root base can also mount additional shared filesystem volumes successfully. ## Related work - #35

rawdaGastan was assigned by thabeta

2026-03-19 19:56:30 +00:00

rawdaGastan commented

2026-04-09 07:27:32 +00:00

Owner

Implementation Spec for Issue #48: Prepared Shared Rootfs Reuse

Objective

Eliminate per-VM rootfs duplication by introducing a prepared shared rootfs base cached per image digest and storage variant. Multiple VMs share one immutable prepared base, with only overlay/runtime data per-VM. Backend selection becomes capability-driven, and concurrent first-boot preparation is serialized with per-image locks.

Requirements

Prepared base directory created once per (image digest, storage variant) pair, reused by all VMs
Prepared base contains extracted rootfs + injected binaries (init, busybox, mycelium, kernel modules), treated as immutable
Backend auto-selection prefers shared-root (virtiofs) when supported, falls back to block storage
Per-VM writable state limited to overlay upper/work, sockets, logs, metadata
VMs using shared-root can mount additional shared filesystem volumes without broken argument assembly
Concurrent first-time base creation serialized via per-image file lock

Files to Modify/Create

File	Action	Description
`storage/prepared_base.rs`	Create	PreparedBaseManager: per-image/variant cached bases with file locking
`storage/mod.rs`	Modify	Add `pub mod prepared_base`
`storage/traits.rs`	Modify	Add `prepared_base_path` and `image_digest` to PrepareConfig
`storage/virtiofs.rs`	Modify	Use prepared base as overlay lowerdir, skip per-VM injection
`storage/block.rs`	Modify	Use prepared base as ext4 image source
`storage/volmgr_backend.rs`	Modify	Snapshot from prepared base subvolume (CoW)
`storage/image.rs`	Modify	Consolidate injection into `inject_all_guest_binaries()`
`vm/manager.rs`	Modify	Integrate PreparedBaseManager, capability-driven backend selection
`paths.rs`	Modify	Add `prepared_bases` path
`config.rs`	Modify	Default backend → "auto"
`hypervisor/process.rs`	Modify	Add test for multiple --fs flags

Implementation Plan

Step 1: Add prepared bases directory to Paths

Files: paths.rs — Add prepared_bases field, helper method, ensure_dirs

Step 2: Consolidate binary injection helper

Files: storage/image.rs — Create inject_all_guest_binaries() centralizing repeated injection code

Step 3: Create PreparedBaseManager module

Files: storage/prepared_base.rs, storage/mod.rs — get_or_create with file locking, .ready marker

Step 4: Update PrepareConfig

Files: storage/traits.rs — Add prepared_base_path and image_digest fields

Step 5: Refactor VirtioFsStorage

Files: storage/virtiofs.rs — Use prepared base as lowerdir, skip injection when available

Step 6: Refactor BlockStorage

Files: storage/block.rs — Use prepared base as source, skip injection

Step 7: Refactor VolmgrBackend

Files: storage/volmgr_backend.rs — Snapshot from prepared base subvolume

Step 8: Integrate into VmManager

Files: vm/manager.rs — Wire PreparedBaseManager into prepare_storage flow

Step 9: Capability-driven backend auto-selection

Files: vm/manager.rs, config.rs — "auto" default, capability probing

Step 10: Concurrency and reuse tests

Files: storage/prepared_base.rs — Thread-safety tests, reuse verification

Step 11: Verify multiple --fs flag assembly

Files: hypervisor/process.rs — Test root + additional mounts produce valid args

Acceptance Criteria

Two VMs from same image start concurrently without racing
Repeated VMs reuse one prepared shared base on disk
Per-VM writable state limited to overlay/runtime data
Shared-root VM can mount additional shared filesystem volumes

Notes

Backward compatible: VMs without image_digest bypass prepared base entirely
Prepared base invalidation (after binary upgrades) is a follow-up
volmgr snapshot approach requires btrfs detection
Default change from "virtiofs" to "auto" is behaviorally identical for virtiofs-capable systems

## Implementation Spec for Issue #48: Prepared Shared Rootfs Reuse ### Objective Eliminate per-VM rootfs duplication by introducing a **prepared shared rootfs base** cached per image digest and storage variant. Multiple VMs share one immutable prepared base, with only overlay/runtime data per-VM. Backend selection becomes capability-driven, and concurrent first-boot preparation is serialized with per-image locks. ### Requirements - Prepared base directory created once per (image digest, storage variant) pair, reused by all VMs - Prepared base contains extracted rootfs + injected binaries (init, busybox, mycelium, kernel modules), treated as immutable - Backend auto-selection prefers shared-root (virtiofs) when supported, falls back to block storage - Per-VM writable state limited to overlay upper/work, sockets, logs, metadata - VMs using shared-root can mount additional shared filesystem volumes without broken argument assembly - Concurrent first-time base creation serialized via per-image file lock ### Files to Modify/Create | File | Action | Description | |------|--------|-------------| | `storage/prepared_base.rs` | **Create** | PreparedBaseManager: per-image/variant cached bases with file locking | | `storage/mod.rs` | Modify | Add `pub mod prepared_base` | | `storage/traits.rs` | Modify | Add `prepared_base_path` and `image_digest` to PrepareConfig | | `storage/virtiofs.rs` | Modify | Use prepared base as overlay lowerdir, skip per-VM injection | | `storage/block.rs` | Modify | Use prepared base as ext4 image source | | `storage/volmgr_backend.rs` | Modify | Snapshot from prepared base subvolume (CoW) | | `storage/image.rs` | Modify | Consolidate injection into `inject_all_guest_binaries()` | | `vm/manager.rs` | Modify | Integrate PreparedBaseManager, capability-driven backend selection | | `paths.rs` | Modify | Add `prepared_bases` path | | `config.rs` | Modify | Default backend → "auto" | | `hypervisor/process.rs` | Modify | Add test for multiple --fs flags | ### Implementation Plan #### Step 1: Add prepared bases directory to Paths Files: `paths.rs` — Add `prepared_bases` field, helper method, ensure_dirs #### Step 2: Consolidate binary injection helper Files: `storage/image.rs` — Create `inject_all_guest_binaries()` centralizing repeated injection code #### Step 3: Create PreparedBaseManager module Files: `storage/prepared_base.rs`, `storage/mod.rs` — get_or_create with file locking, .ready marker #### Step 4: Update PrepareConfig Files: `storage/traits.rs` — Add prepared_base_path and image_digest fields #### Step 5: Refactor VirtioFsStorage Files: `storage/virtiofs.rs` — Use prepared base as lowerdir, skip injection when available #### Step 6: Refactor BlockStorage Files: `storage/block.rs` — Use prepared base as source, skip injection #### Step 7: Refactor VolmgrBackend Files: `storage/volmgr_backend.rs` — Snapshot from prepared base subvolume #### Step 8: Integrate into VmManager Files: `vm/manager.rs` — Wire PreparedBaseManager into prepare_storage flow #### Step 9: Capability-driven backend auto-selection Files: `vm/manager.rs`, `config.rs` — "auto" default, capability probing #### Step 10: Concurrency and reuse tests Files: `storage/prepared_base.rs` — Thread-safety tests, reuse verification #### Step 11: Verify multiple --fs flag assembly Files: `hypervisor/process.rs` — Test root + additional mounts produce valid args ### Acceptance Criteria - [ ] Two VMs from same image start concurrently without racing - [ ] Repeated VMs reuse one prepared shared base on disk - [ ] Per-VM writable state limited to overlay/runtime data - [ ] Shared-root VM can mount additional shared filesystem volumes ### Notes - Backward compatible: VMs without image_digest bypass prepared base entirely - Prepared base invalidation (after binary upgrades) is a follow-up - volmgr snapshot approach requires btrfs detection - Default change from "virtiofs" to "auto" is behaviorally identical for virtiofs-capable systems

rawdaGastan commented

2026-04-09 07:43:32 +00:00

Owner

Test Results

Total: 300
Passed: 300
Failed: 0

Suite	Tests	Passed
my_hypervisor (CLI)	38	38
my_hypervisor_init	14	14
my_hypervisor_lib	248	248
Doc-tests	0	0

✅ All tests pass.

## Test Results - **Total: 300** - **Passed: 300** - **Failed: 0** | Suite | Tests | Passed | Failed | |---|---|---|---| | my_hypervisor (CLI) | 38 | 38 | 0 | | my_hypervisor_init | 14 | 14 | 0 | | my_hypervisor_lib | 248 | 248 | 0 | | Doc-tests | 0 | 0 | 0 | ✅ All tests pass.

rawdaGastan commented

2026-04-09 07:44:33 +00:00

Owner

Implementation Summary

Files Created

crates/my_hypervisor-lib/src/storage/prepared_base.rs — PreparedBaseManager with file-locked get_or_create, .ready marker pattern

Files Modified

crates/my_hypervisor-lib/src/paths.rs — Added prepared_bases path + helper method
crates/my_hypervisor-lib/src/storage/mod.rs — Registered prepared_base module
crates/my_hypervisor-lib/src/storage/traits.rs — Added prepared_base_path and image_digest to PrepareConfig
crates/my_hypervisor-lib/src/storage/image.rs — Consolidated inject_all_guest_binaries() helper
crates/my_hypervisor-lib/src/storage/virtiofs.rs — Uses prepared base as overlay lowerdir, skips per-VM injection
crates/my_hypervisor-lib/src/storage/block.rs — Uses prepared base as ext4 image source
crates/my_hypervisor-lib/src/storage/volmgr_backend.rs — Btrfs snapshot from prepared base with cp fallback
crates/my_hypervisor-lib/src/vm/manager.rs — Integrated PreparedBaseManager, capability-driven auto backend
crates/my_hypervisor-lib/src/config.rs — Default backend changed to "auto"
crates/my_hypervisor-lib/src/hypervisor/process.rs — Added multiple --fs flags test

Test Results

300 tests, 300 passed, 0 failed

Notes

Backward compatible: VMs without image_digest use the existing per-VM flow
Prepared base invalidation after binary upgrades is a follow-up
volmgr uses btrfs snapshot when available, falls back to cp --reflink=auto

## Implementation Summary ### Files Created - `crates/my_hypervisor-lib/src/storage/prepared_base.rs` — PreparedBaseManager with file-locked get_or_create, .ready marker pattern ### Files Modified - `crates/my_hypervisor-lib/src/paths.rs` — Added `prepared_bases` path + helper method - `crates/my_hypervisor-lib/src/storage/mod.rs` — Registered prepared_base module - `crates/my_hypervisor-lib/src/storage/traits.rs` — Added `prepared_base_path` and `image_digest` to PrepareConfig - `crates/my_hypervisor-lib/src/storage/image.rs` — Consolidated `inject_all_guest_binaries()` helper - `crates/my_hypervisor-lib/src/storage/virtiofs.rs` — Uses prepared base as overlay lowerdir, skips per-VM injection - `crates/my_hypervisor-lib/src/storage/block.rs` — Uses prepared base as ext4 image source - `crates/my_hypervisor-lib/src/storage/volmgr_backend.rs` — Btrfs snapshot from prepared base with cp fallback - `crates/my_hypervisor-lib/src/vm/manager.rs` — Integrated PreparedBaseManager, capability-driven auto backend - `crates/my_hypervisor-lib/src/config.rs` — Default backend changed to "auto" - `crates/my_hypervisor-lib/src/hypervisor/process.rs` — Added multiple --fs flags test ### Test Results 300 tests, 300 passed, 0 failed ### Notes - Backward compatible: VMs without image_digest use the existing per-VM flow - Prepared base invalidation after binary upgrades is a follow-up - volmgr uses btrfs snapshot when available, falls back to cp --reflink=auto

rawdaGastan referenced this issue

2026-04-09 08:44:11 +00:00

feat: add prepared base caching to deduplicate rootfs across VMs and fix storage backend injection bugs #81