Implement prepared shared rootfs reuse as the primary deduped runtime path #48

Open
opened 2026-03-19 19:45:43 +00:00 by thabeta · 3 comments
Owner

Problem

This builds on the broader storage-deduplication work tracked in #35.

The current branch already caches pulled image contents, but it still materializes too much state per VM. The result is avoidable disk duplication and extra startup work when multiple VMs use the same image.

Current behavior

1. The cache stops at extracted rootfs

The image cache stores layers and can extract one shared rootfs per image digest, but there is no cached prepared runtime base for the shared-root path.

That means the cache helps with pulling and extracting, but not enough with the work required to actually run many VMs efficiently from the same image.

2. Backend selection is not capability-driven

Storage backend selection currently resolves directly to block storage unless virtiofs is explicitly chosen.

So even when the runtime environment could support a shared-root design, the normal path still pushes users toward a heavier per-VM block-image workflow.

3. Storage preparation consumes the runtime rootfs source directly

The storage preparation path uses the stored rootfs source directly instead of resolving a separate immutable prepared base for image-backed VMs.

That keeps the design centered around "prepare per VM" rather than "prepare once per image, then reuse many times".

4. Shared-root plus extra filesystem mounts needs valid hypervisor argument assembly

The hypervisor argument builder emits one --fs entry for the root shared filesystem and then emits another --fs entry for each extra shared mount.

That is fragile for hypervisor argument parsing. A VM with a shared-root base plus one or more additional shared mounts needs one valid combined filesystem-share argument structure, not repeated top-level flags assembled in a way that can break launch.

5. First-time prepared-base creation needs serialization

Once a prepared shared rootfs cache exists, cold starts from the same image/variant must not race each other. Without a per-image/per-variant lock, two concurrent first boots can both try to create the same prepared base.

Why this matters

  • Disk usage grows much faster than necessary when the same image is launched many times.
  • Startup latency includes repeated per-VM preparation work that should have been amortized.
  • Shared-root should be the efficient first-class path on supported kernels, not an afterthought.
  • Concurrency bugs in first-time base preparation will only show up under realistic parallel starts, which makes them easy to miss and painful to debug later.

Suggested implementation

Prepared image base

  1. Add a cached prepared rootfs per image digest and runtime variant.
  2. Build it once and treat it as immutable shared input.
  3. Include whatever static guest/runtime assets are required for the supported shared-root path.

Backend policy

  1. Introduce an automatic storage mode that prefers the shared-root backend when the selected kernel supports it.
  2. Keep block storage as an explicit or fallback compatibility path.

Per-VM state model

  1. Limit per-VM writable state to overlay upper/work data, sockets, logs, and metadata.
  2. Do not rebuild or recopy the full runtime filesystem for every VM from the same image.

Hypervisor argument assembly

  1. Build filesystem-share arguments in a form that supports the root shared filesystem plus additional shared mounts in a single valid launch configuration.

Concurrency

  1. Serialize first-time prepared-base creation with a per-image/per-variant lock.
  2. Make later launches reuse the completed prepared base rather than rebuilding it.

Acceptance criteria

  • Two VMs from the same image can start concurrently without racing during first-time prepared-base creation.
  • Repeated VMs from the same image reuse one prepared shared base on disk.
  • Per-VM writable state is limited to overlay/runtime data rather than a full duplicated rootfs.
  • A VM using a shared-root base can also mount additional shared filesystem volumes successfully.
## Problem This builds on the broader storage-deduplication work tracked in #35. The current branch already caches pulled image contents, but it still materializes too much state per VM. The result is avoidable disk duplication and extra startup work when multiple VMs use the same image. ## Current behavior ### 1. The cache stops at extracted rootfs The image cache stores layers and can extract one shared rootfs per image digest, but there is no cached prepared runtime base for the shared-root path. That means the cache helps with pulling and extracting, but not enough with the work required to actually run many VMs efficiently from the same image. ### 2. Backend selection is not capability-driven Storage backend selection currently resolves directly to block storage unless `virtiofs` is explicitly chosen. So even when the runtime environment could support a shared-root design, the normal path still pushes users toward a heavier per-VM block-image workflow. ### 3. Storage preparation consumes the runtime rootfs source directly The storage preparation path uses the stored rootfs source directly instead of resolving a separate immutable prepared base for image-backed VMs. That keeps the design centered around "prepare per VM" rather than "prepare once per image, then reuse many times". ### 4. Shared-root plus extra filesystem mounts needs valid hypervisor argument assembly The hypervisor argument builder emits one `--fs` entry for the root shared filesystem and then emits another `--fs` entry for each extra shared mount. That is fragile for hypervisor argument parsing. A VM with a shared-root base plus one or more additional shared mounts needs one valid combined filesystem-share argument structure, not repeated top-level flags assembled in a way that can break launch. ### 5. First-time prepared-base creation needs serialization Once a prepared shared rootfs cache exists, cold starts from the same image/variant must not race each other. Without a per-image/per-variant lock, two concurrent first boots can both try to create the same prepared base. ## Why this matters - Disk usage grows much faster than necessary when the same image is launched many times. - Startup latency includes repeated per-VM preparation work that should have been amortized. - Shared-root should be the efficient first-class path on supported kernels, not an afterthought. - Concurrency bugs in first-time base preparation will only show up under realistic parallel starts, which makes them easy to miss and painful to debug later. ## Suggested implementation ### Prepared image base 1. Add a cached prepared rootfs per image digest and runtime variant. 2. Build it once and treat it as immutable shared input. 3. Include whatever static guest/runtime assets are required for the supported shared-root path. ### Backend policy 1. Introduce an automatic storage mode that prefers the shared-root backend when the selected kernel supports it. 2. Keep block storage as an explicit or fallback compatibility path. ### Per-VM state model 1. Limit per-VM writable state to overlay upper/work data, sockets, logs, and metadata. 2. Do not rebuild or recopy the full runtime filesystem for every VM from the same image. ### Hypervisor argument assembly 1. Build filesystem-share arguments in a form that supports the root shared filesystem plus additional shared mounts in a single valid launch configuration. ### Concurrency 1. Serialize first-time prepared-base creation with a per-image/per-variant lock. 2. Make later launches reuse the completed prepared base rather than rebuilding it. ## Acceptance criteria - Two VMs from the same image can start concurrently without racing during first-time prepared-base creation. - Repeated VMs from the same image reuse one prepared shared base on disk. - Per-VM writable state is limited to overlay/runtime data rather than a full duplicated rootfs. - A VM using a shared-root base can also mount additional shared filesystem volumes successfully. ## Related work - #35
Owner

Implementation Spec for Issue #48: Prepared Shared Rootfs Reuse

Objective

Eliminate per-VM rootfs duplication by introducing a prepared shared rootfs base cached per image digest and storage variant. Multiple VMs share one immutable prepared base, with only overlay/runtime data per-VM. Backend selection becomes capability-driven, and concurrent first-boot preparation is serialized with per-image locks.

Requirements

  • Prepared base directory created once per (image digest, storage variant) pair, reused by all VMs
  • Prepared base contains extracted rootfs + injected binaries (init, busybox, mycelium, kernel modules), treated as immutable
  • Backend auto-selection prefers shared-root (virtiofs) when supported, falls back to block storage
  • Per-VM writable state limited to overlay upper/work, sockets, logs, metadata
  • VMs using shared-root can mount additional shared filesystem volumes without broken argument assembly
  • Concurrent first-time base creation serialized via per-image file lock

Files to Modify/Create

File Action Description
storage/prepared_base.rs Create PreparedBaseManager: per-image/variant cached bases with file locking
storage/mod.rs Modify Add pub mod prepared_base
storage/traits.rs Modify Add prepared_base_path and image_digest to PrepareConfig
storage/virtiofs.rs Modify Use prepared base as overlay lowerdir, skip per-VM injection
storage/block.rs Modify Use prepared base as ext4 image source
storage/volmgr_backend.rs Modify Snapshot from prepared base subvolume (CoW)
storage/image.rs Modify Consolidate injection into inject_all_guest_binaries()
vm/manager.rs Modify Integrate PreparedBaseManager, capability-driven backend selection
paths.rs Modify Add prepared_bases path
config.rs Modify Default backend → "auto"
hypervisor/process.rs Modify Add test for multiple --fs flags

Implementation Plan

Step 1: Add prepared bases directory to Paths

Files: paths.rs — Add prepared_bases field, helper method, ensure_dirs

Step 2: Consolidate binary injection helper

Files: storage/image.rs — Create inject_all_guest_binaries() centralizing repeated injection code

Step 3: Create PreparedBaseManager module

Files: storage/prepared_base.rs, storage/mod.rs — get_or_create with file locking, .ready marker

Step 4: Update PrepareConfig

Files: storage/traits.rs — Add prepared_base_path and image_digest fields

Step 5: Refactor VirtioFsStorage

Files: storage/virtiofs.rs — Use prepared base as lowerdir, skip injection when available

Step 6: Refactor BlockStorage

Files: storage/block.rs — Use prepared base as source, skip injection

Step 7: Refactor VolmgrBackend

Files: storage/volmgr_backend.rs — Snapshot from prepared base subvolume

Step 8: Integrate into VmManager

Files: vm/manager.rs — Wire PreparedBaseManager into prepare_storage flow

Step 9: Capability-driven backend auto-selection

Files: vm/manager.rs, config.rs — "auto" default, capability probing

Step 10: Concurrency and reuse tests

Files: storage/prepared_base.rs — Thread-safety tests, reuse verification

Step 11: Verify multiple --fs flag assembly

Files: hypervisor/process.rs — Test root + additional mounts produce valid args

Acceptance Criteria

  • Two VMs from same image start concurrently without racing
  • Repeated VMs reuse one prepared shared base on disk
  • Per-VM writable state limited to overlay/runtime data
  • Shared-root VM can mount additional shared filesystem volumes

Notes

  • Backward compatible: VMs without image_digest bypass prepared base entirely
  • Prepared base invalidation (after binary upgrades) is a follow-up
  • volmgr snapshot approach requires btrfs detection
  • Default change from "virtiofs" to "auto" is behaviorally identical for virtiofs-capable systems
## Implementation Spec for Issue #48: Prepared Shared Rootfs Reuse ### Objective Eliminate per-VM rootfs duplication by introducing a **prepared shared rootfs base** cached per image digest and storage variant. Multiple VMs share one immutable prepared base, with only overlay/runtime data per-VM. Backend selection becomes capability-driven, and concurrent first-boot preparation is serialized with per-image locks. ### Requirements - Prepared base directory created once per (image digest, storage variant) pair, reused by all VMs - Prepared base contains extracted rootfs + injected binaries (init, busybox, mycelium, kernel modules), treated as immutable - Backend auto-selection prefers shared-root (virtiofs) when supported, falls back to block storage - Per-VM writable state limited to overlay upper/work, sockets, logs, metadata - VMs using shared-root can mount additional shared filesystem volumes without broken argument assembly - Concurrent first-time base creation serialized via per-image file lock ### Files to Modify/Create | File | Action | Description | |------|--------|-------------| | `storage/prepared_base.rs` | **Create** | PreparedBaseManager: per-image/variant cached bases with file locking | | `storage/mod.rs` | Modify | Add `pub mod prepared_base` | | `storage/traits.rs` | Modify | Add `prepared_base_path` and `image_digest` to PrepareConfig | | `storage/virtiofs.rs` | Modify | Use prepared base as overlay lowerdir, skip per-VM injection | | `storage/block.rs` | Modify | Use prepared base as ext4 image source | | `storage/volmgr_backend.rs` | Modify | Snapshot from prepared base subvolume (CoW) | | `storage/image.rs` | Modify | Consolidate injection into `inject_all_guest_binaries()` | | `vm/manager.rs` | Modify | Integrate PreparedBaseManager, capability-driven backend selection | | `paths.rs` | Modify | Add `prepared_bases` path | | `config.rs` | Modify | Default backend → "auto" | | `hypervisor/process.rs` | Modify | Add test for multiple --fs flags | ### Implementation Plan #### Step 1: Add prepared bases directory to Paths Files: `paths.rs` — Add `prepared_bases` field, helper method, ensure_dirs #### Step 2: Consolidate binary injection helper Files: `storage/image.rs` — Create `inject_all_guest_binaries()` centralizing repeated injection code #### Step 3: Create PreparedBaseManager module Files: `storage/prepared_base.rs`, `storage/mod.rs` — get_or_create with file locking, .ready marker #### Step 4: Update PrepareConfig Files: `storage/traits.rs` — Add prepared_base_path and image_digest fields #### Step 5: Refactor VirtioFsStorage Files: `storage/virtiofs.rs` — Use prepared base as lowerdir, skip injection when available #### Step 6: Refactor BlockStorage Files: `storage/block.rs` — Use prepared base as source, skip injection #### Step 7: Refactor VolmgrBackend Files: `storage/volmgr_backend.rs` — Snapshot from prepared base subvolume #### Step 8: Integrate into VmManager Files: `vm/manager.rs` — Wire PreparedBaseManager into prepare_storage flow #### Step 9: Capability-driven backend auto-selection Files: `vm/manager.rs`, `config.rs` — "auto" default, capability probing #### Step 10: Concurrency and reuse tests Files: `storage/prepared_base.rs` — Thread-safety tests, reuse verification #### Step 11: Verify multiple --fs flag assembly Files: `hypervisor/process.rs` — Test root + additional mounts produce valid args ### Acceptance Criteria - [ ] Two VMs from same image start concurrently without racing - [ ] Repeated VMs reuse one prepared shared base on disk - [ ] Per-VM writable state limited to overlay/runtime data - [ ] Shared-root VM can mount additional shared filesystem volumes ### Notes - Backward compatible: VMs without image_digest bypass prepared base entirely - Prepared base invalidation (after binary upgrades) is a follow-up - volmgr snapshot approach requires btrfs detection - Default change from "virtiofs" to "auto" is behaviorally identical for virtiofs-capable systems
Owner

Test Results

  • Total: 300
  • Passed: 300
  • Failed: 0
Suite Tests Passed Failed
my_hypervisor (CLI) 38 38 0
my_hypervisor_init 14 14 0
my_hypervisor_lib 248 248 0
Doc-tests 0 0 0

All tests pass.

## Test Results - **Total: 300** - **Passed: 300** - **Failed: 0** | Suite | Tests | Passed | Failed | |---|---|---|---| | my_hypervisor (CLI) | 38 | 38 | 0 | | my_hypervisor_init | 14 | 14 | 0 | | my_hypervisor_lib | 248 | 248 | 0 | | Doc-tests | 0 | 0 | 0 | ✅ All tests pass.
Owner

Implementation Summary

Files Created

  • crates/my_hypervisor-lib/src/storage/prepared_base.rs — PreparedBaseManager with file-locked get_or_create, .ready marker pattern

Files Modified

  • crates/my_hypervisor-lib/src/paths.rs — Added prepared_bases path + helper method
  • crates/my_hypervisor-lib/src/storage/mod.rs — Registered prepared_base module
  • crates/my_hypervisor-lib/src/storage/traits.rs — Added prepared_base_path and image_digest to PrepareConfig
  • crates/my_hypervisor-lib/src/storage/image.rs — Consolidated inject_all_guest_binaries() helper
  • crates/my_hypervisor-lib/src/storage/virtiofs.rs — Uses prepared base as overlay lowerdir, skips per-VM injection
  • crates/my_hypervisor-lib/src/storage/block.rs — Uses prepared base as ext4 image source
  • crates/my_hypervisor-lib/src/storage/volmgr_backend.rs — Btrfs snapshot from prepared base with cp fallback
  • crates/my_hypervisor-lib/src/vm/manager.rs — Integrated PreparedBaseManager, capability-driven auto backend
  • crates/my_hypervisor-lib/src/config.rs — Default backend changed to "auto"
  • crates/my_hypervisor-lib/src/hypervisor/process.rs — Added multiple --fs flags test

Test Results

300 tests, 300 passed, 0 failed

Notes

  • Backward compatible: VMs without image_digest use the existing per-VM flow
  • Prepared base invalidation after binary upgrades is a follow-up
  • volmgr uses btrfs snapshot when available, falls back to cp --reflink=auto
## Implementation Summary ### Files Created - `crates/my_hypervisor-lib/src/storage/prepared_base.rs` — PreparedBaseManager with file-locked get_or_create, .ready marker pattern ### Files Modified - `crates/my_hypervisor-lib/src/paths.rs` — Added `prepared_bases` path + helper method - `crates/my_hypervisor-lib/src/storage/mod.rs` — Registered prepared_base module - `crates/my_hypervisor-lib/src/storage/traits.rs` — Added `prepared_base_path` and `image_digest` to PrepareConfig - `crates/my_hypervisor-lib/src/storage/image.rs` — Consolidated `inject_all_guest_binaries()` helper - `crates/my_hypervisor-lib/src/storage/virtiofs.rs` — Uses prepared base as overlay lowerdir, skips per-VM injection - `crates/my_hypervisor-lib/src/storage/block.rs` — Uses prepared base as ext4 image source - `crates/my_hypervisor-lib/src/storage/volmgr_backend.rs` — Btrfs snapshot from prepared base with cp fallback - `crates/my_hypervisor-lib/src/vm/manager.rs` — Integrated PreparedBaseManager, capability-driven auto backend - `crates/my_hypervisor-lib/src/config.rs` — Default backend changed to "auto" - `crates/my_hypervisor-lib/src/hypervisor/process.rs` — Added multiple --fs flags test ### Test Results 300 tests, 300 passed, 0 failed ### Notes - Backward compatible: VMs without image_digest use the existing per-VM flow - Prepared base invalidation after binary upgrades is a follow-up - volmgr uses btrfs snapshot when available, falls back to cp --reflink=auto
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
geomind_code/my_hypervisor#48
No description provided.