Phase 5: Backup & disaster recovery (6 copies, 3-2-1 rule) #38

Closed
opened 2026-03-26 02:27:55 +00:00 by mik-tf · 1 comment
Member

Goal

6 copies of all data across 3 storage types, matching freezone's backup architecture. Full cluster can be destroyed and rebuilt from backup.

Depends on

Architecture

Data copies:
  1-3: Kadalu GlusterFS Replica3 (live, 3 server nodes)
    4: Velero → MinIO (in-cluster, Kadalu PVC)
    5: Restic on backup VM 1 (independent TFGrid node)
    6: Restic on backup VM 2 (independent TFGrid node)

3-2-1 rule: 6 copies, 2 storage types (GlusterFS + Restic), 1 offsite (independent VMs).

Tasks

Velero (operational recovery)

  • 5.1 Deploy MinIO in-cluster (Kadalu PVC, 5Gi)
  • 5.2 Install Velero with AWS/S3 plugin (MinIO backend)
  • 5.3 Create daily backup schedule — 2 AM UTC, 7-day retention
  • 5.4 Scope to marketplace namespace only
  • 5.5 Test backup + restore cycle

Restic Backup VM 1 (disaster recovery)

  • 5.6 Terraform: independent TFGrid VM (2 CPU, 2GB RAM, 50GB disk)
  • 5.7 Independent WireGuard network (NOT shared with K3s)
  • 5.8 Restic repo initialized on VM
  • 5.9 Cron: daily snapshot at 2 AM via kubectl exec + tar + restic backup --stdin
  • 5.10 Retention: 7 daily, 4 weekly, 3 monthly

Restic Backup VM 2 (second offsite)

  • 5.11 Terraform: second independent TFGrid VM (different node than VM 1)
  • 5.12 Cron: daily snapshot at 3 AM (offset from VM 1)
  • 5.13 Same retention policy

Verification

  • 5.14 Document restore procedure (step-by-step)
  • 5.15 Test full restore: Velero restore to clean namespace
  • 5.16 Test full restore: Restic restore to new cluster
  • 5.17 Verify marketplace starts and serves data after restore

Reference

  • Freezone Velero: znzfreezone_deploy/k3s-v2/scripts/setup-velero.sh
  • Freezone Restic VMs: znzfreezone_deploy/tfgrid/backup1/tf/main.tf, backup2/tf/main.tf
  • Freezone architecture: znzfreezone_deploy/docs/architecture.md (backup section)

Recovery targets

Scenario Recovery method RTO
Pod crash K8s auto-restart < 1 min
Node failure Pod reschedule + GlusterFS < 5 min
Namespace deleted Velero restore < 15 min
Cluster destroyed Restic restore to new cluster < 60 min

Signed-off-by: mik-tf

## Goal 6 copies of all data across 3 storage types, matching freezone's backup architecture. Full cluster can be destroyed and rebuilt from backup. ## Depends on - https://forge.ourworld.tf/mycelium_code/home/issues/37 (Phase 4: K3s HA cluster) ## Architecture ``` Data copies: 1-3: Kadalu GlusterFS Replica3 (live, 3 server nodes) 4: Velero → MinIO (in-cluster, Kadalu PVC) 5: Restic on backup VM 1 (independent TFGrid node) 6: Restic on backup VM 2 (independent TFGrid node) ``` 3-2-1 rule: 6 copies, 2 storage types (GlusterFS + Restic), 1 offsite (independent VMs). ## Tasks ### Velero (operational recovery) - [ ] **5.1** Deploy MinIO in-cluster (Kadalu PVC, 5Gi) - [ ] **5.2** Install Velero with AWS/S3 plugin (MinIO backend) - [ ] **5.3** Create daily backup schedule — 2 AM UTC, 7-day retention - [ ] **5.4** Scope to marketplace namespace only - [ ] **5.5** Test backup + restore cycle ### Restic Backup VM 1 (disaster recovery) - [ ] **5.6** Terraform: independent TFGrid VM (2 CPU, 2GB RAM, 50GB disk) - [ ] **5.7** Independent WireGuard network (NOT shared with K3s) - [ ] **5.8** Restic repo initialized on VM - [ ] **5.9** Cron: daily snapshot at 2 AM via `kubectl exec` + `tar` + `restic backup --stdin` - [ ] **5.10** Retention: 7 daily, 4 weekly, 3 monthly ### Restic Backup VM 2 (second offsite) - [ ] **5.11** Terraform: second independent TFGrid VM (different node than VM 1) - [ ] **5.12** Cron: daily snapshot at 3 AM (offset from VM 1) - [ ] **5.13** Same retention policy ### Verification - [ ] **5.14** Document restore procedure (step-by-step) - [ ] **5.15** Test full restore: Velero restore to clean namespace - [ ] **5.16** Test full restore: Restic restore to new cluster - [ ] **5.17** Verify marketplace starts and serves data after restore ## Reference - Freezone Velero: `znzfreezone_deploy/k3s-v2/scripts/setup-velero.sh` - Freezone Restic VMs: `znzfreezone_deploy/tfgrid/backup1/tf/main.tf`, `backup2/tf/main.tf` - Freezone architecture: `znzfreezone_deploy/docs/architecture.md` (backup section) ## Recovery targets | Scenario | Recovery method | RTO | |----------|----------------|-----| | Pod crash | K8s auto-restart | < 1 min | | Node failure | Pod reschedule + GlusterFS | < 5 min | | Namespace deleted | Velero restore | < 15 min | | Cluster destroyed | Restic restore to new cluster | < 60 min | Signed-off-by: mik-tf
Author
Member

Consolidated into mycelium_code/home#55 — backup & DR after K3s cluster is provisioned.

— mik-tf

Consolidated into https://forge.ourworld.tf/mycelium_code/home/issues/55 — backup & DR after K3s cluster is provisioned. — mik-tf
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
coopcloud_code/home#38
No description provided.