PR: Complete Document Processing Pipeline #5

Merged
mik-tf merged 9 commits from development_dioxus into development 2026-02-05 23:51:58 +00:00
Owner

Summary

Complete document processing pipeline with hash-based change detection, local ONNX embeddings, URL-based git processing, and UI fixes.

Changes

Hash-Based Change Detection:

  • Content hashes stored in .ai/*.toml for each processed file
  • Skip unchanged files on subsequent runs (fast incremental updates)
  • Demo scripts to test: run once → processes, run again → skips, delete .ai/ → reprocesses

hero_embedder Integration:

  • Integrated local ONNX embeddings (all-MiniLM-L6-v2, 384 dims)
  • Auto-fallback to OpenRouter API (1536 dims) if hero_embedder unavailable
  • Fixed port configuration (3752) in Makefile and config.rs
  • Dynamic dimension handling in generate_embeddings()

URL-based Processing with Git Push:

  • Full cycle: Clone/Pull → Hash check → Process changed files → Push .ai/ → Upload to Redis
  • All demos push .ai/ metadata to remote (use SKIP_PUSH=1 to disable)
  • Fixed repo URLs: docs_geomind, docs_mycelium, docs_owh

UI Fixes:

  • Page numbers now start from 1 (not 0)
  • MDX/JSX stripping: removes import statements and JSX components from markdown
  • Dark mode text visibility fixed in cards

Demo Scripts:

  • demo_basic.rs - Local collection processing
  • demo_geomind.rs - geomind/docs_geomind with git push
  • demo_mycelium.rs - mycelium/docs_mycelium with git push
  • demo_ourworld.rs - ourworld/docs_owh with git push
  • demo_all.rs - Process all namespaces
  • Makefile targets: demo-basic, demo-geomind, demo-mycelium, demo-ourworld, demo-all, demo-reset

Documentation:

  • Created docs/specs.md with full technical specifications
  • Reorganized README: Quick Start at top for newcomers
  • Documented URL-based processing flow and hash-based change detection
  • Created TODO.md with roadmap (cross-reference, CI, Dioxus porting)

Quick Start

git checkout development_dioxus
cp .env.example .env  # Add GROQ_API_KEY
make run
# Open http://localhost:8883

Next Steps (in TODO.md)

  • Cross-reference support (ebooks linking to multiple repos)
  • Test full example suite
  • CI pipeline with binaries
  • Dioxus UI porting to Archipelagos
### Summary Complete document processing pipeline with hash-based change detection, local ONNX embeddings, URL-based git processing, and UI fixes. ### Changes **Hash-Based Change Detection:** - Content hashes stored in `.ai/*.toml` for each processed file - Skip unchanged files on subsequent runs (fast incremental updates) - Demo scripts to test: run once → processes, run again → skips, delete `.ai/` → reprocesses **hero_embedder Integration:** - Integrated local ONNX embeddings (all-MiniLM-L6-v2, 384 dims) - Auto-fallback to OpenRouter API (1536 dims) if hero_embedder unavailable - Fixed port configuration (3752) in Makefile and config.rs - Dynamic dimension handling in `generate_embeddings()` **URL-based Processing with Git Push:** - Full cycle: Clone/Pull → Hash check → Process changed files → Push `.ai/` → Upload to Redis - All demos push `.ai/` metadata to remote (use `SKIP_PUSH=1` to disable) - Fixed repo URLs: `docs_geomind`, `docs_mycelium`, `docs_owh` **UI Fixes:** - Page numbers now start from 1 (not 0) - MDX/JSX stripping: removes import statements and JSX components from markdown - Dark mode text visibility fixed in cards **Demo Scripts:** - `demo_basic.rs` - Local collection processing - `demo_geomind.rs` - geomind/docs_geomind with git push - `demo_mycelium.rs` - mycelium/docs_mycelium with git push - `demo_ourworld.rs` - ourworld/docs_owh with git push - `demo_all.rs` - Process all namespaces - Makefile targets: `demo-basic`, `demo-geomind`, `demo-mycelium`, `demo-ourworld`, `demo-all`, `demo-reset` **Documentation:** - Created `docs/specs.md` with full technical specifications - Reorganized README: Quick Start at top for newcomers - Documented URL-based processing flow and hash-based change detection - Created TODO.md with roadmap (cross-reference, CI, Dioxus porting) ### Quick Start ```bash git checkout development_dioxus cp .env.example .env # Add GROQ_API_KEY make run # Open http://localhost:8883 ``` ### Next Steps (in TODO.md) - Cross-reference support (ebooks linking to multiple repos) - Test full example suite - CI pipeline with binaries - Dioxus UI porting to Archipelagos
- Fix Cargo.toml dependencies to use git URLs instead of local paths
- Rename binaries to hero_books and hero_books_client
- Update port to 8883 per hero_ports registry
- Add context-based RPC endpoint /api/{context}/books/rpc for hero_osis-sdk
- Add local example collection (hero_books_docs) with 4 pages
- Add book definition (herobooks_guide.toml)
- Add .env.example with API key documentation
- Add herobooks_demo.rhai script for full pipeline demo
- Add process_herobooks.rs example for AI processing
- Add init_redis_db.rs for database initialization
- Update Makefile:
  - make run now does full demo with fresh data
  - Add check-env for graceful API key validation
  - Add clean-data for fresh start
  - Rename targets for better UX
- Update README with full pipeline table (10 steps)
- Use hero_redis from git (development branch)
- Fix server find_exported_book to search by title
- Add demo examples (demo_basic, demo_geomind, demo_mycelium, demo_ourworld, demo_all)
- Add docs/specs.md with full technical specifications
- Update README with collection structure and hash-based change detection
- Add Makefile targets for all demos (make demo-basic, etc.)
- Add scripts/check-env.sh for environment validation
- Add src/embedder/ module for embedding fallback
- Move book definitions to examples/ebooks/
- Update .gitignore and .env.example
- Various infrastructure improvements for Cargo git cache builds
- Page numbers now start from 1 instead of 0
- Strip MDX/JSX import statements and components from markdown
- Skip empty pages (embed-only) from book listings
- Fix dark mode text visibility in cards (titles, links, muted text)
Added --port flag to hero_embedder startup commands so it uses
EMBEDDER_PORT (3752) instead of default (3380)
- demo_geomind: now uses forge.ourworld.tf/geomind/docs_geomind
- demo_mycelium: now uses forge.ourworld.tf/mycelium/docs_mycelium
- demo_ourworld: now uses forge.ourworld.tf/ourworld/docs_owh
- All demos now push .ai/ to remote after processing (use SKIP_PUSH=1 to skip)
- Remove redundant demo_docs_owh (same as demo_ourworld)
- Update TODO.md with priority tasks: cross-reference, test suite, CI, Dioxus
- Reorganize README.md: Quick Start at top for newcomers
- Document URL-based processing flow in README and specs.md
- Fix hero_embedder dimensions in specs.md (384, not 1536)
mik-tf force-pushed development_dioxus from 53b5227153 to 2c7183ad8d
Some checks failed
Test / test (pull_request) Failing after 2m48s
Test / test (push) Failing after 3m24s
2026-02-05 23:51:37 +00:00
Compare
mik-tf merged commit e4415d77f4 into development 2026-02-05 23:51:58 +00:00
mik-tf deleted branch development_dioxus 2026-02-05 23:51:58 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_books!5
No description provided.