embedding loading broken, can't load more books/collections, libraries don't show up #23

Open
opened 2026-02-07 16:09:39 +00:00 by despiegk · 4 comments
Owner

need to do something, too slow

there is a new RPC endpoint for hero_embedder its a key/value store per namespace, we can use this to remember info we don't want to forget (in stead of a default redis), and this is per namespace, use this for checking if we need to process a page again or for other config

we should calculate hash per collection (files sorted and hashed) or per file (chose/test) and see with this KVS to process page or not

Indexing book: owh_investment_memo
  16 pages to process, 0 unchanged
  Processed: intro (52 Q&A pairs)
  Processed: executive_summary (41 Q&A pairs)
  Processed: company_overview (30 Q&A pairs)
  Processed: market_opportunity (36 Q&A pairs)
  Processed: product_and_technology (33 Q&A pairs)
  Processed: traction_and_milestones (24 Q&A pairs)
  Processed: financial_overview (16 Q&A pairs)

its too slow, when we use embedder it goes at +100 EMBEDDNGS PER SEC, here much lower, why?

need to do something, too slow > there is a new RPC endpoint for hero_embedder its a key/value store per namespace, we can use this to remember info we don't want to forget (in stead of a default redis), and this is per namespace, use this for checking if we need to process a page again or for other config we should calculate hash per collection (files sorted and hashed) or per file (chose/test) and see with this KVS to process page or not ```bash Indexing book: owh_investment_memo 16 pages to process, 0 unchanged Processed: intro (52 Q&A pairs) Processed: executive_summary (41 Q&A pairs) Processed: company_overview (30 Q&A pairs) Processed: market_opportunity (36 Q&A pairs) Processed: product_and_technology (33 Q&A pairs) Processed: traction_and_milestones (24 Q&A pairs) Processed: financial_overview (16 Q&A pairs) ``` its too slow, when we use embedder it goes at +100 EMBEDDNGS PER SEC, here much lower, why?
Author
Owner

RPC API: http://127.0.0.1:8883/rpc
OpenRPC: http://127.0.0.1:8883/rpc/discover
Books dir: examples/ebooks_local

Use books_client to interact with the API.
Starting Hero Books server at http://127.0.0.1:8883
Books directory: examples/ebooks_local
Hero Embedder: http://localhost:3752/rpc
Exporting 7 books for web serving...
Found collection: python_intro (namespace: coding)
Exported 'coding_python': 2 pages, 0 images (namespace: coding)
Found collection: rust_basics (namespace: coding)
Exported 'coding_rust': 2 pages, 0 images (namespace: coding)
Found collection: rust_basics (namespace: herobooks)
Found collection: scifi_classics (namespace: herobooks)
Found collection: python_intro (namespace: herobooks)
Found collection: hero_books_docs (namespace: herobooks)
Found collection: hero_redis_docs (namespace: herobooks)
Found collection: philosophy_intro (namespace: herobooks)
Found collection: hero_tools_docs (namespace: herobooks)
Exported 'herobooks_guide': 4 pages, 0 images (namespace: herobooks)
Found collection: hero_redis_docs (namespace: herobooks)
Exported 'herobooks_redis': 2 pages, 0 images (namespace: herobooks)
Found collection: hero_tools_docs (namespace: herobooks)
Exported 'herobooks_tools': 2 pages, 0 images (namespace: herobooks)
Found collection: philosophy_intro (namespace: literature)
Exported 'literature_philosophy': 2 pages, 0 images (namespace: literature)
Found collection: scifi_classics (namespace: literature)
Exported 'literature_scifi': 2 pages, 0 images (namespace: literature)
Export directory: /var/folders/c1/m7s3rsy512b7yf46z05hbcy80000gn/T/herotools_export
Indexing books for vector search...
Found 3 namespaces: ["herobooks", "coding", "literature"]
Indexing 3 books to namespace 'herobooks'
Indexing book: herobooks_redis
2 pages to process, 0 unchanged
Processed: overview (14 Q&A pairs) [Q&A cached]
Processed: vector_search (17 Q&A pairs) [Q&A cached]
Uploading 31 documents to hero_embedder...
Book 'herobooks_redis' indexed: 2 processed, 0 unchanged, 0 errors, 31 documents uploaded (31 from cache)
Indexed 'Hero Redis Guide': 2 pages, 31 documents
Indexing book: herobooks_guide
4 pages to process, 0 unchanged
Processed: 1_introduction (10 Q&A pairs) [Q&A cached]
Processed: 2_getting_started (20 Q&A pairs) [Q&A cached]

something wrong with loading, I did loading of demo for mycelium did not show up in books when I started it
where do we keep whcih books we have???
and libraries

maybe this need to be in redis (or hero_redis)

RPC API: http://127.0.0.1:8883/rpc OpenRPC: http://127.0.0.1:8883/rpc/discover Books dir: examples/ebooks_local Use books_client to interact with the API. Starting Hero Books server at http://127.0.0.1:8883 Books directory: examples/ebooks_local Hero Embedder: http://localhost:3752/rpc Exporting 7 books for web serving... Found collection: python_intro (namespace: coding) Exported 'coding_python': 2 pages, 0 images (namespace: coding) Found collection: rust_basics (namespace: coding) Exported 'coding_rust': 2 pages, 0 images (namespace: coding) Found collection: rust_basics (namespace: herobooks) Found collection: scifi_classics (namespace: herobooks) Found collection: python_intro (namespace: herobooks) Found collection: hero_books_docs (namespace: herobooks) Found collection: hero_redis_docs (namespace: herobooks) Found collection: philosophy_intro (namespace: herobooks) Found collection: hero_tools_docs (namespace: herobooks) Exported 'herobooks_guide': 4 pages, 0 images (namespace: herobooks) Found collection: hero_redis_docs (namespace: herobooks) Exported 'herobooks_redis': 2 pages, 0 images (namespace: herobooks) Found collection: hero_tools_docs (namespace: herobooks) Exported 'herobooks_tools': 2 pages, 0 images (namespace: herobooks) Found collection: philosophy_intro (namespace: literature) Exported 'literature_philosophy': 2 pages, 0 images (namespace: literature) Found collection: scifi_classics (namespace: literature) Exported 'literature_scifi': 2 pages, 0 images (namespace: literature) Export directory: /var/folders/c1/m7s3rsy512b7yf46z05hbcy80000gn/T/herotools_export Indexing books for vector search... Found 3 namespaces: ["herobooks", "coding", "literature"] Indexing 3 books to namespace 'herobooks' Indexing book: herobooks_redis 2 pages to process, 0 unchanged Processed: overview (14 Q&A pairs) [Q&A cached] Processed: vector_search (17 Q&A pairs) [Q&A cached] Uploading 31 documents to hero_embedder... Book 'herobooks_redis' indexed: 2 processed, 0 unchanged, 0 errors, 31 documents uploaded (31 from cache) Indexed 'Hero Redis Guide': 2 pages, 31 documents Indexing book: herobooks_guide 4 pages to process, 0 unchanged Processed: 1_introduction (10 Q&A pairs) [Q&A cached] Processed: 2_getting_started (20 Q&A pairs) [Q&A cached] something wrong with loading, I did loading of demo for mycelium did not show up in books when I started it where do we keep whcih books we have??? and libraries maybe this need to be in redis (or hero_redis)
despiegk changed title from embedding loading to embedding loading broken, can't load more books/collections, libraries don't show up 2026-02-07 16:13:58 +00:00
Author
Owner

I see there are multiple demo's
but should be more generic, all libraries should be remembered
no point having these multiple demo's

I see there are multiple demo's but should be more generic, all libraries should be remembered no point having these multiple demo's
Owner

Addressed in PR #24 (branch development-fixes-feb).

Commits: 08591c5, 23ea189

  • Added KVS methods to embedder client: kvs_set(), kvs_get(), kvs_delete(), kvs_list()
  • After successful export, book metadata stored in KVS: book:{name} → JSON with name, namespace, path, page_count, title
  • On startup, KVS is read to discover previously loaded books — skips re-export if unchanged
  • Per-page content hashes stored in KVS: page_hash:{namespace}:{book}:{page} → content hash
  • Pipeline restructure ensures Q&A is cached and reused, not re-extracted

Note: KVS requires the hero_embedder SDK change (namespace_create_with_quality()) which has been pushed to hero_embedder development. Until hero_embedder is rebuilt with these changes, KVS operations will log warnings and fall back to file-based state.

Addressed in PR #24 (branch `development-fixes-feb`). **Commits:** `08591c5`, `23ea189` - Added KVS methods to embedder client: `kvs_set()`, `kvs_get()`, `kvs_delete()`, `kvs_list()` - After successful export, book metadata stored in KVS: `book:{name}` → JSON with name, namespace, path, page_count, title - On startup, KVS is read to discover previously loaded books — skips re-export if unchanged - Per-page content hashes stored in KVS: `page_hash:{namespace}:{book}:{page}` → content hash - Pipeline restructure ensures Q&A is cached and reused, not re-extracted **Note:** KVS requires the hero_embedder SDK change (`namespace_create_with_quality()`) which has been pushed to hero_embedder development. Until hero_embedder is rebuilt with these changes, KVS operations will log warnings and fall back to file-based state.
Owner

KVS Persistence Complete

What was done

Write side (persist_books_state()) was already working — stores book metadata in hero_embedder KVS namespace herobooks_state.

Read side (load_books_state()) was missing — now implemented:

  1. compute_book_content_hash() — Computes a single hash per book by hashing all page contents + page names + page count. Detects any content change, page reorder, or page addition/removal.

  2. persist_books_state() — Enhanced to store content_hash alongside existing metadata (name, namespace, page_count, image_count, export_path).

  3. load_books_state() — New function that reads all book:* keys from KVS on startup, returns a map of book_name → (namespace, content_hash, page_count).

  4. index_books_for_search() — Modified to accept persisted state. For each exported book, computes current content hash and compares against KVS-stored hash. If they match, the book is skipped entirely (embeddings are already in hero_embedder).

  5. Startup flow updated: Step 3 loads KVS state, Step 4 indexes (skipping unchanged books), Step 5 persists updated state.

Hash-based change detection

Two layers of hash checking now work together:

  • Per-page hashes (.ai/*.toml metadata) — skip AI Q&A re-extraction for unchanged pages
  • Per-book KVS hashes — skip embedding upload entirely for unchanged books on restart

Test results

Cold start (no KVS state): All 3 books indexed normally, 26 pages processed, 667 documents uploaded. State persisted to KVS.

Warm restart (KVS state exists):

Loaded persisted state for 3 books from KVS
Skipping 'Geomind Tier-s Agentic Cloud' — unchanged (hash eb2152e3)
Skipping 'OurWorld Investment Memo' — unchanged (hash 4d6b9cf2)
Skipping 'Hero Books Guide' — unchanged (hash 7222ce54)
All 3 books unchanged — skipping indexing (fast restart)

Search works correctly after warm restart (embeddings persist in hero_embedder).

Also fixed

  • Relevance bar display — was showing 0% for all results due to broken score conversion. Now normalizes scores within result set (best=95%, worst=40%).

Branch: development-fixes-feb

## KVS Persistence Complete ### What was done **Write side** (`persist_books_state()`) was already working — stores book metadata in hero_embedder KVS namespace `herobooks_state`. **Read side** (`load_books_state()`) was missing — now implemented: 1. **`compute_book_content_hash()`** — Computes a single hash per book by hashing all page contents + page names + page count. Detects any content change, page reorder, or page addition/removal. 2. **`persist_books_state()`** — Enhanced to store `content_hash` alongside existing metadata (name, namespace, page_count, image_count, export_path). 3. **`load_books_state()`** — New function that reads all `book:*` keys from KVS on startup, returns a map of book_name → (namespace, content_hash, page_count). 4. **`index_books_for_search()`** — Modified to accept persisted state. For each exported book, computes current content hash and compares against KVS-stored hash. If they match, the book is skipped entirely (embeddings are already in hero_embedder). 5. **Startup flow** updated: Step 3 loads KVS state, Step 4 indexes (skipping unchanged books), Step 5 persists updated state. ### Hash-based change detection Two layers of hash checking now work together: - **Per-page hashes** (`.ai/*.toml` metadata) — skip AI Q&A re-extraction for unchanged pages - **Per-book KVS hashes** — skip embedding upload entirely for unchanged books on restart ### Test results **Cold start** (no KVS state): All 3 books indexed normally, 26 pages processed, 667 documents uploaded. State persisted to KVS. **Warm restart** (KVS state exists): ``` Loaded persisted state for 3 books from KVS Skipping 'Geomind Tier-s Agentic Cloud' — unchanged (hash eb2152e3) Skipping 'OurWorld Investment Memo' — unchanged (hash 4d6b9cf2) Skipping 'Hero Books Guide' — unchanged (hash 7222ce54) All 3 books unchanged — skipping indexing (fast restart) ``` Search works correctly after warm restart (embeddings persist in hero_embedder). ### Also fixed - **Relevance bar display** — was showing 0% for all results due to broken score conversion. Now normalizes scores within result set (best=95%, worst=40%). Branch: `development-fixes-feb`
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
lhumina_code/hero_books#23
No description provided.