|
All checks were successful
Test / test (push) Successful in 3m34s
The [patch] section in Cargo.toml referenced local paths (../hero_lib/packages/*) that only exist in the developer workspace but not in CI. This caused all CI builds to fail with 'No such file or directory' for herolib-ai. Move the path overrides to .cargo/config.toml which is already gitignored, so local development keeps working while CI uses the git dependencies. |
||
|---|---|---|
| .cargo | ||
| .claude | ||
| .forgejo | ||
| _archive | ||
| _beta | ||
| crates/books-client | ||
| docs/schemas | ||
| examples | ||
| scripts | ||
| sdk/js | ||
| specs | ||
| src | ||
| templates | ||
| tosort | ||
| .env.example | ||
| .gitignore | ||
| build.rs | ||
| build.sh | ||
| buildenv.sh | ||
| Cargo.lock | ||
| Cargo.toml | ||
| install.sh | ||
| LICENSE | ||
| Makefile | ||
| publish.sh | ||
| README.md | ||
| run.sh | ||
| run_slides.sh | ||
| rust-toolchain.toml | ||
| TODO.md | ||
| VISION.md | ||
| WEBSITE.md | ||
Hero Books - Document Management System
A Rust-based document collection management system with CLI, library, and web interfaces for processing markdown-based documentation with support for cross-collection references, link validation, and export to self-contained directories.
# Source your environment variables
source ~/.config/env.sh # or wherever you keep your secrets
# Build and run the web server
make run
# See all available commands
make help
Features
- Collection scanning: Automatically discover collections marked with
.collectionfiles - Cross-collection references: Link between pages in different collections using
collection:pagesyntax - Include directives: Embed content from other pages with
!!include collection:page - Link validation: Detect broken links to pages, images, and files
- Export: Generate self-contained directories with all dependencies
- Access control: Group-based ACL via
.groupfiles - Git integration: Automatically detect repository URLs
Installation
Install from Binaries
Download the pre-built binary from the Forge package registry:
mkdir -p ~/hero/bin
curl -fsSL -o ~/hero/bin/hero_books \
"https://forge.ourworld.tf/api/packages/lhumina_code/generic/hero_books/dev/hero_books-linux-amd64"
chmod +x ~/hero/bin/hero_books
Build from Source
git clone https://forge.ourworld.tf/lhumina_code/hero_books
cd hero_books
make build
Binaries will be at:
target/release/books_client- CLI clienttarget/release/books_server- Web server
Install to ~/hero/bin/:
make install
Run Different Documentation Sets
Run different documentation sets with isolated embeddings:
make run # Local books (7 books, 3 libraries, fast/offline)
make run-git # Git-based books (real content from forge repos)
make stop # Stop all services
Architecture & Concepts
Separation of Concerns
Hero Books separates content from presentation:
DocTree (Content):
- Manages markdown collections and pages
- Validates links and references
- Tracks files and images
- Processes include directives
- Enforces access control
Website (Presentation):
- Defines navigation structure
- Configures sidebars and menus
- Manages theming and styling
- Handles SEO metadata
- Provides plugin architecture
This separation allows flexible website layouts without changing content.
Key Concepts
Collections: Directories of markdown pages marked with .collection file
- Each collection is independently managed
- Collections can reference each other
- Access control per collection via ACL files
Pages: Individual markdown files with:
- Extracted title (from H1 heading)
- Description (from first paragraph)
- Parsed internal links and includes
- Optional front matter metadata
Links: References to pages, images, or files:
- Same collection:
[text](page_name) - Cross-collection:
[text](collection:page) - External: Automatic detection of HTTP(S) URLs
- Images: Identified by extension
Groups: Access control lists defining user membership
- Grant read/write access to collections
- Support wildcards for email patterns
- Support group inclusion (nested groups)
Export: Self-contained read-only directory:
- Pages and files organized by collection
- JSON metadata for each collection
- Suitable for static hosting or archival
Data Flow
Directory Scan
↓
Find Collections (.collection files)
↓
Parse Pages (extract metadata, parse links)
↓
Validate Links (check references exist)
↓
Process Includes (expand !!include directives)
↓
Enforce ACL (check group membership)
↓
Export (write to structured directory)
↓
Read Client (query exported collections)
Environment Variables
This project follows the env_secrets convention.
Source your env file before running — no .env files, no dotenv crates.
source ~/.config/env.sh # or wherever you keep your secrets
make run
Variables used by Hero Books
| Variable | Required | Purpose |
|---|---|---|
GROQ_API_KEY |
Yes | Groq API — Q&A extraction, AI summary, Whisper transcription (get one) |
OPENROUTER_API_KEY |
Yes | OpenRouter API — LLM fallback, embeddings (get one) |
SAMBANOVA_API_KEY |
No | SambaNova API — additional LLM provider (get one) |
GIT_TOKEN |
No | Personal access token for cloning private repos from Forge |
HERO_EMBEDDER_URL |
No | Override local embedder endpoint (default: http://localhost:3752/rpc) |
All variable names are canonical across the ecosystem — see the env_secrets skill for the full registry.
Git Authentication (for private repos)
For private repositories on forge.ourworld.tf, set GIT_TOKEN in your env file
(personal access token from Forge: Settings > Applications > Generate Token).
Alternatively, use SSH keys — ensure your key is loaded in ssh-agent.
CLI Usage
The books_client CLI talks to a running books_server via OpenRPC.
Start the server first
books_server --port 8883 --books-dir /path/to/docs
Scan for collections
# Scan a local path (server must have access)
books_client scan --path /path/to/docs
# Scan from git repository
books_client scan --git-url https://github.com/user/docs.git
List and inspect collections
# List all collections
books_client list
# Get collection details
books_client get my-collection
# Get all pages in a collection
books_client get-pages my-collection
# Get a specific page
books_client get-page my-collection page-name
Process collections
# Process for Q&A extraction and embeddings
books_client process my-collection
# Force reprocessing
books_client process my-collection --force
Metadata management
# Get collection metadata
books_client get-metadata my-collection
# Set collection metadata
books_client set-metadata my-collection --json '{"key": "value"}'
Server health
# Check server health
books_client health
# View OpenRPC schema
books_client discover
Directory Structure
Source Structure
docs/
├── collection1/
│ ├── .collection # Marks as collection (optional: name:custom_name)
│ ├── read.acl # Optional: group names for read access
│ ├── write.acl # Optional: group names for write access
│ ├── page1.md
│ ├── subdir/
│ │ └── page2.md
│ └── img/
│ └── logo.png
├── collection2/
│ ├── .collection
│ └── intro.md
└── groups/ # Special collection for ACL groups
├── .collection
├── admins.group
└── editors.group
Export Structure
/tmp/books/
├── content/
│ └── collection_name/
│ ├── page1.md # Pages at root of collection dir
│ ├── page2.md
│ ├── img/ # All images in img/ subdirectory
│ │ └── logo.png
│ └── files/ # All other files in files/ subdirectory
│ └── document.pdf
└── meta/
└── collection_name.json # Collection metadata
File Formats
.collection
name:custom_collection_name
If empty or name not specified, uses directory name.
.group
// Comments start with //
user@example.com
*@company.com
include:other_group
ACL files (read.acl, write.acl)
admins
editors
One group name per line.
Link Syntax
Page links
[text](page_name) # Same collection
[text](collection:page) # Cross-collection
Image links
 # Same collection
 # Cross-collection
Include directives
!!include page_name
!!include collection:page_name
Name Normalization
Page and collection names are normalized:
- Convert to lowercase
- Replace
-with_ - Replace
/with_ - Remove
.mdextension - Strip numeric prefix (e.g.,
03_page→page) - Remove special characters
Supported Image Extensions
.png,.jpg,.jpeg,.gif,.svg,.webp,.bmp,.tiff,.ico
Service Management with Zinit
Hero Books can be registered and managed as a Zinit-managed service with automatic restart, health checks, and port management.
Starting as a Zinit Service
# Start web server as Zinit-managed service
books_server --port 8883 --start
# Start with custom books directory
books_server --port 8883 --books-dir ./books --start
# Multi-instance support
books_server --port 8883 --start --instance prod
books_server --port 9568 --start --instance dev
Service Management
Once started with --start, services are managed by Zinit:
# View service status
zinit status books_server
# View service logs
zinit logs books_server
# Stop service
zinit stop books_server
# Restart service
zinit restart books_server
# Multi-instance commands
zinit status books_server_prod
zinit logs books_server_dev
Service Features
- Automatic Restart: Service restarts on failure with 5s delay
- Health Checks: TCP port health checks every 10s
- Max Restarts: Up to 5 restart attempts before stopping
- Logging: Full log history available via
zinit logs - Verification: Defensive self-test to verify successful startup
Error Handling
If service startup fails, Zinit will:
- Attempt TCP connection to verify port binding
- Check service state and PID
- Display recent logs on failure
- Clean up failed service registration
Detailed error messages provide diagnostic information:
- Port already in use
- Binary path incorrect
- Zinit server not running
- Permission denied
Development
Code Quality
This project maintains high code quality standards:
-
Dead Code Cleanup: Unused code is either removed or marked with
#[allow(dead_code)]with clear justification:flatten_chapter_pages()- Utility function kept for testingclassify_topic()- Public API method reserved for future useembeddings_from_cache- Field used for statistics reporting
-
Compiler Warnings: All compiler warnings in the
hero_bookscrate are resolved (external dependencies only)
Building
# Build release binaries
make build
# Build with debug info (dev mode)
cargo build
# Run tests
make test
# Run all tests including integration tests
make test-all
# Generate documentation
cargo doc --no-deps --open
# Check for compiler warnings
cargo check
Testing
# Run all tests
cargo test
# Run specific module tests
cargo test doctree
cargo test website
# Run with output
cargo test -- --nocapture
Code Organization
src/
├── lib.rs # Library exports
├── main.rs # Web server entry point (books_server binary)
├── bin/
│ ├── books.rs # CLI entry point (books_client binary)
│ └── cli.rs # Legacy CLI entry point
├── cli/ # CLI commands and handlers
├── doctree/ # Document management
├── ebook/ # Ebook parsing
├── ontology/ # AI-powered semantic extraction
├── vectorsdk/ # Vector search and embeddings
├── publishing/ # Publishing configuration
├── book/ # Book and PDF processing
├── web/ # HTTP API routes and handlers
└── website/ # Website configuration
Adding New Features
- New DocTree functionality: Add to
src/doctree/ - New Website config: Add to
src/website/ - New CLI commands: Add to
src/cli/mod.rs - New API endpoints: Add to
src/web/mod.rs
Library Usage
Ontology Processing
The ontology processor uses AI to classify documents and extract semantic concepts/relationships.
use hero_books::ontology::{OntologyProcessor, ProcessorConfig, ONTOLOGIES};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create processor with default config
let processor = OntologyProcessor::new();
let document = "Our SaaS product integrates with Slack...";
// Classification only (quick)
let matches = processor.classify(document).await?;
for m in &matches {
println!("{}: {} (primary: {})", m.topic, m.score, m.is_primary);
}
// Full processing (classification + extraction)
let result = processor.process(document).await?;
for sem in &result.semantics {
println!("{}: {} concepts, {} relationships",
sem.category, sem.concepts.len(), sem.relationships.len());
}
// Direct extraction for specific topics
let semantics = processor.extract(document, &["product", "technology"]).await?;
Ok(())
}
Available Topics: business, technology, product, commercial, people, news, legal, financial, health, education
Configuration:
let config = ProcessorConfig {
confidence_threshold: Some(8), // Min score to consider (default: 7)
max_topics: Some(3), // Limit topics processed
filter_topics: Some(vec!["product".into(), "technology".into()]),
temperature: Some(0.0), // LLM temperature
max_input_tokens: Some(60_000), // Chunk if larger
..Default::default()
};
let processor = OntologyProcessor::with_config(config);
Requirements: Source your env file with at least one API key set:
GROQ_API_KEY(preferred)SAMBANOVA_API_KEYOPENROUTER_API_KEY
See examples/src/ontology_processing.rs for a complete example.
DocTree
use doctree::{DocTree, ExportArgs};
fn main() -> doctree::Result<()> {
// Create and scan
let mut doctree = DocTree::new("mydocs");
doctree.scan(Path::new("/path/to/docs"), &[])?;
doctree.init_post()?; // Validate links
// Access pages
let page = doctree.page_get("collection:page")?;
let content = page.content()?;
// Export
doctree.export(ExportArgs {
destination: PathBuf::from("/tmp/books"),
reset: true,
include: false,
})?;
Ok(())
}
DocTreeClient (for reading exports)
use doctree::DocTreeClient;
fn main() -> doctree::Result<()> {
let client = DocTreeClient::new(Path::new("/tmp/books"))?;
// List collections
let collections = client.list_collections()?;
// Get page content
let content = client.get_page_content("collection", "page")?;
// Check existence
if client.page_exists("collection", "page") {
println!("Page exists!");
}
Ok(())
}
Directory Structure
hero_books/
├── examples/
│ ├── collections/ # Local test collections
│ │ └── hero_books_docs/ # Basic demo collection
│ ├── ebooks/ # Book definitions (TOML)
│ └── rhai/ # Demo scripts
├── src/
│ ├── lib.rs # Library entry point and module declarations
│ ├── main.rs # Web server binary entry point
│ ├── bin/
│ │ ├── books.rs # CLI binary entry point (books_client)
│ │ └── cli.rs # Legacy CLI entry point
│ ├── cli/ # CLI commands and handlers
│ ├── doctree/ # Document tree management
│ ├── ebook/ # Ebook parsing
│ ├── ontology/ # AI-powered semantic extraction
│ ├── vectorsdk/ # Vector search and embeddings
│ ├── publishing/ # Publishing configuration
│ ├── book/ # Book and PDF processing
│ ├── web/ # HTTP API routes and handlers
│ └── website/ # Website configuration
├── crates/
│ └── books-client/ # Rust client library for the API
├── examples/ # Example code for library usage
├── Cargo.toml # Package configuration
├── Makefile # Build automation (make help to see all targets)
├── build.rs # Build script
├── README.md # This file
├── openrpc.json # OpenRPC 1.3.2 API specification
└── target/
├── debug/ # Debug builds
└── release/
├── books_client # CLI binary
└── books_server # Web server binary