Stateless OpenAI-compatible embeddings and reranking daemon using BGE ONNX models.

Rust 66.2%
HTML 33.2%
CSS 0.4%
Makefile 0.2%

Find a file

mik-tf 9d3c8fe67f All checks were successful lab release / release (push) Successful in 27m7s Details fix(ml): disable the CPU arena and memory pattern for ONNX sessions The default arena allocator never returns memory to the OS and grows with every new peak batch shape, and the memory-pattern cache accumulates per input shape. On the long-running shared engine that combination grew RSS past 10 GB over weeks until the kernel OOM killed the process. Non-arena allocation keeps RSS flat for a small per-request cost; memory patterns are pointless with dynamic batch sizes anyway. Signed-by: mik-tf <mik-tf@noreply.invalid>		2026-06-11 21:03:21 -04:00
.forgejo/workflows	fix(ci): drop --lib from gnu test (server crate is bin-only)	2026-06-11 15:27:41 +00:00
crates	fix(ml): disable the CPU arena and memory pattern for ONNX sessions	2026-06-11 21:03:21 -04:00
schema	chore: migrate to hero_lifecycle, herolib_derive/openrpc, and canonical service.toml keys	2026-05-31 23:28:54 +02:00
.gitignore	feat: initial scaffold — stateless OpenAI-compatible embedder daemon	2026-05-26 14:59:23 +02:00
Cargo.lock	fix(build): bump hero_rpc lock to df2f8b1	2026-06-02 16:53:11 +02:00
Cargo.toml	chore: bump rust-version to 1.96, drop hero_rpc_derive patch, update deps	2026-06-01 13:20:47 +02:00
Cargo.toml.hero_builder_backup	chore: switch to baso_info! macro, herolib_derive/openrpc, and patch hero_rpc_derive locally	2026-06-01 09:30:58 +02:00
LICENSE	feat: initial scaffold — stateless OpenAI-compatible embedder daemon	2026-05-26 14:59:23 +02:00
README.md	feat: add admin dashboard, reranker, status/mem_info/ort_info RPC methods, and install helper	2026-05-29 10:09:37 +02:00
rust-toolchain.toml	chore: switch to baso_info! macro, herolib_derive/openrpc, and patch hero_rpc_derive locally	2026-06-01 09:30:58 +02:00

README.md

hero_embedder_provider

Stateless OpenAI-compatible embeddings + reranking daemon backed by BGE ONNX models. No auth, no sessions, no database — models are loaded once at startup and kept in memory.

What it does

Embeddings — four BGE quality levels (INT8 fast → FP32 best), 384d or 768d
Reranking — BGE cross-encoder reranker scores (query, doc) pairs
OpenAI-compatible REST on rest.sock + TCP 8092 for drop-in client use
Hero-native JSON-RPC 2.0 on rpc.sock with full status / mem_info introspection
Admin dashboard on admin.sock — live status, embed/rerank playground, performance benchmarks

Crates

Crate	Role
`hero_embedder_provider_lib`	ONNX embedder + reranker, download, install
`hero_embedder_provider_server`	Axum server — REST, RPC, TCP 8092
`hero_embedder_provider_admin`	Admin dashboard web UI
`hero_embedder_provider_sdk`	Generated OpenRPC client for Rust consumers

Quality levels

Four quality levels are exposed on the embed call and as REST model IDs. Select by quality (1–4) in JSON-RPC, or by model in the REST API:

Level	Model ID	BGE model	Precision	Dimensions	Max tokens	Use case
Q1	`bge-small-q1`	bge-small-en-v1.5	INT8	384	128	Fast, low memory
Q2	`bge-small-q2`	bge-small-en-v1.5	FP32	384	256	Balanced
Q3	`bge-base-q3`	bge-base-en-v1.5	INT8	768	256	High quality
Q4	`bge-base-q4`	bge-base-en-v1.5	FP32	768	512	Best accuracy
—	`bge-reranker-base`	bge-reranker-base	FP32	n/a	512	Cross-encoder rerank

OpenAI alias IDs are also accepted on the REST endpoint:

text-embedding-3-small → bge-small-q2
text-embedding-3-large → bge-base-q4

Bind address

Interface	Address
TCP	`127.0.0.1:8092` (Linux: mycelium overlay if up)
REST socket	`$PATH_SOCKETS/hero_embedder_provider/rest.sock`
RPC socket	`$PATH_SOCKETS/hero_embedder_provider/rpc.sock`
Admin socket	`$PATH_SOCKETS/hero_embedder_provider/admin.sock`

On Linux the TCP listener probes the local mycelium daemon at startup; if a mycelium overlay address is found it binds there so other nodes on the overlay can reach it directly.

REST API (OpenAI-compatible)

Method	Path	Description
`POST`	`/v1/embeddings`	Compute dense vectors for one or more inputs
`GET`	`/v1/models`	List available local model IDs
`GET`	`/health`	`{"status": "ok"

Embeddings request

POST /v1/embeddings
Content-Type: application/json

{ "model": "bge-small-q1", "input": "Hello world" }

Batch input and the text-embedding-3-* aliases work the same way.

JSON-RPC 2.0 API (Hero-native)

All methods are served on rpc.sock at POST /rpc.

Method	Params	Description
`health`	—	Service status + `models_ready` boolean
`status`	—	Download progress, model load progress, models root path
`models.list`	—	List locally available model IDs
`mem_info`	—	Process RSS, system RAM, per-model disk sizes
`ort_info`	—	ONNX Runtime version, path, min-version check
`embed`	`texts: string[]`, `quality?: 1–4`	Compute embeddings; preserves INT8 quantization scale
`rerank`	`query`, `docs: [{id,text}]`, `top_k?`	BGE cross-encoder rerank

`embed` example

{ "jsonrpc": "2.0", "id": 1, "method": "embed",
  "params": [["Hello world", "Machine learning"], 1] }

Response includes embeddings, precision (int8/fp16), dimensions, quality, and model name. INT8 embeddings carry a scale field for dequantisation.

`rerank` example

{ "jsonrpc": "2.0", "id": 2, "method": "rerank",
  "params": ["what is ML?",
             [{"id":"d1","text":"ML is..."},{"id":"d2","text":"Weather..."}],
             5] }

Returns an array of {id, score} sorted by descending score.

Health during startup

Model loading takes seconds (CPU-only) to minutes (cold first-run download). Both sockets and TCP are bound immediately at startup and return {"status": "starting"} until models are ready — hero_proc probes never flap. The status RPC method exposes per-file download progress and per-model load progress during this window.

ONNX Runtime

ONNX Runtime 1.25.0+ is required. At startup the server:

Checks for an existing installation (hero lib dir → Homebrew → system paths).
If not found, downloads the official GitHub release archive automatically.

To override: set ORT_DYLIB_PATH to the full path of the dylib before starting. The ort_info RPC method reports the detected path and version.

Build

cargo build --workspace --release

Use from Rust

use hero_embedder_provider_sdk::EmbedderProviderClient;

let client = EmbedderProviderClient::connect().await?;

// Embed at Q1 (fast INT8)
let resp = client.embed(vec!["hello world".into()], Some(1)).await?;
println!("{}d {}", resp.dimensions, resp.precision);

// Rerank
let hits = client.rerank(
    "what is ML?".into(),
    vec![("d1".into(), "ML is...".into()), ("d2".into(), "Weather".into())],
    Some(5),
).await?;

Or point any OpenAI client at http://127.0.0.1:8092/v1 with no auth.

README.md Unescape Escape