Stateless OpenAI-compatible embeddings and reranking daemon using BGE ONNX models.
  • Rust 66.2%
  • HTML 33.2%
  • CSS 0.4%
  • Makefile 0.2%
Find a file
mik-tf 9d3c8fe67f
All checks were successful
lab release / release (push) Successful in 27m7s
fix(ml): disable the CPU arena and memory pattern for ONNX sessions
The default arena allocator never returns memory to the OS and grows
with every new peak batch shape, and the memory-pattern cache
accumulates per input shape. On the long-running shared engine that
combination grew RSS past 10 GB over weeks until the kernel OOM killed
the process. Non-arena allocation keeps RSS flat for a small
per-request cost; memory patterns are pointless with dynamic batch
sizes anyway.

Signed-by: mik-tf <mik-tf@noreply.invalid>
2026-06-11 21:03:21 -04:00
.forgejo/workflows fix(ci): drop --lib from gnu test (server crate is bin-only) 2026-06-11 15:27:41 +00:00
crates fix(ml): disable the CPU arena and memory pattern for ONNX sessions 2026-06-11 21:03:21 -04:00
schema chore: migrate to hero_lifecycle, herolib_derive/openrpc, and canonical service.toml keys 2026-05-31 23:28:54 +02:00
.gitignore feat: initial scaffold — stateless OpenAI-compatible embedder daemon 2026-05-26 14:59:23 +02:00
Cargo.lock fix(build): bump hero_rpc lock to df2f8b1 2026-06-02 16:53:11 +02:00
Cargo.toml chore: bump rust-version to 1.96, drop hero_rpc_derive patch, update deps 2026-06-01 13:20:47 +02:00
Cargo.toml.hero_builder_backup chore: switch to baso_info! macro, herolib_derive/openrpc, and patch hero_rpc_derive locally 2026-06-01 09:30:58 +02:00
LICENSE feat: initial scaffold — stateless OpenAI-compatible embedder daemon 2026-05-26 14:59:23 +02:00
README.md feat: add admin dashboard, reranker, status/mem_info/ort_info RPC methods, and install helper 2026-05-29 10:09:37 +02:00
rust-toolchain.toml chore: switch to baso_info! macro, herolib_derive/openrpc, and patch hero_rpc_derive locally 2026-06-01 09:30:58 +02:00

hero_embedder_provider

Stateless OpenAI-compatible embeddings + reranking daemon backed by BGE ONNX models. No auth, no sessions, no database — models are loaded once at startup and kept in memory.

What it does

  • Embeddings — four BGE quality levels (INT8 fast → FP32 best), 384d or 768d
  • Reranking — BGE cross-encoder reranker scores (query, doc) pairs
  • OpenAI-compatible REST on rest.sock + TCP 8092 for drop-in client use
  • Hero-native JSON-RPC 2.0 on rpc.sock with full status / mem_info introspection
  • Admin dashboard on admin.sock — live status, embed/rerank playground, performance benchmarks

Crates

Crate Role
hero_embedder_provider_lib ONNX embedder + reranker, download, install
hero_embedder_provider_server Axum server — REST, RPC, TCP 8092
hero_embedder_provider_admin Admin dashboard web UI
hero_embedder_provider_sdk Generated OpenRPC client for Rust consumers

Quality levels

Four quality levels are exposed on the embed call and as REST model IDs. Select by quality (14) in JSON-RPC, or by model in the REST API:

Level Model ID BGE model Precision Dimensions Max tokens Use case
Q1 bge-small-q1 bge-small-en-v1.5 INT8 384 128 Fast, low memory
Q2 bge-small-q2 bge-small-en-v1.5 FP32 384 256 Balanced
Q3 bge-base-q3 bge-base-en-v1.5 INT8 768 256 High quality
Q4 bge-base-q4 bge-base-en-v1.5 FP32 768 512 Best accuracy
bge-reranker-base bge-reranker-base FP32 n/a 512 Cross-encoder rerank

OpenAI alias IDs are also accepted on the REST endpoint:

  • text-embedding-3-smallbge-small-q2
  • text-embedding-3-largebge-base-q4

Bind address

Interface Address
TCP 127.0.0.1:8092 (Linux: mycelium overlay if up)
REST socket $PATH_SOCKETS/hero_embedder_provider/rest.sock
RPC socket $PATH_SOCKETS/hero_embedder_provider/rpc.sock
Admin socket $PATH_SOCKETS/hero_embedder_provider/admin.sock

On Linux the TCP listener probes the local mycelium daemon at startup; if a mycelium overlay address is found it binds there so other nodes on the overlay can reach it directly.

REST API (OpenAI-compatible)

Method Path Description
POST /v1/embeddings Compute dense vectors for one or more inputs
GET /v1/models List available local model IDs
GET /health `{"status": "ok"

Embeddings request

POST /v1/embeddings
Content-Type: application/json

{ "model": "bge-small-q1", "input": "Hello world" }

Batch input and the text-embedding-3-* aliases work the same way.

JSON-RPC 2.0 API (Hero-native)

All methods are served on rpc.sock at POST /rpc.

Method Params Description
health Service status + models_ready boolean
status Download progress, model load progress, models root path
models.list List locally available model IDs
mem_info Process RSS, system RAM, per-model disk sizes
ort_info ONNX Runtime version, path, min-version check
embed texts: string[], quality?: 14 Compute embeddings; preserves INT8 quantization scale
rerank query, docs: [{id,text}], top_k? BGE cross-encoder rerank

embed example

{ "jsonrpc": "2.0", "id": 1, "method": "embed",
  "params": [["Hello world", "Machine learning"], 1] }

Response includes embeddings, precision (int8/fp16), dimensions, quality, and model name. INT8 embeddings carry a scale field for dequantisation.

rerank example

{ "jsonrpc": "2.0", "id": 2, "method": "rerank",
  "params": ["what is ML?",
             [{"id":"d1","text":"ML is..."},{"id":"d2","text":"Weather..."}],
             5] }

Returns an array of {id, score} sorted by descending score.

Health during startup

Model loading takes seconds (CPU-only) to minutes (cold first-run download). Both sockets and TCP are bound immediately at startup and return {"status": "starting"} until models are ready — hero_proc probes never flap. The status RPC method exposes per-file download progress and per-model load progress during this window.

ONNX Runtime

ONNX Runtime 1.25.0+ is required. At startup the server:

  1. Checks for an existing installation (hero lib dir → Homebrew → system paths).
  2. If not found, downloads the official GitHub release archive automatically.

To override: set ORT_DYLIB_PATH to the full path of the dylib before starting. The ort_info RPC method reports the detected path and version.

Build

cargo build --workspace --release

Use from Rust

use hero_embedder_provider_sdk::EmbedderProviderClient;

let client = EmbedderProviderClient::connect().await?;

// Embed at Q1 (fast INT8)
let resp = client.embed(vec!["hello world".into()], Some(1)).await?;
println!("{}d {}", resp.dimensions, resp.precision);

// Rerank
let hits = client.rerank(
    "what is ML?".into(),
    vec![("d1".into(), "ML is...".into()), ("d2".into(), "Weather".into())],
    Some(5),
).await?;

Or point any OpenAI client at http://127.0.0.1:8092/v1 with no auth.