Documentation

Get started with RigRun

Everything you need to deploy, configure, and integrate.

Quick Start

Up and running in three steps

System Requirements

Minimum (small models)

NVIDIA GPU with 24GB+ VRAM, 64GB system RAM, Ubuntu 22.04+, RHEL 8+ / Rocky Linux 8+, or Windows 11.

Recommended (122B model + KV compression)

RTX PRO 6000 Blackwell (96GB), 128GB RAM, Ubuntu 24.04.

vLLM llama.cpp Ollama

Installation

Single static binary. Download, configure config.toml, run. No Docker required for base deployment. Docker is used optionally for the code-interpreter sandbox.

Companion apps available: RigRun Desktop (Electron, pre-release) and RigRun Mobile (Flutter, not yet released).

 $ ./rigrun serve

First Query

OpenAI-compatible API at /v1/chat/completions.

Drop-in replacement — change your base URL, keep your existing code. Supports streaming (SSE), tool calling, and structured output.

SSE Streaming Tool Calling Structured Output

Topology

What's running on the box

A full RigRun deployment is several cooperating services on different ports. The Go server is the front door; the rest are optional but recommended. Auxiliary models run on CPU to keep VRAM free for the main inference engine.

:8787

RigRun

Go server: chat, tools, research, memory, audio, training control

:3100

Pyros

Pure-Go safety engine: 7-pillar pipeline, optional but recommended

:8096

Rolling Memory

Infinite-context proxy: speaks OpenAI + Anthropic, per-session disk store

:8081

llama-embed

CPU embedding service (Qwen3-Embedding-4B)

:8082

llama-rerank

CPU reranker (Qwen3-Reranker-0.6B)

:8083

llama-classify

CPU classifier for routing (Qwen3-0.6B)

:8084

vision

Vision model (Qwen3-VL-4B), idle-unloaded to free VRAM

Memory layer

Rolling Memory v4 — infinite context proxy

A drop-in OpenAI/Anthropic-compatible reverse proxy on port 8096 that gives any LLM effectively unlimited context via per-session disk-backed verbatim, summary, and embedding tiers. Latency stays flat with depth because retrieval keeps each forward pass roughly the same size regardless of total context length.

10M

Characters tested

2.5M

Tokens tested

100%

Needle recall

~10s

Latency at depth

Verified 2026-04-11: needle-in-haystack 100% pass through 10M chars (8.9–9.1s), 5/5 fact consistency, 10/10 multi-turn retrieval, 17/17 smoke tests passed. Includes per-CWD session isolation, dashboard at /memory/ui, Prometheus metrics, and a built-in /search endpoint backed by SearxNG.

Reference

API Reference

Endpoints listen on different ports per service. Authentication via Authorization: Bearer header. RigRun has ~260 routes total — selected groups shown below.

RigRun

http://localhost:8787

POST /v1/chat/completions Chat inference (OpenAI-compatible)

POST /v1/messages Chat inference (Anthropic-compatible)

GET /v1/models List available models

POST /v1/audio/transcribe Audio transcription

POST /v1/tools/{name}/execute Tool execution with approval handshake

GET /v1/memory/search Semantic memory search

POST /v1/research/sessions Agentic research session (4-phase A-RAG)

GET /metrics Prometheus-format metrics

Pyros Safety Engine

http://localhost:3100

GET /v1/health Engine health + backend probe

GET /v1/layers Pillar status and per-layer state

GET /v1/stats Engine statistics, vitals, throughput

GET /v1/signals Active signal blackboard (gate signals)

GET /v1/breakers Circuit breaker states

GET /metrics Prometheus-format metrics

Rolling Memory v4

http://localhost:8096

POST /v1/chat/completions OpenAI-compatible chat with infinite context

POST /v1/messages Anthropic-compatible chat with infinite context

GET /memory/session/{id}/summaries Per-session summary tier

GET /memory/session/{id}/export Export full session for training/audit

DELETE /memory/session/{id} Wipe a session (cache + disk)

GET /memory/ui Live dashboard with semantic search

GET /search Built-in web search (SearxNG-backed)

Example Request

bash

Streaming chat completion

curl http://localhost:8787/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $RIGRUN_API_KEY" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

Setup

Configuration

All configuration lives in a single config.toml file. Sensible defaults ship out of the box — most deployments only need to set the local backend URL and model name.

[routing]

default_modemax_tierparanoid_modeoffline_mode

Controls how requests are routed between local models and optional cloud fallback. Set offline_mode = true to enforce zero external traffic. paranoid_mode enables full request/response auditing.

[local]

ollama_urlollama_model

Points to your local inference backend. Supports Ollama, llama.cpp (OpenAI-compatible server), and vLLM. The model name here is what the router resolves when a request arrives.

[cloud]

openrouter_keydefault_model

Optional cloud fallback for when local inference is unavailable or the request exceeds local model capability. Disabled by default. Only activates if explicitly configured and offline_mode is false.

[security]

classificationaudit_enabledsession_timeoutmfa

Set the baseline classification level (UNCLASSIFIED, CUI, SECRET). When audit_enabled is true, every request and response is logged to an append-only audit trail. MFA gates the desktop app and API key management.

[training]

enabledschedulemethodsscoring

Controls the self-improving overnight pipeline. Schedule accepts cron syntax. 11 training methods supported (DPO, SimPO, ORPO, KTO, CPO, GRPO, IPO, SPPO, RPO, AERO, iterative DPO-VP) plus ZO2 Ultimate (zeroth-order optimization with 12 advanced techniques: sparse MeZO, HiZOO Hessian, curriculum, cosine epsilon annealing, cross-night EMA, etc.). Pairs are dual-sourced from rolling-memory and legacy sessions, scored by a Skywork reward model (runs locally, no external service dependency) + a sandboxed CodeRL+ execution scorer + the 122B itself as judge.

Internals

Pyros pipeline architecture

Pyros is the standalone safety engine that runs alongside RigRun (or any other LLM server) on port 3100. Every request flows through 7 pillars in middleware order — first listed is outermost wrapper. The pipeline is variable-width: easy queries skip expensive pillars based on Oracle's gate signals.

Pyros 7-pillar pipeline

I Oracle Predictive intel — UCB1 cascade routing, gate signals to skip easy work

II Tribunal Adversarial verification — 6-feature hallucination probe, courtroom debate

III Fortress Pre-inference defense — prompt injection blocks at conf ≥ 0.7, SmoothLLM, AIS detectors

IV MindEye Metacognition — knowledge graph injection, isotonic calibration, QualityScore

V Forge Self-improvement — EvoPrompt evolution, MDL prompt compression, DICE exemplar capture

VI Crucible Operational safety — circuit breakers, OTel spans, stigmergic blackboard

VII Singularity Homeostatic regulation — PID over runtime vitals, superposition, digital twin

⊕ LLM Inference Token generation via the configured backend (RigRun, vLLM, llama.cpp, OpenAI, Anthropic)

Response passes back through all pillars in reverse for post-processing

Note: Pyros runs as a separate process. RigRun has its own 11-layer Go middleware stack (CORS → Auth → SessionTimeout → JWT → Trace → Metrics → RateLimit → Logging → MaxBody → SecurityHeaders → Recovery) that runs before the request is forwarded to Pyros. The two engines are independent — either can run without the other.

Pyros pillars

RigRun middleware

≥99%

Adversarial block

External dependencies

Extensibility

Creating Custom Agents

Agents are created through the Agent Factory — describe what you need in plain English, and the 14-step pipeline handles design, testing, and deployment.

System prompt synthesis

Multi-pass generation with independent critique. A second model reviews and hardens the prompt before deployment.

Tool selection from typed registry

The factory auto-selects from a typed tool registry based on the agent's domain requirements. Each tool has input/output schemas and usage examples.

Knowledge base assembly

Vector store with semantic chunking. Documents are ingested, chunked by semantic boundaries (not token count), embedded, and indexed in ChromaDB.

Adversarial stress testing

Five probe types: prompt injection, hallucination induction, scope violation, confidentiality extraction, and output format corruption.

Quality gate

Human approval required before any agent goes live. Automated scoring checks 100% happy-path pass rate and 50% edge-case pass rate minimum.

See deployed agents →

Help

Support

Founding Access

Direct line to the builder. Three founding seats, hand-selected. Roadmap input, locked-in pricing when paid tiers open, and your conversations preserved for your own eventual local copy.

License

License Customers

Pricing finalizes after the Founding Access program closes. Email support included with every RigRun license. Configuration assistance and upgrade guidance.

Enterprise

Dedicated technical support. On-premise installation assistance. Custom SLA available. Multi-node and classified environment design experience.

jesse@thornveil.ai