// 05 · Agentic framework

How fast is the pipeline runner?

Tool dispatch overhead, RAG retrieval latency, model-call wrapper cost, HITL round-trip distribution, pipeline orchestration step-to-step. Reproducible from a fresh clone via pytest benches/, same convention as the engine bench.

Engine vs framework

This page measures the Python agentic framework, the runner that orchestrates pipeline steps, dispatches scoped tool calls, manages RAG retrieval, gates writes through HITL, and wraps model calls. For the in-house Rust trading engine (state-cache writes at 310 ns, full pipeline at 14 µs), see engine latency.

What the runner gives you

// governance, not just speed

The latency below proves the runner is lean. These are the guarantees that make it safe to run agents for real clients. Ten are measured on this page; the rest are how the platform is built.

Scoped, governed toolsA crew only sees the tools you grant it. Ungranted tools never enter the model's schema, so permissions are the dispatch primitive, not an afterthought.

0.6 µs

Human-in-the-loop writesEvery write runs the enforcement gate (a reactive watcher state, a per-cycle write cap, per-tenant quota, cost cap), then waits for operator approval. Reads flow freely; writes are gated.

0.3 µs

Per-workflow RAGEach workflow gets its own isolated vector store with hybrid retrieval, so one client's documents never bleed into another's context.

0.28 ms

Bring-your-own-model20+ providers behind one wrapper (Anthropic, OpenAI, Gemini, Mistral, DeepSeek, Qwen, plus local Ollama and LM Studio), one consistent shape across all.

1.6 µs

Cost & token accountingEvery model call is priced against a per-model table and aggregated into a running USD total, so you can bill clients and cap spend per tenant.

0.4 µs

Full observabilityAn OpenTelemetry span per tool call, model invocation, and pipeline run, carrying cost, tokens, latency, and error reasons. Operate agents you can actually see.

0.3 µs

Static context assemblyThe system prompt, granted knowledge docs, and tool schemas are packed into the context block each turn sends the model, kept separate from rolling history.

1.4 µs

Cross-run crew memoryA crew's working memory persists between runs and restores on the next one, so long-running agents keep their context across sessions.

53 µs

Agentic crewsMulti-persona crews (macro, technical, risk, execution) hand context persona to persona, with a risk veto and reactive sidecars that can halt the chain mid-run.

1.2 µs

Prompt-injection defenseAnything the agent reads from an untrusted source (retrieved documents, tool results, fetched web pages) is scanned for prompt-injection, jailbreak, and data-exfiltration patterns before the model can act on it. Each pattern carries a severity score, and the total decides the outcome: safe text passes through; mildly suspicious text is still passed but fenced off as data the model must not obey, and the event is logged; a clearly malicious signal, such as an attempt to leak a secret or hijack the conversation format, is dropped before the model ever sees it. How strict that cutoff is can be tuned per deployment.

17 µs

Credential isolationPer-user encrypted vaults. Agents act through short-lived tickets and never touch raw API keys, so a client's secrets stay scoped to that client.

AES-256

Multi-tenant by designProject-scoped roles and per-pipeline state isolation. One tenant's crew cannot read, halt, or spend against another's. Run many clients on one platform.

RBAC

Join the community

metric	status	p50	measured	note & reproduce
`tool_dispatch (0-arg)` per call	measured	0.60 µs p95 0.60 · p99 1.10 µs · n=10000	2026-06-20	Runner-side cost of dispatching a zero-arg scoped tool: name lookup, ToolResponse wrap. Excludes the tool's own work + any network. Reproduce: `pytest benches/bench_tool_dispatch.py::test_tool_dispatch_0arg -s`
`tool_dispatch (5-arg)` per call	measured	0.70 µs p95 0.70 · p99 0.90 µs · n=10000	2026-06-20	Same path, 5-element input dict, median operator-registered shape. Reproduce: `pytest benches/bench_tool_dispatch.py::test_tool_dispatch_5arg -s`
`tool_dispatch (20-arg)` per call	measured	1.00 µs p95 1.10 · p99 1.30 µs · n=10000	2026-06-20	Wide-arg dispatch, long-tail tools like melaya_create_order with all optional risk params filled in. Reproduce: `pytest benches/bench_tool_dispatch.py::test_tool_dispatch_20arg -s`
`pipeline_step_transition (linear)` per step, 10-step chain	measured	0.22 µs p95 0.23 · p99 0.24 µs · n=2000	2026-06-20	Time from one pipeline step completing to the next being invoked, in a linear chain. Pure runner overhead (graph walk + variable binding + await). Reproduce: `pytest benches/bench_pipeline_orchestration.py::test_pipeline_linear -s`
`pipeline_step_transition (parallel)` per step, 10-step fanout	measured	3.32 µs p95 3.64 · p99 5.88 µs · n=2000	2026-06-20	Same transition cost in a parallel fanout via asyncio.gather. Higher than linear here: at N=10 the gather’s scheduling setup dominates, and it only drops below linear once steps block on real I/O. Reproduce: `pytest benches/bench_pipeline_orchestration.py::test_pipeline_parallel -s`
`registry_boot` per cold boot · register-only	measured	4.36 ms p95 5218.80 · p99 6368.90 µs · n=30	2026-06-20	The runtime walks its tool + crew modules at boot. The bench measures the introspect+register step on 250 synthetic tools (production adds Python import-time on top, this number is register-only). Reproduce: `pytest benches/bench_registry_boot.py -s`
`rag_retrieve (10k chunks)` per query, top-5	measured	281.10 µs p95 447.10 · p99 782.10 µs · n=2000	2026-06-20	embed(query) + brute-force kNN + chunk hydration over a 10k-chunk in-memory index. A production ANN index is 1.5-3× faster. Reproduce: `pytest benches/bench_rag_retrieval.py::test_rag_retrieval_10k -s`
`rag_retrieve (100k chunks)` per query, top-5	measured	5.52 ms p95 8450.40 · p99 9662.40 µs · n=2000	2026-06-20	Same path, 10× larger corpus. Brute force is O(N·D) so expect ~10-15× growth in p50 vs the 10k bench. Reproduce: `pytest benches/bench_rag_retrieval.py::test_rag_retrieval_100k -s`
`model_wrapper_overhead` per LLM turn (network mocked)	measured	1.60 µs p95 2.00 · p99 2.80 µs · n=1000	2026-06-20	Runner overhead around a model API call: prompt assembly, message-history pack, post-response routing. Provider HTTP boundary mocked to isolate runner cost from network. Reproduce: `pytest benches/bench_model_wrapper_overhead.py -s`
`context_assembly` per turn	measured	1.40 µs p95 1.60 · p99 1.80 µs · n=5000	2026-06-20	Builds the static context block a turn sends the model: system prompt + granted knowledge docs + tool schemas. Distinct from rolling history (model_wrapper) and RAG retrieval. Reproduce: `pytest benches/bench_context_assembly.py -s`
`session_memory` per save + load	measured	53.00 µs p95 76.30 · p99 113.30 µs · n=5000	2026-06-20	Cross-run working-memory persistence: serialize a 50-turn crew memory to the session store and restore it on the next run. In-process store, so no DB latency is included. Reproduce: `pytest benches/bench_session_memory.py::test_session_memory_roundtrip -s`
`cost_tracking` per model call	measured	0.40 µs p95 0.40 · p99 0.60 µs · n=10000	2026-06-20	Records one model call's token usage against a price table and updates the running USD total plus per-model breakdown. This is what enables per-tenant billing and spend caps. Reproduce: `pytest benches/bench_cost_tracking.py -s`
`tracing_overhead` per span	measured	0.30 µs p95 1.10 · p99 1.40 µs · n=10000	2026-06-20	Per-span observability tax: open an OpenTelemetry-style span, stamp the gen_ai / cost / latency attributes, close, and hand to the exporter. What enabling tracing adds per traced operation. Reproduce: `pytest benches/bench_tracing_overhead.py -s`
`crew_orchestration` per 4-persona run	measured	1.20 µs p95 2.00 · p99 2.10 µs · n=2000	2026-06-20	A 4-persona crew (macro, technical, risk, execution) hands context persona to persona, with the risk persona armed to veto and halt the chain mid-run. Pure orchestration overhead. Reproduce: `pytest benches/bench_crew_orchestration.py -s`
`prompt_injection_scan` per untrusted input	measured	17.40 µs p95 26.40 · p99 30.50 µs · n=10000	2026-06-20	The prompt-injection scan run on untrusted content (RAG-retrieved docs, tool outputs) before it reaches the model: weighted pattern match against injection / jailbreak / exfiltration markers, then allow / flag / block. Wired into rag.py and the tool-output postprocess. Reproduce: `pytest benches/bench_prompt_injection.py -s`
`hitl_gate_overhead` per write attempt	measured	0.30 µs p95 0.40 · p99 0.40 µs · n=10000	2026-06-20	The synchronous safety checks run before every write is queued for approval: sidecar-state read (a reactive watcher that can halt a run), per-cycle write cap, per-tenant daily quota, running cost cap. The trading-grade-discipline machinery, measured, distinct from the human wait below. Reproduce: `pytest benches/bench_hitl_gate_overhead.py -s`
`hitl_approval_round_trip` human-bound	method only	method documented	n/a	Time from 'approval requested' to 'approval received', median over real operator sessions. Dominated by human attention; cannot be benched synthetically. Methodology documented; awaiting a 30-day production telemetry cut. Reproduce: `see results/hitl_round_trip/methodology_only.json`
`concurrent_agent_executions` platform limit	config	50	-	Configurable per-workspace cap on simultaneous agent runs (default 50); backpressure queues the rest. A deployment config knob, not a measurement. Reproduce: `configured in deployment, not benched`

tier	hardware	config	tool_dispatch p50	pipeline_step p50	registry_boot	rag_retrieval_10k p50
A. Production	Xeon Plat 8369B (Ice Lake-SP)	Ubuntu 22.04, pinned core, perf gov, py3.12	awaiting	awaiting	awaiting	awaiting
B. Modern Linux server	Xeon Gold 6438 / EPYC 9354	Ubuntu 22/24, perf gov, py3.12	3-8 µs	10-25 µs	1-3 s	0.5-2 ms
C. Apple Silicon	M2 / M3 / M4 MacBook	macOS 14+, arm64, py3.12	2-6 µs	8-20 µs	0.8-2 s	0.4-1.5 ms
D. Modern desktop *	i9-13900H (Raptor Lake-H)	Win11, py3.12.4, unpinned	0.6-1.0 µs	0.22 µs	4.4 ms	0.28 ms

scenario	isolated memory driver	peak RSS	status
`Idle floor` s0_idle	runner floor + drift / orphan gate	40 MB 637 commit	measured
`LLM-agnostic baseline` s1_baseline	orchestration heap, real dispatch path	40 MB 637 commit	measured
`RAG (Qdrant) 10k` s2_rag_qdrant	Qdrant index in-runner; embedder remote	1254 MB 1925 commit	measured
`RAG 100k + doc ingest` s2b_rag_100k	doc-extraction transient peak	2056 MB 2926 commit	measured
`web_search fast path` s3a_websearch_fast	curl_cffi TLS, no browser (~0)	52 MB 647 commit	measured
`web_search stealth rescue` s3b_websearch_browser	Chromium tree (bimodal)	610 MB 1064 commit	measured
`aiml HF tool` s_aiml	in-process torch (the one local model load)	740 MB 2158 commit	measured
`WSS ingress watcher` s11_wss	in-runner ring buffers	45 MB 642 commit	measured
`python_repl code-exec` s12_coderepl	child interpreter + sci-stack (peak)	173 MB 2016 commit	measured
`remotion render` s13_render	Node + 2nd render-Chromium (peak)	910 MB 2032 commit	measured
`MCP stdio servers` s15_mcp	per-server child process	92 MB 680 commit	measured
`long-context compaction` s16_compaction	saw-tooth history; reclaim proof	42 MB 640 commit	measured
`streaming assembly` s17_streaming	SSE / token accumulation	41 MB 638 commit	measured
`huge tool output` s18_tooloutput	multi-MB transient (peak)	40 MB 642 commit	measured
`concurrency validation` s_conc	gates the capacity fit	140 MB 2534 commit	measured

Melaya — Build AI agents for any job. Agentic platform for research, ops, outreach, reporting — and the only one where agents can actually trade.

How fast is the pipeline runner?

What the runner gives you

What the framework benchmark suite covers

Reproduce all of this in one command

Hardware-tier expectations

Memory footprint & capacity