Delfhos
Explanation

Understanding-oriented discussions of Delfhos concepts, design decisions, and internal architecture. Read this to build a mental model — not to accomplish a specific task.

The Two-package Architecture

Delfhos is split into two Python packages: delfhos/ (the public API) and cortex/ (the internal engine). Users only ever import from delfhos.

Layer 1 — Public API
Agent · Chat · Memory · @tool
Layer 2 — Orchestrator
Code Generation · Error Recovery · Approval Gates
Layer 3 — Connections & Tool Runtime
Sandboxed Python · Security Gates · REST APIs
External Systems
LLMs · Provider APIs · External REST APIs
delfhos/ — Public API

Everything you import lives here. The Agent class is the primary entry point; Chat and Memory are pure data structures that you pass in. This layer is intentionally thin: it validates inputs, sets up types, and delegates to cortex.

cortex/ — Internal Engine

Contains the orchestrator, LLM integration, tool execution sandbox, approval manager, connection implementations, and the OpenAPI compiler for REST API tools. Users never import from cortex directly.

How the Orchestration Loop Works

When you call agent.run(), Delfhos executes a deterministic 8-step pipeline. Understanding each step helps you reason about latency, cost, and failure points.

1
Memory Retrieval

If a Memory is attached, run semantic search against stored facts. Top-k facts are injected into the system prompt.

2
Tool Prefiltering

If enable_prefilter=True, the light_llm reads the task and available tools, then selects the relevant subset. Reduces context size by ~60%.

3
Schema Loading (SQL only)

If a SQL connection is in the selected tools, the actual table schemas are fetched from the database and included in the code generation prompt.

4
Code Generation

The heavy_llm (or code_llm) receives the system prompt, memory facts, chat history, task, and tool API docs. It responds with a Python code block.

5
Approval Gate

If any tool has confirm=True and the generated code calls that action, execution pauses. The agent waits until a human approves or rejects.

6
Sandboxed Execution

Generated code runs in an isolated execution environment. If Docker is available, the code runs in a disposable container with no network, read-only filesystem, memory/CPU caps, and zero Linux capabilities. All tool calls are proxied back to the host via a Unix socket — API credentials never enter the container.

7
Retry Loop

If execution raises an exception, the error is fed back to the LLM for corrected code generation. Repeats up to retry_count times.

8
Result Composition

Final output is collected. Token counts and cost are calculated. Result is added to Chat history (if enabled). A Response object is returned.

Which steps are skippable?

StepSkipped when
Memory RetrievalNo Memory instance is mounted.
Prefilteringenable_prefilter=False or fewer than 4 tools registered.
Approval GateThe generated code does not invoke any confirm-listed operation.
Error RecoveryCode executes without raising an exception.

How Multi-LLM Routing Reduces Costs

Delfhos allows you to assign different LLMs to different tasks in the execution pipeline. By using a cheap fast model for tool routing and a powerful model for code generation, you can reduce overall cost by 2–3×.

The core insight

1. Routing

Which tools are relevant to this task? This is simple work. A cheap, fast model can handle it in milliseconds.

2. Generation

What code should those tools execute? This requires deep reasoning. A powerful model is needed to handle complexity, edge cases, and error recovery.

By splitting these across two models, you pay frontier prices only for the work that needs it. The result: 60% fewer tokens in code generation, which translates to 2–3× lower cost for multi-tool agents.

Practical example: optimized multi-tool agent

python
from delfhos import Agent, SQL, Gmail, Sheets, Drive, WebSearch

agent = Agent(
    tools=[SQL(...), Gmail(...), Sheets(...), Drive(...), WebSearch(...)],
    light_llm="gemini-3.1-flash-lite-preview",
    heavy_llm="gpt-5.4",
    enable_prefilter=True,
)
# • light_llm processes all 5 tools, selects 2 relevant ones
# • heavy_llm only sees the 2 selected tools + full schemas
# • Heavy model prompt is ~70% smaller
# • Overall cost is 60–70% lower

Typical cost savings

Single LLM (10 tools)
text
light_llm: $0.00
heavy_llm: $0.020 (all 10 tools in prompt)
Total: $0.020
Dual-LLM (10 tools)
text
light_llm: $0.0005 (route 10 → 3)
heavy_llm: $0.006 (only 3 tools in prompt)
Total: $0.0065
Savings: ~68% per task — The extra light_llm call is cheap enough to pay for itself by reducing the heavy_llm prompt 3–4×.

How Memory Retrieval Works

Delfhos uses semantic search — not keyword search — to retrieve relevant facts. The embedding model runs locally; no API call is required.

Chat — Session Memory

Chat is an ordered list of (role, content) pairs. Cleared when the Python process exits.

  • Stored in RAM — zero I/O overhead
  • Preserves exact wording of prior turns
  • Suitable for chatbots, interactive CLIs
Memory — Persistent Store

Memory stores facts with sentence-transformer vector embeddings in a local SQLite database. Before each run, the top_k most semantically similar entries are retrieved and silently prepended to the prompt context.

  • Persists across Python sessions
  • Retrieved by semantic similarity, not recency
  • Suitable for agents with domain knowledge

How memory retrieval works

  1. 1.The user task is encoded as a vector using the same sentence-transformer model used during storage.
  2. 2.The database performs a cosine similarity search against all stored fact vectors.
  3. 3.The top_k most similar facts are formatted as a context block and injected at the start of the LLM prompt.
  4. 4.After a successful run, the agent may write new facts learned during execution back to the store.

How the Approval System Works

Approval is configured per-connection and operates at three levels: interactive terminal, custom callback, and programmatic API.

1 — Interactive

Default mode. When the agent pauses, a terminal prompt shows the tool name, action, and parameters. Developer selects Approve or Reject.

2 — Custom callback

on_confirm=fn lets you integrate with Slack, email, or a web dashboard. Return True, False, or None (fall back to UI).

3 — Programmatic

Use run_async(), then poll get_pending_approvals() and call approve() or reject() from your web handler.

What the LLM sees: When a request is rejected, the rejection reason is fed back to the LLM as context so it can revise its approach.

How Tool Code Generation Works

Delfhos does not use function-calling APIs. Instead, the LLM writes a short Python script and Delfhos executes it in a sandbox.

Composability

The LLM can write loops, conditionals, and multi-step logic combining multiple tools in a single generated script.

Transparency

The generated code is human-readable and can be inspected or logged. Set verbose=True to print it.

Retry with context

When code fails, the full error traceback is fed back to the LLM which often generates a correct fix on the next attempt.

Sandbox isolation

Only the registered tool objects are in scope. Dangerous builtins, filesystem access, and arbitrary network calls are blocked.

How the Execution Sandbox Works

Delfhos uses a layered sandbox with a pluggable backend. The SandboxExecutor selects the strongest isolation available at runtime.

Local backend — process-level

Used when Docker is not available. Generated code is compiled and exec()'d inside the host Python process with restricted __builtins__, import whitelist, and timeout enforcement via asyncio.wait_for().

Docker backend — container-level

Used when Docker is available. Generated code runs in a disposable container with full OS-level isolation. API credentials never enter the container. Tool calls travel over a TCP bridge (host.docker.internal) to the host process — the container receives only the serialized result. Outbound network access is blocked at the code level: urllib, requests, httpx, socket, and similar modules are blocked by the import allowlist.

text
┌─────────────────── Host Process ────────────────────────┐
│                                                          │
│  Orchestrator ──▶ SandboxExecutor ──▶ DockerSandbox     │
│       ↑                                      │          │
│       │                    TCP (RPC bridge)  │          │
│       │               host.docker.internal   │          │
│       │                                      ▼          │
│       │                  ┌── Docker Container ──────┐   │
│       │                  │  exec(agent_code)         │   │
│       │◄─── tool result  │  proxy.gmail.send() ─────┼───┘
│       │                  │  Code-level net block     │
│       │                  │  Read-only filesystem     │
│       │                  │  512 MB RAM cap           │
│       │                  └───────────────────────────┘
└──────────────────────────────────────────────────────────┘

How APITool Works

APITool connects any REST API to a Delfhos agent through a five-stage pipeline.

1
Compilation always, no LLM

The OpenAPICompiler reads the OpenAPI 3.x spec, resolves all $ref pointers, and transforms every operation into a Delfhos-native tool entry with a Python function signature, parameter descriptions, and compressed API docs.

2
LLM Enrichment optional — enrich=True

After compilation, an LLM rewrites endpoint descriptions to be more actionable and infers response schemas for undocumented endpoints. The enriched manifest is cached — on subsequent runs the LLM is never called again.

3
Registration

Compiled entries are registered into three internal stores: TOOL_REGISTRY (for the prefilter LLM), TOOL_ACTION_SUMMARIES (for prefilter ranking), and COMPRESSED_API_DOCS (for code generation prompts).

4
Execution

The APIExecutor receives calls from generated code and maps every argument to the correct HTTP location (path, query, header, or body). Before sending, it injects headers, params, and path_params.

5
Background Schema Sampling optional — sample=True (default)

After each successful API call, a daemon thread infers the exact response schema from the real data and saves it to sampled_schemas.json. The agent's knowledge of each endpoint's output improves automatically with use — zero tokens, zero latency.