Understanding-oriented discussions of Delfhos concepts, design decisions, and internal architecture. Read this to build a mental model — not to accomplish a specific task.
The Two-package Architecture
Delfhos is split into two Python packages: delfhos/ (the public API) and cortex/ (the internal engine). Users only ever import from delfhos.
Everything you import lives here. The Agent class is the primary entry point; Chat and Memory are pure data structures that you pass in. This layer is intentionally thin: it validates inputs, sets up types, and delegates to cortex.
Contains the orchestrator, LLM integration, tool execution sandbox, approval manager, connection implementations, and the OpenAPI compiler for REST API tools. Users never import from cortex directly.
How the Orchestration Loop Works
When you call agent.run(), Delfhos executes a deterministic 8-step pipeline. Understanding each step helps you reason about latency, cost, and failure points.
Memory Retrieval
If a Memory is attached, run semantic search against stored facts. Top-k facts are injected into the system prompt.
Tool Prefiltering
If enable_prefilter=True, the light_llm reads the task and available tools, then selects the relevant subset. Reduces context size by ~60%.
Schema Loading (SQL only)
If a SQL connection is in the selected tools, the actual table schemas are fetched from the database and included in the code generation prompt.
Code Generation
The heavy_llm (or code_llm) receives the system prompt, memory facts, chat history, task, and tool API docs. It responds with a Python code block.
Approval Gate
If any tool has confirm=True and the generated code calls that action, execution pauses. The agent waits until a human approves or rejects.
Sandboxed Execution
Generated code runs in an isolated execution environment. If Docker is available, the code runs in a disposable container with no network, read-only filesystem, memory/CPU caps, and zero Linux capabilities. All tool calls are proxied back to the host via a Unix socket — API credentials never enter the container.
Retry Loop
If execution raises an exception, the error is fed back to the LLM for corrected code generation. Repeats up to retry_count times.
Result Composition
Final output is collected. Token counts and cost are calculated. Result is added to Chat history (if enabled). A Response object is returned.
Which steps are skippable?
| Step | Skipped when |
|---|---|
| Memory Retrieval | No Memory instance is mounted. |
| Prefiltering | enable_prefilter=False or fewer than 4 tools registered. |
| Approval Gate | The generated code does not invoke any confirm-listed operation. |
| Error Recovery | Code executes without raising an exception. |
How Multi-LLM Routing Reduces Costs
Delfhos allows you to assign different LLMs to different tasks in the execution pipeline. By using a cheap fast model for tool routing and a powerful model for code generation, you can reduce overall cost by 2–3×.
The core insight
Which tools are relevant to this task? This is simple work. A cheap, fast model can handle it in milliseconds.
What code should those tools execute? This requires deep reasoning. A powerful model is needed to handle complexity, edge cases, and error recovery.
By splitting these across two models, you pay frontier prices only for the work that needs it. The result: 60% fewer tokens in code generation, which translates to 2–3× lower cost for multi-tool agents.
Practical example: optimized multi-tool agent
Typical cost savings
How Memory Retrieval Works
Delfhos uses semantic search — not keyword search — to retrieve relevant facts. The embedding model runs locally; no API call is required.
Chat is an ordered list of (role, content) pairs. Cleared when the Python process exits.
- • Stored in RAM — zero I/O overhead
- • Preserves exact wording of prior turns
- • Suitable for chatbots, interactive CLIs
Memory stores facts with sentence-transformer vector embeddings in a local SQLite database. Before each run, the top_k most semantically similar entries are retrieved and silently prepended to the prompt context.
- • Persists across Python sessions
- • Retrieved by semantic similarity, not recency
- • Suitable for agents with domain knowledge
How memory retrieval works
- 1.The user task is encoded as a vector using the same sentence-transformer model used during storage.
- 2.The database performs a cosine similarity search against all stored fact vectors.
- 3.The top_k most similar facts are formatted as a context block and injected at the start of the LLM prompt.
- 4.After a successful run, the agent may write new facts learned during execution back to the store.
How the Approval System Works
Approval is configured per-connection and operates at three levels: interactive terminal, custom callback, and programmatic API.
Default mode. When the agent pauses, a terminal prompt shows the tool name, action, and parameters. Developer selects Approve or Reject.
on_confirm=fn lets you integrate with Slack, email, or a web dashboard. Return True, False, or None (fall back to UI).
Use run_async(), then poll get_pending_approvals() and call approve() or reject() from your web handler.
How Tool Code Generation Works
Delfhos does not use function-calling APIs. Instead, the LLM writes a short Python script and Delfhos executes it in a sandbox.
The LLM can write loops, conditionals, and multi-step logic combining multiple tools in a single generated script.
The generated code is human-readable and can be inspected or logged. Set verbose=True to print it.
When code fails, the full error traceback is fed back to the LLM which often generates a correct fix on the next attempt.
Only the registered tool objects are in scope. Dangerous builtins, filesystem access, and arbitrary network calls are blocked.
How the Execution Sandbox Works
Delfhos uses a layered sandbox with a pluggable backend. The SandboxExecutor selects the strongest isolation available at runtime.
Used when Docker is not available. Generated code is compiled and exec()'d inside the host Python process with restricted __builtins__, import whitelist, and timeout enforcement via asyncio.wait_for().
Used when Docker is available. Generated code runs in a disposable container with full OS-level isolation. API credentials never enter the container. Tool calls travel over a TCP bridge (host.docker.internal) to the host process — the container receives only the serialized result. Outbound network access is blocked at the code level: urllib, requests, httpx, socket, and similar modules are blocked by the import allowlist.
How APITool Works
APITool connects any REST API to a Delfhos agent through a five-stage pipeline.
Compilation always, no LLM
The OpenAPICompiler reads the OpenAPI 3.x spec, resolves all $ref pointers, and transforms every operation into a Delfhos-native tool entry with a Python function signature, parameter descriptions, and compressed API docs.
LLM Enrichment optional — enrich=True
After compilation, an LLM rewrites endpoint descriptions to be more actionable and infers response schemas for undocumented endpoints. The enriched manifest is cached — on subsequent runs the LLM is never called again.
Registration
Compiled entries are registered into three internal stores: TOOL_REGISTRY (for the prefilter LLM), TOOL_ACTION_SUMMARIES (for prefilter ranking), and COMPRESSED_API_DOCS (for code generation prompts).
Execution
The APIExecutor receives calls from generated code and maps every argument to the correct HTTP location (path, query, header, or body). Before sending, it injects headers, params, and path_params.
Background Schema Sampling optional — sample=True (default)
After each successful API call, a daemon thread infers the exact response schema from the real data and saves it to sampled_schemas.json. The agent's knowledge of each endpoint's output improves automatically with use — zero tokens, zero latency.
