2026-03-29 · gemini-3.1-flash-lite-preview · 3 runs / task

Benchmark Results

Four frameworks. Four real-world tasks. One clear winner across speed, token efficiency, and developer experience.

bolt

2.2×

Faster than Agno

avg 2.0 s vs 4.4 s

savings

2.0×

Fewer tokens than PydanticAI

avg 1,141 vs 2,304

verified

100%

Success rate

all frameworks · all tasks

// methodology

Reproducible & transparent

Each framework ran the same tasks under identical conditions — same model, same prompts, same hardware. Task 2 (Gmail Email Triage) is excluded from aggregates: native Gmail integration differences make cross-framework comparisons unreliable there.

Model

gemini-3.1-flash-lite

Runs / task

3 runs → avg

Frameworks

4 total

Date

2026-03-29

Tasks Evaluated

#	Task	Tools
1	Expense Categorisation	1 custom tool
2	Quarterly ROI Analysis	2 chained tools
3	HR Headcount Report	2 chained tools
4	IT Ticket Prioritisation	2 chained tools

Metric 01

Speed

Average execution time per task (seconds). Lower is better.

1 — Expense Categorisation

bolt Delfhos fastest

Delfhos

1.7s

LangChain

2.6s

PydanticAI

2.5s

Agno

2.6s

2 — Quarterly ROI Analysis

bolt Delfhos fastest

Delfhos

2.3s

LangChain

3.3s

PydanticAI

3.7s

Agno

5.4s

3 — HR Headcount Report

bolt Delfhos fastest

Delfhos

LangChain

PydanticAI

3.7s

Agno

4.3s

4 — IT Ticket Prioritisation

bolt Delfhos fastest

Delfhos

1.9s

LangChain

5.1s

PydanticAI

4.7s

Agno

5.4s

Average across all tasks

Delfhos: 2.0 s

LangChain 3.8 s PydanticAI 3.6 s Agno 4.4 s

Delfhos is 1.8× faster than LangChain, 1.8× faster than PydanticAI, and 2.2× faster than Agno.

Metric 02

Token Usage

Average input + output tokens per task. Lower is better — directly impacts cost.

1 — Expense Categorisation

savings Delfhos leanest

Delfhos

925

LangChain

1,116

PydanticAI

1,197

Agno

1,119

2 — Quarterly ROI Analysis

savings Delfhos leanest

Delfhos

1,261

LangChain

2,062

PydanticAI

2,408

Agno

2,280

3 — HR Headcount Report

savings Delfhos leanest

Delfhos

1,178

LangChain

1,945

PydanticAI

2,008

Agno

1,935

4 — IT Ticket Prioritisation

savings Delfhos leanest

Delfhos

1,199

LangChain

3,511

PydanticAI

3,603

Agno

3,508

Token Breakdown — Average per Task

Framework	Avg Input	Avg Output	Avg Total
Delfhos	828	313	1,141
LangChain	1,734	425	2,159
PydanticAI	1,842	462	2,304
Agno	1,760	451	2,211

Average across all tasks

Delfhos: 1,141 tokens

LangChain 2,159 PydanticAI 2,304 Agno 2,211

Delfhos uses ~1.9× fewer tokens than LangChain, ~2.0× fewer than PydanticAI, and ~1.9× fewer than Agno.

Metric 03

Setup Complexity

Non-blank, non-comment lines to initialise the agent and define tools (imports excluded). Lower is better.

1 — Expense Categorisation

code Delfhos simplest

Delfhos

8 lines

LangChain

15 lines

PydanticAI

9 lines

Agno

10 lines

2 — Quarterly ROI Analysis

code Delfhos simplest

Delfhos

9 lines

LangChain

16 lines

PydanticAI

10 lines

Agno

11 lines

3 — HR Headcount Report

code Delfhos simplest

Delfhos

9 lines

LangChain

16 lines

PydanticAI

10 lines

Agno

11 lines

4 — IT Ticket Prioritisation

code Delfhos simplest

Delfhos

9 lines

LangChain

16 lines

PydanticAI

10 lines

Agno

11 lines

Average setup lines

Delfhos: 8.8 lines

LangChain 15.8 PydanticAI 9.8 Agno 10.8

Delfhos requires 1.8× fewer lines than LangChain, 1.1× fewer than PydanticAI, and 1.2× fewer than Agno.

// aggregate

The full picture

Framework	Avg Speed	Avg Tokens	Avg Setup LOC	Success Rate
Delfhos	2.0 s	1,141	8.8	100%
PydanticAI	3.6 s	2,304	9.8	100%
Agno	4.4 s	2,211	10.8	100%
LangChain	3.8 s	2,159	15.8	100%

bolt

Speed

Delfhos consistently finishes first. The gap widens as tool-chaining complexity increases — up to 2.2× faster than Agno on multi-tool tasks.

savings

Token Efficiency

Delfhos sends leaner prompts across all tasks. The advantage is most pronounced on Task 4, where competitors consume ~3× more tokens.

code

Developer Experience

LangChain requires roughly double the boilerplate of Delfhos. PydanticAI and Agno are more concise but still trail by 1–2 lines per task.

Leading on every metric.

All four frameworks achieved 100% success. Delfhos leads on speed, tokens, and simplicity — without sacrificing reliability.

Get started arrow_forward View on GitHub