2026-03-29 · gemini-3.1-flash-lite-preview · 3 runs / task

Benchmark Results

Four frameworks. Four real-world tasks. One clear winner across speed, token efficiency, and developer experience.

bolt
2.2×
Faster than Agno
avg 2.0 s vs 4.4 s
savings
2.0×
Fewer tokens than PydanticAI
avg 1,141 vs 2,304
verified
100%
Success rate
all frameworks · all tasks

// methodology

Reproducible & transparent

Each framework ran the same tasks under identical conditions — same model, same prompts, same hardware. Task 2 (Gmail Email Triage) is excluded from aggregates: native Gmail integration differences make cross-framework comparisons unreliable there.

Model
gemini-3.1-flash-lite
Runs / task
3 runs → avg
Frameworks
4 total
Date
2026-03-29

Tasks Evaluated

#TaskTools
1Expense Categorisation1 custom tool
2Quarterly ROI Analysis2 chained tools
3HR Headcount Report2 chained tools
4IT Ticket Prioritisation2 chained tools
Metric 01

Speed

Average execution time per task (seconds). Lower is better.

1 — Expense Categorisation
bolt Delfhos fastest
Delfhos
1.7s
LangChain
2.6s
PydanticAI
2.5s
Agno
2.6s
2 — Quarterly ROI Analysis
bolt Delfhos fastest
Delfhos
2.3s
LangChain
3.3s
PydanticAI
3.7s
Agno
5.4s
3 — HR Headcount Report
bolt Delfhos fastest
Delfhos
2s
LangChain
4s
PydanticAI
3.7s
Agno
4.3s
4 — IT Ticket Prioritisation
bolt Delfhos fastest
Delfhos
1.9s
LangChain
5.1s
PydanticAI
4.7s
Agno
5.4s
Average across all tasks
Delfhos: 2.0 s
LangChain 3.8 s PydanticAI 3.6 s Agno 4.4 s
Delfhos is 1.8× faster than LangChain, 1.8× faster than PydanticAI, and 2.2× faster than Agno.
Metric 02

Token Usage

Average input + output tokens per task. Lower is better — directly impacts cost.

1 — Expense Categorisation
savings Delfhos leanest
Delfhos
925
LangChain
1,116
PydanticAI
1,197
Agno
1,119
2 — Quarterly ROI Analysis
savings Delfhos leanest
Delfhos
1,261
LangChain
2,062
PydanticAI
2,408
Agno
2,280
3 — HR Headcount Report
savings Delfhos leanest
Delfhos
1,178
LangChain
1,945
PydanticAI
2,008
Agno
1,935
4 — IT Ticket Prioritisation
savings Delfhos leanest
Delfhos
1,199
LangChain
3,511
PydanticAI
3,603
Agno
3,508

Token Breakdown — Average per Task

FrameworkAvg InputAvg OutputAvg Total
Delfhos8283131,141
LangChain1,7344252,159
PydanticAI1,8424622,304
Agno1,7604512,211
Average across all tasks
Delfhos: 1,141 tokens
LangChain 2,159 PydanticAI 2,304 Agno 2,211
Delfhos uses ~1.9× fewer tokens than LangChain, ~2.0× fewer than PydanticAI, and ~1.9× fewer than Agno.
Metric 03

Setup Complexity

Non-blank, non-comment lines to initialise the agent and define tools (imports excluded). Lower is better.

1 — Expense Categorisation
code Delfhos simplest
Delfhos
8 lines
LangChain
15 lines
PydanticAI
9 lines
Agno
10 lines
2 — Quarterly ROI Analysis
code Delfhos simplest
Delfhos
9 lines
LangChain
16 lines
PydanticAI
10 lines
Agno
11 lines
3 — HR Headcount Report
code Delfhos simplest
Delfhos
9 lines
LangChain
16 lines
PydanticAI
10 lines
Agno
11 lines
4 — IT Ticket Prioritisation
code Delfhos simplest
Delfhos
9 lines
LangChain
16 lines
PydanticAI
10 lines
Agno
11 lines
Average setup lines
Delfhos: 8.8 lines
LangChain 15.8 PydanticAI 9.8 Agno 10.8
Delfhos requires 1.8× fewer lines than LangChain, 1.1× fewer than PydanticAI, and 1.2× fewer than Agno.

// aggregate

The full picture

FrameworkAvg SpeedAvg TokensAvg Setup LOCSuccess Rate
Delfhos2.0 s1,1418.8100%
PydanticAI3.6 s2,3049.8100%
Agno4.4 s2,21110.8100%
LangChain3.8 s2,15915.8100%
bolt

Speed

Delfhos consistently finishes first. The gap widens as tool-chaining complexity increases — up to 2.2× faster than Agno on multi-tool tasks.

savings

Token Efficiency

Delfhos sends leaner prompts across all tasks. The advantage is most pronounced on Task 4, where competitors consume ~3× more tokens.

code

Developer Experience

LangChain requires roughly double the boilerplate of Delfhos. PydanticAI and Agno are more concise but still trail by 1–2 lines per task.

Leading on every metric.

All four frameworks achieved 100% success. Delfhos leads on speed, tokens, and simplicity — without sacrificing reliability.