— bash — netgod.dev — 80×24

guest@netgod.dev:~/blog$ cat evals-the-most-underrated-ai-skill.md

POST(AI)netgod.dev manualPOST(AI)

════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════

NAME

$ Evals Are the Most Underrated AI Engineering Skill

DESCRIPTION

Anyone can build an LLM demo. Shipping an LLM product means knowing if it actually works. Evals are how you know.

DATE

2025-03-05

DURATION

1 min read

Tests vs evals

A test asserts a single deterministic outcome. An eval measures the distribution of outcomes across a graded dataset, and gives you a number to track over time.

The minimum viable eval

Collect 30–50 real user inputs from your logs
Hand-grade the ideal output for each
Run your prompt against all of them, score with another LLM ("LLM-as-judge") or with simple regex/JSON checks
Track the score in CI on every prompt change

That's it. No eval framework needed. A spreadsheet works for the first month.

What you'll discover

The prompt change you were sure was an improvement made things 8% worse on the long-tail cases. This happens every single time. Without evals you ship it anyway.

If you don't have evals, your prompt engineering is vibes.

netgod.dev manual2025-03-05END