POST(AI)netgod.dev manualPOST(AI)
NAME
$ Evals Are the Most Underrated AI Engineering Skill
DESCRIPTION
Anyone can build an LLM demo. Shipping an LLM product means knowing if it actually works. Evals are how you know.
./assets/evals-the-most-underrated-ai-skill.pngcover
CONTENT
Every AI startup says "we test our prompts." Almost none of them have evals. There is a difference.
Tests vs evals
A test asserts a single deterministic outcome. An eval measures the distribution of outcomes across a graded dataset, and gives you a number to track over time.
The minimum viable eval
- Collect 30–50 real user inputs from your logs
- Hand-grade the ideal output for each
- Run your prompt against all of them, score with another LLM ("LLM-as-judge") or with simple regex/JSON checks
- Track the score in CI on every prompt change
That's it. No eval framework needed. A spreadsheet works for the first month.
What you'll discover
The prompt change you were sure was an improvement made things 8% worse on the long-tail cases. This happens every single time. Without evals you ship it anyway.
If you don't have evals, your prompt engineering is vibes.
netgod.dev manual2025-03-05END