— bash — netgod.dev — 80×24
guest@netgod.dev:~/blog$ cat evals-the-most-underrated-ai-skill.md
← cd ../blog
POST(AI)netgod.dev manualPOST(AI)
NAME

$ Evals Are the Most Underrated AI Engineering Skill

DESCRIPTION

Anyone can build an LLM demo. Shipping an LLM product means knowing if it actually works. Evals are how you know.

DATE
2025-03-05
DURATION
1 min read
TAGS
./assets/evals-the-most-underrated-ai-skill.pngcover
CONTENT

Every AI startup says "we test our prompts." Almost none of them have evals. There is a difference.

Tests vs evals

A test asserts a single deterministic outcome. An eval measures the distribution of outcomes across a graded dataset, and gives you a number to track over time.

The minimum viable eval

  1. Collect 30–50 real user inputs from your logs
  2. Hand-grade the ideal output for each
  3. Run your prompt against all of them, score with another LLM ("LLM-as-judge") or with simple regex/JSON checks
  4. Track the score in CI on every prompt change

That's it. No eval framework needed. A spreadsheet works for the first month.

What you'll discover

The prompt change you were sure was an improvement made things 8% worse on the long-tail cases. This happens every single time. Without evals you ship it anyway.

If you don't have evals, your prompt engineering is vibes.
netgod.dev manual2025-03-05END