EPISODE 02

The Grader

AUTO-PLAY · ~90 SEC · SPACE = PAUSE · ←→ = CHAPTERS
THE FACTORY
AI article pipeline
script
LIVE SERP
top-3 mandatory
script
THE GRADER
a reader, not an SEO expert
judgment
0 read · 0 errors
RULEBOOK
0 rules
WIN0
COMPETITIVE0
LOSE0

AUDIT · REVIEW #1 · A GLASSES BUYING GUIDE
7 errors

the same maintenance warning pasted ×3 · three image URLs that 404 · a fabricated "dating" section with no source

By review #15, the running total was 118.

One month. The factory got graded.

0
competitor articles fetched
0
own articles audited
0
errors found
0
rules in the rulebook
sources: fetched-file dates 04/15–05/14 · audit report (9 iterations) · review log · workflow.md
⚠️ runs in bursts on demand, not on cron — and whether the fixes recovered traffic is not yet measured

Your move

1

Grade as a reader, not a dashboard. One persona question — "can I decide after reading this?" — catches what metric checklists miss.

2

Every failure becomes a numbered rule. This loop isn't cron — it's run → catch → write the rule → run again. 30 rules and counting.

3

Judge against the live environment. Your quality bar is whatever currently wins the SERP, not your style guide.

Transcript & receipts — the full written case study
01 · Cold open

Review #1 was a buying guide for semi-rimless glasses, written by an AI content pipeline that had already shipped it. The grader read it the way a customer would. It found the same maintenance warning pasted three times word-for-word. It found three image URLs that would 404 on every reader's screen. It found an entire section about glasses making you more attractive on dates — fabricated, with no source behind it. Seven errors, logged with severity ratings.

That was article one of fifteen. By review fifteen, the running total was 118 errors — written by the factory the grader's owner had built.

02 · The problem

A small content studio runs an AI article factory: pipelines that research, write, and publish at a pace no human team matches. The catch is that nobody reads the output the way a reader does. The people who built the factory check metrics and formatting; the only true readers are the customer — and Google.

That gap stayed invisible until it got expensive. In March 2026, a Google update penalized exactly the content profile the factory was producing — mass-produced pages lost most of their traffic across the industry. When the grader later checked, 8 of the 15 audited articles matched the penalized profile. Quality drift you can't see compounds quietly, then bills you all at once.

03 · The loop

The loop's one design commitment: the grader is a reader, not an SEO expert. Its core question is "can someone decide or learn after reading this?" — never "does this tick the optimization boxes."

one grading run (per keyword or article)
│
├─ serp.py        fetch the real ranking for the keyword     [script]
│                 top-3 results are mandatory competitors
├─ fetch.py       pull each competitor's full article        [script]
│                 provider chain: fetch → CDP → CF → Jina
│                 blocked? skip, report with N−1. never stall
│
├─ grader agent   read everything as one persona:            [judgment]
│                 "I'm searching this because ___"
│                 build coverage matrix (topic union)
│                 → WIN / COMPETITIVE / LOSE + gaps
│
└─ rulebook       every miss becomes a numbered rule —       [memory]
                  30 rules so far. the next run starts smarter

An honest label, since our own format demands it: this loop runs in bursts, on demand — not unattended on cron. The loop lives in the last box. Each audit amends the rulebook, and the rulebook persists, so every run inherits everything every previous run learned. Nine audit iterations grew it to 30 numbered rules.

A few rules show the flavor of what accumulated:

  • #1's page type IS the verdict. If a shop outranks every article, users want to buy, not read. Don't argue with the scoreboard.
  • More content ≠ higher ranking. If #8 covers more topics than #1, that's a signal about #1's structure — not praise for #8.
  • Padding actively hurts. If removing the emotional filler collapses the article, it was a scaffold holding up air.
04 · What broke

The blocked fetch. Some competitor sites block scrapers, and the early runs would retry the same URL until the whole audit stalled. The fix became a rule: the provider chain already tried everything — skip the site, mark it "(unfetchable)" with the snippet you have, and ship the report with N−1 competitors. A stuck loop is worse than an incomplete one.

The wrong scoreboard. Generic web search returned a plausible but wrong rank order for Chinese keywords — so the grader was comparing against a SERP that didn't exist. The fix: always hit the real SERP API first; general-purpose search results are not the scoreboard.

The grader's own bias. Early reports kept praising the longest competitor as "the most comprehensive on the SERP" — even when it ranked #8. The fix went into the rulebook as the "more content ≠ higher ranking" rule, plus a harsher one: judge by information density per sentence, and treat filler as negative. The loop needed a rule against its own taste.

05 · The numbers

One month of grading, April 15 to May 14, 2026.

MetricValueSource
Competitor articles fetched139fetched-file dates in the workspace
Own articles audited15audit report
Errors found118 (7.9 / article)audit report + review log
Audit iterations9audit report header
Rules in the rulebook30workflow.md section count
Articles matching the Google-penalized profile8 of 15audit report, March 2026 update
The quality law it found1-source articles worst, 5–6-source articles clean → 5-source minimum now enforcedaudit report
⚠️ Two honest gaps: the loop runs in bursts on demand (the schedule lives in the rulebook iteration, not cron), and whether the 118 fixes recovered traffic is not yet measured — the grader measures content quality, not business outcomes.
06 · Take it home

Three design rules transfer to any grade-your-own-output loop (generated code, generated reports, generated listings):

  • Grade as a reader, not a dashboard. Give the agent one persona and one question — "can I decide after reading this?" — and it catches duplication, fabrication, and filler that metric checklists score as fine.
  • Every failure becomes a numbered rule. The rulebook is the loop. If a run's mistake doesn't end as a written rule, the next run repeats it.
  • Judge against the live environment. Fetch what actually wins right now and compare against that — your internal style guide drifts; the environment doesn't lie.

The pipeline is two small Python scripts plus a markdown rulebook. The 30 rules are the part that took a month of being wrong to learn.