[This is a placeholder essay — the structure and key arguments are in place; refine before publishing publicly.]
The wrong way to evaluate AI content
When teams first experiment with AI content tools, they almost always do the same thing: they ask the model to produce some content, read what comes out, and decide whether it's "good enough."
This is the wrong test. It calibrates the team's quality bar to whatever the model produces, rather than calibrating the workflow to whatever quality bar the team already has. The result is a slow, almost imperceptible regression — articles that would have been rejected six months ago start getting published, because they're now compared against AI-generated drafts rather than the team's pre-AI quality standard.
Six months in, the team's content output has doubled. The team is happy. Rankings start to slip. Nobody connects the two until it's been twelve months and the slip has become a slide.
The right test: existing content as the bar
The right way to evaluate an AI content workflow starts before any AI is involved. You select your highest-quality existing articles — the ones you'd point to as exemplars of your editorial standard. The ones that ranked, drove revenue, got shared, and made the team proud.
You take the keyword targets and brief structures from those articles. You feed them into your proposed AI workflow. And you compare what comes out.
The question isn't "is the output good?" The question is "is this output indistinguishable from our exemplars, given a reasonable amount of human editing?"
Three honest answers
The answer to that question is one of three things, and which it is determines whether to deploy the workflow:
1. Yes, indistinguishable with light editing.
Light editing means a senior editor spends 20-40 minutes per article on structure, voice, and fact-checking. If the workflow gets to the team's bar with that level of effort, deploy it. The productivity gain is real. The quality bar holds.
2. Yes, but only with heavy editing.
Heavy editing means an editor spends 90+ minutes per article, often rewriting whole sections. If this is what's required, the workflow's leverage is much smaller than it looked. You're not doubling output — you're shifting which humans do which parts of the work, and the time savings might not pay for the AI infrastructure.
This is the most common case. It's also where most teams convince themselves to deploy anyway, on the assumption that "the model will get better." Sometimes it does. Often it doesn't, and you've sunk months of editorial bandwidth into a workflow that never paid back.
3. No — not at the existing bar.
If your exemplars genuinely cannot be replicated by the workflow with reasonable human oversight, the honest answer is: don't deploy it. Either lower your editorial ambition (a real choice — sometimes appropriate) or accept that AI-assisted production isn't the right scaling lever for this content type.
Most teams skip this answer because admitting it feels like failure. But it's not failure. It's calibration. AI-assisted production is right for some content types and wrong for others. The discipline is in being honest about which is which.
The architecture matters more than the model
One reason the existing-content test is useful: it surfaces architectural problems that look like model problems. Most teams that fail the test fail because their workflow is architecturally weak — bad prompting, wrong model selection, missing context, no editorial guardrails — not because the underlying model is incapable.
Switching from GPT-4 to Claude Opus rarely fixes a workflow that fails the bar test. Rebuilding the prompt structure, adding few-shot examples from your existing content, and inserting a human-review gate often does. The model is rarely the limiting factor.
The teams that succeed with AI content treat the model as a component in a larger system. The teams that fail treat the model as the system.
What to do before deploying any AI content workflow
A short checklist for evaluating any proposed AI content workflow before it touches your production pipeline:
- Pick your top 10 exemplar articles — the ones representing your true quality bar
- Run the proposed workflow against the briefs that produced those articles
- Compare side-by-side, not just AI output read in isolation
- Time the human editing required to bring AI output to exemplar quality
- Calculate the real productivity gain (output volume) divided by the real cost (model fees + editorial time)
- Make the deployment decision based on the gain-to-cost ratio, not on whether the AI output "feels" good
This is the bar test. It's slow. It's unglamorous. It surfaces uncomfortable answers. And it's the difference between scaling content production successfully and watching your rankings slip a year after deployment.
Where this fits in our work
The bar test is built into how we design AI integrations. We don't recommend turning on AI content generation until the architecture passes this test against the client's existing exemplars. Sometimes that means recommending against deployment. We've done that more than once. Clients usually thank us for it later.