A compounding pattern catalog for AI slop

A methodology for detecting AI-generated content faces a structural problem the field rarely discusses directly. The patterns the methodology catalogs keep moving. Last year’s “delve” cluster gets retrained out by an alignment update. Today’s high-signal marker did not exist eighteen months ago. Tomorrow’s frontier model will produce patterns that nobody has names for yet.

Most detection systems treat this as a versioning problem. Ship v2, v3, v4. Deprecate the older catalog entries. Point users at whatever is current. The pattern catalog tracks the rolling frontier of model behavior, and last year’s pattern set is discarded once it stops firing.

That shape has a hidden cost. It makes the methodology useless for any work where the content was not produced by today’s model. A news organization auditing articles from 2023 cannot use a 2026 catalog that has dropped the 2023-era patterns. A researcher studying how LLM output has evolved cannot read across versions if each version’s catalog only contains its own present. A regulator investigating disclosure failures from two years ago has nothing in scope.

The synthesis-content-quality methodology behind slopcheck chose a different shape. Patterns are never deleted. When a newer model version trains a pattern out, the catalog tags it with its era and retains it. The “As an AI language model” preamble that was ubiquitous in 2023 ChatGPT output is gone from current GPT-5.1, but it is still in the catalog, flagged Historical, with the years of its prevalence. The same applies to the Bard “as a large language model trained by Google” stem, early Claude over-disclaimer patterns, the first-generation refusal templates from before instruction-tuning matured, and the deprecated emoji-as-bullet structure that was characteristic of GPT-3.5 in late 2022.

This is the compounding-archive principle. The catalog grows monotonically. The newest patterns are tagged Active. Patterns that are receding because a model class is being trained against them are tagged Declining. Patterns that have been substantially retrained out but are still recognizable in older content are tagged Historical. A few patterns are tagged Deprecated, meaning they have been superseded by a more precise formulation but the original is kept for backwards comparison.

What the compounding archive enables

Forensic review at the same depth as current-content review. A newsroom auditing pieces from 2023 gets the full catalog applied at era-correct calibration. Editors checking today’s drafts get the same fidelity. The catalog does not split into “current” and “archive” versions.

Historical scholarship. The era metadata is research data in its own right. Studying how the saturated-vocabulary cluster evolved from 2023 through 2026 is a real research question, and the answer is in the catalog rather than reconstructed from scattered blog posts.

Calibration that is honest about time. Each pattern carries a per-family base-rate column split by year. An editor in 2026 reading the calibration table for em-dash density learns immediately that the marker was high signal for the Claude family and pre-GPT-5.1 ChatGPT, has plummeting base rate in GPT-5.1 onward after the November 2025 anti-em-dash personalization update, and was near-zero in Llama outputs across all years. The same calibration table that tells the editor what to weight today is the calibration table that tells the historian what to weight for any given year.

The fourteen-field per-pattern template

Each pattern in the catalog carries fourteen fields. The fields exist so that extension is mechanical and falsification is possible. A short list with one-line definitions:

id. A stable identifier with a two-letter section prefix and a three-digit number. Example: A3-SS-001 is criterion 1 in the style-and-structural subsection of section A3.
name. The pattern’s human-readable label.
description. One paragraph defining the pattern operationally. Not the only example; the criterion of inclusion.
era. One of Active, Declining, Historical, Deprecated.
years_active. The range during which the pattern fires in production output. Allows time-stratified detection.
families. Which LLM families the pattern characterizes. May be all-of, one-of, or absent-from.
zone. Where in an LLM response the pattern lives. WRAPPER-OPENER, BODY-PERSISTENT, WRAPPER-CLOSER, HYBRID, or MID-BODY-INSERT.
signal_strength_when_present. A 0.0 to 1.0 conditional probability that text containing the pattern is AI-generated.
base_rate_by_family. A per-family table of how frequently the pattern appears in unedited output, split by year for the major shifts.
causal_attribution. A code from the twelve-element taxonomy that names the likely origin. RLHF reward shaping, training-data skew variants, alignment-and-safety tuning, refusal-avoidance behavior, helpfulness optimization, system-prompt artifacts, tokenizer/architecture effects, product-wrapper effects.
examples. At least three real examples, drawn from production output. Synthetic examples are forbidden; the catalog requires evidence.
non_examples. Cases where the pattern’s surface form is present but the underlying signal is not. Critical for editors learning calibration.
combined_signals. References to B2 combos this pattern participates in.
notes. Open field for edge cases, ESL considerations, and recent calibration shifts.

The template makes contribution mechanical. A researcher adding a new pattern fills in the fourteen fields, attaches three examples, picks an era tag, and the catalog accepts the entry. The same template, read in reverse, makes any pattern in the catalog falsifiable. If the examples are not real, or the family attribution is wrong, or the calibration table does not match observed base rates, the entry can be amended or retired with the audit trail intact.

The four catalog sections

The methodology divides patterns into four sections. The division is not aesthetic; each section answers a different research question.

Section A1, model-family fingerprinting, asks which family produced this output. The section covers eight families: Anthropic Claude, OpenAI GPT, Google Gemini, Meta Llama, xAI Grok, DeepSeek, Mistral, and Qwen. Active pattern count is around one hundred, with historical entries adding roughly thirty more. The headline patterns for each family are calibrated to a high signal-strength-when-present score. For Claude, the strongest single signature in mid-2026 is em-dash density combined with bulleted bolded lead-ins and uniform paragraph length. For GPT, it is the saturated-vocabulary cluster centered on “delve,” anchored in the Kobak et al. study showing 13.5 percent of 2024 biomedical abstracts carried the marker. For Gemini, the leading signature is plain-text markdown leakage in non-rendering channels.

Section A2, substance and depth, asks whether the content carries claims that do real work or merely fills space. The framing is grounded in a substantial literature: Frankfurt’s On Bullshit (2005) on the structural difference between lying and indifference to truth, Hicks-Humphries-Slater’s “ChatGPT is Bullshit” in Ethics and Information Technology 26:38 (2024) extending the framework to LLM output, Pennycook et al.’s Bullshit Receptivity Scale (Judgment and Decision Making 10:6, 2015), Sourati et al.’s 2025 homogenization survey, and Padmakumar and He’s 2024 study on output diversity loss. Seventeen sub-patterns operationalize the literature into editorial tests an editor can run in five minutes. The deletion test (A2-SUB-001) is the single most useful: if a paragraph can be removed without losing any claim, evidence, or transition, the paragraph is structural slop regardless of authorship. The any-company test (A2-SUB-006) covers business writing: if a paragraph applies equally well to any company in the industry, it carries no load-bearing claim about the specific company it is supposed to be about. The third high-yield test is load-bearing claim count (A2-SUB-003), which counts the sentences in a piece that carry claims the rest of the piece depends on. Substantive prose runs three or more load-bearing claims per hundred words.

Section A3, refreshed pattern catalog, holds the seventy-six criteria that survived the v3.1.0 to v4.0 transition or were added in the refresh. They are renumbered thematically with two-letter prefixes. A3-LT covers language and tone (fifteen criteria). A3-SS covers style and structural patterns (nine criteria, including the em-dash entry with its per-family calibration). A3-TF covers technical and formatting markers (placeholders, chatbot artifacts, broken or fabricated links). A3-CS covers citation and sourcing patterns (hallucinated citations promoted to high signal, vague attribution, the four retrieval-era additions ChatGPT introduced when the GPT models began citing sources). A3-CX covers context-specific markers (industry slop, lack of personal detail). A3-HD covers hyperbolic and dramatic patterns. A3-CE covers confidentiality exposure (scenario fingerprinting, operational decisions used as teaching material). A3-BT covers behavioral and tonal patterns (saturated vocabulary, exhausted metaphors, concierge tone, sycophancy drift, partial-refusal stems). A3-FA covers frame and audience patterns (insider context collapse and seven net-new entries). A3-SR covers social-register patterns specific to social-media writing.

The cross-cutting B layer applies to every pattern in the A1, A2, and A3 sections. B1 carries the causal-attribution column. B2 catalogs the eighty-six combined-signal fingerprints. B3 carries the two-axis calibration tables.

Calibration that takes the ESL bias seriously

The v3.1.0 heuristic (“five or more medium-confidence indicators equals very likely AI”) is replaced in v4.0 with explicit two-axis calibration. Each criterion carries a signal-strength-when-present score (0.0 to 1.0) and a base-rate-by-family column. The split reveals the asymmetries that count-based heuristics hide. Em-dash density carries a high signal-strength-when-present score against the Claude family in mid-2026 but a plummeting base rate in GPT-5.1 outputs after the anti-em-dash personalization rolled out in late 2025. Uniform paragraph length carries a moderate signal strength but a very high base rate that overlaps with non-native English writing.

The ESL safe-harbor is the calibration constraint that follows from taking that overlap seriously. Liang et al. (arxiv 2304.02819, 2023) demonstrated that GPT detectors misclassify a large fraction of non-native English writing as AI-generated. The cornerstone signature for AI in many detection systems (uniform paragraph length combined with restricted vocabulary and heavy transition use) is also the cornerstone signature for English written by writers whose first language is something else. A methodology that does not address this is producing a discrimination liability for any organization that adopts it.

The safe-harbor is a hard rule in the catalog. Any detection that triggers the cornerstone signature must be combined with at least one register-specific AI marker before the piece is flagged. The register-specific markers include the saturated-vocabulary cluster (three or more focal words from the catalog), em-dash density at the per-family threshold, system-prompt artifact bleed, chatbot reflex phrases like the sycophancy opener or concierge closer, or a hallucinated citation. Without one of those, the signature alone is most likely non-native English writing by a human author.

Combined-signal fingerprints

The B2 catalog holds eighty-six combinations. Each combination has a measured false-positive rate that is lower than the rate for any of its component patterns alone. The point is not that more patterns is better; the point is that specific co-occurrence is more diagnostic than count.

Two combinations are the workhorses for general editorial use. B2-COMBO-003, the Claude.ai default, fires when em-dash density, bulleted lists with bolded lead-ins, and uniform paragraph length appear together. The false-positive rate at full co-occurrence is below 0.5 percent. B2-COMBO-001, the ChatGPT 4o tell, fires when saturated-vocabulary clustering combines with exhausted-metaphor construction and a section-ending summary. The false-positive rate at full co-occurrence is below 1 percent.

A third combination matters for any system that processes content from non-native English authors. B2-COMBO-010, the ESL false-positive trap, has the same component patterns as the cornerstone AI signature: uniform paragraph length plus restricted vocabulary plus heavy transition use. The B2 entry catalogs this combination as a negative marker. When the combination fires without a register-specific AI marker also present, the system must not flag. The catalog records the combination so that any rules-engine downstream of the methodology has the constraint built in.

How to extend the catalog

The fourteen-field template makes contribution mechanical. The repository (github.com/synthesisengineering/synthesis-skills) is open source under the relevant CC0 / MIT terms. A researcher who finds a new pattern in a frontier model release submits the fourteen fields with three examples and an era tag. A reviewer checks the examples against production output, verifies the family attribution, and either accepts the entry, requests amendments, or rejects with an explanation.

The era taxonomy makes the catalog resistant to staleness in the obvious sense. A pattern that gets retrained out moves from Active to Declining to Historical over a year or two of release cycles; the entry stays in the catalog with the correct era tag. The catalog does not need to be rewritten every model generation; it accretes.

What this enables in practice

The methodology drives slopcheck, the hosted tool at tools.synthesiswriting.org/slopcheck/. The tool reads the catalog and applies it to submitted content with the two axes reported separately (AI-provenance signal and slop-independence). Newsroom editors, content leads, and individual writers use the tool through the web app, the Slopcheck Custom GPT in the OpenAI store, the Slopcheck Claude Project recipe, a browser extension, a command-line tool, and email submission to [email protected]. The tool is free everywhere, with zero data collection and no signup; voluntary support runs through GitHub Sponsors.

The same catalog can drive any editorial tool an organization wants to build. The fourteen-field template, the two-axis calibration tables, the ESL safe-harbor rule, and the eighty-six combined-signal fingerprints are public. The skill source is available for installation into any agent that reads the Agent Skills standard (Claude Code, OpenAI Codex, Cursor, GitHub Copilot, and forty-plus other clients).

A companion piece covers the engineering: the multi-pass orchestrator, the hosted-tier safety architecture, the tools-router pattern that fronts the subdomain. The canonical launch piece, written for editors and content owners rather than for methodology readers, lives on synthesiswriting.org. Both posts cross-link from here.