Doctrine 08: Capability-Graded Doctrine — How We Hedge Without Going Bland
A claim shipped without a confidence grade is doing one of two things wrong. If the writer is more certain than their evidence, the claim is an overclaim and reader trust collapses on the first contact with reality. If the writer is less certain than their evidence, the claim has been hedged into noise and the reader can no longer tell what the writer actually believes. Both fail differently, and both fail the same way: the claim cannot be scored against the future because the writer never named what would constitute a hit or a miss.
Capability-graded doctrine is the middle path. Every load-bearing claim gets two annotations: a confidence grade naming how strong the writer's belief is, and an evidence bracket naming the basis for that belief. The grade and bracket together let the reader audit the writer's calibration, and let the writer's future self update the claim when reality has had its say.
This essay specifies the discipline. The canonical user is the Mercantile Thesis (the QM canon's flagship essay on AI-2026 market structure, hereafter "V2"), whose three dated falsifiable bets are the falsification mechanism the grades route into; the rubric-application is Doctrine 06's eight-axis check (the audit rubric that scores candidate sovereign-appliance products against the merchant lens). Three artifacts make the discipline concrete: a grade on each load-bearing claim, an evidence bracket per grade, and a dated bet that the grade can be scored against. The discipline is generalizable (every essay in the canon should use it), but it has anti-patterns that look like grading and aren't, and the worst of them are the ones that destroy voice.
I. The Two Failure Modes
The unhedged-claim failure mode is familiar. A confident writer ships a sentence like "the per-token price of frontier-class inference falls roughly an order of magnitude per year for three years running." It reads as authoritative. A reader who half-believes it nods along. A reader who fact-checks it discovers that "roughly an order of magnitude" is doing a lot of work and that "three years running" depends on which benchmark you measure against. Trust quietly collapses. The writer doesn't know it has happened because the reader has stopped engaging.
The over-hedged-claim failure mode is less famous and more common. A cautious writer ships "the per-token price of frontier-class inference, on my reading and depending substantially on the benchmark and time horizon, may have fallen at a rate that some observers have described as approximately an order of magnitude per year over a multi-year period." The same factual content; none of the rhetorical force. The reader cannot tell whether the writer is hedging because the evidence genuinely is weak or because the writer is performing humility. There is no signal. The sentence cannot fail because it has not committed to anything.
The discipline that produces the first sentence is publish to be right. The discipline that produces the second is publish to be safe. Neither is calibrated. A calibrated writer publishes to be scored.
The Mercantile Thesis V2 ships the calibrated version: "It is likely that the per-token cost of frontier-class inference falls another order of magnitude over the next twenty-four months ... [Evidence: three-year price-curve extrapolation; cross-checked against Aschenbrenner's compute-trajectory framework.] Continuation of an existing trend, not a new prediction." The grade is likely. The evidence is the three-year trend plus a named cross-check. The claim commits to a specific prediction with a specific time horizon, and the writer's confidence is named as likely rather than near-certain or uncertain but plausible. A reader can score this in twenty-four months. A reader can also score the writer's calibration over time by checking how often likely claims hit and how often they miss.
II. The Four-Grade Scale
The Mercantile Thesis V2 uses four grades, in ascending order of confidence:
Uncertain but plausible. The writer believes the claim is more likely true than false but cannot rule out scenarios where it is false. Used when the evidence is genuinely thin or when the reasoning relies on org-design read or analog argument. Example from V2: "It is uncertain but plausible that the appliance layer is built first by a workstation-scale integrator rather than by Apple or NVIDIA themselves." The evidence is org-design read; the writer cannot rule out Apple shipping the appliance first, and says so.
Likely. The writer believes the claim is meaningfully more likely true than false, with concrete evidence supporting it. Used for claims where the trend is established but the future could break it. Example from V2: "It is likely that the wrappers around foundation-model APIs see structural margin compression as the floor rises." The evidence is the historical analog (electricity wrappers, web-hosting wrappers, smartphone-app wrappers). A counter-trend would change the grade.
Likely-to-near-certain. The writer believes the claim is strongly supported and can name only narrow scenarios where it would fail. Used when the historical analog is robust and the mechanism is well-understood. Example from V2: "It is likely-to-near-certain that the next durable category is the appliance layer: vertically integrated, sovereign, multi-agent, deterministic, hardware-native." The evidence is the substrate-to-appliance transition pattern across electricity / internal combustion / mobile compute. The writer thinks the burden of proof should sit with people arguing AI is the exception.
Known unknown. The writer cannot grade the claim because the evidence does not yet exist. Used for forward predictions that depend on resolutions the future will produce. Bet 1's specific resolution date (Q4 2027 for SWE-bench Verified ≤10pp gap) is a known unknown until 2027-12-31; the writer commits to a position and accepts that the resolution is genuinely outside their current knowledge.
These four are not the only possible grades. They are the ones the canon uses because they cover the practically-useful range without proliferating into pseudo-precision. A finer scale (62% confident, 74% confident) buys nothing without a calibration record long enough to score the percentages against. The four-grade scale is honest about what the writer can reliably distinguish.
III. The Evidence Bracket
A grade without an evidence bracket is a confidence theatre. The bracket names what the grade is supported by, in the format [Evidence: <type>; <source>].
Six types of evidence the canon uses:
- Trend extrapolation. A multi-year pattern in measured data. Example: V2 Bet 1's "three-year price-curve extrapolation."
- Historical analog. A prior commercial / industrial / political pattern that the current situation resembles structurally. Example: V2's "electricity wrappers 1893–1907 → web-hosting wrappers 1999–2003 → smartphone-app wrappers 2010–2014."
- Org-design read. An interpretation of an organization's incentives, structure, or stated commitments that predicts behavior. Example: V2's "org-design read of Apple's developer-facing AI cautiousness and NVIDIA's data-center-shape."
- Citation cross-check. A claim that has been independently surfaced by a named outside source. Example: V2's cross-check against Aschenbrenner's compute-trajectory framework.
- Engineering receipt. A claim that is backed by running code or a measured benchmark from the writer's own work. Example: Sovereign Audit 04's 38-microsecond latency benchmark backing the "sovereign appliance is feasible at workstation scale" claim.
- Audit procedure. A claim that is backed by a published rubric and a specified scoring procedure. Example: Bet 3's reference to Doctrine 06's eight-axis check.
Each evidence type has its own failure mode. Trend extrapolation breaks when the trend breaks. Historical analog breaks when the analog turns out to be structurally unlike the present. Org-design read breaks when the organization restructures. Citation cross-check breaks when the cross-checked source is itself wrong. Engineering receipt breaks when the receipt's measurement methodology is contested. Audit procedure breaks when the rubric is contested. The reader who sees the evidence bracket can attack the specific failure mode rather than dismissing the claim wholesale.
The bracket is also a calibration substrate for the writer. Over time, the writer can audit which evidence types they tend to over-trust (and which they under-trust) by tracking which graded-claim types hit and which miss.
IV. The Audit-Trail Discipline
A graded claim is a public record. When reality scores it, the writer owes the canon two things: an explicit acknowledgment of the score (hit or miss), and a revised claim (or a kept claim, if the original held). The audit trail makes this auditable.
The mechanism is straightforward. Every essay containing a graded claim gets a follow-up "claim register" entry recording the grade, the evidence, the resolution date if applicable, and the eventual outcome. The Mercantile Thesis V2's three Bets are the canonical example: each Bet has a date, a falsification criterion, and an explicit commitment that "if these are wrong by the dates given, the merchant lens needs revision (not retirement, but explicit revision), with the failed bets documented as part of the canon."
A failed graded claim is not a discrediting event. It is the calibration mechanism doing its job. The canon's credibility comes from the discipline of recording the failures publicly, not from the absence of failures. A canon with no recorded misses is either too cautious to make load-bearing claims or is hiding them; either way the reader cannot calibrate the writer's grading.
The audit trail also feeds the canon's own evolution. If a particular evidence type repeatedly produces missed claims, the canon should down-weight that evidence type going forward. If likely claims are hitting at rates closer to near-certain, the writer is under-grading. If near-certain claims are missing more often than the grade implies, the writer is overconfident. The grades are calibration instruments; they only work if they get scored.
V. Anti-Patterns That Look Like Grading
Three patterns look like the discipline and aren't. The discipline exists in part to refuse them.
Hedging as throat-clearing. A writer prepends on my reading, for the most part, in some sense, to a first approximation, and arguably to claims that don't actually have grade-relevant content. The hedges read as humility but contribute no information about the writer's confidence. "On my reading, the merchant lens is the right frame for the appliance layer in some sense for the most part" says nothing the unhedged version doesn't, and obscures which parts of the claim the writer is actually less certain about.
The discipline rejects this. A grade is a structural commitment to a specific confidence level. A throat-clearer is rhetorical apology. Stripping out the throat-clearers tightens the prose without losing any signal because there was no signal in them.
The "for the most part" escape hatch. A writer makes a strong claim and then qualifies it with for the most part or in most cases to avoid being scored on edge cases. The escape hatch reads as nuance but actually prevents the claim from being falsified. There is no scenario in which the claim definitively misses, because any miss can be attributed to "well, the claim was about most cases, this is an edge case."
The discipline rejects this. A claim that cannot be falsified is not a graded claim. If the writer believes the claim only holds in 80% of cases, the grade should reflect that (likely, with the failure mode being [specific scenario]), and the falsification criterion should specify what proportion of the universe the claim is making predictions about.
Pseudo-precision in the grade. A writer assigns a numeric confidence (72% confident, 0.8 likelihood) without a calibration record long enough to score the numbers against. The numeric grade reads as rigor but actually under-specifies because the reader cannot tell what 72% means in the writer's idiolect. Does this writer's 72% claims hit at 72%, 50%, 90%? Without a calibration record the number is performative.
The discipline accepts numeric grades only when the writer has a calibration record long enough to ground them. The four-grade scale is the canon's default precisely because it's the right granularity for a calibration record at the canon's current size. As the canon grows and the audit trail accumulates, finer grades may become useful.
The voice discipline this enforces is severe: the writer who takes capability-graded doctrine seriously cannot pad with hedges they don't mean, cannot escape-hatch with "for the most part," and cannot perform precision they haven't earned. Every prose decision has to be grade-honest. The result is denser, sharper, and more falsifiable prose, not more cautious prose. That distinction is the heart of the discipline.
VI. The Grade Graveyard
A canon that grades its claims also needs a place to record the misses. The canon's grade graveyard is the running register of claims that have resolved as misses, with the original grade, the evidence bracket, the resolution date, and the lesson extracted.
Operational state as of 2026-05: the graveyard does not yet exist as a published file. It will live at ~/blog/GRADE_GRAVEYARD.md when the canon's first dated bet resolves (Bet 1, Q4 2027 — see V2's "Falsifiable bets" section). The descriptions in this section are the design specification, not a current-state report; the discipline is operational once the first miss lands and the file is created. A reviewer pre-2027 should read this section as the commitment, not the receipt.
The graveyard does three things:
It calibrates the writer. Looking back at five missed near-certain claims teaches the writer that they tend to overrate the structural-stability of organizations, or that they tend to under-weight regulatory disruption, or whatever the systematic miss-pattern reveals. The pattern is not visible from any single missed claim; it is visible from the aggregate.
It calibrates the reader. A reader who can audit the graveyard can decide whether to trust the writer's likely claims at face value or to discount them. Different writers will have different miss-patterns; the graveyard makes the writer's idiolect-of-grading inspectable.
It de-shames the miss. A canon with a public graveyard treats failed claims as routine calibration data, not as discrediting events. Writers are more willing to make load-bearing graded claims when the cost of being wrong is "log the miss in the graveyard and revise" rather than "be quietly discredited and pretend the claim was never made." The graveyard is what makes the discipline survivable.
The Mercantile Thesis V2's three Bets are the canon's first formal entries in the graveyard-to-be. Bet 1 resolves Q4 2027; Bet 2 resolves Q4 2028; Bet 3 resolves Q4 2029. Each will land in the graveyard with a hit-or-miss verdict. The canon's calibration over time will be audited against that record.
The discipline's working test: would a reader who audited the graveyard be willing to take the writer's next likely claim at face value? If yes, the discipline is working. If no, either the grades are mis-calibrated or the graveyard is incomplete. Either way, the canon needs to fix the discipline before it ships another graded claim.
None of these mechanisms — the four grades, the evidence brackets, the audit trail, the graveyard — is itself the point. The point is that every load-bearing claim in the canon should be inspectable: the reader should be able to see what the writer believes, why, and how strongly, and the canon should be able to score that belief against reality on a known timeline.
Capability-graded doctrine is what makes the canon worth reading more than once. The first read is for the framework; the second read, after some claims have resolved, is for the calibration.
Known Issues for V2
This V1.1 is a first-pass discipline specification, audited adversarially via cold-reader test before publication. Six known gaps remain, deferred to V2 of this essay (V2 ships at the same URL with a dated revision footer; Bet 3's external reviewers should record which version they audited against):
- Grade-graveyard is design-spec not artifact. The discipline names the graveyard but the file
~/blog/GRADE_GRAVEYARD.mddoesn't exist until Bet 1 resolves. V2 should either ship the file (empty, with a schema) when the discipline is ratified, or explicitly delete §VI until the first miss lands. - Mid-resolution revision protocol missing. A graded claim can be falsified by mid-resolution evidence (e.g., a Q2 2027 paper that disproves Bet 1's premise before the 2027-12-31 resolution date). The discipline doesn't specify how to handle interim revision without polluting the audit trail.
- 50-claim threshold for finer scales is unjustified. §V cites "~50 resolved graded claims" as the threshold for a finer grading scale. No derivation. V2 should either justify with Tetlock-literature math or remove the specific number.
- The four grades are author-internal, not externally calibrated. The grade definitions are the writer's idiolect; a different writer's
likelymay have different empirical hit-rate than this writer's. V2 should specify a calibration-disclosure format for each writer adopting the discipline. - Evidence-type taxonomy is six-category but unjustified at six. Why six? Why not five (collapse engineering receipt + audit procedure into "verifiable artifact") or seven (split historical analog into single-case vs cross-case)? V2 should defend the cardinality.
- No protocol for retracting a published grade. If a writer realizes a graded claim was wrong pre-resolution-date (not because reality scored it, but because the writer's reasoning was flawed), the discipline doesn't specify how to retract honestly without polluting the audit trail. V2 needs a retraction protocol.
Sources
Foundational:
- The Mercantile Thesis V2, particularly the "What this means for the next decade" section's three forward-looking claims with grades and evidence brackets, and the "Falsifiable bets" section as the canonical falsification surface.
- Doctrine 01 — Quantitative Mercantilism, A Field Statement, the methodological foundation this essay extends.
- Doctrine 06 — The Eight-Axis Check, which applies the grading discipline at the per-axis Pass/Partial/Fail level. The rubric is capability-graded doctrine applied to product evaluation.
Adjacent:
- Philip Tetlock and Dan Gardner, Superforecasting (2015). The empirical foundation for calibration as a measurable skill. The four-grade scale's restraint against pseudo-precision draws on Tetlock's finding that broad probability bands without long calibration records outperform precise-sounding numeric forecasts.
- Nassim Nicholas Taleb, Skin in the Game (2018). The dated-bet falsification mechanism is consistent with Taleb's argument that public predictions without skin in the game are noise.
- The IPCC AR6 confidence-level vocabulary (
virtually certain,very likely,likely,about as likely as not,unlikely, etc.) is the closest published standard for graded scientific claims; the four-grade canon scale is a compressed version.
Cross-references in the canon:
- Doctrine 09 — The Dual-Receipt System. The companion discipline: every graded claim ships with an engineering-side receipt that the grade can be calibrated against.
- Sovereign Audit 08 — The Mercantile Thesis. Engineering-side receipt for the V2's graded claims.
Footnotes
- A finer scale (five, seven, ten grades) was considered and rejected. The Tetlock calibration literature suggests that the marginal accuracy of finer scales requires calibration records of hundreds-to-thousands of resolved predictions. The QM canon as of 2026-05 has three formal dated bets and a handful of less-formal graded claims; a four-grade scale matches the calibration record we can actually score against. As the canon grows past ~50 resolved graded claims, a finer scale may become defensible. ↩
- "On my reading" is throat-clearing when the writer would have made the same claim without it. It's grade-relevant when the writer is signaling that the claim depends on a specific interpretive frame the reader may not share. For example, a structural reading of a commercial pattern that an institutional-economics reader might categorize differently. The test: does removing "on my reading" change which scenarios would falsify the claim? If yes, keep it. If no, strip it. ↩