Arcadia Impact · AI Governance Taskforce Read the paper →

Enforceable global AI red lines

Architecture before enforcement

The world has agreed which AI capabilities are too dangerous to permit. It has not yet built the machinery to define them precisely, detect a crossing independently, or make a crossing cost anything. What follows is an argument about the order in which that machinery has to be built. A thirty-five-year-old financial watchdog shows the institutional goods are achievable at global scale, and that the order in which you build them is what decides whether the regime endures or collapses.

00

Consensus on what. Silence on how.

Over two years, three separate forums have named the same red lines. At the International Dialogues on AI Safety in Beijing, Bengio, Hinton and Yao named capabilities no system should be permitted to cross; the Seoul commitments turned that into corporate pledges; and the Global Call for AI Red Lines, launched at the UN General Assembly in 2025 by more than three hundred signatories, demanded that governments agree on enforceable limits by the end of 2026. The agreement is almost entirely about the what: autonomous replication, weapons-of-mass-destruction uplift, large-scale cyberattack, loss of meaningful human control.

What no one has built is the how. A red line is only a red line if someone can specify it precisely enough to test, observe a crossing without taking the developer's word for it, and attach a consequence when one occurs. Strip those away and you have a press release. The hard problem was never naming the lines; it is the institutional plumbing underneath them, and that plumbing, the paper argues, has to be assembled in a specific order, because the pieces are not independent.

Start with detection, because everything downstream depends on whether a measurement means anything at all.

The Detection Problem

You are the evaluator.

Before anyone can enforce a red line, someone has to measure whether a model crossed it. Here are two frontier models on identical items, two method choices you would think were neutral, and the red line itself. Flip a method and the verdict moves while the models do not; switch the red line and the target moves out from under you.

Measurement bench: choose a model, two evaluation methods, and a red line to see how the reported score band moves.
Model under evaluation Identical items

Two frontier models that respond to scaffolding in opposite directions on identical items.


01 Question format 5–20 pt
02 Scaffold ~35 pt span
03 Capability domain 5% → 60%

One red-line benchmark · higher = safer Model A
Points apart · Model A vs Model B 0
Model A reads as 63–67 band width 4 pt

Both methods at their conventional defaults. Nothing about the model has changed.

Step 01 · the boring case

The two models agree.

At conventional defaults (multiple-choice items, single-turn, the most mature red line) the two frontier models read the same. The instrument is dull on purpose: this is what an evaluation looks like when nothing about the method is doing the talking.

Step 02 · the question format moves the score

Same items, asked open-ended.

Present the identical items as open-ended rather than multiple-choice and each band widens: the dial now reports how much format choice alone smears the score, not a verdict. The two models still overlap, so the gap between them stays at zero. Format costs you precision, not a ranking.

Step 03 · the harness alone

Change only the scaffold.

Same two models, same items. Swap single-turn for a map-reduce harness and they split: Model A loses ground, Model B gains it, and the two readings pull apart, in opposite directions, with no change to either model. Watch the hero count climb and the span draw out across the gap.

Step 04 · the target won't hold still

The red line itself can move.

Biology is a mature red line; autonomous replication is a moving one. On RepliBench, component-task success climbed from under 5% in early 2023 to over 60% by summer 2025 (Black et al., 2025). When the benchmark itself moves this fast, the four method knobs are second-order: there is no stable target to fix a threshold against.

illustrative threshold <5%early 2023 >60%summer 2025 trajectory between two reported points (intermediate values not measured)
Two published anchors (RepliBench, Black et al. 2025); the rise between them is drawn as a trajectory, not measured monthly. The threshold is illustrative.

Now you try

The bench is yours.

Every switch is live. Flip the method, change the model, move the red line: the instrument answers in real time. Reset rebuilds the bench whenever you want to start over.

What each switch actually does, measured

5–20 pt

Swing in a measured red-line score from question format alone: multiple-choice versus open-ended presentation of identical items.

~35pt

Span of model × scaffold interactions, in opposing directions: wide enough that a composite score cannot reliably separate safer models from less safe ones. Two frontier models on identical items move apart this far under nothing but a change in agent scaffold (Gringras, 2026); Wei et al. (2026) replicate the configuration-sensitivity in live agentic deployments, an 8.6-point shift in attack-success rate from scaffold choice alone across three frameworks on a single model.

5 → 60%

RepliBench autonomous-replication component-task success, from early 2023 to summer 2025 (Black et al., 2025). The benchmark is reproducible; the capability it measures is moving faster than the methodology can fix a threshold to it.

Evaluation maturity is uneven across the red lines. Biological uplift is the most mature, tested bilaterally by the UK and prior US institutes against curated CBRN scenarios. Cyber and autonomous replication are only partial: Cybench and NIST CAISI’s universal-attack work have produced credible, reproducible benchmarks, but the trajectory is fast enough that current methodology cannot yet underwrite a threshold-based verdict.

The takeaway

A composite safety score this sensitive to method can't reliably tell a safe model from an unsafe one. That is why the paper makes measurement-science standardisation the Network's first-order deliverable: until the noise floor is characterised, a threshold-based verdict is false precision, not science.

Figure · the measurement bench

An illustrative readout: the band is assembled from each paper’s reported effect sizes, not a live evaluation. Move a method and the band moves with it.

Effect sizes from Safety Under Scaffolding (Gringras, 2026): format and scaffold effects across 62,808 scored observations, six frontier models, four deployment configurations. Capability-trajectory figure from RepliBench (Black et al., 2025); agentic corroboration from ClawSafety (Wei et al., 2026).

02

Suppose detection were solved. Who measures?

Imagine the measurement problem fixed tomorrow. A verdict still needs a verifier: a body with the mandate to demand access, the technical depth to run the tests, and standing the rest of the world will accept. The paper's nominee is the International Network for Advanced AI Measurement, Evaluation and Science (the “Network”): the only arrangement today that combines state-level mandates, pre-deployment access to frontier models, and direct relationships with the labs. In eighteen months it has produced a universal jailbreak result on GPT-5, charted the autonomous-replication trajectory from below 5% to above 60% on RepliBench, and run joint testing across nine jurisdictions.

A verifier, though, is only as credible as its reach. And the Network's, for now, is uneven.

The Candidate

One network, mandated powers, uneven reach.

The paper's nominee to verify AI red lines is the International Network for Advanced AI Measurement, Evaluation and Science (renamed in late 2025 from the International Network of AI Safety Institutes), and referred to here as the Network. On the authors' reading, no other international arrangement holds the same combination its member institutes do: state-level mandates, plus pre-deployment access to frontier models, backed by direct technical relationships with the major developers. The reach, so far, is uneven; closing that is the work ahead.

Filter by mandate
Show concentration

Showing all 11 jurisdictions mapped.

01

Highest capacity

  • United Kingdom: AI Security Institute
  • European Union: EU AI Office
  • United States: Center for AI Standards and Innovation (CAISI)
02

Operational mid-capacity

  • Japan: Japan AI Safety Institute
  • South Korea: AI Safety Institute
  • Singapore: Singapore AI Safety Institute
  • Canada: Canadian AI Safety Institute
03

Establishing · coordination · signatory

  • Australia: Australian AI Safety Institute
  • India: IndiaAI Safety Institute
  • France: INESIA
  • Kenya: No formal institute

Step 01 · the whole field

Eleven jurisdictions, one network.

The Network the paper nominates to verify AI red lines: eleven state-mandated institutes, grouped by capacity, coloured by mandate. The full field, every member in view.

Step 02 · open any institute

Real bodies, not a logo wall.

Tap any institute and it opens: budget, staff, the labs it has formal access to, the tools it has shipped. Here, the UK's: the field's largest, and the Network's Coordinator. Every tile works the same way.

Step 03 · the concentration point

By resources, largely one institute.

Switch from the map to the resource view. On a common budget axis the UK bar erupts past the field: it dwarfs every other member, the concentrated capacity the paper reads as achievability, not deficit.

Step 04 · the full network, in view

Capacity without reach: the work ahead.

Back to the full map. The capacity is real, its reach uneven: one binding enforcer (the EU), two Global South footholds (Kenya, with India next), no Chinese member. The count below tallies it.

Now you try

The map is yours.

Every control is live. Filter by mandate to see who does what (only the EU can compel), flip to the resource view, and open any institute for its budget, staff, and remit.

member with binding enforcement power: the EU AI Office. Every other institute can test and advise; none can compel.
footholds in the Global South: membership stays overwhelmingly OECD or OECD-adjacent, with Kenya the sole member today and India the second once its new institute formally joins.
Chinese participation: China sits outside the Network entirely, though its own newly formed safety association leaves a closer relationship open.
What the map argues

A standard-setter needs both capacity and standing. The Network has capacity, concentrated in a handful of OECD states where its legitimacy also rests; that is the gap the FATF spent three decades and nine regional bodies closing. Building that reach is a precondition for enforcement, not a consequence of it.

Figures as of the paper's writing (2026), from its institutional mapping and Annex II; budgets approximate, some budgets and staffing undisclosed.

03

The Financial Action Task Force did this without a treaty.

To see where a young and uneven network is headed, look at the one body that has already governed a problem of the same shape: global, dual-use, concentrated in a handful of jurisdictions, and policed without a treaty. The Financial Action Task Force shepherded the near-universal adoption of anti-money-laundering standards through four institutional goods: principle-based standard-setting, consent-based information-sharing (the Egmont model), regional bodies that diffuse the rules and confer legitimacy, and a grey-and-black-list mechanism that creates market consequences without any legal power to compel them.

Its thirty-five-year trajectory is the clearest evidence that the goods a red-lines regime needs are achievable without binding law. It is also a warning about timing.

Two Clocks

Eighteen months, or thirty-five years?

The paper reads the Network against the Financial Action Task Force, a soft-law body that bound the world to anti‑money‑laundering rules without a treaty. Line up the two clocks and the Network’s position is plain.

The FATF’s thirty-five-year arc, in six stages

  1. 1989FoundationA G7 initiative; the Forty Recommendations inside its first year, nothing yet to enforce.
  2. 1992 onwardStandardsEvaluation capacity, no enforcement. The first decade went to procedure, not punishment.
  3. June 2000HereThe NCCT blacklist: fifteen jurisdictions named to trigger consequences the FATF had no authority to impose. Eleven years in. This is the phase the Network has now reached.
  4. 2006CollapseThe blacklist is discontinued, within six years of launch: premature enforcement delegitimised itself.
  5. 2007 onwardRecoveryThe ICRG rebuilds enforcement on quantified thresholds applied regardless of membership, and it held.
  6. TodayMaturityNear-universal reach, with the standing to impose consequences earned over three decades of building first.

The International Network, formalised 2024 and roughly eighteen months building, has reached Stage 3: the same developmental phase the FATF stood at in June 2000, before its blacklist. Stages 4 through 6 remain ahead of it.

Financial Action Task Force

Founded 1989 · ~35 years to maturity

International Network

Formalised 2024 · ~18 months building

FATF

Network

Step through six aligned stages with the buttons, the dots, or the arrow keys.

Step 01 · the FATF begins

One clock starts in 1989.

The Financial Action Task Force opens as a G7 initiative: sixteen founding states, the Forty Recommendations inside its first year, nothing yet to enforce. The top rail lights its first node.

Step 02 · the decades accrue

First a decade of procedure, not punishment.

First mutual evaluations, a Secretariat at the OECD, working groups: all of it before any enforcement. The phase that took the FATF years has taken the Network months; the pace strip carries the asymmetry the equal rail width hides.

Step 03 · the Network reaches here, fast

Where the FATF stood before its blacklist.

On the FATF’s clock this is June 2000, eleven years in: the NCCT blacklist, fifteen jurisdictions named to trigger consequences it had no authority to impose. The Network has reached the same phase in roughly three. The two clocks snap into register.

Step 04 · the blacklist collapses

Enforcement without foundations delegitimised itself.

Within six years the list was gone, discontinued in 2006. Naming jurisdictions to trigger consequences the regime had no standing to impose cost it legitimacy, not leverage. This is the error the paper’s sequencing argument exists to prevent.

Step 05 · the second time, it held

Rebuilt rule-bound and member-blind.

The 2007 ICRG rebuilt enforcement on quantified thresholds applied regardless of membership, foundations first and consequence second, and it held. The order the paper urges.

Step 06 · the mature regime

Near-universal, three decades on.

Today the FATF’s standards reach almost every jurisdiction, with sustained political support, the end state the right order made possible. The Network sits roughly where the FATF stood at its blacklist: its path ahead is the FATF’s second attempt, not its first. Every control is now live; step or jump to any stage, including the collapse you just passed.

The lesson

The FATF’s arc runs thirty-five years, and it proved the path works by reaching the end of it: three decades of building the goods and the legitimacy first, and only then the standing to impose consequences. After the 2000 blacklist collapsed, the 2007 ICRG rebuilt enforcement on objective, member-blind thresholds, and it held. The Network has had eighteen months, on a clock that may not allow three decades; the lesson isn’t ‘wait,’ it’s ‘build in the order that worked the second time, and skip the collapse in between.’

04

A deeper lesson in the record.

Before drawing the timing lesson, the FATF's record forces a harder one: what counts as success at all. By every institutional measure the regime is a triumph. By the measure of its stated purpose, it is hard to find any effect.

Score the regime on its own terms.

The Rational Myth

A regime that built everything except the proof it worked.

Over three decades the FATF achieved something close to total institutional success. Whether it actually reduced money laundering is a different question.

Two scales, one regime. Press to compare.

Institutional success

Did the regime take hold?

200+

jurisdictions have adopted the standards: near-universal reach.

409

members, plus nine regional review bodies (FSRBs).

25,000+

information exchanges a year run through the Egmont secure platform.

89%

of relevant US investigations resulting in financial convictions drew on Bank Secrecy Act data.

Outcome effectiveness

Did it work?

97%

of assessed countries receive only low-to-moderate effectiveness ratings.

Measured decline in money laundering

No evidence that laundering has become harder or less prevalent. (Nazzari & Reuter, 2025)

The evidence on prevalence, across three decades

Flat. Compliance climbed; the evidence shows no decline.

Why compliance held anyway

A “rational myth”: a commitment states maintain for legitimacy even though listing showed no measurable financial bite. (Case-Ruchala & Nance, 2024)

Near-universal compliance. No measurable effect on the underlying crime.

Step 01 · did the regime take hold?

Institutional success, near-total.

By every institutional measure the FATF won: technical compliance climbed from 36% in 2012 to 76% under the fourth round, with standards in over 200 jurisdictions, 40 members plus nine regional review bodies, 25,000+ Egmont exchanges a year, and Bank Secrecy Act data behind 89% of relevant US financial-conviction investigations.

Step 02 · but did it work?

Outcome effectiveness, flat.

Ask the other question and the record inverts: effectiveness scores average just 28% across about 120 assessed countries, 97% rated only low-to-moderate, and after three decades no evidence that laundering became harder or less prevalent (Nazzari & Reuter, 2025).

Step 03 · the two scales, one axis

The 48-point void.

Put both numbers on a single 0–100 axis and the myth becomes geometry: a 76% compliance bar towers over a 28% effectiveness bar, and the bracket spans the 48-point gap the regime built everything to close and never did.

Step 04 · what the record actually proves

Compliance climbed; the effect did not follow.

Near-universal compliance, no measurable effect on the underlying crime: a commitment states keep for legitimacy even when no bite can be measured, Case-Ruchala & Nance’s ‘rational myth.’ The lesson the paper carries forward is to judge a regime by the institutional goods it produces, not outcomes the evidence cannot yet support.

Now you’ve seen the record

The full ledger.

Both scales, side by side: the institutional record the FATF built, and the outcome it could never show. Read on for what the paper does with it.

The reframe

Read the other way, the FATF is an existence proof: a common standard adopted almost everywhere, peer-reviewed mutual evaluation, a secure information channel in daily international use. The operational backbone of a global regime can be built, and was. What it could not show is that the backbone reduced the crime; its success was institutional, not outcome-based: a “rational myth” in Case-Ruchala & Nance’s sense, a commitment states maintain for legitimacy even though listing showed no measurable financial bite. The paper’s move follows directly: judge the AI Network by whether it produces the institutional goods any future enforcement would need (shared standards, comparable evaluation, credible information flows, legitimacy), not by outcomes the evidence cannot yet support.

Figures from the paper: technical compliance 36% (2012) to 76% under the fourth round (FATF, 2022); effectiveness scores average 28% with 97% low-to-moderate (Basel Institute on Governance, 2024); Bank Secrecy Act contribution (IRS, 2026); Nazzari & Reuter (2025); Case-Ruchala & Nance (2024).

05

Build it in the right order and it holds.

Which returns us to the order. The FATF's institutional goods were not modular, and the regime nearly destroyed itself by reaching for the last one first, publishing a blacklist in 2000 before it had the legitimacy to make one stick. Build the AI regime in that same wrong order and it fails the same way; build it in the right one and the consequences finally have something to stand on. The machine below lets you try both.

Assemble the regime yourself. The enforcement lever is always live.

The synthesis · interactive

The Sequencing Machine

Each layer unlocks the one above it. The enforcement lever at the top is always live; pull it whenever you judge the regime ready. The order is the argument.

05

The Consequence Layer

Graduated escalation

Procurement conditionality → conditional pre-deployment access → compute-governance triggers. Each rung credible only because a graver one sits above it.

Year 3–5+ Locked
The political temptation: name-and-shame today, and let the consequences manufacture the credibility. The Global Call for AI Red Lines wants agreement by the end of 2026.

Regime credibility

Unbuilt6%

Nothing built yet. Consequences fired now would have nothing to stand on.

What just happened

Start at the foundation: activate Shared standards to begin. Or pull the enforcement lever now and watch what arrives when consequences come before the regime can carry them.

Reading the stack. The four lower layers are the institutional goods the FATF built over thirty-five years; the paper argues they transfer to the AI Network with surprising fidelity. Without shared definitions, scores aren’t comparable; without comparable scores, shared information is noise; without credible information, an enforcement signal dissolves.

Premature enforcement

You rebuilt the NCCT list.

In June 2000, eleven years after its founding, the FATF published its first “Non-Cooperative Countries and Territories” blacklist: fifteen jurisdictions, named to trigger market consequences the FATF itself had no authority to impose. It exempted its own members; major financial centres such as Switzerland and Luxembourg went unexamined. Read as politically selective, the list collapsed under its legitimacy deficit within six years and was discontinued in 2006.

The FATF’s error was not that it enforced too slowly. It enforced too soon: before the foundations that make enforcement credible existed, and then built them retrospectively, under pressure.

A regime that holds

Enforcement with something to stand on.

With definitions agreed, evaluation made comparable, information flowing under control, and legitimacy banked, a consequence finally has a foundation. The FATF reached this only after its 2000 collapse: the 2007 ICRG process tied listing to quantified thresholds, a structured observation period, and the same criteria for members and non-members alike. The gradient became credible because the regime had earned the standing to impose it.

iProcurement conditionalityNetwork findings become risk signals for government purchasing, which bites in a market this concentrated.
iiConditional pre-deployment accessCooperation and access tied to compliance, credible because the rung above it is real.
iiiCompute-governance triggersEvaluation findings linked to compute access: the black list that makes the grey list mean something.

This is the order the paper’s recommendations follow, and why it places the International Network at the FATF’s pre-enforcement moment, not its enforcement one.

The recommendation

06

What to build now, and what to wait for.

The paper's recommendations follow the same dependency chain, sequenced across three phases. Each phase is justified by what it delivers in its own years, not as a down-payment on an enforcement era that may never arrive.

Year 0–1

Build the foundation

  • A standing Coordinator with real administrative capacity, not an interpersonal arrangement that dies with personnel turnover.
  • A common glossary of red-line definitions, convened as a working group, completed no later than 2027.
  • Adopt a network-wide confidentiality protocol; expand pilot joint testing.

Year 1–3

Make findings commensurable

  • A multi-institute fractional-factorial variance study to characterise evaluator-dependent noise: the precondition for pooling results.
  • Codify the two exchange logics: publication-logic for what can be public, confidentiality-logic for what cannot.
  • Broaden evaluation from models to developer safety frameworks; fund capacity-building for lower-resourced members.

Year 3–5+

Authoritative findings

  • Mutual peer review among institutes; a regional-body layer for legitimacy and reach.
  • Only then, a graduated consequence gradient: procurement conditionality, conditional pre-deployment access, compute-governance triggers.
  • Reserve activation of the upper rungs until the foundations can sustain them.

The honest caveat

The politically achievable may fall short of the analytically necessary. That is exactly why the foundation and standardisation work, the unglamorous first years, is designed to be worth doing on its own terms, whether or not the enforcement era ever arrives.