A framework for figuring out when AI-generated code can be formally verified — and when you're kidding yourself.
I've been thinking about a problem that's been bugging me for a while.
We're all using AI to write code now. Copilot, Claude, ChatGPT, internal tools — whatever your flavor. And the code is… surprisingly good? It passes tests, it looks reasonable, it usually does what you asked for.
But "usually" is doing a lot of heavy lifting in that sentence.
Here's the thing nobody talks about at the stand-up: testing can show you bugs exist. It cannot prove they don't. That's not a philosophical position. It's a mathematical fact, courtesy of Dijkstra, circa 1972. And it matters a lot more now that we're generating code at a pace where human review is becoming a bottleneck rather than a safety net.
So I started asking a different question. Not "is this code correct?" but something more fundamental:
Can we even know if this code is correct? And if so, how hard is it to find out?
That question led to a framework I've been developing — a way to classify any AI-generated artifact by how amenable it is to formal verification. Not testing. Not code review. Mathematical proof. I'm wrapping up an academic paper on this with machine-checked proofs of the core claims, and I plan to publish it soon. But the ideas are too useful to keep locked in a PDF, so here's the practitioner's version.
Let me walk you through it.
After a lot of analysis (and quite a bit of theorem proving), I identified five independent structural properties that determine verification tractability. Think of them as dials. Each one, independently, makes the proof harder or easier.
A pure function takes inputs, produces outputs, done. No database calls, no API hits, no checking the clock, no random numbers. Give it the same inputs, you get the same outputs, every single time.
This is the single biggest factor. If your AI-generated function is pure, proving it correct is dramatically easier because the proof doesn't need to account for every possible state of the universe. It just needs to reason about inputs and outputs.
The moment your code reads from a database or calls an external service, you've introduced the entire state of that external system into your verification problem. The proof now has to say "this works regardless of what the database returns," which is a much, much harder claim.
Practical takeaway: If you're configuring AI to generate code for critical paths, push it toward pure functions wherever possible. This isn't just good software engineering — it's the single most impactful thing you can do for verifiability.
A function that validates a credit card number using the Luhn algorithm has a tiny, well-defined state space. A real-time order book has a state space that grows with every order, every cancellation, every partial fill.
There are three flavors here:
This one catches people off guard.
Before you can prove code is correct, you need a precise definition of "correct." Not a user story. Not a Jira ticket. A mathematical statement.
For some problems, this is easy: "the output list must be sorted and contain exactly the same elements as the input list." That's crisp. You can write a theorem about it.
For others? "The recommendation engine should surface relevant products." Good luck writing a theorem about that.
There are actually three levels:
Recognizing undecidability early saves you from pouring resources into a hole with no bottom.
Every external system your code calls — every API, every database, every third-party library — enters your verification as an unverified assumption. "We assume the payment gateway returns accurate status codes." "We assume the market data feed delivers prices within 50ms."
These assumptions might be true. They might not. The point is: your proof is only as strong as your weakest assumption.
And here's the kicker: the confidence degrades multiplicatively, not additively. If each of 10 assumptions has a 95% chance of being correct, your overall confidence isn't 50% (10 × 5% risk). It's 0.95¹⁰ ≈ 60%. With 20 assumptions at 95% each, you're at 36%.
That math should make anyone with a lot of microservices a little nervous.
Code that processes one thing at a time: manageable. Code where multiple threads are reading and writing shared state simultaneously: nightmare.
The number of possible execution orderings (interleavings) grows combinatorially with the number of concurrent threads — this is the classic state-space explosion problem. For n threads of m steps each, the number of interleavings is on the order of (nm)! / (m!)ⁿ. Concretely, a system with 3 threads of 10 steps each has on the order of billions of possible interleavings. Testing can cover a tiny fraction of them. Proving correctness must account for all of them.
This is why concurrent systems are the last frontier of formal verification, and why race conditions haunt production systems for years before they manifest.
Based on where a piece of software falls on these five dimensions, it lands in one of four tiers. I've formally proved that these tiers are exhaustive, mutually exclusive, and strictly ordered — every artifact goes into exactly one bucket, and the buckets are ranked by difficulty.
Tier 1 — Directly Verifiable. Pure code, simple state, clear spec, minimal dependencies, no concurrency. Current tools can handle this today. If your AI is generating pure data transformations, sorting algorithms, validation logic, or calculation engines — you can formally verify them. Right now. This is not speculative.
Tier 2 — Verifiable with Effort. Some managed side effects, structured state, sequential processing, moderate dependencies. You'll need a theorem prover and either a human expert or a very good AI proof assistant, but it's achievable. Think API contract verification, state machine correctness, regulatory rule engines.
Tier 3 — Research Frontier. High concurrency, many dependencies, complex state. Theoretically verifiable, but beyond what current tools can handle practically. Active research area. Think distributed transaction systems, real-time multi-service orchestration.
Beyond — Formally Intractable. The specification itself is undecidable, or the combination of unbounded state, full concurrency, and opaque dependencies makes verification provably impossible. Don't throw money at formal verification here. Invest in robust testing, runtime monitoring, and circuit breakers instead. Know when to stop trying to prove and start trying to detect.
Here's the result that surprised me the most.
If your code is pure (no side effects) and your state space is well-structured (finite or inductive), then the concurrency dimension collapses entirely. It effectively becomes zero.
Why? Because without side effects and with deterministic transitions, there's only one possible execution trace for any given input sequence. There's nothing to interleave. The scariest dimension of the framework just… vanishes.
This has a profound architectural implication: the way you generate the code determines whether you can verify the code. If your AI coding pipeline is configured to produce pure, structurally clean artifacts, you've preemptively eliminated the hardest verification challenge. This is a design-time decision, not a test-time one.
Okay, but here's the question sharp readers are already asking: if we're using AI to write the code and AI to write the proofs, haven't we just moved the problem?
Yes and no.
The proof-checking side is actually solved. Modern theorem provers have a tiny, human-audited kernel — typically a few thousand lines of code — that mechanically checks whether a proof is valid. This kernel is deterministic and exhaustively scrutinized. If it says the proof is valid, it's valid. It doesn't matter if the proof was generated by a human, an AI, or a monkey at a typewriter.
The specification side is NOT solved. If an AI generates both the code and the spec, it can (unconsciously, through optimization pressure) gravitate toward specs that are easy to prove rather than specs that are meaningful to prove. "I proved the output list has the same length as the input" is not the same as "I proved the output is correctly sorted." The proof is valid. The spec is inadequate. You now have false confidence — the worst kind.
The fix: Specifications must remain under human review. The human says what the code should do. The AI figures out how to prove it does. The kernel confirms the proof is valid. Three-party separation of concerns.
One more finding that matters for anyone building systems from components: Tier 1 is not closed under composition.
Translation: you can have two individually verified components, combine them, and the result is not automatically verified. Why? Because each component's verification depends on assumptions about its interfaces, and when you compose them, those assumptions accumulate. Two components with 2 external dependencies each combine into a system with 4, potentially pushing it past the Tier 1 threshold.
This is a real problem for microservice architectures. Each service might be verified in isolation, but the system-level guarantee requires an additional composition argument. That's the subject of the next paper I'm working on.
Theory is nice. Toolchains are nicer. Here's a tier-by-tier guide to what you can use today and what to watch for tomorrow.
These are your pure functions, your data transformations, your business logic calculations. The tooling exists now:
The bottleneck at Tier 2 isn't the proving — it's the specifying. Focus on making your intentions machine-readable:
You can't prove these correct (yet), but you can be disciplined about managing risk:
The decomposition question. If a system is too complex for direct verification (Tier 3 or beyond), can we systematically break it into pieces that are verifiable, and does verifying the pieces give us confidence in the whole?
Early results say: it's subtle. Decomposition and recomposition are not symmetric. But there are conditions under which it works, and I think they're formalizable. Stay tuned.
Some questions I'm genuinely wrestling with and would appreciate perspectives on:
Drop a comment. Disagree with me. Tell me I'm wrong about the tiers. That's how this gets better.
I'm finishing an academic paper with the full formal framework and machine-checked proofs behind these ideas. I'll share the link here once it's published. If you work on formal methods, AI code generation, or software assurance and want to collaborate on the decomposition paper, reach out.