Regular Expressions: The Formal Model
You already write regex. You know \d+ matches digits and .* matches everything. But when a colleague asks "why did that regex take down the server?" or "why can't I match balanced parentheses with regex?" — the answer isn't in any syntax guide.
This is the theory behind the tool.
Understanding how regex engines work explains ReDoS, why some engines are faster than others, why backreferences are controversial, and why your regex-based HTML parser is formally broken. This article assumes you know regex syntax — if you need a refresher, start with Regular Expressions.
Learning Objectives
By the end of this article, you'll be able to:
- Explain how Thompson's construction compiles a regex pattern into an NFA
- Distinguish DFA engines (guaranteed \(O(n)\)) from NFA/backtracking engines and their practical tradeoffs
- Identify patterns vulnerable to catastrophic backtracking and explain the mechanism (ReDoS)
- Explain why backreferences make regex non-regular — and what that means for engine design
- Know when regex cannot solve a problem and why a parser is the correct tool instead
Where You've Seen This
These are the situations where the formal model matters:
- ReDoS CVEs — Vulnerability reports like Cloudflare's 2019 outage trace back to the backtracking model described here; the root cause is always a mathematical property of the pattern, not a code bug
re2vsre— Google'sre2library guarantees \(O(n)\) matching by using a DFA engine; Python'sremodule uses backtracking and can be exponential on adversarial input- Regex101's "explain" mode — The NFA diagram it generates is literally what your engine compiles your pattern into
- "You can't parse HTML with regex" — This isn't snobbery; it's a formal result about what regular languages can and cannot express
- Database regex performance — PostgreSQL's
~operator and MySQL'sREGEXPhave different performance characteristics based on their underlying engine architectures
Why This Matters for Production Code
In July 2019, a regex in Cloudflare's WAF firewall ruleset caused CPU usage to spike to 100% across their entire network for 27 minutes. A minor pattern update introduced catastrophic backtracking — exponential runtime against certain inputs.
This was not a bug in the code. It was a mathematical property of the pattern interacting with the backtracking engine. The same mechanism has caused multiple Stack Overflow outages and is tracked as a vulnerability class (CWE-1333).
When you install google-re2 in Python or reach for Rust's regex crate, you're trading away backtracking features for guaranteed \(O(n)\) performance. That trade-off only makes sense if you understand what the backtracking engine is doing — and when it becomes dangerous.
High-traffic input validation (URL routing, request filtering, log parsing) benefits from DFA engines. Complex extraction with backreferences requires NFA/backtracking engines. Knowing the difference helps you pick the right tool.
Balanced parentheses, matching XML tags at arbitrary nesting depth, validating that a string has equal numbers of as and bs — none of these are regular languages. No regex, no matter how clever, can correctly handle them for all possible inputs. Understanding why tells you when to stop reaching for regex and start reaching for a parser.
From Pattern to Automaton
When you call re.compile(r'\d+'), your regex engine doesn't interpret the pattern character by character at match time. It compiles the pattern to a state machine first.
The compilation pipeline:
| Stage | Operation | Output |
|---|---|---|
| 1 | Parse the regex | Syntax tree |
| 2 | Thompson's construction | NFA |
| 3 | Subset construction (optional) | DFA |
| 4 | Execute against input | Match / No match |
Thompson's construction converts a regex to an NFA systematically — each regex operator maps to a small NFA fragment that gets composed into a larger machine:
| Regex | NFA structure |
|---|---|
a |
Two states, one transition labeled a |
a\|b |
Split state with two paths, both joining at end |
ab |
Chain: state for a → state for b |
a* |
Loop: a transition back to same state, epsilon path to exit |
The regex ab*c compiles to this automaton:
stateDiagram-v2
direction LR
[*] --> S0
S0 --> S1: a
S1 --> S1: b
S1 --> S2: c
S2 --> [*]
- S0 → S1 on
a(must see exactly onea) - S1 → S1 on
b(loop: zero or morebs) - S1 → S2 on
c(exit loop, accept)
Complex patterns are built by composing these fragments. Every regex you've ever written has an equivalent state machine — regex101's "explain" diagram visualizes exactly this.
Two Engine Architectures
You already know how DFAs and NFAs differ. Real regex engines implement one model or the other, and the choice has concrete consequences for every regex you run in production.
DFA Engines
DFA engines convert the NFA to a deterministic automaton before execution. At match time, they make exactly one state transition per input character — guaranteed \(O(n)\) time regardless of the pattern.
Common DFA engines: Google's re2, GNU grep (basic patterns), Rust's regex crate, awk, Go's regexp package
| Property | Value |
|---|---|
| Time complexity | \(O(n)\) always |
| ReDoS vulnerable | No |
Backreferences (\1) |
Not supported |
| Variable-length lookbehind | Not supported |
NFA/Backtracking Engines
NFA engines simulate the automaton using backtracking. When the engine reaches a choice point — a|b, or a quantifier like a+ — it tries one path. If that path fails, it backtracks and tries the next.
Common NFA/backtracking engines: Python's re, JavaScript, Java, PHP (preg_*), Ruby, Perl, most "standard" regex engines
| Property | Value |
|---|---|
| Time complexity | \(O(n)\) typical, \(O(2^n)\) worst case |
| ReDoS vulnerable | Yes |
Backreferences (\1) |
Supported |
| Complex lookahead/lookbehind | Supported |
Choosing an Engine in Practice
The engine is a library choice, not a language choice. Here's what each major language provides by default — and how to opt into DFA-based matching when it matters:
Backreferences Break the Model
Here is a result that surprises most working engineers: backreferences make regex non-regular.
Consider the pattern (\w+) \1 — any word followed by a space and the same word. This matches "the the" and "error error" but not "the cat". The language it describes — "a word, a space, the same word" — cannot be recognized by any finite automaton.
Why? To evaluate \1, the engine must remember the exact text captured by Group 1. That text could be any length. An FSM has no memory beyond its current state — it can track which state it's in, but not what characters it previously matched. Backreferences require unbounded memory, which puts them outside the class of regular languages entirely.
This has practical implications:
- DFA engines don't support
\1— supporting it would require abandoning the DFA architecture - Backtracking engines support
\1but can be slow when patterns combine backreferences with quantifiers - When you write
(\w+)\s+\1, you're using a feature that technically makes your "regex" more than a regular expression
The same argument applies to any backreference: \2, \3, and named captures all step outside the formal regular language model.
Why Backtracking Blows Up
Now the consequence: because backtracking NFA engines support backreferences and other advanced features, they expose you to exponential blowup on adversarial input.
The pattern (a+)+b against aaaaaaaaac (many a's, ending in something that's not b):
The outer + can match 1, 2, 3, ... times. Each time, the inner a+ can consume 1, 2, ... remaining a's. When the overall match fails on c, the engine backtracks through every possible way to partition the a's among the outer repetitions.
For \(n\) a's, there are \(2^{n-1}\) partitions to try:
| Input | Backtracks attempted |
|---|---|
5 a's + c |
16 |
10 a's + c |
512 |
20 a's + c |
524,288 |
30 a's + c |
~500 million |
40 a's + c |
~500 billion |
A request with 40 adversarial characters can pin a server thread indefinitely.
Why the fix works:
With a single a+, there's exactly one way to match the a's. The engine reaches c, sees the b doesn't match, and fails immediately — no choice points to revisit.
The general rule: nested quantifiers on the same character class create exponential choice points. Flatten them. A linear regex always performs linearly.
Lookahead and Lookbehind
Lookahead ((?=...)) and lookbehind ((?<=...)) are zero-width assertions — they check a condition at the current position without consuming input characters.
Unlike backreferences, these don't require unbounded memory in most implementations. The lookahead pattern itself is finite. But they do require the engine to "branch" at the assertion point and explore, which is why their support varies:
| Feature | DFA engines | NFA engines |
|---|---|---|
Positive lookahead (?=...) |
Limited | Supported |
Negative lookahead (?!...) |
Limited | Supported |
Fixed-length lookbehind (?<=abc) |
Limited | Supported |
Variable-length lookbehind (?<=\w+) |
Not supported | Python only (most don't) |
JavaScript and Java require fixed-length lookbehind. Python's re module uniquely supports variable-length lookbehind — most other NFA engines don't.
The Practical Boundary
| What you need to match | Right tool |
|---|---|
| Fixed patterns, no captures | DFA engine (re2, Rust regex) |
| Capture groups, substitution | NFA engine (Python re, JS, Java) |
Backreferences (\1) |
NFA engine only |
| Balanced parentheses | Parser — regex cannot do this |
| Nested structures (HTML, XML) | Parser — regex cannot do this |
Equal counts (aⁿbⁿ) |
Parser — regex cannot do this |
| Programming language syntax | Parser — regex cannot do this |
The FSM article covers why FSMs — and therefore regex — cannot handle nested or counting problems. When you hit that boundary, the tool you reach for is a parser.
Technical Interview Context
The formal model of regex engines is relevant in security-focused interviews and in systems discussions about high-throughput input processing.
What is ReDoS and how do you prevent it?
Regex Denial of Service exploits catastrophic backtracking in NFA/backtracking engines. Patterns with nested quantifiers on the same character class (like (a+)+) create exponential backtracking on adversarial input. Prevention: flatten nested quantifiers, switch to a DFA engine (re2, Rust's regex crate), or add input length limits.
Why is re2 safer than Python's re?
re2 uses a DFA engine with guaranteed \(O(n)\) matching regardless of the pattern. Python's re uses backtracking and can exhibit exponential runtime on crafted input, making it exploitable if the pattern accepts user-controlled input.
What's the difference between an NFA and a DFA?
Both are finite automata, but an NFA can be in multiple states simultaneously; a DFA has exactly one state at each step. DFAs are faster to execute but cannot support backreferences. NFA engines support \1-style backreferences; DFA engines structurally cannot.
Can regex match balanced parentheses?
No. Balanced parentheses require counting to arbitrary depth, which requires unbounded memory. Finite automata have no memory beyond their current state — this problem is formally outside the regular language class and requires a parser.
Practice Problems
Practice Problem 1: Engine Selection
You're building an API gateway that validates incoming URLs against a whitelist pattern: ^/api/v\d+/[a-z]+$. The gateway handles 50,000 requests per second. You have two library choices: one using a DFA engine, one using backtracking.
Which do you choose and why?
Answer
DFA engine. The pattern contains no backreferences or complex lookaheads, so both engines produce correct results. But at 50,000 req/s, the guaranteed \(O(n)\) DFA performance matters — and you eliminate the theoretical ReDoS risk if the pattern ever gets extended. The backtracking engine offers no benefit for this problem.
Practice Problem 2: Identify the ReDoS Risk
Which of these patterns is vulnerable to catastrophic backtracking? Why?
a. ^\d+$
b. ^(a|aa)+$
c. ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
Answer
b. ^(a|aa)+$ is vulnerable. At each position the engine can match a or aa — two choices per character. Against aaaa...x (many a's, ending in something that fails $), the engine backtracks through \(2^n\) combinations.
a. ^\d+$ is safe — single quantifier, no choice points.
c. The email regex has potential: the [a-zA-Z0-9.-]+ component repeated over the domain can backtrack on adversarial input without the @. In practice, the literal @ and anchors limit the damage — but this class of pattern is why some systems replace regex email validation with a simple contains-@ check.
Practice Problem 3: Regular or Not?
For each language, decide: regular (expressible by pure regex without backreferences), or not regular?
a. "All strings where the number of 0s is divisible by 3"
b. "All strings of the form ww — a string followed by itself"
c. "All valid IPv4 addresses (four octets 0–255)"
Answer
a. Regular. Divisibility by 3 can be tracked with three states (remainder 0, 1, 2) — this is the exact binary divisibility FSM from the Finite State Machines article. No unbounded memory needed.
b. Not regular. Matching ww requires remembering the exact string w, which could be arbitrarily long. This is the same class of problem as backreferences — no FSM can solve it.
c. Technically regular, but barely. Each octet is bounded (0–255), so you can write a finite regex for it. In practice, the pattern ((?:25[0-5]|2[0-4]\d|[01]?\d\d?) per octet) is complex enough that parsing and range-checking in code is more readable and maintainable.
Key Takeaways
| Concept | What to Remember |
|---|---|
| Thompson's construction | Every regex compiles to an NFA; each operator maps to a small NFA fragment |
| DFA engines | \(O(n)\) always; no backreferences (re2, Rust regex, Go regexp) |
| NFA/backtracking engines | Flexible, supports \1; can be exponential (Python re, JS, Java) |
| Catastrophic backtracking | Nested quantifiers → exponential choice points → ReDoS |
| Backreferences are non-regular | \1 requires unbounded memory; no DFA can support it |
| Lookahead/lookbehind | Zero-width assertions; variable-length lookbehind is engine-specific |
| The hard boundary | Nested structures and counting problems require a parser, not regex |
Further Reading
On This Site
- Regular Expressions — Syntax reference: patterns, quantifiers, groups, flags, and practical patterns
- Finite State Machines — The automaton theory this article builds on
- How Parsers Work — The tool you reach for when regex isn't enough
External
- Regular Expression Matching Can Be Simple And Fast by Russ Cox — the technical argument for DFA engines and why
re2exists - Cloudflare Outage Post-Mortem — the 2019 ReDoS incident, explained in detail by Cloudflare's engineering team
Regular expressions look like a simple text-processing tool. They are — until a pattern interacts with the backtracking engine on adversarial input, or until you try to match a structure that requires counting or nesting. The backtracking model makes them dangerous in one direction; the regular language boundary makes them wrong in another. Knowing which category your problem falls into is what separates engineers who reach for regex reflexively from engineers who reach for it deliberately.