Regular Expressions
You've copied regex from Stack Overflow. You've pasted ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ into a validator without reading it. You've used IDE find-and-replace in regex mode and wondered why . suddenly matches everything.
This is the reference that makes those patterns readable.
Regular expressions are a pattern language for matching, extracting, and transforming text. Every working engineer uses them; few have a complete mental model. This article closes that gap.
Learning Objectives
By the end of this article, you'll be able to:
- Read and write regex patterns using literals, metacharacters, quantifiers, and groups
- Use capturing groups and backreferences to extract and transform text
- Apply flags to modify matching behavior across Python, JavaScript, Go, and other languages
- Recognize the common pitfalls: catastrophic backtracking, missing anchors, and escaped characters
- Know when regex is the right tool and when to reach for a parser instead
Where You've Seen This
Regex appears throughout a working engineer's day:
grepand command-line tools βgrep -E 'ERROR|FATAL' app.logis regex pattern matching- IDE find-and-replace β VS Code and IntelliJ's regex mode uses this exact syntax
- Input validation β email, phone, zip code validators in every web form
- Log parsing β extracting timestamps, log levels, and messages from structured log output
- URL routing β Rails, Django, and Express route patterns use regex under the hood
sed,awk,perl -peβ every time you've transformed text on the command line- Database queries β
WHERE column REGEXP 'pattern'in MySQL;~in PostgreSQL
Why This Matters for Production Code
Every production incident starts with log parsing. The pattern grep -E '\[ERROR\].*request_id=abc123' app.log replaces 50 lines of string splitting and loop logic. Log shipping pipelines like Logstash and Fluentd use regex-based pattern matching to extract structured fields from unstructured log lines.
Email addresses, phone numbers, passwords, credit cards β every input your API accepts needs format validation. Regex encodes complex rules declaratively: ^(?=.*\d)(?=.*[a-zA-Z]).{8,}$ is a complete password policy in one pattern.
Use regex for format validation (does this look like an email?), not semantic validation (is this email address real?). Send a confirmation email for that.
ETL pipelines, data migrations, and API normalization involve transforming text. Capturing groups extract structured data from unstructured input β timestamps from log entries, reformatted phone numbers, normalized addresses from multiple source formats.
Patterns with nested quantifiers like (a+)+ can trigger catastrophic backtracking β exponential runtime against adversarial input. This caused real outages at Cloudflare, Stack Overflow, and others.
The practical fix: avoid nested quantifiers. For the theory of why this happens β how regex engines compile to automata and where the explosion comes from β see Regular Expressions: The Formal Model.
What is a Regular Expression?
A regular expression is a pattern that describes a set of strings. Instead of listing every valid string, you describe the rules for what makes a string valid.
| Email Address Pattern | |
|---|---|
That monstrosity matches email addresses. Let's learn to read (and write) these things.
Basic Building Blocks
Literal Characters
Most characters match themselves:
| Pattern | Matches |
|---|---|
cat |
"cat" |
hello |
"hello" |
123 |
"123" |
Metacharacters
Some characters have special meaning:
| Character | Meaning | Example | Matches |
|---|---|---|---|
. |
Any single character | c.t |
"cat", "cot", "c9t" |
^ |
Start of string | ^hello |
"hello world" (at start) |
$ |
End of string | world$ |
"hello world" (at end) |
\ |
Escape special char | \. |
literal "." |
Character Classes
Match one character from a set:
| Pattern | Meaning | Matches |
|---|---|---|
[aeiou] |
Any vowel | "a", "e", "i", "o", "u" |
[0-9] |
Any digit | "0" through "9" |
[a-zA-Z] |
Any letter | "a"β"z", "A"β"Z" |
[^0-9] |
NOT a digit | anything except "0"β"9" |
Shorthand Classes
Common patterns have shortcuts:
| Shorthand | Equivalent | Meaning |
|---|---|---|
\d |
[0-9] |
Digit |
\D |
[^0-9] |
Not a digit |
\w |
[a-zA-Z0-9_] |
Word character |
\W |
[^a-zA-Z0-9_] |
Not a word character |
\s |
[ \t\n\r] |
Whitespace |
\S |
[^ \t\n\r] |
Not whitespace |
Unicode and \w
By default in many engines, \w matches only ASCII (a-z, A-Z, 0-9, _). It won't match accented letters ("Γ©", "Γ±"), non-Latin alphabets, or emoji.
- JavaScript: Use the
uflag:/\w+/u - Python:
re.UNICODEis default in Python 3 - Or be explicit:
[a-zA-ZΓ-ΓΏ]for Latin with accents
Quantifiers: How Many?
| Quantifier | Meaning | Example | Matches |
|---|---|---|---|
* |
Zero or more | ab*c |
"ac", "abc", "abbc"... |
+ |
One or more | ab+c |
"abc", "abbc"... (not "ac") |
? |
Zero or one | colou?r |
"color", "colour" |
{n} |
Exactly n | a{3} |
"aaa" |
{n,} |
n or more | a{2,} |
"aa", "aaa"... |
{n,m} |
Between n and m | a{2,4} |
"aa", "aaa", "aaaa" |
Greedy vs Lazy
By default, quantifiers are greedy β they match as much as possible:
| Greedy Matching | |
|---|---|
Add ? for lazy matching β match as little as possible:
Grouping and Alternatives
Groups: ( )
Parentheses group patterns together:
Capturing Groups
Groups also capture what they match for later use:
| Capturing Groups in Python | |
|---|---|
- Two capturing groups: 3 digits, hyphen, 4 digits
- Group 0 is always the entire matched string
- Groups 1, 2, 3... correspond to left parentheses in order
Practical use β reformatting phone numbers:
| Reformatting with Capture Groups | |
|---|---|
Non-Capturing Groups: (?: )
When you need grouping but don't need to capture:
| Non-Capturing Group | |
|---|---|
Backreferences
Match the same text again that was captured by an earlier group:
| Finding Duplicate Words | |
|---|---|
\1 matches exactly the text captured by Group 1. If Group 1 captured "the", then \1 only matches "the" at that position β not any other word.
Matching paired HTML tags:
| Matching Paired Tags | |
|---|---|
Backreferences and the Formal Model
Backreferences are powerful but technically step outside the regular language model. An FSM has no memory beyond its current state β it can't remember arbitrary captured text. This is why high-performance engines like re2 don't support \1. For the full explanation, see Regular Expressions: The Formal Model.
Anchors and Boundaries
| Anchor | Meaning |
|---|---|
^ |
Start of string (or line with multiline flag) |
$ |
End of string (or line with multiline flag) |
\b |
Word boundary |
\B |
Not a word boundary |
Word boundaries prevent partial matches:
| Word Boundary Example | |
|---|---|
Lookahead and Lookbehind
These match a position without consuming characters:
| Syntax | Name | Meaning |
|---|---|---|
(?=...) |
Positive lookahead | Followed by ... |
(?!...) |
Negative lookahead | NOT followed by ... |
(?<=...) |
Positive lookbehind | Preceded by ... |
(?<!...) |
Negative lookbehind | NOT preceded by ... |
Example β password must contain a digit and a letter:
| Password Validation with Lookaheads | |
|---|---|
(?=.*\d)β somewhere ahead, there's a digit(?=.*[a-zA-Z])β somewhere ahead, there's a letter.{8,}β at least 8 characters total
Flags and Modifiers
Flags change how the engine interprets your pattern:
| Flag | Name | What It Does |
|---|---|---|
i |
Case insensitive | Match both uppercase and lowercase |
g |
Global | Find all matches (not just first) |
m |
Multiline | ^ and $ match line starts/ends |
s |
Dotall | . matches newlines too |
| Flags in Rust | |
|---|---|
Practical Patterns
| Email Pattern | |
|---|---|
| Part | Meaning |
|---|---|
[a-zA-Z0-9._%+-]+ |
Local part |
@ |
Literal @ |
[a-zA-Z0-9.-]+ |
Domain name |
\.[a-zA-Z]{2,} |
Dot + TLD (2+ letters) |
Email Validation Reality
This is a simplification. The actual spec (RFC 5322) is absurdly complex. In practice: check for @ and send a confirmation email.
| US Phone Number | |
|---|---|
Matches 5551234567, 555-123-4567, (555) 123-4567, 555.123.4567.
Mismatched Parentheses
This allows (555-123-4567 (open paren, no close). Stricter version:
| Strict Parentheses | |
|---|---|
| IPv4 Address | |
|---|---|
Not Fully Validated
This matches 999.999.999.999. For true validation, parse the octets and check 0β255 in code.
| Structured Log Pattern | |
|---|---|
For [2024-03-15 14:30:45] [ERROR] Something went wrong:
- Group 1:
2024-03-15 14:30:45 - Group 2:
ERROR - Group 3:
Something went wrong
Regex in Your Language
| Regex in Python | |
|---|---|
| Regex in Go | |
|---|---|
Common Pitfalls
Catastrophic Backtracking
Patterns with nested quantifiers can cause exponential runtime against crafted input:
| Dangerous Pattern | |
|---|---|
This has caused real outages (Cloudflare 2019, Stack Overflow). The fix: flatten to a single quantifier.
| Safe Rewrite | |
|---|---|
For the theory of why this happens, see Regular Expressions: The Formal Model.
Forgetting Anchors
\d{3}-\d{4} matches "555-1234" anywhere in a string, including inside longer text. Add anchors when you need a full-string match:
| With Anchors | |
|---|---|
Escaping Special Characters
To match literal special characters, escape them with \:
| Escaping | |
|---|---|
Inside a character class, most special characters are literal: [.*+?] matches ., *, +, or ?.
Technical Interview Context
Regex problems appear in interviews either as direct tasks ("write a pattern to extract all URLs") or as discussion topics in code review and security scenarios.
Write a regex to validate a phone number / extract all URLs from this text
You'll be expected to know \d, \w, \s, quantifiers (+, *, ?, {n,m}), anchors (^, $), and groups (). The key technique: read a pattern aloud β "one or more digits, then a hyphen, then more digits" maps directly to \d+-\d+.
What's the difference between greedy and lazy matching?
Greedy (.*) matches as much as possible; lazy (.*?) matches as little as possible. On <a>foo</b>, <.+> matches the whole string; <.+?> matches just <a> and </b> separately.
Why shouldn't you parse HTML with regex?
HTML allows arbitrary nesting, which requires counting and memory beyond what finite automata provide. Any regex that looks like it handles nesting will fail on edge cases β malformed input, deeply nested tags, or attributes with angle brackets. Use a proper HTML parser.
What is catastrophic backtracking / ReDoS?
Certain patterns like (a+)+ create exponential backtracking on adversarial input β 30 characters can trigger billions of match attempts. Nested quantifiers on overlapping character classes are the warning sign.
Practice Problems
Practice Problem 1: URL Validation
Write a regex that matches HTTP/HTTPS URLs:
https://example.comhttp://sub.domain.org/pathhttps://site.io/page?id=123
Solution
Pattern: ^https?://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}(/[^\s]*)?$
^https?β "http" or "https" (sis optional)://β literal[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}β domain and TLD(/[^\s]*)?β optional path (any non-whitespace after/)$β end of string
Practice Problem 2: Date Validation
Match dates in YYYY-MM-DD format where month is 01β12 and day is 01β31.
Solution
Pattern: ^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$
\d{4}β four-digit year(0[1-9]|1[0-2])β month 01β09 or 10β12(0[1-9]|[12]\d|3[01])β day 01β09, 10β29, or 30β31
Note: This rejects month 13 but accepts February 31. For calendar-valid dates, parse the numbers and validate in code.
Practice Problem 3: Find Duplicate Words
Write a regex that finds repeated consecutive words like "the the" or "is is".
Solution
Pattern: \b(\w+)\s+\1\b
\bβ word boundary (prevents matching partial words)(\w+)β capture one or more word characters as Group 1\s+β one or more whitespace characters\1β backreference: must match the exact text captured by Group 1\bβ closing word boundary
Note: This uses a backreference, which technically steps outside the regular language model β FSMs can't remember arbitrary captured text. See Regular Expressions: The Formal Model for why.
Key Takeaways
| Concept | Syntax | Example |
|---|---|---|
| Any character | . |
a.c matches "abc", "a1c" |
| Character class | [...] |
[aeiou] matches vowels |
| Negated class | [^...] |
[^0-9] matches non-digits |
| Zero or more | * |
a* matches "", "a", "aaa" |
| One or more | + |
a+ matches "a", "aaa" |
| Optional | ? |
colou?r matches both spellings |
| Alternation | \| |
cat\|dog matches either |
| Capture group | (...) |
(\d{3}) captures three digits |
| Non-capturing | (?:...) |
(?:ab)+ groups without capturing |
| Backreference | \1 |
(\w+)\s+\1 finds repeated words |
| Word boundary | \b |
\bword\b matches whole word |
| Lookahead | (?=...) |
(?=.*\d) requires digit ahead |
Further Reading
On This Site
- Regular Expressions: The Formal Model β How regex engines compile patterns to automata, why backtracking causes ReDoS, and what regex fundamentally cannot match
- Finite State Machines β The automaton theory underlying regex engines
- How Parsers Work β When regex isn't powerful enough
External
- Regex101 β Interactive regex tester with explanation, NFA visualizer, and step debugger
- RFC 5322 β The full email address specification (a useful reminder of when not to use regex)
Regular expressions look like line noise until suddenly they don't β and then you'll reach for them constantly. Build patterns piece by piece, test at Regex101, and resist the urge to write everything in one inscrutable expression. Understanding the syntax is the first step; understanding the engine is what separates regex that works from regex that causes incidents.