Regular Expressions

You've seen Finite State Machines—elegant diagrams showing states and transitions. You've seen BNF—formal grammar notation. But when you need to actually use pattern matching in your code, you reach for regular expressions (regex).

Regex is the practical face of formal language theory. It's the same mathematical power as FSMs, wrapped in a terse syntax that fits in a single line. Love them or hate them (often both), regular expressions are an essential tool.

Why Regular Expressions Matter

Regular expressions are embedded in virtually every modern programming language and tool. They're the Swiss Army knife of text processing:

Validation: Email addresses, phone numbers, passwords, credit cards
Search and Replace: Find complex patterns across codebases (grep, IDE search)
Parsing: Extract data from logs, CSV files, API responses
Lexical Analysis: The first stage of compilation—breaking source code into tokens
Data Cleaning: Normalize messy input, strip unwanted characters
URL Routing: Web frameworks use regex to match request paths

The return on investment is massive. Spend an hour learning regex fundamentals, unlock decades of productivity. What would take 50 lines of string manipulation code becomes a single elegant pattern.

They're not just for programmers. Journalists use regex to analyze leaked documents. Scientists extract data from research papers. Anyone working with text at scale needs this tool.

What is a Regular Expression?

A regular expression is a pattern that describes a set of strings. Instead of listing every valid string, you describe the rules for what makes a string valid.

Email Address Pattern
1	`^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`

That monstrosity? It matches email addresses. 😅 Let's learn to read (and write) these things.

Basic Building Blocks

Literal Characters

Most characters match themselves:

Pattern	Matches
`cat`	"cat"
`hello`	"hello"
`123`	"123"

Metacharacters

Some characters have special meaning:

Character	Meaning	Example	Matches
`.`	Any single character	`c.t`	"cat", "cot", "c9t"
`^`	Start of string	`^hello`	"hello world" (at start)
`$`	End of string	`world$`	"hello world" (at end)
`\`	Escape special char	`\.`	literal "."

Character Classes

Match one character from a set:

Pattern	Meaning	Matches
`[aeiou]`	Any vowel	"a", "e", "i", "o", "u"
`[0-9]`	Any digit	"0" through "9"
`[a-zA-Z]`	Any letter	"a"-"z", "A"-"Z"
`[^0-9]`	NOT a digit	anything except "0"-"9"

Shorthand Classes

Common patterns have shortcuts:

Shorthand	Equivalent	Meaning
`\d`	`[0-9]`	Digit
`\D`	`[^0-9]`	Not a digit
`\w`	`[a-zA-Z0-9_]`	Word character
`\W`	`[^a-zA-Z0-9_]`	Not a word character
`\s`	`[ \t\n\r]`	Whitespace
`\S`	`[^ \t\n\r]`	Not whitespace

Unicode Characters

By default in many regex engines, \w matches only ASCII characters (a-z, A-Z, 0-9, _).

This means \w WON'T match:

Accented letters: "é", "ñ", "ü"
Non-Latin alphabets: "π", "こ", "א"
Emoji: "😀"

Solutions:

JavaScript: Use the u flag: /\w+/u
Python: Use re.UNICODE flag (default in Python 3): re.search(r'\w+', text, re.UNICODE)
Or be explicit: Use [a-zA-ZÀ-ÿ] for Latin with accents, or custom character classes

For truly international text, consider using Unicode categories like \p{L} (any letter) if your engine supports them.

Concept Check 1: Three-Letter Words

Write a regex pattern that matches three letters starting with a vowel.

Solution

Pattern: [aeiouAEIOU][a-zA-Z][a-zA-Z]

[aeiouAEIOU] - starts with a vowel
[a-zA-Z] - second letter
[a-zA-Z] - third letter

Note: This matches the pattern anywhere in text. Later we'll learn about word boundaries (\b) to match complete words only.

Concept Check 2: Postal Codes

Write a regex pattern that matches a Canadian postal code format (like "K1A 0B1" - letter, digit, letter, space, digit, letter, digit).

Solution

Pattern: [A-Z]\d[A-Z] \d[A-Z]\d

[A-Z] - uppercase letter
\d - digit
[A-Z] - uppercase letter
- space
\d - digit
[A-Z] - uppercase letter
\d - digit

Quantifiers: How Many?

Quantifiers specify repetition:

Quantifier	Meaning	Example	Matches
`*`	Zero or more	`ab*c`	"ac", "abc", "abbc", "abbbc"...
`+`	One or more	`ab+c`	"abc", "abbc", "abbbc"... (not "ac")
`?`	Zero or one	`colou?r`	"color", "colour"
`{n}`	Exactly n	`a{3}`	"aaa"
`{n,}`	n or more	`a{2,}`	"aa", "aaa", "aaaa"...
`{n,m}`	Between n and m	`a{2,4}`	"aa", "aaa", "aaaa"

Greedy vs Lazy

By default, quantifiers are greedy—they match as much as possible.

Greedy Matching Example
1 2 3	`Pattern: <.*> String: <div>hello</div> Match: <div>hello</div> (the whole thing!)`

Add ? for lazy matching—match as little as possible:

Lazy Matching Example
1 2 3	`Pattern: <.*?> String: <div>hello</div> Match: <div> (just the first tag)`

Concept Check 3: Password Length

Write a regex pattern for a password that's exactly 8-16 characters long (no more, no less).

Solution

Pattern: ^.{8,16}$

^ - start of string
. - any character
{8,16} - between 8 and 16 times
$ - end of string

Concept Check 4: Hashtags

Write a regex pattern that matches a hashtag (starts with #, followed by one or more word characters).

Solution

Pattern: #\w+

# - literal hashtag
\w+ - one or more word characters

Concept Check 5: Multiple Spaces

Write a regex pattern that matches multiple spaces (two or more consecutive spaces).

Solution

Pattern: {2,} or +

{2,} - space character, 2 or more times
+ - two spaces followed by zero or more spaces

Grouping and Alternatives

Groups: `( )`

Parentheses group patterns together:

Grouping Examples
1 2	`(ab)+ # One or more "ab": "ab", "abab", "ababab" (cat\|dog) # "cat" or "dog"`

Capturing Groups

Groups also capture what they match for later use. Think of parentheses as creating a "memory slot" that saves whatever matched inside them.

Why capture? You often want to extract specific parts of a match, not just verify a pattern exists.

For example, if you're matching a phone number like 555-1234, you might want to:

Extract just the first three digits (555)
Extract just the last four digits (1234)
Rearrange the parts into a different format

How it works:

Capturing Groups in Python
import re

match = re.search(r'(\d{3})-(\d{4})', 'Call 555-1234')  # (1)!
print(match.group(0))  # "555-1234" (entire match)  # (2)!
print(match.group(1))  # "555" (first capture group)  # (3)!
print(match.group(2))  # "1234" (second capture group)  # (4)!

Search for pattern with two capturing groups: 3 digits, hyphen, 4 digits
Group 0 is always the entire matched string
Group 1 captures the first parenthesized sub-pattern (3 digits)
Group 2 captures the second parenthesized sub-pattern (4 digits)

Breaking it down:

Part	What it does
`(\d{3})`	Group 1: Captures first 3 digits
`-`	Matches literal hyphen (not captured)
`(\d{4})`	Group 2: Captures last 4 digits

The parentheses create numbered groups (1, 2, 3...). Group 0 is always the entire match.

Another example - Parsing dates:

Extracting Date Components
import re

text = "Born on 1995-08-23"
match = re.search(r'(\d{4})-(\d{2})-(\d{2})', text)  # (1)!

year = match.group(1)   # "1995"  # (2)!
month = match.group(2)  # "08"
day = match.group(3)    # "23"

print(f"Year: {year}, Month: {month}, Day: {day}")
# Output: Year: 1995, Month: 08, Day: 23

Pattern with three groups: 4 digits (year), 2 digits (month), 2 digits (day)
Extract each component by accessing numbered capture groups

Practical use - Reformatting:

Reformatting Phone Numbers
import re

phone = "555-123-4567"
# Capture three groups
match = re.search(r'(\d{3})-(\d{3})-(\d{4})', phone)  # (1)!

# Rearrange into different format
formatted = f"({match.group(1)}) {match.group(2)}-{match.group(3)}"  # (2)!
print(formatted)  # Output: (555) 123-4567

Capture three groups: area code, exchange, and line number
Reconstruct the phone number in a different format using the captured groups

Key insight: Parentheses do two things:

Group patterns together for quantifiers (like (ab)+)
Capture the matched text for later use

When you need grouping but don't want to capture, use non-capturing groups (?:...) (explained next).

Non-Capturing Groups: `(?: )`

When you need grouping but don't need to capture:

Non-Capturing Group Example
1	`(?:ab)+ # Groups without capturing`

Backreferences

A backreference lets you match the same text again that was captured by an earlier group. Instead of matching a pattern, you're matching the exact string that was already captured.

Why use backreferences? To find repeated or matching patterns where you don't know in advance what the text will be, only that it should be the same.

Basic example - Finding duplicate words:

Backreference Example
1	`(\w+)\s+\1 # Matches repeated words: "the the", "is is"`

How it works:

(\w+) - Captures one or more word characters (this is Group 1)
\s+ - Matches one or more whitespace characters
\1 - Matches exactly the same text that Group 1 captured

So if Group 1 captured "the", then \1 will only match "the" (not "a" or "dog" or anything else).

Step-by-step with "the the":

Step	Pattern Part	Matches	Group 1 Contains
1	`(\w+)`	"the"	"the"
2	`\s+`	" " (space)	"the"
3	`\1`	"the" (must match exactly!)	"the"

If the text were "the dog", it wouldn't match because \1 looks for "the" again (not "dog").

Finding matching HTML tags:

Matching Opening and Closing Tags
1	`<(\w+)>.*?</\1>`

This matches paired HTML tags like <div>content</div> or <span>text</span>:

<(\w+)> - Captures the opening tag name (Group 1)
.*? - Matches any content (lazy)
</\1> - Matches closing tag with the same name as Group 1

Breaking it down:

For the string <div>hello</div>:

Part	Matches	Group 1
`<(\w+)>`	`<div>`	"div"
`.*?`	"hello"	"div"
`</\1>`	`</div>`	"div"

But <div>hello</span> wouldn't match because \1 is "div", not "span".

Multiple backreferences:

Multiple Backreferences Example
1	`(\w+) and \1, (\w+) and \2`

This matches patterns like "cats and cats, dogs and dogs":

(\w+) - Group 1 captures first word
and \1 - Matches " and " followed by same word as Group 1
, (\w+) - Group 2 captures second word
and \2 - Matches " and " followed by same word as Group 2

Backreference numbers:

\1 refers to Group 1
\2 refers to Group 2
\3 refers to Group 3
And so on...

Important: Backreferences match the captured text, not the pattern. If (\d+) captures "42", then \1 will only match "42" exactly, not any other number.

Concept Check 6: Color Spelling

Write a regex pattern that matches either "color" or "colour" using alternation.

Solution

Pattern: colou?r (simpler) or col(o|ou)r (using alternation)

Both work, but colou?r is more concise

Concept Check 7: HTML Tag Capture

Write a regex pattern that matches HTML tags like <div>, <span>, <p> and captures the tag name.

Solution

Pattern: <(\w+)>

< - literal less-than
(\w+) - capture group for tag name (one or more word chars)
> - literal greater-than

Concept Check 8: Doubled Words

Write a regex pattern that finds doubled words like "the the" or "is is".

Solution

Pattern: (\w+)\s+\1

(\w+) - capture one or more word characters
\s+ - one or more whitespace
\1 - backreference to first captured group (must match the same text)

Note: This pattern works but may match partial words. Later we'll learn about word boundaries (\b) to match complete words only.

Anchors and Boundaries

Anchor	Meaning
`^`	Start of string (or line with multiline flag)
`$`	End of string (or line with multiline flag)
`\b`	Word boundary
`\B`	Not a word boundary

Word boundaries are incredibly useful:

Word Boundary Example
1 2 3	`Pattern: \bcat\b Matches: "the cat sat" ✓ Doesn't match: "category" ✗, "bobcat" ✗`

Concept Check 9: TODO Lines

Write a regex pattern that matches lines that start with "TODO:".

Solution

Pattern: ^TODO:

^ - start of line
TODO: - literal text

Concept Check 10: File Extensions

Write a regex pattern that matches files that end with .md or .txt.

Solution

Pattern: \.(md|txt)$

\. - literal dot (escaped)
(md|txt) - either "md" or "txt"
$ - end of string

Concept Check 11: Word Boundaries

Write a regex pattern that matches the word "run" as a standalone word (not in "running" or "runner").

Solution

Pattern: \brun\b

\b - word boundary
run - literal text
\b - word boundary
This won't match "running" or "runner" because of the boundaries

Lookahead and Lookbehind

These match a position without consuming characters:

Syntax	Name	Meaning
`(?=...)`	Positive lookahead	Followed by ...
`(?!...)`	Negative lookahead	NOT followed by ...
`(?<=...)`	Positive lookbehind	Preceded by ...
`(?<!...)`	Negative lookbehind	NOT preceded by ...

Example: Password Validation

Password must have a digit and a letter:

Password Validation Pattern
1	`^(?=.\d)(?=.[a-zA-Z]).{8,}$`

Breaking it down:

^ — start
(?=.*\d) — somewhere ahead, there's a digit
(?=.*[a-zA-Z]) — somewhere ahead, there's a letter
.{8,} — at least 8 characters total
$ — end

Concept Check 12: Contains Uppercase

Write a regex pattern that matches a string that contains at least one uppercase letter (anywhere).

Solution

Pattern: ^(?=.*[A-Z]).+$

^ - start
(?=.*[A-Z]) - positive lookahead: somewhere there's an uppercase
.+ - one or more characters
$ - end

Concept Check 13: Password Validation

Write a regex pattern for a password with at least one digit AND at least one special character (!@#$%).

Solution

Pattern: ^(?=.*\d)(?=.*[!@#$%]).{8,}$

^ - start
(?=.*\d) - lookahead: contains a digit
(?=.*[!@#$%]) - lookahead: contains a special character
.{8,} - at least 8 characters
$ - end

Concept Check 14: Lookahead Without Capture

Write a regex pattern that matches a dollar amount that's followed by "USD" (but don't capture "USD").

Solution

Pattern: \$\d+(?:\.\d{2})?(?= USD)

\$ - literal dollar sign
\d+ - one or more digits
(?:\.\d{2})? - optional decimal point and 2 digits
(?= USD) - positive lookahead: followed by " USD" (not captured)

Flags and Modifiers

Flags (also called modifiers) change how the regex engine interprets your pattern. They're added after the closing delimiter in most languages.

Common Flags

Flag	Name	What It Does
`i`	Case insensitive	Match both uppercase and lowercase
`g`	Global	Find all matches (not just first)
`m`	Multiline	`^` and `$` match line starts/ends, not just string
`s`	Dotall	`.` matches newlines too

Flag Syntax by Language

Python - Flags JavaScript - Flags Go - Flags Rust - Flags Java - Flags C++ - Flags

Flags in Python
import re

re.search(r'hello', text, re.I)              # Case insensitive (re.IGNORECASE)
re.findall(r'\d+', text)                     # findall is inherently global
re.search(r'^line', text, re.M)              # Multiline (re.MULTILINE)
re.search(r'.', text, re.S)                  # Dotall (re.DOTALL)
re.search(r'hello', text, re.I | re.M)       # Multiple flags with |

Flags in JavaScript
/pattern/flags

/hello/i          // Case insensitive
/\d+/g            // Global - find all numbers
/^line/m          // Multiline - ^ matches line starts
/./s              // Dotall - . matches newlines
/hello/gi         // Multiple flags: global + case insensitive

Flags in Go
import "regexp"

// Go regex is always case-sensitive by default
regexp.MatchString(`(?i)hello`, text)        // Case insensitive (inline flag)
regexp.FindAllString(`\d+`, text, -1)        // Find all (use -1 for all matches)
regexp.MatchString(`(?m)^line`, text)        // Multiline (inline flag)
regexp.MatchString(`(?s).`, text)            // Dotall (inline flag)
regexp.MatchString(`(?im)hello`, text)       // Multiple flags (inline)

Flags in Rust
use regex::Regex;

// Case insensitive - use (?i) inline flag
let re = Regex::new(r"(?i)hello").unwrap();
re.is_match(text);

// Find all matches
let re = Regex::new(r"\d+").unwrap();
let matches: Vec<_> = re.find_iter(text).collect();

// Multiline - use (?m) inline flag
let re = Regex::new(r"(?m)^line").unwrap();

// Dotall - use (?s) inline flag
let re = Regex::new(r"(?s).").unwrap();

// Multiple flags
let re = Regex::new(r"(?im)hello").unwrap();

Flags in Java
import java.util.regex.*;

// Case insensitive
Pattern.compile("hello", Pattern.CASE_INSENSITIVE);
Pattern.compile("(?i)hello");  // Inline flag alternative

// Find all matches (use Matcher.find() in loop)
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(text);
while (m.find()) { /* ... */ }

// Multiline
Pattern.compile("^line", Pattern.MULTILINE);

// Dotall
Pattern.compile(".", Pattern.DOTALL);

// Multiple flags
Pattern.compile("hello", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);

Flags in C++
#include <regex>

// Case insensitive
std::regex re("hello", std::regex::icase);

// Find all matches (use std::sregex_iterator)
std::regex re("\\d+");
auto matches = std::sregex_iterator(text.begin(), text.end(), re);

// Multiline (use ECMAScript grammar with multiline)
std::regex re("^line", std::regex::multiline);

// Dotall - not directly supported in C++11, use [\s\S] instead
std::regex re("[\\s\\S]");  // Workaround for dotall

// Multiple flags
std::regex re("hello", std::regex::icase | std::regex::multiline);

Example: Case Insensitive Matching

Without flag:

Case Sensitive (Default)
1 2	`/cat/ # Matches: "cat" # Doesn't match: "Cat", "CAT"`

With i flag:

Case Insensitive
1	`/cat/i # Matches: "cat", "Cat", "CAT", "CaT"`

Example: Global Flag

The global flag controls whether to find just the first match or all matches:

Python - Global JavaScript - Global Go - Global Rust - Global Java - Global C++ - Global

Find All vs Find First in Python
import re

text = "2024-12-15"

# Find first match only
first = re.search(r'\d+', text)
print(first.group())  # "2024"

# Find all matches
all_matches = re.findall(r'\d+', text)
print(all_matches)  # ["2024", "12", "15"]

Global vs Non-Global in JavaScript
const text = "2024-12-15";

// Without g - finds first match only
text.match(/\d+/)     // ["2024"]

// With g - finds all matches
text.match(/\d+/g)    // ["2024", "12", "15"]

Find All vs Find First in Go
package main

import (
    "fmt"
    "regexp"
)

func main() {
    text := "2024-12-15"
    re := regexp.MustCompile(`\d+`)  // (1)!

    // Find first match only
    first := re.FindString(text)  // (2)!
    fmt.Println(first)  // "2024"

    // Find all matches
    all := re.FindAllString(text, -1)  // (3)!
    fmt.Println(all)  // ["2024" "12" "15"]
}

MustCompile panics on invalid regex (use Compile for error handling)
FindString returns first match as string (empty string if no match)
Second parameter -1 means find all matches (positive number limits results)

Find All vs Find First in Rust
use regex::Regex;

fn main() {
    let text = "2024-12-15";
    let re = Regex::new(r"\d+").unwrap();  // (1)!

    // Find first match only
    if let Some(first) = re.find(text) {  // (2)!
        println!("{}", first.as_str());  // "2024"
    }

    // Find all matches
    let all: Vec<&str> = re.find_iter(text)  // (3)!
        .map(|m| m.as_str())  // (4)!
        .collect();
    println!("{:?}", all);  // ["2024", "12", "15"]
}

unwrap() panics on invalid regex (prefer ? in real code)
find() returns Option<Match> - use if let to handle
find_iter() returns iterator over all matches (lazy evaluation)
map() extracts string slice from each Match object

Find All vs Find First in Java
import java.util.regex.*;
import java.util.*;

public class GlobalFlag {
    public static void main(String[] args) {
        String text = "2024-12-15";
        Pattern pattern = Pattern.compile("\\d+");  // (1)!
        Matcher matcher = pattern.matcher(text);  // (2)!

        // Find first match only
        if (matcher.find()) {  // (3)!
            System.out.println(matcher.group());  // "2024"
        }

        // Find all matches
        matcher.reset();  // (4)!
        List<String> all = new ArrayList<>();
        while (matcher.find()) {  // (5)!
            all.add(matcher.group());
        }
        System.out.println(all);  // [2024, 12, 15]
    }
}

Compile pattern once for reuse (throws PatternSyntaxException on invalid regex)
Create Matcher object that performs operations on the input text
find() advances to next match each call (stateful operation)
reset() returns matcher to start of string for re-scanning
Loop repeatedly calling find() to get all matches (Java's "global" approach)

Find All vs Find First in C++
#include <iostream>
#include <regex>
#include <string>
#include <vector>

int main() {
    std::string text = "2024-12-15";
    std::regex re(R"(\d+)");  // (1)!

    // Find first match only
    std::smatch match;  // (2)!
    if (std::regex_search(text, match, re)) {  // (3)!
        std::cout << match[0] << std::endl;  // "2024"
    }

    // Find all matches
    auto begin = std::sregex_iterator(text.begin(), text.end(), re);  // (4)!
    auto end = std::sregex_iterator();  // (5)!
    std::vector<std::string> all;
    for (auto i = begin; i != end; ++i) {  // (6)!
        all.push_back(i->str());
    }
    // Prints: 2024, 12, 15
    for (const auto& s : all) {
        std::cout << s << " ";
    }
    std::cout << std::endl;
    return 0;
}

Raw string literal R"(...)" avoids escaping backslashes
std::smatch stores match results for strings (use std::cmatch for C-strings)
regex_search finds first match and populates match object
sregex_iterator iterates over all matches (begin points to first match)
Default-constructed iterator serves as end sentinel
Dereference iterator to get match_results, then call str() for matched text

Example: Multiline Flag

Changes how ^ and $ work:

Python - Multiline JavaScript - Multiline Go - Multiline Rust - Multiline Java - Multiline C++ - Multiline

Multiline Flag in Python
import re

text = """Line 1
Line 2
Line 3"""

# Without MULTILINE: ^ matches only start of entire string
matches = re.findall(r'^Line', text)
print(matches)  # ['Line'] - only first line

# With MULTILINE: ^ matches start of any line
matches = re.findall(r'^Line', text, re.MULTILINE)
print(matches)  # ['Line', 'Line', 'Line'] - all three lines

Multiline Flag in JavaScript
const text = `Line 1
Line 2
Line 3`;

// Without m: ^ matches only start of entire string
/^Line/              // Matches "Line 1" only

// With m: ^ matches start of any line
/^Line/m             // Matches at start of each line

// Example with matchAll
Array.from(text.matchAll(/^Line/gm))  // 3 matches

Multiline Flag in Go
package main

import (
    "fmt"
    "regexp"
)

func main() {
    text := `Line 1
Line 2
Line 3`

    // Without multiline mode (default in Go is multiline)
    // Go's regexp always treats ^ and $ as multiline
    re := regexp.MustCompile(`(?m)^Line`)
    matches := re.FindAllString(text, -1)
    fmt.Println(matches)  // [Line Line Line]
}

Multiline Flag in Rust
use regex::Regex;

fn main() {
    let text = "Line 1\nLine 2\nLine 3";

    // Without multiline: ^ matches only start of entire string
    let re = Regex::new(r"^Line").unwrap();
    let count = re.find_iter(text).count();
    println!("{}", count);  // 1 - only first line

    // With multiline: ^ matches start of any line
    let re_multi = Regex::new(r"(?m)^Line").unwrap();
    let count = re_multi.find_iter(text).count();
    println!("{}", count);  // 3 - all three lines
}

Multiline Flag in Java
import java.util.regex.*;

public class MultilineFlag {
    public static void main(String[] args) {
        String text = "Line 1\nLine 2\nLine 3";

        // Without MULTILINE: ^ matches only start of entire string
        Pattern pattern = Pattern.compile("^Line");
        Matcher matcher = pattern.matcher(text);
        int count = 0;
        while (matcher.find()) count++;
        System.out.println(count);  // 1

        // With MULTILINE: ^ matches start of any line
        Pattern multiPattern = Pattern.compile("^Line", Pattern.MULTILINE);
        Matcher multiMatcher = multiPattern.matcher(text);
        count = 0;
        while (multiMatcher.find()) count++;
        System.out.println(count);  // 3
    }
}

Multiline Flag in C++
#include <iostream>
#include <regex>
#include <string>

int main() {
    std::string text = "Line 1\nLine 2\nLine 3";

    // Without multiline (C++ default is NOT multiline)
    std::regex re("^Line");
    auto begin = std::sregex_iterator(text.begin(), text.end(), re);
    auto end = std::sregex_iterator();
    std::cout << std::distance(begin, end) << std::endl;  // 1

    // With multiline (use ECMAScript multiline syntax)
    std::regex re_multi("^Line", std::regex::multiline);
    begin = std::sregex_iterator(text.begin(), text.end(), re_multi);
    std::cout << std::distance(begin, end) << std::endl;  // 3
    return 0;
}

Example: Dotall Flag

Makes . match newlines:

Python - Dotall JavaScript - Dotall Go - Dotall Rust - Dotall Java - Dotall C++ - Dotall

Dotall Flag in Python
import re

text = "Hello\nWorld"

# Without DOTALL: . doesn't match newlines
match = re.search(r'Hello.World', text)
print(match)  # None

# With DOTALL: . matches newlines too
match = re.search(r'Hello.World', text, re.DOTALL)
print(match.group())  # "Hello\nWorld"

Dotall Flag in JavaScript
const text = "Hello\nWorld";

// Without s: . doesn't match newlines
/Hello.World/        // Doesn't match

// With s: . matches newlines too
/Hello.World/s       // Matches!

// Using test()
/Hello.World/s.test(text)  // true

Dotall Flag in Go
package main

import (
    "fmt"
    "regexp"
)

func main() {
    text := "Hello\nWorld"

    // Without dotall: . doesn't match newlines (Go default)
    re := regexp.MustCompile(`Hello.World`)
    fmt.Println(re.MatchString(text))  // false

    // With dotall: . matches newlines (use (?s) flag)
    re_dotall := regexp.MustCompile(`(?s)Hello.World`)
    fmt.Println(re_dotall.MatchString(text))  // true
}

Dotall Flag in Rust
use regex::Regex;

fn main() {
    let text = "Hello\nWorld";

    // Without dotall: . doesn't match newlines
    let re = Regex::new(r"Hello.World").unwrap();
    println!("{}", re.is_match(text));  // false

    // With dotall: . matches newlines (use (?s) flag)
    let re_dotall = Regex::new(r"(?s)Hello.World").unwrap();
    println!("{}", re_dotall.is_match(text));  // true
}

Dotall Flag in Java
import java.util.regex.*;

public class DotallFlag {
    public static void main(String[] args) {
        String text = "Hello\nWorld";

        // Without DOTALL: . doesn't match newlines
        Pattern pattern = Pattern.compile("Hello.World");
        Matcher matcher = pattern.matcher(text);
        System.out.println(matcher.find());  // false

        // With DOTALL: . matches newlines
        Pattern dotallPattern = Pattern.compile("Hello.World", Pattern.DOTALL);
        Matcher dotallMatcher = dotallPattern.matcher(text);
        System.out.println(dotallMatcher.find());  // true
    }
}

Dotall Flag in C++
#include <iostream>
#include <regex>
#include <string>

int main() {
    std::string text = "Hello\nWorld";

    // Without dotall: . doesn't match newlines (C++ default)
    std::regex re("Hello.World");
    std::cout << std::regex_search(text, re) << std::endl;  // false (0)

    // C++ doesn't have direct dotall flag
    // Workaround: use [\s\S] instead of .
    std::regex re_workaround(R"(Hello[\s\S]World)");
    std::cout << std::regex_search(text, re_workaround) << std::endl;  // true (1)
    return 0;
}

Concept Check 15: Case Insensitive Email

Modify the email pattern to match emails regardless of case (e.g., "User@EXAMPLE.COM").

Solution

JavaScript: /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/i

Python: re.search(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', email, re.I)

The i flag makes the match case insensitive, so you don't need both [a-z] and [A-Z] anymore.

Practical Examples

Email Address Phone Numbers IP Address Log Parsing

Email Address Pattern
1	`^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`

Part	Meaning
`^`	Start
`[a-zA-Z0-9._%+-]+`	Local part (one or more valid chars)
`@`	Literal @
`[a-zA-Z0-9.-]+`	Domain name
`\.`	Literal dot
`[a-zA-Z]{2,}`	TLD (at least 2 letters)
`$`	End

Email Validation Reality

This regex is a simplification. The actual email spec (RFC 5322) is absurdly complex. 🤯 In practice, just check for @ and send a confirmation email.

US Phone Number Pattern (Simplified)
1	`^$?(\d{3})$?[-.\s]?(\d{3})[-.\s]?(\d{4})$`

Matches:

5551234567
555-123-4567
(555) 123-4567
555.123.4567

Mismatched Parentheses

This pattern has a flaw: it allows mismatched parentheses!

Invalid matches it allows:

(555-123-4567 (opening paren, no closing)
555)-123-4567 (closing paren, no opening)

Fixed version (requires both or neither):

Phone Number with Matched Parentheses
1	`^(\d{3}\|($\d{3}$))[-.\s]?(\d{3})[-.\s]?(\d{4})$`

Or more explicitly with alternation:

Phone Number - Strict Parentheses
1	`^($\d{3}$\|\d{3})[-.\s]?(\d{3})[-.\s]?(\d{4})$`

This says: "Either (555) OR 555, but not a mix."

IPv4 Address Pattern
1	`^(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$`

This Isn't Perfect

This matches 999.999.999.999, which isn't a valid IP. For true validation, you'd need (?:25[0-5]|2[0-4]\d|[01]?\d\d?) for each octet, or just parse the numbers and check in code.

Log Parsing Pattern
1	`^\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] \[(\w+)\] (.*)$`

For: [2024-03-15 14:30:45] [ERROR] Something went wrong

Group 1: 2024-03-15 14:30:45
Group 2: ERROR
Group 3: Something went wrong

Regex in Different Languages

Most languages use similar syntax, with minor variations:

Python - Usage JavaScript - Usage Rust - Usage Command Line (grep)

Regular Expressions in Python
import re

# Search for pattern
match = re.search(r'\d+', 'Order 12345')  # (1)!
print(match.group())  # "12345"

# Find all matches
matches = re.findall(r'\d+', 'Items: 5, 10, 15')  # (2)!
print(matches)  # ['5', '10', '15']

# Replace
result = re.sub(r'\d+', 'X', 'Order 123')  # (3)!
print(result)  # "Order X"

search() finds the first match in the string and returns a match object
findall() returns a list of all non-overlapping matches
sub() replaces all matches with a replacement string

Regular Expressions in JavaScript
// Test if pattern matches
/\d+/.test('Order 12345')  // true  // (1)!

// Find match
'Order 12345'.match(/\d+/)  // ['12345']  // (2)!

// Replace
'Order 123'.replace(/\d+/, 'X')  // "Order X"  // (3)!

test() returns boolean - true if pattern is found anywhere in string
match() returns array of matches (use /g flag for all matches)
replace() substitutes first match with replacement (use /g for all)

Regular Expressions in Rust
use regex::Regex;  // (1)!

// Search for pattern
let re = Regex::new(r"\d+").unwrap();
if let Some(mat) = re.find("Order 12345") {  // (2)!
    println!("{}", mat.as_str());  // "12345"
}

// Find all matches
let caps: Vec<&str> = re
    .find_iter("Items: 5, 10, 15")  // (3)!
    .map(|m| m.as_str())
    .collect();
println!("{:?}", caps);  // ["5", "10", "15"]

// Replace
let result = re.replace_all("Order 123", "X");  // (4)!
println!("{}", result);  // "Order X"

Requires regex crate: add regex = "1" to Cargo.toml
find() returns Option<Match> for the first match
find_iter() returns an iterator over all matches
replace_all() substitutes all matches with replacement string

Regular Expressions in grep
# Find lines containing "error"
grep -E 'error' logfile.txt

# Find lines starting with a date
grep -E '^[0-9]{4}-[0-9]{2}-[0-9]{2}' logfile.txt

The Connection to Theory

Remember how we said FSMs and regular expressions describe the same languages? Here's the connection:

Regex	FSM Equivalent
`ab`	Sequence of states
`a\|b`	Branch (two paths from same state)
`a*`	Loop back to same state
`a+`	Transition + loop
`a?`	Optional path (epsilon transition)

Every regex can be converted to an NFA (Non-deterministic Finite Automaton), then to a DFA (Deterministic Finite Automaton), then executed efficiently. That's what regex engines do under the hood.

Example: The regex ab*c converts to this FSM:

stateDiagram-v2
    direction LR

    [*] --> S0
    S0 --> S1: a
    S1 --> S1: b
    S1 --> S2: c
    S2 --> [*]

Start at S0
Read 'a' → move to S1
Read zero or more 'b's → loop on S1
Read 'c' → move to S2 (accepting state)

Common Pitfalls

Catastrophic Backtracking

Some patterns cause exponential backtracking:

Catastrophic Backtracking
1	`(a+)+b`

Against a string like aaaaaaaaaaaaaaaaac, the engine tries every possible way to divide the a's among the groups—and there are exponentially many. This can freeze your program. ❄️ Not fun.

Solutions:

Refactor to avoid nested quantifiers (universal solution):

Avoiding Nested Quantifiers
1 2 3 4 5	`# Bad - nested quantifiers (a+)+b # Better - single quantifier a+b`

Use atomic groups (advanced, not supported everywhere):

An atomic group (?>...) matches like a normal group, but once it succeeds, the regex engine "commits" to that match and won't backtrack into it.

How it works:

Normal group (a+): If the overall pattern fails, the engine can backtrack and try matching fewer a's
Atomic group (?>a+): Once matched, the engine won't reconsider - it's "locked in"

Example:

Atomic Groups

# Without atomic group - catastrophic backtracking
(a+)+b    # Against "aaaaaac", tries every way to split a's

# With atomic group - no backtracking inside
(?>a+)+b  # Matches all a's in one chunk, can't backtrack into it

The atomic group prevents the exponential backtracking by saying "once I've matched the a's, I'm done - don't try different ways to split them up."

Use possessive quantifiers (advanced, limited support):

Possessive quantifiers (*+, ++, ?+) work like atomic groups but with shorter syntax - they match and don't give back:

Possessive Quantifiers
1 2	`# Possessive quantifier - no backtracking a++b`

Limited Support

Atomic groups and possessive quantifiers are not supported in all regex engines. JavaScript doesn't support them at all. Stick with solution #1 (refactoring) for maximum compatibility.

Forgetting Anchors

Without Anchors
1	`\d{3}-\d{4}`

This matches "555-1234" inside "call 555-1234 now". If you want exact matches, use anchors:

With Anchors
1	`^\d{3}-\d{4}$`

Escaping Special Characters

To match literal special characters, escape them:

Escaped Special Characters
1	`\.\*\+\?\[\]\{\}\^\$\\|\\`

Or use a character class where most specials are literal:

Special Characters in Character Class
1	`[.+?] # Matches literal ., , +, or ?`

Practice Problems

Practice Problem 1: URL Validation

Write a regex that matches HTTP/HTTPS URLs like:

https://example.com
http://sub.domain.org/path
https://site.io/page?id=123

Practice Problem 2: Date Formats

Match dates in YYYY-MM-DD format where:

Year is 4 digits
Month is 01-12
Day is 01-31

Bonus: Can you ensure month doesn't exceed 12?

Practice Problem 3: Find Duplicates

Write a regex that finds repeated consecutive words in text, like "the the" or "is is".

Hint: Use backreferences.

Key Takeaways

Concept	Syntax	Example
Any character	`.`	`a.c` matches "abc", "a1c"
Character class	`[...]`	`[aeiou]` matches vowels
Negated class	`[^...]`	`[^0-9]` matches non-digits
Zero or more	`*`	`a*` matches "", "a", "aaa"
One or more	`+`	`a+` matches "a", "aaa"
Optional	`?`	`colou?r` matches both spellings
Alternation	`\\|`	`cat\\|dog` matches either
Group	`(...)`	`(ab)+` matches "ab", "abab"
Word boundary	`\b`	`\bword\b` matches whole word

Regular Expressions

Why Regular Expressions Matter

What is a Regular Expression?

Basic Building Blocks

Literal Characters

Metacharacters

Character Classes

Shorthand Classes

Quantifiers: How Many?

Greedy vs Lazy

Grouping and Alternatives

Groups: `( )`

Capturing Groups

Non-Capturing Groups: `(?: )`

Backreferences

Anchors and Boundaries

Lookahead and Lookbehind

Flags and Modifiers

Common Flags

Flag Syntax by Language

Example: Case Insensitive Matching

Example: Global Flag

Example: Multiline Flag

Example: Dotall Flag

Practical Examples

Regex in Different Languages

The Connection to Theory

Common Pitfalls

Catastrophic Backtracking

Forgetting Anchors

Escaping Special Characters

Practice Problems

Key Takeaways

Further Reading

Video Summary

Regular Expressions

Why Regular Expressions Matter

What is a Regular Expression?

Basic Building Blocks

Literal Characters

Metacharacters

Character Classes

Shorthand Classes

Quantifiers: How Many?

Greedy vs Lazy

Grouping and Alternatives

Groups: ( )

Capturing Groups

Non-Capturing Groups: (?: )

Backreferences

Anchors and Boundaries

Lookahead and Lookbehind

Flags and Modifiers

Common Flags

Flag Syntax by Language

Example: Case Insensitive Matching

Example: Global Flag

Example: Multiline Flag

Example: Dotall Flag

Practical Examples

Regex in Different Languages

The Connection to Theory

Common Pitfalls

Catastrophic Backtracking

Forgetting Anchors

Escaping Special Characters

Practice Problems

Key Takeaways

Further Reading

Video Summary

Groups: `( )`

Non-Capturing Groups: `(?: )`