description: From raw text to meaning: How parsers use grammars to understand code and data.
How Parsers Work
You've learned about RTNs, BNF, and Finite State Machines. These are all ways to describe languages. But how do we actually use these descriptions to process text? How does Python know that print("hello") is valid but print("hello isn't?
The answer is parsing—the process of analyzing text according to a grammar. It's how compilers understand your code, how browsers render HTML, and how your JSON config files become usable data. 🎯
Why Is Parsing Necessary?
Computers can't directly understand text—not even simple code like x = 2 + 3. Here's why parsing is essential:
The Problem: Text is ambiguous and unstructured
When you write 2 + 3 * 4, you know multiplication happens first. But to a computer, it's just a string: "2 + 3 * 4". How does it know:
- What's an operator vs. a number?
- Which operation happens first?
- Whether the syntax is valid?
- How to represent nested structures like
f(g(x))?
What Happens Without Parsing?
Imagine trying to execute code directly from text:
"x = 2 + 3 * 4"- Which+or*do you do first?"if (x > 5) print(x)"- Where does the condition end and the body begin?"func(a, b)"- How do you knowfuncis a function call anda, bare arguments?
You'd need custom logic for every possible code pattern. It's chaos.
The Solution: Structured Representation
Parsing transforms unstructured text into a tree structure that:
- ✓ Validates syntax ("Is this valid code?")
- ✓ Captures meaning ("What operations are being done?")
- ✓ Shows relationships ("Which operations happen first?")
- ✓ Enables execution ("Now I can evaluate/compile this!")
Without parsing, you can't compile, execute, or even validate code. It's the bridge between human-readable text and computer-executable instructions.
The Big Picture
When a compiler or interpreter processes your code, it goes through three stages:
Step 1: Lexical Analysis
flowchart LR
A[("<b>Source Code</b>")]
B["<b>Lexer</b>"]
C[("<b>Tokens</b>")]
A -->|input| B
B -->|output| C
style A fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
style B fill:#4a5568,stroke:#cbd5e0,stroke-width:2px,color:#fff
style C fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
Step 2: Parsing
flowchart LR
A[("<b>Tokens</b>")]
B["<b>Parser</b>"]
C[("<b>Parse Tree / AST</b>")]
A -->|input| B
B -->|output| C
style A fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
style B fill:#4a5568,stroke:#cbd5e0,stroke-width:2px,color:#fff
style C fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
Step 3: Compilation/Interpretation
flowchart LR
A[("<b>Parse Tree / AST</b>")]
B["<b>Compiler / Interpreter</b>"]
C[("<b>Result</b>")]
A -->|input| B
B -->|output| C
style A fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
style B fill:#4a5568,stroke:#cbd5e0,stroke-width:2px,color:#fff
style C fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
| Stage | Input | Output | What It Does |
|---|---|---|---|
| Lexer | Source Code | Tokens | Breaks text into meaningful chunks Example: "x = 42" → [ID:x, EQUALS, NUM:42] |
| Parser | Tokens | Parse Tree / AST | Builds a tree structure from tokens Example: Tokens → Tree with assignment node |
| Compiler/Interpreter | Parse Tree / AST | Result | Executes or compiles the tree Example: Tree → machine code or output |
What is AST?
AST stands for Abstract Syntax Tree - a simplified tree structure representing the code's meaning and structure, without unnecessary syntax details. More details below!
We'll focus on the first two stages: lexing and parsing.
Stage 1: Lexical Analysis (Lexing)
The lexer (or tokenizer) breaks raw text into meaningful chunks called tokens. It's like turning a stream of characters into a stream of words.
What Tokens Look Like
For the expression total = price * 2 + tax:
| Token Type | Value |
|---|---|
| IDENTIFIER | "total" |
| EQUALS | "=" |
| IDENTIFIER | "price" |
| STAR | "*" |
| NUMBER | "2" |
| PLUS | "+" |
| IDENTIFIER | "tax" |
The lexer doesn't understand grammar—it just recognizes patterns. "Is this a number? A keyword? An operator?" That's all it asks. Simple creature, the lexer. 🔍
Lexer Implementation
Lexers are typically implemented as Finite State Machines and use Regular Expressions to define token patterns:
stateDiagram-v2
direction LR
[*] --> Start
Start --> InNumber: digit
InNumber --> InNumber: digit
InNumber --> Start: whitespace [emit NUMBER]
Start --> InIdent: letter
InIdent --> InIdent: letter/digit
InIdent --> Start: whitespace [emit IDENTIFIER]
Start --> Start: whitespace [skip]
Start --> Start: + [emit PLUS]
Start --> Start: * [emit STAR]
Start --> Start: = [emit EQUALS]
Here's a simple lexer implementation:
- Define all token types and their regex patterns as (name, pattern) tuples
- SKIP tokens (whitespace) are recognized but not added to the output
- Track current position in the input string
- Try each token pattern in order until one matches
- Try to match the pattern at the current position
- Only add non-whitespace tokens to the result
- Move position forward past the matched token
- If no pattern matches, we have an invalid character
- Define token types as [name, regex] pairs (arrays in JS, not tuples)
- SKIP tokens recognized but not added to output
- Track current position in input string
- Destructuring assignment extracts tokenType and pattern from each pair
- Anchor regex with
^to match only at current position slice(pos)creates substring from current position to end- Filter out whitespace tokens from final result
- Move position forward by length of matched token
- No pattern matched - input contains invalid character
- Define struct to hold token type name and compiled regex pattern
- Returns slice of tokens and error (idiomatic Go error handling)
- Slice of TokenSpec structs initialized inline
- Raw string literals (backticks) used for regex patterns in Go
- SKIP tokens for whitespace (recognized but not stored)
- Track current position as integer index
- Underscore
_discards the index (we only need the spec) - Slice
text[pos:]creates substring from position to end - Filter out whitespace tokens from results
- Advance position by length of matched string
- No pattern matched - return error with unknown character
- Auto-derive Debug trait for printing token structures
- Returns
Result<Vec<Token>, String>- Rust's idiomatic error handling - Vector of tuples containing token type strings and compiled regex patterns
- Raw strings
r"..."don't require escaping backslashes in Rust - SKIP tokens recognized but not added to result vector
- Track position as byte index (mut allows mutation)
- Create string slice from current position to end
- Borrow token_specs with
&to iterate without consuming if letpattern matching extracts match if pattern succeeds- Dereference
*token_typeto compare the &str with string literal - Advance position by number of bytes matched
- No pattern matched - return Err variant with error message
- Return Ok variant wrapping the successful token vector
- Inner static class holds token type name and compiled regex Pattern
- Compile pattern with
^anchor to match at start of string - Create immutable list using Arrays.asList helper method
- Double backslash
\\needed in Java strings to represent single backslash in regex - SKIP tokens for whitespace (tabs and spaces)
- Track current position as integer index
- Enhanced for-loop iterates over token specs
- Create substring from current position and try to match pattern
- Filter out whitespace tokens from final result
- Advance position by length of matched region
- No pattern matched - throw runtime exception with character
- Struct holds token type name and compiled std::regex pattern
- Pass string by const reference to avoid copying
- Initialize vector with brace-enclosed initializer list
- Raw string literals
R"(...)"don't require escaping in C++11+ - SKIP tokens for whitespace recognition
- Track position as size_t (unsigned integer type)
- Create substring from current position to end
- Range-based for loop with const reference to avoid copying
- std::smatch stores regex match results
- Search for pattern match in remaining text
- Filter out whitespace tokens from results
- Advance position by length of matched text
- No pattern matched - throw runtime_error exception
This produces a list of tokens, where each token is a tuple of (TOKEN_TYPE, VALUE). The lexer has broken the input string into meaningful pieces that the parser can work with.
Stage 2: Parsing
The parser takes tokens and builds a structured representation—usually a tree—according to the grammar rules.
Why Trees?
A flat list of tokens doesn't capture the structure and relationships in code. Consider 2 + 3 * 4:
- As tokens:
[NUM:2, PLUS, NUM:3, STAR, NUM:4]- just a flat sequence - As a tree: Shows that
3 * 4happens first (nested deeper), then2 +the result
Trees naturally represent:
- Operator precedence: Deeper operations execute first (
*before+) - Hierarchical structure: Nested expressions, function calls, code blocks
- Grammar rules: Each tree node represents a grammar rule being applied
- Evaluation order: Walk the tree to know what to compute when
Without a tree, you'd need complex logic to figure out "which operation happens first?" The tree makes it obvious—start at the leaves, work up to the root.
Connection to BNF and RTN
Remember BNF and RTN from earlier? They're grammar notations that describe valid syntax. The parser uses these rules to build the tree:
- Each BNF rule becomes a tree node: When the parser applies
<expression> ::= <term> "+" <term>, it creates an expression node with two term children - RTN paths trace tree construction: Following an RTN from start to end corresponds to building a subtree
- Recursion in grammar = Recursion in tree: Nested expressions in code become nested nodes in the tree
The grammar is the blueprint; the tree is the structure built from that blueprint. The parser is the construction worker following the plans.
Parse Trees vs Abstract Syntax Trees
Parse Tree (Concrete Syntax Tree): Shows every grammar rule applied.
Abstract Syntax Tree (AST): Simplified tree focusing on meaning, not syntax.
For 2 + 3 * 4:
Expression
|
+---------+---------+
| | |
Term '+' Term
| |
Factor +-------+-------+
| | | |
'2' Term '*' Factor
|
Factor '4'
|
'3'
The AST is what most compilers actually work with.
Scheme: Where Parse Tree = AST
In Scheme & Parse Trees, you'll see a language where the written code is the AST. Because Scheme uses prefix notation with explicit parentheses (+ 1 (* 2 3)), there is no ambiguity, and the parse tree matches the code structure 1:1.
Why ASTs Drop Punctuation
Notice how the AST doesn't include parentheses, commas, or semicolons? That's intentional. Once you've parsed the code and built the tree structure, punctuation has served its purpose—it told the parser how to group things.
What gets dropped:
- Parentheses () - the tree structure already shows grouping
- Commas , - already represented as multiple children
- Semicolons ; - just statement separators, not meaningful after parsing
- Keywords like then, do - the tree node type already captures the meaning
What gets kept:
- Operators +, -, * - needed to know which operation to perform
- Literals 42, "hello" - the actual values
- Identifiers x, foo - variable and function names
The AST keeps only what's needed for execution or compilation. This makes it smaller and easier to work with than the full parse tree.
SQL Execution Plans: A Related Concept
If you've worked with databases, you might recognize tree structures from SQL execution plans. They're related but different:
Parse Tree/AST (what parsers build): - Represents the syntactic structure of the code - Shows what the query means grammatically - Created during parsing (before execution)
Execution Plan (what query optimizers build): - Represents the execution strategy for running the query - Shows how to efficiently execute the query (which indexes, join order, etc.) - Created after parsing, during query optimization
The full pipeline is: SQL text → Lexer → Parser → AST → Query Optimizer → Execution Plan → Results
When database administrators tune queries by examining execution plans, they're looking at a tree structure that comes after parsing—it's the optimized plan for executing the already-parsed query. Both are trees showing hierarchical operations, but the AST captures syntax while the execution plan captures runtime strategy.
Grammar Ambiguity
Some grammars are ambiguous—they allow multiple valid parse trees for the same input. This is a problem because the program's meaning becomes unclear.
The Classic Example: Dangling Else
Consider this grammar for if-statements:
| Ambiguous If-Statement Grammar | |
|---|---|
Now parse this code:
Two valid parse trees exist:
Interpretation 1: The else belongs to the inner if:
Interpretation 2: The else belongs to the outer if:
Both are valid according to the grammar! This is the Dangling Else problem.
How Languages Solve It
Most languages resolve this through precedence rules:
- Convention: The
elsematches the nearest unmatchedif(Interpretation 1) - Require explicit delimiters: Python uses indentation; C uses braces
{ } - Require
endkeywords: Pascal and Ada make all blocks explicit
Why Ambiguity Matters for Parsing
- Parsers must make choices: When faced with ambiguity, the parser picks one interpretation (usually via precedence rules)
- Different parsers might choose differently: Without explicit rules, two implementations could parse the same code differently
- Grammar design is crucial: Good language design avoids ambiguity through careful grammar construction
Key insight: Ambiguity is a property of the grammar, not the parser. A well-designed grammar eliminates ambiguity before parsing begins.
Parsing Strategies
Not all grammars are equally easy to parse, and different approaches have different trade-offs. Think of parsing strategies like different tools in a toolbox—a hammer for some jobs, a screwdriver for others.
Choosing Your Approach
Why Different Strategies?
Different parsing strategies exist because:
- Grammar constraints: Some grammars are ambiguous or have features (like left-recursion) that break certain parsing approaches
- Implementation complexity: Hand-written parsers need simplicity; generated parsers can be more complex
- Error handling: Some strategies give better error messages than others
- Performance: Different strategies have different speed and memory characteristics
- Parsing power: Some strategies can handle more complex grammars than others
Which Strategy Should You Use?
| Situation | Recommended Strategy | Why? |
|---|---|---|
| Writing by hand | Top-down (Recursive Descent) | Easy to understand, maps directly to grammar, good error messages |
| Using a parser generator | Bottom-up (LR/LALR) | More powerful, handles more grammars, tools do the hard work |
| Simple expressions | Top-down | Quick to implement, sufficient for most expression grammars |
| Complex language (C, Java) | Bottom-up (via tool) | Handles complex grammar features, proven at scale |
| Educational purposes | Top-down | Easier to understand and trace execution |
For most projects: start with recursive descent (top-down) because it's intuitive. Only reach for more powerful strategies when you hit limitations.
Strategy: Start with the goal (e.g., "program") and work down to terminals.
Recursive Descent is the most intuitive top-down approach. Each grammar rule becomes a function:
Given this grammar:
| Expression Grammar | |
|---|---|
Left Recursion: A Top-Down Parser Killer
Top-down parsers cannot handle left-recursive grammars. A rule like this will cause infinite recursion:
Why it breaks: When parse_expression() is called, the first thing it does is call parse_expression() again (to match <expression>), which calls parse_expression() again, forever. The parser never consumes a token!
The fix: Rewrite left recursion as iteration (loops):
# Instead of:
<expression> ::= <expression> "+" <term>
# Use:
<expression> ::= <term> { "+" <term> }
The { } notation means "zero or more," which translates to a while loop in code. This is why our grammar above uses { ... } - it avoids left recursion.
Bottom-up parsers (LR) handle left recursion naturally, which is one reason they're more powerful.
Here's a recursive descent parser implementation:
- Store the list of tokens from the lexer
- Track current position in the token list
- Consume one token and optionally validate its type
- Move to the next token after consuming
- Handles lowest precedence operators (+ and -)
- Start by parsing the higher-precedence term
- Build a binary operation node combining left and right operands
- Handles medium precedence operators (* and /)
- Handles highest precedence: numbers and parenthesized expressions
- Handle parenthesized sub-expressions
- Recursively parse the expression inside parentheses
- Store token array from lexer in instance variable
- Track current position as index into token array
- Default parameter
nullallows calling without expected type - Increment position to move to next token
- Handles lowest precedence operators (+ and -)
- Start by parsing higher-precedence term
- Build binary operation array: [type, operator, left, right]
- Handles medium precedence operators (* and /)
- Handles highest precedence: numbers and parenthesized expressions
- Handle parenthesized sub-expressions for grouping
- Recursively parse expression inside parentheses
- Store slice of tokens from lexer
- Track current position as integer index
- Variadic parameter
...stringallows optional expected type - Increment position to advance to next token
- Handles lowest precedence operators (+ and -)
- Start with higher-precedence term
- Build slice representing binary operation node
- Handles medium precedence operators (* and /)
- Handles highest precedence: numbers and parentheses
- Handle parenthesized sub-expressions for grouping
- Recursive call parses expression inside parentheses
| Recursive Descent Parser in Rust | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | |
- Enum defines AST node types (Number or Binary Operation)
- Box heap-allocates nodes to allow recursive tree structure
- Store vector of tokens from lexer
- Track current position as usize (unsigned integer)
- Returns Result type for idiomatic Rust error handling
- Increment position to move to next token
- Handles lowest precedence operators (+ and -)
?operator propagates errors up the call stack- Create BinOp variant with boxed left/right children
- Handles medium precedence operators (* and /)
- Handles highest precedence: numbers and parentheses
- Handle parenthesized sub-expressions for grouping
- Recursive call parses expression inside parentheses
| Recursive Descent Parser in Java | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | |
- Store List of tokens from lexer
- Track current position as integer index
- Method overloading allows optional expectedType parameter
- Increment position to move to next token
- Handles lowest precedence operators (+ and -)
- Start with higher-precedence term
- Create new BinOpNode object for binary operation
- Handles medium precedence operators (* and /)
- Handles highest precedence: numbers and parentheses
- Handle parenthesized sub-expressions for grouping
- Recursive call parses expression inside parentheses
| Recursive Descent Parser in C++ | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | |
- Store vector of tokens from lexer
- Track position as size_t (unsigned integer type)
- Default parameter allows calling without expectedType
- Increment position to advance to next token
- Returns unique_ptr for automatic memory management
- Start with higher-precedence term
std::movetransfers ownership to new BinOpNode- Handles medium precedence operators (* and /)
- Handles highest precedence: numbers and parentheses
- Handle parenthesized sub-expressions for grouping
- Recursive call parses expression inside parentheses
This is a tree structure represented as nested tuples—our AST! Each tuple is a node:
('binop', '+', left, right)= a binary operation node with operator+('number', 2)= a leaf node containing the value2- The nesting shows the tree structure:
2 + (3 * 4)
Compare this to the lexer's flat list of tokens—the parser has transformed that flat list into a hierarchical tree that captures the meaning and precedence.
Notice the Structure
The parser functions mirror the grammar exactly:
parse_expressionhandles+and-parse_termhandles*and/parse_factorhandles numbers and parentheses
This is why BNF translates so directly into code!
Recursion and the Call Stack
Notice how parse_factor can call parse_expression, which calls parse_term, which calls parse_factor again? This mutual recursion works because each function call is managed by the function call stack.
When parsing (2 + 3), the call stack looks like:
parse_factor() ← handles the parentheses
→ parse_expression() ← handles the + inside
→ parse_term()
→ parse_factor() ← handles 2
When the innermost call completes, it returns to the previous function—classic LIFO behavior. Without the call stack, recursive parsing would be impossible.
How It Handles Operator Precedence:
Our grammar naturally handles precedence! Here's why:
expressionhandles+and-termhandles*and/factorhandles numbers and parentheses
Since term is nested inside expression, multiplication happens "deeper" in the tree—which means it's evaluated first.
For 2 + 3 * 4:
Result: 2 + (3 * 4) = 14 ✓
Adding More Precedence Levels:
Want to add exponentiation (^) with highest precedence?
| Grammar with Exponentiation | |
|---|---|
Notice <power> calls itself on the right side—this makes ^ right-associative: \(2^{3^4} = 2^{(3^4)}\).
Strategy: Start with tokens and combine them into larger structures, building the parse tree from the leaves up to the root.
How it works:
- Begin with individual tokens
- Recognize patterns that match the right-hand side of grammar rules
- "Reduce" those patterns to non-terminals
- Continue until reaching the start symbol
Advantages:
- More powerful than top-down (handles more grammars)
- Naturally handles left-recursive grammars
- Can detect syntax errors earlier
Disadvantages:
- More complex to implement by hand
- Harder to understand and debug
- Error messages can be less intuitive
In practice: Bottom-up parsers are typically generated by tools like YACC and GNU Bison rather than written by hand. These tools take a grammar specification and automatically generate efficient parser code.
Understanding LL and LR
Now that we've seen both approaches, let's compare them:
| Type | Reads | Builds Tree | Used By |
|---|---|---|---|
| LL(k) | Left-to-right | Leftmost derivation (top-down) | Recursive descent, ANTLR |
| LR(k) | Left-to-right | Rightmost derivation (bottom-up) | YACC, GNU Bison |
What "Leftmost" and "Rightmost" Derivation Mean:
These terms describe the order in which the parser builds the parse tree:
Leftmost Derivation (LL - Top-Down):
- Start with the goal (e.g.,
<expression>) - Expand grammar rules from left to right, top to bottom
- Like reading: start with the big picture, fill in details left-to-right
Example: 2 + 3
<expression>
→ <term> + <term> (expand expression)
→ <number> + <term> (expand leftmost term first)
→ 2 + <term> (substitute 2)
→ 2 + <number> (expand remaining term)
→ 2 + 3 (substitute 3)
Rightmost Derivation (LR - Bottom-Up):
- Start with the tokens (e.g.,
2,+,3) - Combine them into larger structures, working right-to-left when choosing what to reduce
- Like building with blocks: start with pieces, assemble upward
Example: 2 + 3
2 + 3
→ <number> + 3 (reduce rightmost number first)
→ <number> + <number> (reduce left number)
→ <term> + <term> (reduce to terms)
→ <expression> (reduce to expression)
For practical purposes: LL is top-down (goal → tokens), LR is bottom-up (tokens → goal). The "leftmost/rightmost" distinction matters for formal theory but the key difference is the direction.
Understanding Lookahead: The "(k)" Parameter
The "(k)" means how many tokens ahead the parser peeks to decide which grammar rule to apply.
LL(1) - One Token Lookahead:
When parsing, you often need to decide between alternatives. With LL(1), you look at the next token to decide:
```python title="LL(1) Lookahead Example" linenums="1"
# Grammar rule with alternatives:
# <factor> ::= NUMBER | "(" <expression> ")"
def parse_factor(self):
token = self.current_token() # Look at next token
if token[0] == 'NUMBER': # Lookahead says: it's a number
return self.parse_number()
elif token[0] == 'LPAREN': # Lookahead says: it's a parenthesized expr
return self.parse_paren_expr()
```
One token is enough to decide which path to take.
When You Need More Lookahead:
Some grammars need LL(2) or LL(3):
# Ambiguous with LL(1):
<statement> ::= "if" <expr> "then" <stmt>
| "if" <expr> "then" <stmt> "else" <stmt>
After seeing if, both rules start the same way. You might need to look ahead 3-4 tokens to distinguish them.
Why Lookahead Matters:
- LL(1) is simplest and most efficient (what recursive descent typically uses)
- LL(k) for k>1 requires more complex logic
- LR(1) is more powerful than LL(1) - can handle more grammars with just 1 token lookahead
- More lookahead = more memory and complexity
Most hand-written parsers use LL(1) because it's simple. Parser generators can handle larger k values automatically.
Working with Parse Trees
Once you have a parse tree or AST, you need to do something with it—evaluate it, compile it, or analyze it. This section covers common operations.
Evaluating the AST
Once you have an AST, you can walk it to compute results:
```python title="AST Evaluator in Python" linenums="1"
def evaluate(node): # (1)!
if node[0] == 'number': # (2)!
return node[1]
elif node[0] == 'binop': # (3)!
op, left, right = node[1], node[2], node[3] # (4)!
left_val = evaluate(left) # (5)!
right_val = evaluate(right)
if op == '+': return left_val + right_val # (6)!
if op == '-': return left_val - right_val
if op == '*': return left_val * right_val
if op == '/': return left_val / right_val
raise ValueError(f"Unknown node type: {node[0]}")
# Using our earlier AST
ast = ('binop', '+', ('number', 2), ('binop', '*', ('number', 3), ('number', 4)))
print(evaluate(ast)) # 14
```
- Recursively evaluate an AST node and return its computed value
- Base case: if it's a number node, return its value
- If it's a binary operation, evaluate both operands and apply the operator
- Unpack the operator and operands from the tuple
- Recursively evaluate left and right subtrees first
- Apply the operator to the evaluated operands
This tree-walking interpreter is the simplest approach. Real interpreters might:
- Compile the AST to bytecode
- Optimize the tree before execution
- Generate machine code
Error Handling
Good parsers give helpful error messages:
```python title="Error Handling in Parser" linenums="1"
def consume(self, expected_type):
token = self.current_token()
if token is None: # (1)!
raise SyntaxError(f"Unexpected end of input, expected {expected_type}")
if token[0] != expected_type: # (2)!
raise SyntaxError(
f"Line {self.line}, column {self.column}: " # (3)!
f"Expected {expected_type}, got {token[0]} ('{token[1]}')"
)
self.pos += 1
return token
```
- Check if we've run out of tokens unexpectedly
- Validate that the token type matches what we expected
- Provide location information (line and column) for better error messages
More sophisticated parsers can:
- Recover from errors and continue parsing
- Suggest corrections ("Did you mean...?"')
- Highlight the exact location of the problem
Parser Generators
Writing parsers by hand is educational, but for real projects, consider parser generators:
| Tool | Language | Grammar Style |
|---|---|---|
| ANTLR | Java, Python, etc. | EBNF-like |
| PLY | Python | YACC-like |
| Lark | Python | EBNF |
| Peggy (formerly PEG.js) | JavaScript | PEG |
| GNU Bison | C/C++ | YACC |
You write the grammar, the tool generates the parser code. ✨ Laziness is a virtue in programming.
Example with Lark (Python):
```python title="Parser Generator with Lark" linenums="1"
from lark import Lark
grammar = """ # (1)!
start: expr # (2)!
expr: term (("|"|"-") term)* # (3)!
term: factor (("*"|"/") factor)* # (4)!
factor: NUMBER | "(" expr ")" # (5)!
NUMBER: /\d+/ # (6)!
%ignore " " # (7)!
"""
parser = Lark(grammar) # (8)!
tree = parser.parse("2 + 3 * 4")
print(tree.pretty())
```
- Define grammar using EBNF-like syntax as a multi-line string
- Entry point of the grammar - must start with an expression
- Expression handles lowest precedence: addition and subtraction
- Term handles higher precedence: multiplication and division
- Factor handles highest precedence: numbers and parenthesized expressions
- Define what a NUMBER token looks like using regex
- Ignore whitespace in the input
- Create parser from grammar - Lark generates all parsing code automatically
Real-World Parsing
JSON is simple enough to parse by hand:
value = object | array | string | number | "true" | "false" | "null"
object = "{" [ pair { "," pair } ] "}"
pair = string ":" value
array = "[" [ value { "," value } ] "]"
Most languages have built-in JSON parsers because it's so common. The grammar is regular enough that a hand-written recursive descent parser works well.
HTML is messy—browsers handle malformed HTML gracefully. Real HTML parsers use complex error recovery:
| Malformed HTML Example | |
|---|---|
Technically invalid, but browsers render it anyway! The web is wild.
Why HTML parsing is hard:
- Browsers must handle broken HTML (unclosed tags, misnested elements)
- Contextual parsing rules (what's valid inside
<script>differs from<div>) - Historical quirks mode for backward compatibility
Modern language parsers are sophisticated:
Error recovery — IDE features like syntax highlighting and autocomplete work even with incomplete/invalid code
Incremental parsing — Fast re-parsing on edits by only updating changed parts of the tree
Loose parsing modes — Handle incomplete code during typing for real-time feedback
Practice Problems
Challenge 1: Add Unary Minus
Modify the expression grammar and parser to handle unary minus:
-52 + -3-(-4)
Where does the new rule go in the precedence hierarchy?
Challenge 2: Parse Variable Assignments
Extend the grammar to handle:
You'll need to track variable names and store their values.
Challenge 3: Error Messages
Improve the parser to give line and column numbers in error messages. What information do you need to track during lexing?
Key Takeaways
| Concept | What It Does |
|---|---|
| Lexer | Breaks text into tokens |
| Parser | Builds tree from tokens |
| Token | Meaningful chunk (keyword, number, operator) |
| AST | Tree representing program structure |
| Recursive Descent | Each grammar rule = one function |
| Precedence | Handled by grammar structure (nesting depth) |
Further Reading
- Binary Trees & Representation — Tree structures and hierarchical data
- Recursive Transition Networks — Visual grammars
- Backus-Naur Form — Grammar notation
- Finite State Machines — Foundation for lexers
- Crafting Interpreters by Robert Nystrom — Free online book, excellent deep dive
Parsing bridges the gap between human-readable text and computer-manipulable structure. It's where theory meets practice—where BNF rules become working code. Once you understand parsing, you can build your own languages, transform code, and peek behind the curtain of every compiler and interpreter you use. That's real power. 🔧