description: From raw text to meaning: How parsers use grammars to understand code and data.

How Parsers Work

You've learned about RTNs, BNF, and Finite State Machines. These are all ways to describe languages. But how do we actually use these descriptions to process text? How does Python know that print("hello") is valid but print("hello isn't?

The answer is parsing—the process of analyzing text according to a grammar. It's how compilers understand your code, how browsers render HTML, and how your JSON config files become usable data. 🎯

Why Is Parsing Necessary?

Computers can't directly understand text—not even simple code like x = 2 + 3. Here's why parsing is essential:

The Problem: Text is ambiguous and unstructured

When you write 2 + 3 * 4, you know multiplication happens first. But to a computer, it's just a string: "2 + 3 * 4". How does it know:

What's an operator vs. a number?
Which operation happens first?
Whether the syntax is valid?
How to represent nested structures like f(g(x))?

What Happens Without Parsing?

Imagine trying to execute code directly from text:

"x = 2 + 3 * 4" - Which + or * do you do first?
"if (x > 5) print(x)" - Where does the condition end and the body begin?
"func(a, b)" - How do you know func is a function call and a, b are arguments?

You'd need custom logic for every possible code pattern. It's chaos.

The Solution: Structured Representation

Parsing transforms unstructured text into a tree structure that:

✓ Validates syntax ("Is this valid code?")
✓ Captures meaning ("What operations are being done?")
✓ Shows relationships ("Which operations happen first?")
✓ Enables execution ("Now I can evaluate/compile this!")

Without parsing, you can't compile, execute, or even validate code. It's the bridge between human-readable text and computer-executable instructions.

The Big Picture

When a compiler or interpreter processes your code, it goes through three stages:

Step 1: Lexical Analysis

flowchart LR
    A[("<b>Source Code</b>")]
    B["<b>Lexer</b>"]
    C[("<b>Tokens</b>")]

    A -->|input| B
    B -->|output| C

    style A fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
    style B fill:#4a5568,stroke:#cbd5e0,stroke-width:2px,color:#fff
    style C fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff

Step 2: Parsing

flowchart LR
    A[("<b>Tokens</b>")]
    B["<b>Parser</b>"]
    C[("<b>Parse Tree / AST</b>")]

    A -->|input| B
    B -->|output| C

    style A fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
    style B fill:#4a5568,stroke:#cbd5e0,stroke-width:2px,color:#fff
    style C fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff

Step 3: Compilation/Interpretation

flowchart LR
    A[("<b>Parse Tree / AST</b>")]
    B["<b>Compiler / Interpreter</b>"]
    C[("<b>Result</b>")]

    A -->|input| B
    B -->|output| C

    style A fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
    style B fill:#4a5568,stroke:#cbd5e0,stroke-width:2px,color:#fff
    style C fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff

Stage	Input	Output	What It Does
Lexer	Source Code	Tokens	Breaks text into meaningful chunks Example: `"x = 42"` → `[ID:x, EQUALS, NUM:42]`
Parser	Tokens	Parse Tree / AST	Builds a tree structure from tokens Example: Tokens → Tree with assignment node
Compiler/Interpreter	Parse Tree / AST	Result	Executes or compiles the tree Example: Tree → machine code or output

What is AST?

AST stands for Abstract Syntax Tree - a simplified tree structure representing the code's meaning and structure, without unnecessary syntax details. More details below!

We'll focus on the first two stages: lexing and parsing.

Stage 1: Lexical Analysis (Lexing)

The lexer (or tokenizer) breaks raw text into meaningful chunks called tokens. It's like turning a stream of characters into a stream of words.

What Tokens Look Like

For the expression total = price * 2 + tax:

Token Type	Value
IDENTIFIER	"total"
EQUALS	"="
IDENTIFIER	"price"
STAR	"*"
NUMBER	"2"
PLUS	"+"
IDENTIFIER	"tax"

The lexer doesn't understand grammar—it just recognizes patterns. "Is this a number? A keyword? An operator?" That's all it asks. Simple creature, the lexer. 🔍

Lexer Implementation

Lexers are typically implemented as Finite State Machines and use Regular Expressions to define token patterns:

stateDiagram-v2
    direction LR

    [*] --> Start
    Start --> InNumber: digit
    InNumber --> InNumber: digit
    InNumber --> Start: whitespace [emit NUMBER]
    Start --> InIdent: letter
    InIdent --> InIdent: letter/digit
    InIdent --> Start: whitespace [emit IDENTIFIER]
    Start --> Start: whitespace [skip]
    Start --> Start: + [emit PLUS]
    Start --> Start: * [emit STAR]
    Start --> Start: = [emit EQUALS]

Here's a simple lexer implementation:

Python - Lexer JavaScript - Lexer Go - Lexer Rust - Lexer Java - Lexer C++ - Lexer

Simple Lexer in Python
import re

TOKEN_SPEC = [  # (1)!
    ('NUMBER',   r'\d+'),
    ('IDENT',    r'[a-zA-Z_][a-zA-Z0-9_]*'),
    ('PLUS',     r'\+'),
    ('MINUS',    r'-'),
    ('STAR',     r'\*'),
    ('SLASH',    r'/'),
    ('EQUALS',   r'='),
    ('LPAREN',   r'\('),
    ('RPAREN',   r'\)'),
    ('SKIP',     r'[ \t]+'),  # (2)!
]

def tokenize(text):
    tokens = []
    pos = 0  # (3)!
    while pos < len(text):
        for token_type, pattern in TOKEN_SPEC:  # (4)!
            regex = re.compile(pattern)
            match = regex.match(text, pos)  # (5)!
            if match:
                if token_type != 'SKIP':  # (6)!
                    tokens.append((token_type, match.group()))
                pos = match.end()  # (7)!
                break
        else:  # (8)!
            raise SyntaxError(f"Unknown character: {text[pos]}")
    return tokens

print(tokenize("x = 42 + y"))
# Output: [('IDENT', 'x'), ('EQUALS', '='), ('NUMBER', '42'),
#          ('PLUS', '+'), ('IDENT', 'y')]

Define all token types and their regex patterns as (name, pattern) tuples
SKIP tokens (whitespace) are recognized but not added to the output
Track current position in the input string
Try each token pattern in order until one matches
Try to match the pattern at the current position
Only add non-whitespace tokens to the result
Move position forward past the matched token
If no pattern matches, we have an invalid character

Simple Lexer in JavaScript
const TOKEN_SPEC = [  // (1)!
    ['NUMBER',  /\d+/],
    ['IDENT',   /[a-zA-Z_][a-zA-Z0-9_]*/],
    ['PLUS',    /\+/],
    ['MINUS',   /-/],
    ['STAR',    /\*/],
    ['SLASH',   /\//],
    ['EQUALS',  /=/],
    ['LPAREN',  /\(/],
    ['RPAREN',  /\)/],
    ['SKIP',    /[ \t]+/],  // (2)!
];

function tokenize(text) {
    const tokens = [];
    let pos = 0;  // (3)!

    while (pos < text.length) {
        let matched = false;

        for (const [tokenType, pattern] of TOKEN_SPEC) {  // (4)!
            const regex = new RegExp('^' + pattern.source);  // (5)!
            const match = text.slice(pos).match(regex);  // (6)!

            if (match) {
                if (tokenType !== 'SKIP') {  // (7)!
                    tokens.push([tokenType, match[0]]);
                }
                pos += match[0].length;  // (8)!
                matched = true;
                break;
            }
        }

        if (!matched) {  // (9)!
            throw new SyntaxError(`Unknown character: ${text[pos]}`);
        }
    }

    return tokens;
}

console.log(tokenize("x = 42 + y"));
// Output: [['IDENT', 'x'], ['EQUALS', '='], ['NUMBER', '42'],
//          ['PLUS', '+'], ['IDENT', 'y']]

Define token types as [name, regex] pairs (arrays in JS, not tuples)
SKIP tokens recognized but not added to output
Track current position in input string
Destructuring assignment extracts tokenType and pattern from each pair
Anchor regex with ^ to match only at current position
slice(pos) creates substring from current position to end
Filter out whitespace tokens from final result
Move position forward by length of matched token
No pattern matched - input contains invalid character

Simple Lexer in Go
package main

import (
    "fmt"
    "regexp"
)

type TokenSpec struct {  // (1)!
    tokenType string
    pattern   *regexp.Regexp
}

type Token struct {
    Type  string
    Value string
}

func tokenize(text string) ([]Token, error) {  // (2)!
    tokenSpecs := []TokenSpec{  // (3)!
        {"NUMBER", regexp.MustCompile(`^\d+`)},  // (4)!
        {"IDENT", regexp.MustCompile(`^[a-zA-Z_][a-zA-Z0-9_]*`)},
        {"PLUS", regexp.MustCompile(`^\+`)},
        {"MINUS", regexp.MustCompile(`^-`)},
        {"STAR", regexp.MustCompile(`^\*`)},
        {"SLASH", regexp.MustCompile(`^/`)},
        {"EQUALS", regexp.MustCompile(`^=`)},
        {"LPAREN", regexp.MustCompile(`^\(`)},
        {"RPAREN", regexp.MustCompile(`^\)`)},
        {"SKIP", regexp.MustCompile(`^[ \t]+`)},  // (5)!
    }

    tokens := []Token{}
    pos := 0  // (6)!

    for pos < len(text) {
        matched := false

        for _, spec := range tokenSpecs {  // (7)!
            match := spec.pattern.FindString(text[pos:])  // (8)!
            if match != "" {
                if spec.tokenType != "SKIP" {  // (9)!
                    tokens = append(tokens, Token{spec.tokenType, match})
                }
                pos += len(match)  // (10)!
                matched = true
                break
            }
        }

        if !matched {  // (11)!
            return nil, fmt.Errorf("unknown character: %c", text[pos])
        }
    }

    return tokens, nil
}

func main() {
    tokens, _ := tokenize("x = 42 + y")
    fmt.Println(tokens)
    // Output: [{IDENT x} {EQUALS =} {NUMBER 42} {PLUS +} {IDENT y}]
}

Define struct to hold token type name and compiled regex pattern
Returns slice of tokens and error (idiomatic Go error handling)
Slice of TokenSpec structs initialized inline
Raw string literals (backticks) used for regex patterns in Go
SKIP tokens for whitespace (recognized but not stored)
Track current position as integer index
Underscore _ discards the index (we only need the spec)
Slice text[pos:] creates substring from position to end
Filter out whitespace tokens from results
Advance position by length of matched string
No pattern matched - return error with unknown character

Simple Lexer in Rust
use regex::Regex;

#[derive(Debug)]  // (1)!
struct Token {
    token_type: String,
    value: String,
}

fn tokenize(text: &str) -> Result<Vec<Token>, String> {  // (2)!
    let token_specs = vec![  // (3)!
        ("NUMBER", Regex::new(r"^\d+").unwrap()),  // (4)!
        ("IDENT", Regex::new(r"^[a-zA-Z_][a-zA-Z0-9_]*").unwrap()),
        ("PLUS", Regex::new(r"^\+").unwrap()),
        ("MINUS", Regex::new(r"^-").unwrap()),
        ("STAR", Regex::new(r"^\*").unwrap()),
        ("SLASH", Regex::new(r"^/").unwrap()),
        ("EQUALS", Regex::new(r"^=").unwrap()),
        ("LPAREN", Regex::new(r"^\(").unwrap()),
        ("RPAREN", Regex::new(r"^\)").unwrap()),
        ("SKIP", Regex::new(r"^[ \t]+").unwrap()),  // (5)!
    ];

    let mut tokens = Vec::new();
    let mut pos = 0;  // (6)!

    while pos < text.len() {
        let remaining = &text[pos..];  // (7)!
        let mut matched = false;

        for (token_type, pattern) in &token_specs {  // (8)!
            if let Some(mat) = pattern.find(remaining) {  // (9)!
                if *token_type != "SKIP" {  // (10)!
                    tokens.push(Token {
                        token_type: token_type.to_string(),
                        value: mat.as_str().to_string(),
                    });
                }
                pos += mat.end();  // (11)!
                matched = true;
                break;
            }
        }

        if !matched {  // (12)!
            return Err(format!("Unknown character: {}", &text[pos..pos+1]));
        }
    }

    Ok(tokens)  // (13)!
}

fn main() {
    let tokens = tokenize("x = 42 + y").unwrap();
    println!("{:?}", tokens);
    // Output: [Token { token_type: "IDENT", value: "x" }, ...]
}

Auto-derive Debug trait for printing token structures
Returns Result<Vec<Token>, String> - Rust's idiomatic error handling
Vector of tuples containing token type strings and compiled regex patterns
Raw strings r"..." don't require escaping backslashes in Rust
SKIP tokens recognized but not added to result vector
Track position as byte index (mut allows mutation)
Create string slice from current position to end
Borrow token_specs with & to iterate without consuming
if let pattern matching extracts match if pattern succeeds
Dereference *token_type to compare the &str with string literal
Advance position by number of bytes matched
No pattern matched - return Err variant with error message
Return Ok variant wrapping the successful token vector

Simple Lexer in Java
import java.util.*;
import java.util.regex.*;

class Token {
    String type;
    String value;

    Token(String type, String value) {
        this.type = type;
        this.value = value;
    }

    @Override
    public String toString() {
        return "(" + type + ", " + value + ")";
    }
}

class Lexer {
    static class TokenSpec {  // (1)!
        String type;
        Pattern pattern;

        TokenSpec(String type, String pattern) {
            this.type = type;
            this.pattern = Pattern.compile("^" + pattern);  // (2)!
        }
    }

    public static List<Token> tokenize(String text) {
        List<TokenSpec> specs = Arrays.asList(  // (3)!
            new TokenSpec("NUMBER", "\\d+"),  // (4)!
            new TokenSpec("IDENT", "[a-zA-Z_][a-zA-Z0-9_]*"),
            new TokenSpec("PLUS", "\\+"),
            new TokenSpec("MINUS", "-"),
            new TokenSpec("STAR", "\\*"),
            new TokenSpec("SLASH", "/"),
            new TokenSpec("EQUALS", "="),
            new TokenSpec("LPAREN", "\\("),
            new TokenSpec("RPAREN", "\\)"),
            new TokenSpec("SKIP", "[ \\t]+")  // (5)!
        );

        List<Token> tokens = new ArrayList<>();
        int pos = 0;  // (6)!

        while (pos < text.length()) {
            boolean matched = false;

            for (TokenSpec spec : specs) {  // (7)!
                Matcher matcher = spec.pattern.matcher(text.substring(pos));  // (8)!
                if (matcher.find()) {
                    if (!spec.type.equals("SKIP")) {  // (9)!
                        tokens.add(new Token(spec.type, matcher.group()));
                    }
                    pos += matcher.end();  // (10)!
                    matched = true;
                    break;
                }
            }

            if (!matched) {  // (11)!
                throw new RuntimeException("Unknown character: " + text.charAt(pos));
            }
        }

        return tokens;
    }

    public static void main(String[] args) {
        List<Token> tokens = tokenize("x = 42 + y");
        System.out.println(tokens);
        // Output: [(IDENT, x), (EQUALS, =), (NUMBER, 42), (PLUS, +), (IDENT, y)]
    }
}

Inner static class holds token type name and compiled regex Pattern
Compile pattern with ^ anchor to match at start of string
Create immutable list using Arrays.asList helper method
Double backslash \\ needed in Java strings to represent single backslash in regex
SKIP tokens for whitespace (tabs and spaces)
Track current position as integer index
Enhanced for-loop iterates over token specs
Create substring from current position and try to match pattern
Filter out whitespace tokens from final result
Advance position by length of matched region
No pattern matched - throw runtime exception with character

Simple Lexer in C++
#include <iostream>
#include <vector>
#include <regex>
#include <string>
#include <stdexcept>

struct Token {
    std::string type;
    std::string value;
};

struct TokenSpec {  // (1)!
    std::string type;
    std::regex pattern;
};

std::vector<Token> tokenize(const std::string& text) {  // (2)!
    std::vector<TokenSpec> specs = {  // (3)!
        {"NUMBER", std::regex(R"(^\d+)")},  // (4)!
        {"IDENT", std::regex(R"(^[a-zA-Z_][a-zA-Z0-9_]*)")},
        {"PLUS", std::regex(R"(^\+)")},
        {"MINUS", std::regex(R"(^-)")},
        {"STAR", std::regex(R"(^\*)")},
        {"SLASH", std::regex(R"(^/)")},
        {"EQUALS", std::regex(R"(^=)")},
        {"LPAREN", std::regex(R"(^\()")},
        {"RPAREN", std::regex(R"(^\))")},
        {"SKIP", std::regex(R"(^[ \t]+)")}  // (5)!
    };

    std::vector<Token> tokens;
    size_t pos = 0;  // (6)!

    while (pos < text.length()) {
        bool matched = false;
        std::string remaining = text.substr(pos);  // (7)!

        for (const auto& spec : specs) {  // (8)!
            std::smatch match;  // (9)!
            if (std::regex_search(remaining, match, spec.pattern)) {  // (10)!
                if (spec.type != "SKIP") {  // (11)!
                    tokens.push_back({spec.type, match.str()});
                }
                pos += match.length();  // (12)!
                matched = true;
                break;
            }
        }

        if (!matched) {  // (13)!
            throw std::runtime_error("Unknown character: " +
                                   std::string(1, text[pos]));
        }
    }

    return tokens;
}

int main() {
    auto tokens = tokenize("x = 42 + y");
    for (const auto& tok : tokens) {
        std::cout << "(" << tok.type << ", " << tok.value << ") ";
    }
    std::cout << std::endl;
    // Output: (IDENT, x) (EQUALS, =) (NUMBER, 42) (PLUS, +) (IDENT, y)
    return 0;
}

Struct holds token type name and compiled std::regex pattern
Pass string by const reference to avoid copying
Initialize vector with brace-enclosed initializer list
Raw string literals R"(...)" don't require escaping in C++11+
SKIP tokens for whitespace recognition
Track position as size_t (unsigned integer type)
Create substring from current position to end
Range-based for loop with const reference to avoid copying
std::smatch stores regex match results
Search for pattern match in remaining text
Filter out whitespace tokens from results
Advance position by length of matched text
No pattern matched - throw runtime_error exception

This produces a list of tokens, where each token is a tuple of (TOKEN_TYPE, VALUE). The lexer has broken the input string into meaningful pieces that the parser can work with.

Stage 2: Parsing

The parser takes tokens and builds a structured representation—usually a tree—according to the grammar rules.

Why Trees?

A flat list of tokens doesn't capture the structure and relationships in code. Consider 2 + 3 * 4:

As tokens: [NUM:2, PLUS, NUM:3, STAR, NUM:4] - just a flat sequence
As a tree: Shows that 3 * 4 happens first (nested deeper), then 2 + the result

Trees naturally represent:

Operator precedence: Deeper operations execute first (* before +)
Hierarchical structure: Nested expressions, function calls, code blocks
Grammar rules: Each tree node represents a grammar rule being applied
Evaluation order: Walk the tree to know what to compute when

Without a tree, you'd need complex logic to figure out "which operation happens first?" The tree makes it obvious—start at the leaves, work up to the root.

Connection to BNF and RTN

Remember BNF and RTN from earlier? They're grammar notations that describe valid syntax. The parser uses these rules to build the tree:

Each BNF rule becomes a tree node: When the parser applies <expression> ::= <term> "+" <term>, it creates an expression node with two term children
RTN paths trace tree construction: Following an RTN from start to end corresponds to building a subtree
Recursion in grammar = Recursion in tree: Nested expressions in code become nested nodes in the tree

The grammar is the blueprint; the tree is the structure built from that blueprint. The parser is the construction worker following the plans.

Parse Trees vs Abstract Syntax Trees

Parse Tree (Concrete Syntax Tree): Shows every grammar rule applied.

Abstract Syntax Tree (AST): Simplified tree focusing on meaning, not syntax.

For 2 + 3 * 4:

Concrete Syntax Tree

         Expression
              |
    +---------+---------+
    |         |         |
  Term       '+'      Term
    |                   |
  Factor        +-------+-------+
    |           |       |       |
   '2'        Term     '*'   Factor
               |
            Factor              '4'
               |
              '3'

Abstract Syntax Tree

The AST is what most compilers actually work with.

Scheme: Where Parse Tree = AST

In Scheme & Parse Trees, you'll see a language where the written code is the AST. Because Scheme uses prefix notation with explicit parentheses (+ 1 (* 2 3)), there is no ambiguity, and the parse tree matches the code structure 1:1.

Why ASTs Drop Punctuation

Notice how the AST doesn't include parentheses, commas, or semicolons? That's intentional. Once you've parsed the code and built the tree structure, punctuation has served its purpose—it told the parser how to group things.

What gets dropped: - Parentheses () - the tree structure already shows grouping - Commas , - already represented as multiple children - Semicolons ; - just statement separators, not meaningful after parsing - Keywords like then, do - the tree node type already captures the meaning

What gets kept: - Operators +, -, * - needed to know which operation to perform - Literals 42, "hello" - the actual values - Identifiers x, foo - variable and function names

The AST keeps only what's needed for execution or compilation. This makes it smaller and easier to work with than the full parse tree.

SQL Execution Plans: A Related Concept

If you've worked with databases, you might recognize tree structures from SQL execution plans. They're related but different:

Parse Tree/AST (what parsers build): - Represents the syntactic structure of the code - Shows what the query means grammatically - Created during parsing (before execution)

Execution Plan (what query optimizers build): - Represents the execution strategy for running the query - Shows how to efficiently execute the query (which indexes, join order, etc.) - Created after parsing, during query optimization

The full pipeline is: SQL text → Lexer → Parser → AST → Query Optimizer → Execution Plan → Results

When database administrators tune queries by examining execution plans, they're looking at a tree structure that comes after parsing—it's the optimized plan for executing the already-parsed query. Both are trees showing hierarchical operations, but the AST captures syntax while the execution plan captures runtime strategy.

Grammar Ambiguity

Some grammars are ambiguous—they allow multiple valid parse trees for the same input. This is a problem because the program's meaning becomes unclear.

The Classic Example: Dangling Else

Consider this grammar for if-statements:

Ambiguous If-Statement Grammar
<statement> ::= <if-statement> | <other>
<if-statement> ::= "if" <condition> "then" <statement>
                 | "if" <condition> "then" <statement> "else" <statement>

Now parse this code:

if a then if b then x else y

Two valid parse trees exist:

Interpretation 1: The else belongs to the inner if:

if a then (if b then x else y)

Interpretation 2: The else belongs to the outer if:

if a then (if b then x) else y

Both are valid according to the grammar! This is the Dangling Else problem.

How Languages Solve It

Most languages resolve this through precedence rules:

Convention: The else matches the nearest unmatched if (Interpretation 1)
Require explicit delimiters: Python uses indentation; C uses braces { }
Require end keywords: Pascal and Ada make all blocks explicit

Why Ambiguity Matters for Parsing

Parsers must make choices: When faced with ambiguity, the parser picks one interpretation (usually via precedence rules)
Different parsers might choose differently: Without explicit rules, two implementations could parse the same code differently
Grammar design is crucial: Good language design avoids ambiguity through careful grammar construction

Key insight: Ambiguity is a property of the grammar, not the parser. A well-designed grammar eliminates ambiguity before parsing begins.

Parsing Strategies

Not all grammars are equally easy to parse, and different approaches have different trade-offs. Think of parsing strategies like different tools in a toolbox—a hammer for some jobs, a screwdriver for others.

Choosing Your Approach

Why Different Strategies?

Different parsing strategies exist because:

Grammar constraints: Some grammars are ambiguous or have features (like left-recursion) that break certain parsing approaches
Implementation complexity: Hand-written parsers need simplicity; generated parsers can be more complex
Error handling: Some strategies give better error messages than others
Performance: Different strategies have different speed and memory characteristics
Parsing power: Some strategies can handle more complex grammars than others

Which Strategy Should You Use?

Situation	Recommended Strategy	Why?
Writing by hand	Top-down (Recursive Descent)	Easy to understand, maps directly to grammar, good error messages
Using a parser generator	Bottom-up (LR/LALR)	More powerful, handles more grammars, tools do the hard work
Simple expressions	Top-down	Quick to implement, sufficient for most expression grammars
Complex language (C, Java)	Bottom-up (via tool)	Handles complex grammar features, proven at scale
Educational purposes	Top-down	Easier to understand and trace execution

For most projects: start with recursive descent (top-down) because it's intuitive. Only reach for more powerful strategies when you hit limitations.

Top-Down Parsing (LL) Bottom-Up Parsing (LR)

Strategy: Start with the goal (e.g., "program") and work down to terminals.

Recursive Descent is the most intuitive top-down approach. Each grammar rule becomes a function:

Given this grammar:

Expression Grammar
<expression> ::= <term> { ("+" | "-") <term> }
<term> ::= <factor> { ("*" | "/") <factor> }
<factor> ::= NUMBER | "(" <expression> ")"

Left Recursion: A Top-Down Parser Killer

Top-down parsers cannot handle left-recursive grammars. A rule like this will cause infinite recursion:

<expression> ::= <expression> "+" <term>   # ❌ Left-recursive!

Why it breaks: When parse_expression() is called, the first thing it does is call parse_expression() again (to match <expression>), which calls parse_expression() again, forever. The parser never consumes a token!

The fix: Rewrite left recursion as iteration (loops):

# Instead of:
<expression> ::= <expression> "+" <term>

# Use:
<expression> ::= <term> { "+" <term> }

The { } notation means "zero or more," which translates to a while loop in code. This is why our grammar above uses { ... } - it avoids left recursion.

Bottom-up parsers (LR) handle left recursion naturally, which is one reason they're more powerful.

Here's a recursive descent parser implementation:

Python - Parser JavaScript - Parser Go - Parser Rust - Parser Java - Parser C++ - Parser

Recursive Descent Parser in Python
class Parser:
    def __init__(self, tokens):
        self.tokens = tokens  # (1)!
        self.pos = 0  # (2)!

    def current_token(self):
        if self.pos < len(self.tokens):
            return self.tokens[self.pos]
        return None

    def consume(self, expected_type=None):  # (3)!
        token = self.current_token()
        if expected_type and token[0] != expected_type:
            raise SyntaxError(f"Expected {expected_type}, got {token}")
        self.pos += 1  # (4)!
        return token

    def parse_expression(self):  # (5)!
        """expression = term { ('+' | '-') term }"""
        left = self.parse_term()  # (6)!

        while self.current_token() and self.current_token()[0] in ('PLUS', 'MINUS'):
            op = self.consume()[1]
            right = self.parse_term()
            left = ('binop', op, left, right)  # (7)!

        return left

    def parse_term(self):  # (8)!
        """term = factor { ('*' | '/') factor }"""
        left = self.parse_factor()

        while self.current_token() and self.current_token()[0] in ('STAR', 'SLASH'):
            op = self.consume()[1]
            right = self.parse_factor()
            left = ('binop', op, left, right)

        return left

    def parse_factor(self):  # (9)!
        """factor = NUMBER | '(' expression ')'"""
        token = self.current_token()

        if token[0] == 'NUMBER':
            self.consume()
            return ('number', int(token[1]))

        elif token[0] == 'LPAREN':  # (10)!
            self.consume('LPAREN')
            expr = self.parse_expression()  # (11)!
            self.consume('RPAREN')
            return expr

        else:
            raise SyntaxError(f"Unexpected token: {token}")

# Usage
tokens = tokenize("2 + 3 * 4")
parser = Parser(tokens)
ast = parser.parse_expression()
print(ast)
# Output: ('binop', '+', ('number', 2), ('binop', '*', ('number', 3), ('number', 4)))

Store the list of tokens from the lexer
Track current position in the token list
Consume one token and optionally validate its type
Move to the next token after consuming
Handles lowest precedence operators (+ and -)
Start by parsing the higher-precedence term
Build a binary operation node combining left and right operands
Handles medium precedence operators (* and /)
Handles highest precedence: numbers and parenthesized expressions
Handle parenthesized sub-expressions
Recursively parse the expression inside parentheses

Recursive Descent Parser in JavaScript
class Parser {
    constructor(tokens) {
        this.tokens = tokens;  // (1)!
        this.pos = 0;  // (2)!
    }

    currentToken() {
        if (this.pos < this.tokens.length) {
            return this.tokens[this.pos];
        }
        return null;
    }

    consume(expectedType = null) {  // (3)!
        const token = this.currentToken();
        if (expectedType && token[0] !== expectedType) {
            throw new SyntaxError(`Expected ${expectedType}, got ${token}`);
        }
        this.pos++;  // (4)!
        return token;
    }

    parseExpression() {  // (5)!
        // expression = term { ('+' | '-') term }
        let left = this.parseTerm();  // (6)!

        while (this.currentToken() &&
               ['PLUS', 'MINUS'].includes(this.currentToken()[0])) {
            const op = this.consume()[1];
            const right = this.parseTerm();
            left = ['binop', op, left, right];  // (7)!
        }

        return left;
    }

    parseTerm() {  // (8)!
        // term = factor { ('*' | '/') factor }
        let left = this.parseFactor();

        while (this.currentToken() &&
               ['STAR', 'SLASH'].includes(this.currentToken()[0])) {
            const op = this.consume()[1];
            const right = this.parseFactor();
            left = ['binop', op, left, right];
        }

        return left;
    }

    parseFactor() {  // (9)!
        // factor = NUMBER | '(' expression ')'
        const token = this.currentToken();

        if (token[0] === 'NUMBER') {
            this.consume();
            return ['number', parseInt(token[1])];
        }

        if (token[0] === 'LPAREN') {  // (10)!
            this.consume('LPAREN');
            const expr = this.parseExpression();  // (11)!
            this.consume('RPAREN');
            return expr;
        }

        throw new SyntaxError(`Unexpected token: ${token}`);
    }
}

// Usage
const tokens = tokenize("2 + 3 * 4");
const parser = new Parser(tokens);
const ast = parser.parseExpression();
    console.log(ast);
    // Output: ['binop', '+', ['number', 2], ['binop', '*', ['number', 3], ['number', 4]]]

Store token array from lexer in instance variable
Track current position as index into token array
Default parameter null allows calling without expected type
Increment position to move to next token
Handles lowest precedence operators (+ and -)
Start by parsing higher-precedence term
Build binary operation array: [type, operator, left, right]
Handles medium precedence operators (* and /)
Handles highest precedence: numbers and parenthesized expressions
Handle parenthesized sub-expressions for grouping
Recursively parse expression inside parentheses

Recursive Descent Parser in Go
package main

import "fmt"

type ASTNode interface{}

type Parser struct {
    tokens []Token  // (1)!
    pos    int  // (2)!
}

func NewParser(tokens []Token) *Parser {
    return &Parser{tokens: tokens, pos: 0}
}

func (p *Parser) currentToken() *Token {
    if p.pos < len(p.tokens) {
        return &p.tokens[p.pos]
    }
    return nil
}

func (p *Parser) consume(expectedType ...string) Token {  // (3)!
    token := p.currentToken()
    if token == nil {
        panic("Unexpected end of tokens")
    }
    if len(expectedType) > 0 && token.Type != expectedType[0] {
        panic(fmt.Sprintf("Expected %s, got %s", expectedType[0], token.Type))
    }
    p.pos++  // (4)!
    return *token
}

func (p *Parser) parseExpression() ASTNode {  // (5)!
    // expression = term { ('+' | '-') term }
    left := p.parseTerm()  // (6)!

    for p.currentToken() != nil &&
        (p.currentToken().Type == "PLUS" || p.currentToken().Type == "MINUS") {
        op := p.consume().Value
        right := p.parseTerm()
        left = []interface{}{"binop", op, left, right}  // (7)!
    }

    return left
}

func (p *Parser) parseTerm() ASTNode {  // (8)!
    // term = factor { ('*' | '/') factor }
    left := p.parseFactor()

    for p.currentToken() != nil &&
        (p.currentToken().Type == "STAR" || p.currentToken().Type == "SLASH") {
        op := p.consume().Value
        right := p.parseFactor()
        left = []interface{}{"binop", op, left, right}
    }

    return left
}

func (p *Parser) parseFactor() ASTNode {  // (9)!
    // factor = NUMBER | '(' expression ')'
    token := p.currentToken()

    if token.Type == "NUMBER" {
        p.consume()
        return []interface{}{"number", token.Value}
    }

    if token.Type == "LPAREN" {  // (10)!
        p.consume("LPAREN")
        expr := p.parseExpression()  // (11)!
        p.consume("RPAREN")
        return expr
    }

    panic(fmt.Sprintf("Unexpected token: %v", token))
}

func main() {
    tokens, _ := tokenize("2 + 3 * 4")
    parser := NewParser(tokens)
    ast := parser.parseExpression()
        fmt.Printf("%v\n", ast)
        // Output: [binop + [number 2] [binop * [number 3] [number 4]]]
    }

Store slice of tokens from lexer
Track current position as integer index
Variadic parameter ...string allows optional expected type
Increment position to advance to next token
Handles lowest precedence operators (+ and -)
Start with higher-precedence term
Build slice representing binary operation node
Handles medium precedence operators (* and /)
Handles highest precedence: numbers and parentheses
Handle parenthesized sub-expressions for grouping
Recursive call parses expression inside parentheses

Recursive Descent Parser in Rust
#[derive(Debug, Clone)]
enum ASTNode {  // (1)!
    Number(i32),
    BinOp {
        op: String,
        left: Box<ASTNode>,  // (2)!
        right: Box<ASTNode>,
    },
}

struct Parser {
    tokens: Vec<Token>,  // (3)!
    pos: usize,  // (4)!
}

impl Parser {
    fn new(tokens: Vec<Token>) -> Self {
        Parser { tokens, pos: 0 }
    }

    fn current_token(&self) -> Option<&Token> {
        self.tokens.get(self.pos)
    }

    fn consume(&mut self, expected_type: Option<&str>) -> Result<Token, String> {  // (5)!
        match self.current_token() {
            None => Err("Unexpected end of tokens".to_string()),
            Some(token) => {
                if let Some(exp) = expected_type {
                    if &token.token_type != exp {
                        return Err(format!("Expected {}, got {}", exp, token.token_type));
                    }
                }
                self.pos += 1;  // (6)!
                Ok(token.clone())
            }
        }
    }

    fn parse_expression(&mut self) -> Result<ASTNode, String> {  // (7)!
        // expression = term { ('+' | '-') term }
        let mut left = self.parse_term()?;  // (8)!

        while let Some(token) = self.current_token() {
            if token.token_type == "PLUS" || token.token_type == "MINUS" {
                let op = self.consume(None)?.value;
                let right = self.parse_term()?;
                left = ASTNode::BinOp {  // (9)!
                    op,
                    left: Box::new(left),
                    right: Box::new(right),
                };
            } else {
                break;
            }
        }

        Ok(left)
    }

    fn parse_term(&mut self) -> Result<ASTNode, String> {  // (10)!
        // term = factor { ('*' | '/') factor }
        let mut left = self.parse_factor()?;

        while let Some(token) = self.current_token() {
            if token.token_type == "STAR" || token.token_type == "SLASH" {
                let op = self.consume(None)?.value;
                let right = self.parse_factor()?;
                left = ASTNode::BinOp {
                    op,
                    left: Box::new(left),
                    right: Box::new(right),
                };
            } else {
                break;
            }
        }

        Ok(left)
    }

    fn parse_factor(&mut self) -> Result<ASTNode, String> {  // (11)!
        // factor = NUMBER | '(' expression ')'
        match self.current_token() {
            None => Err("Unexpected end of tokens".to_string()),
            Some(token) => {
                if token.token_type == "NUMBER" {
                    let value = token.value.parse::<i32>().unwrap();
                    self.consume(None)?;
                    Ok(ASTNode::Number(value))
                } else if token.token_type == "LPAREN" {  // (12)!
                    self.consume(Some("LPAREN"))?;
                    let expr = self.parse_expression()?;  // (13)!
                    self.consume(Some("RPAREN"))?;
                    Ok(expr)
                } else {
                    Err(format!("Unexpected token: {:?}", token))
                }
            }
        }
    }
}

fn main() {
    let tokens = tokenize("2 + 3 * 4").unwrap();
    let mut parser = Parser::new(tokens);
    let ast = parser.parse_expression().unwrap();
        println!("{:?}", ast);
        // Output: BinOp { op: "+", left: Number(2), right: BinOp { op: "*", left: Number(3), right: Number(4) } }
    }

Enum defines AST node types (Number or Binary Operation)
Box heap-allocates nodes to allow recursive tree structure
Store vector of tokens from lexer
Track current position as usize (unsigned integer)
Returns Result type for idiomatic Rust error handling
Increment position to move to next token
Handles lowest precedence operators (+ and -)
? operator propagates errors up the call stack
Create BinOp variant with boxed left/right children
Handles medium precedence operators (* and /)
Handles highest precedence: numbers and parentheses
Handle parenthesized sub-expressions for grouping
Recursive call parses expression inside parentheses

Recursive Descent Parser in Java
import java.util.*;

interface ASTNode {}

class NumberNode implements ASTNode {
    int value;
    NumberNode(int value) { this.value = value; }
    @Override
    public String toString() { return "Number(" + value + ")"; }
}

class BinOpNode implements ASTNode {
    String op;
    ASTNode left, right;
    BinOpNode(String op, ASTNode left, ASTNode right) {
        this.op = op;
        this.left = left;
        this.right = right;
    }
    @Override
    public String toString() {
        return "BinOp(" + op + ", " + left + ", " + right + ")";
    }
}

class Parser {
    List<Token> tokens;  // (1)!
    int pos;  // (2)!

    Parser(List<Token> tokens) {
        this.tokens = tokens;
        this.pos = 0;
    }

    Token currentToken() {
        if (pos < tokens.size()) {
            return tokens.get(pos);
        }
        return null;
    }

    Token consume(String expectedType) {  // (3)!
        Token token = currentToken();
        if (token == null) {
            throw new RuntimeException("Unexpected end of tokens");
        }
        if (expectedType != null && !token.type.equals(expectedType)) {
            throw new RuntimeException("Expected " + expectedType + ", got " + token.type);
        }
        pos++;  // (4)!
        return token;
    }

    Token consume() {
        return consume(null);
    }

    ASTNode parseExpression() {  // (5)!
        // expression = term { ('+' | '-') term }
        ASTNode left = parseTerm();  // (6)!

        while (currentToken() != null &&
               (currentToken().type.equals("PLUS") || currentToken().type.equals("MINUS"))) {
            String op = consume().value;
            ASTNode right = parseTerm();
            left = new BinOpNode(op, left, right);  // (7)!
        }

        return left;
    }

    ASTNode parseTerm() {  // (8)!
        // term = factor { ('*' | '/') factor }
        ASTNode left = parseFactor();

        while (currentToken() != null &&
               (currentToken().type.equals("STAR") || currentToken().type.equals("SLASH"))) {
            String op = consume().value;
            ASTNode right = parseFactor();
            left = new BinOpNode(op, left, right);
        }

        return left;
    }

    ASTNode parseFactor() {  // (9)!
        // factor = NUMBER | '(' expression ')'
        Token token = currentToken();

        if (token.type.equals("NUMBER")) {
            consume();
            return new NumberNode(Integer.parseInt(token.value));
        }

        if (token.type.equals("LPAREN")) {  // (10)!
            consume("LPAREN");
            ASTNode expr = parseExpression();  // (11)!
            consume("RPAREN");
            return expr;
        }

        throw new RuntimeException("Unexpected token: " + token);
    }

    public static void main(String[] args) {
        List<Token> tokens = Lexer.tokenize("2 + 3 * 4");
        Parser parser = new Parser(tokens);
        ASTNode ast = parser.parseExpression();
            System.out.println(ast);
            // Output: BinOp(+, Number(2), BinOp(*, Number(3), Number(4)))
        }
    }

Store List of tokens from lexer
Track current position as integer index
Method overloading allows optional expectedType parameter
Increment position to move to next token
Handles lowest precedence operators (+ and -)
Start with higher-precedence term
Create new BinOpNode object for binary operation
Handles medium precedence operators (* and /)
Handles highest precedence: numbers and parentheses
Handle parenthesized sub-expressions for grouping
Recursive call parses expression inside parentheses

Recursive Descent Parser in C++
#include <iostream>
#include <vector>
#include <memory>
#include <string>
#include <stdexcept>

struct ASTNode {
    virtual ~ASTNode() = default;
    virtual std::string toString() const = 0;
};

struct NumberNode : ASTNode {
    int value;
    NumberNode(int v) : value(v) {}
    std::string toString() const override {
        return "Number(" + std::to_string(value) + ")";
    }
};

struct BinOpNode : ASTNode {
    std::string op;
    std::unique_ptr<ASTNode> left;
    std::unique_ptr<ASTNode> right;

    BinOpNode(std::string o, std::unique_ptr<ASTNode> l, std::unique_ptr<ASTNode> r)
        : op(o), left(std::move(l)), right(std::move(r)) {}

    std::string toString() const override {
        return "BinOp(" + op + ", " + left->toString() + ", " + right->toString() + ")";
    }
};

class Parser {
    std::vector<Token> tokens;  // (1)!
    size_t pos;  // (2)!

public:
    Parser(const std::vector<Token>& toks) : tokens(toks), pos(0) {}

    const Token* currentToken() {
        if (pos < tokens.size()) {
            return &tokens[pos];
        }
        return nullptr;
    }

    Token consume(const std::string& expectedType = "") {  // (3)!
        auto token = currentToken();
        if (!token) {
            throw std::runtime_error("Unexpected end of tokens");
        }
        if (!expectedType.empty() && token->type != expectedType) {
            throw std::runtime_error("Expected " + expectedType + ", got " + token->type);
        }
        pos++;  // (4)!
        return *token;
    }

    std::unique_ptr<ASTNode> parseExpression() {  // (5)!
        // expression = term { ('+' | '-') term }
        auto left = parseTerm();  // (6)!

        while (currentToken() &&
               (currentToken()->type == "PLUS" || currentToken()->type == "MINUS")) {
            std::string op = consume().value;
            auto right = parseTerm();
            left = std::make_unique<BinOpNode>(op, std::move(left), std::move(right));  // (7)!
        }

        return left;
    }

    std::unique_ptr<ASTNode> parseTerm() {  // (8)!
        // term = factor { ('*' | '/') factor }
        auto left = parseFactor();

        while (currentToken() &&
               (currentToken()->type == "STAR" || currentToken()->type == "SLASH")) {
            std::string op = consume().value;
            auto right = parseFactor();
            left = std::make_unique<BinOpNode>(op, std::move(left), std::move(right));
        }

        return left;
    }

    std::unique_ptr<ASTNode> parseFactor() {  // (9)!
        // factor = NUMBER | '(' expression ')'
        auto token = currentToken();

        if (token->type == "NUMBER") {
            consume();
            return std::make_unique<NumberNode>(std::stoi(token->value));
        }

        if (token->type == "LPAREN") {  // (10)!
            consume("LPAREN");
            auto expr = parseExpression();  // (11)!
            consume("RPAREN");
            return expr;
        }

        throw std::runtime_error("Unexpected token: " + token->type);
    }
};

int main() {
    auto tokens = tokenize("2 + 3 * 4");
    Parser parser(tokens);
    auto ast = parser.parseExpression();
        std::cout << ast->toString() << std::endl;
        // Output: BinOp(+, Number(2), BinOp(*, Number(3), Number(4)))
        return 0;
    }

Store vector of tokens from lexer
Track position as size_t (unsigned integer type)
Default parameter allows calling without expectedType
Increment position to advance to next token
Returns unique_ptr for automatic memory management
Start with higher-precedence term
std::move transfers ownership to new BinOpNode
Handles medium precedence operators (* and /)
Handles highest precedence: numbers and parentheses
Handle parenthesized sub-expressions for grouping
Recursive call parses expression inside parentheses

This is a tree structure represented as nested tuples—our AST! Each tuple is a node:

('binop', '+', left, right) = a binary operation node with operator +
('number', 2) = a leaf node containing the value 2
The nesting shows the tree structure: 2 + (3 * 4)

Compare this to the lexer's flat list of tokens—the parser has transformed that flat list into a hierarchical tree that captures the meaning and precedence.

Notice the Structure

The parser functions mirror the grammar exactly:

parse_expression handles + and -
parse_term handles * and /
parse_factor handles numbers and parentheses

This is why BNF translates so directly into code!

Recursion and the Call Stack

Notice how parse_factor can call parse_expression, which calls parse_term, which calls parse_factor again? This mutual recursion works because each function call is managed by the function call stack.

When parsing (2 + 3), the call stack looks like:

parse_factor()      ← handles the parentheses
  → parse_expression() ← handles the + inside
    → parse_term()
      → parse_factor() ← handles 2

When the innermost call completes, it returns to the previous function—classic LIFO behavior. Without the call stack, recursive parsing would be impossible.

How It Handles Operator Precedence:

Our grammar naturally handles precedence! Here's why:

expression handles + and -
term handles * and /
factor handles numbers and parentheses

Since term is nested inside expression, multiplication happens "deeper" in the tree—which means it's evaluated first.

For 2 + 3 * 4:

     +          (evaluated last)
    / \
   2   *        (evaluated first)
      / \
     3   4

Result: 2 + (3 * 4) = 14 ✓

Adding More Precedence Levels:

Want to add exponentiation (^) with highest precedence?

Grammar with Exponentiation
<expression> ::= <term> { ("+" | "-") <term> }
<term> ::= <power> { ("*" | "/") <power> }
<power> ::= <factor> [ "^" <power> ]
<factor> ::= NUMBER | "(" <expression> ")"

Notice <power> calls itself on the right side—this makes ^ right-associative: \(2^{3^4} = 2^{(3^4)}\).

Strategy: Start with tokens and combine them into larger structures, building the parse tree from the leaves up to the root.

How it works:

Begin with individual tokens
Recognize patterns that match the right-hand side of grammar rules
"Reduce" those patterns to non-terminals
Continue until reaching the start symbol

Advantages:

More powerful than top-down (handles more grammars)
Naturally handles left-recursive grammars
Can detect syntax errors earlier

Disadvantages:

More complex to implement by hand
Harder to understand and debug
Error messages can be less intuitive

In practice: Bottom-up parsers are typically generated by tools like YACC and GNU Bison rather than written by hand. These tools take a grammar specification and automatically generate efficient parser code.

Understanding LL and LR

Now that we've seen both approaches, let's compare them:

Type	Reads	Builds Tree	Used By
LL(k)	Left-to-right	Leftmost derivation (top-down)	Recursive descent, ANTLR
LR(k)	Left-to-right	Rightmost derivation (bottom-up)	YACC, GNU Bison

What "Leftmost" and "Rightmost" Derivation Mean:

These terms describe the order in which the parser builds the parse tree:

Leftmost Derivation (LL - Top-Down):

Start with the goal (e.g., <expression>)
Expand grammar rules from left to right, top to bottom
Like reading: start with the big picture, fill in details left-to-right

Example: 2 + 3
<expression>
  → <term> + <term>           (expand expression)
  → <number> + <term>          (expand leftmost term first)
  → 2 + <term>                 (substitute 2)
  → 2 + <number>               (expand remaining term)
  → 2 + 3                      (substitute 3)

Rightmost Derivation (LR - Bottom-Up):

Start with the tokens (e.g., 2, +, 3)
Combine them into larger structures, working right-to-left when choosing what to reduce
Like building with blocks: start with pieces, assemble upward

Example: 2 + 3
2 + 3
  → <number> + 3               (reduce rightmost number first)
  → <number> + <number>        (reduce left number)
  → <term> + <term>            (reduce to terms)
  → <expression>               (reduce to expression)

For practical purposes: LL is top-down (goal → tokens), LR is bottom-up (tokens → goal). The "leftmost/rightmost" distinction matters for formal theory but the key difference is the direction.

Understanding Lookahead: The "(k)" Parameter

The "(k)" means how many tokens ahead the parser peeks to decide which grammar rule to apply.

LL(1) - One Token Lookahead:

When parsing, you often need to decide between alternatives. With LL(1), you look at the next token to decide:

        ```python title="LL(1) Lookahead Example" linenums="1"
# Grammar rule with alternatives:
# <factor> ::= NUMBER | "(" <expression> ")"

def parse_factor(self):
    token = self.current_token()  # Look at next token

    if token[0] == 'NUMBER':      # Lookahead says: it's a number
        return self.parse_number()
    elif token[0] == 'LPAREN':    # Lookahead says: it's a parenthesized expr
        return self.parse_paren_expr()
        ```

One token is enough to decide which path to take.

When You Need More Lookahead:

Some grammars need LL(2) or LL(3):

# Ambiguous with LL(1):
<statement> ::= "if" <expr> "then" <stmt>
              | "if" <expr> "then" <stmt> "else" <stmt>

After seeing if, both rules start the same way. You might need to look ahead 3-4 tokens to distinguish them.

Why Lookahead Matters:

LL(1) is simplest and most efficient (what recursive descent typically uses)
LL(k) for k>1 requires more complex logic
LR(1) is more powerful than LL(1) - can handle more grammars with just 1 token lookahead
More lookahead = more memory and complexity

Most hand-written parsers use LL(1) because it's simple. Parser generators can handle larger k values automatically.

Working with Parse Trees

Once you have a parse tree or AST, you need to do something with it—evaluate it, compile it, or analyze it. This section covers common operations.

Evaluating the AST

Once you have an AST, you can walk it to compute results:

Python - Evaluator JavaScript - Evaluator Go - Evaluator Rust - Evaluator Java - Evaluator C++ - Evaluator

    ```python title="AST Evaluator in Python" linenums="1"
def evaluate(node):  # (1)!
    if node[0] == 'number':  # (2)!
        return node[1]

    elif node[0] == 'binop':  # (3)!
        op, left, right = node[1], node[2], node[3]  # (4)!
        left_val = evaluate(left)  # (5)!
        right_val = evaluate(right)

        if op == '+': return left_val + right_val  # (6)!
        if op == '-': return left_val - right_val
        if op == '*': return left_val * right_val
        if op == '/': return left_val / right_val

    raise ValueError(f"Unknown node type: {node[0]}")

# Using our earlier AST
ast = ('binop', '+', ('number', 2), ('binop', '*', ('number', 3), ('number', 4)))
print(evaluate(ast))  # 14
    ```

Recursively evaluate an AST node and return its computed value
Base case: if it's a number node, return its value
If it's a binary operation, evaluate both operands and apply the operator
Unpack the operator and operands from the tuple
Recursively evaluate left and right subtrees first
Apply the operator to the evaluated operands

AST Evaluator in JavaScript
function evaluate(node) {
    if (node[0] === 'number') {
        return node[1];
    }

    if (node[0] === 'binop') {
        const [, op, left, right] = node;
        const leftVal = evaluate(left);
        const rightVal = evaluate(right);

        if (op === '+') return leftVal + rightVal;
        if (op === '-') return leftVal - rightVal;
        if (op === '*') return leftVal * rightVal;
        if (op === '/') return leftVal / rightVal;
    }

    throw new Error(`Unknown node type: ${node[0]}`);
}

// Using our earlier AST
const ast = ['binop', '+', ['number', 2], ['binop', '*', ['number', 3], ['number', 4]]];
console.log(evaluate(ast));  // 14

AST Evaluator in Go
func evaluate(node ASTNode) float64 {
    switch n := node.(type) {
    case []interface{}:
        nodeType := n[0].(string)

        if nodeType == "number" {
            switch v := n[1].(type) {
            case int:
                return float64(v)
            case float64:
                return v
            }
        }

        if nodeType == "binop" {
            op := n[1].(string)
            left := evaluate(n[2])
            right := evaluate(n[3])

            switch op {
            case "+": return left + right
            case "-": return left - right
            case "*": return left * right
            case "/": return left / right
            }
        }
    }

    panic(fmt.Sprintf("Unknown node type: %v", node))
}

// Using our earlier AST
ast := []interface{}{"binop", "+", []interface{}{"number", 2},
                    []interface{}{"binop", "*", []interface{}{"number", 3},
                    []interface{}{"number", 4}}}
fmt.Println(evaluate(ast))  // 14

AST Evaluator in Rust
fn evaluate(node: &ASTNode) -> Result<i32, String> {
    match node {
        ASTNode::Number(n) => Ok(*n),

        ASTNode::BinOp { op, left, right } => {
            let left_val = evaluate(left)?;
            let right_val = evaluate(right)?;

            match op.as_str() {
                "+" => Ok(left_val + right_val),
                "-" => Ok(left_val - right_val),
                "*" => Ok(left_val * right_val),
                "/" => Ok(left_val / right_val),
                _ => Err(format!("Unknown operator: {}", op)),
            }
        }
    }
}

// Using our earlier AST
let ast = ASTNode::BinOp {
    op: "+".to_string(),
    left: Box::new(ASTNode::Number(2)),
    right: Box::new(ASTNode::BinOp {
        op: "*".to_string(),
        left: Box::new(ASTNode::Number(3)),
        right: Box::new(ASTNode::Number(4)),
    }),
};
println!("{}", evaluate(&ast).unwrap());  // 14

AST Evaluator in Java
public static int evaluate(ASTNode node) {
    if (node instanceof NumberNode) {
        return ((NumberNode) node).value;
    }

    if (node instanceof BinOpNode) {
        BinOpNode binOp = (BinOpNode) node;
        int leftVal = evaluate(binOp.left);
        int rightVal = evaluate(binOp.right);

        switch (binOp.op) {
            case "+": return leftVal + rightVal;
            case "-": return leftVal - rightVal;
            case "*": return leftVal * rightVal;
            case "/": return leftVal / rightVal;
        }
    }

    throw new RuntimeException("Unknown node type: " + node.getClass());
}

// Using our earlier AST
ASTNode ast = new BinOpNode("+",
    new NumberNode(2),
    new BinOpNode("*", new NumberNode(3), new NumberNode(4))
);
System.out.println(evaluate(ast));  // 14

AST Evaluator in C++
int evaluate(const ASTNode* node) {
    if (auto num = dynamic_cast<const NumberNode*>(node)) {
        return num->value;
    }

    if (auto binOp = dynamic_cast<const BinOpNode*>(node)) {
        int leftVal = evaluate(binOp->left.get());
        int rightVal = evaluate(binOp->right.get());

        if (binOp->op == "+") return leftVal + rightVal;
        if (binOp->op == "-") return leftVal - rightVal;
        if (binOp->op == "*") return leftVal * rightVal;
        if (binOp->op == "/") return leftVal / rightVal;
    }

    throw std::runtime_error("Unknown node type");
}

// Using our earlier AST
auto ast = std::make_unique<BinOpNode>("+",
    std::make_unique<NumberNode>(2),
    std::make_unique<BinOpNode>("*",
        std::make_unique<NumberNode>(3),
        std::make_unique<NumberNode>(4)
    )
);
std::cout << evaluate(ast.get()) << std::endl;  // 14

This tree-walking interpreter is the simplest approach. Real interpreters might:

Compile the AST to bytecode
Optimize the tree before execution
Generate machine code

Error Handling

Good parsers give helpful error messages:

        ```python title="Error Handling in Parser" linenums="1"
def consume(self, expected_type):
    token = self.current_token()
    if token is None:  # (1)!
        raise SyntaxError(f"Unexpected end of input, expected {expected_type}")
    if token[0] != expected_type:  # (2)!
        raise SyntaxError(
            f"Line {self.line}, column {self.column}: "  # (3)!
            f"Expected {expected_type}, got {token[0]} ('{token[1]}')"
        )
    self.pos += 1
    return token
        ```

Check if we've run out of tokens unexpectedly
Validate that the token type matches what we expected
Provide location information (line and column) for better error messages

More sophisticated parsers can:

Recover from errors and continue parsing
Suggest corrections ("Did you mean...?"')
Highlight the exact location of the problem

Parser Generators

Writing parsers by hand is educational, but for real projects, consider parser generators:

Tool	Language	Grammar Style
ANTLR	Java, Python, etc.	EBNF-like
PLY	Python	YACC-like
Lark	Python	EBNF
Peggy (formerly PEG.js)	JavaScript	PEG
GNU Bison	C/C++	YACC

You write the grammar, the tool generates the parser code. ✨ Laziness is a virtue in programming.

Example with Lark (Python):

        ```python title="Parser Generator with Lark" linenums="1"
from lark import Lark

grammar = """  # (1)!
    start: expr  # (2)!
    expr: term (("|"|"-") term)*  # (3)!
    term: factor (("*"|"/") factor)*  # (4)!
    factor: NUMBER | "(" expr ")"  # (5)!
    NUMBER: /\d+/  # (6)!
    %ignore " "  # (7)!
"""

parser = Lark(grammar)  # (8)!
tree = parser.parse("2 + 3 * 4")
print(tree.pretty())
        ```

Define grammar using EBNF-like syntax as a multi-line string
Entry point of the grammar - must start with an expression
Expression handles lowest precedence: addition and subtraction
Term handles higher precedence: multiplication and division
Factor handles highest precedence: numbers and parenthesized expressions
Define what a NUMBER token looks like using regex
Ignore whitespace in the input
Create parser from grammar - Lark generates all parsing code automatically

Real-World Parsing

JSON Parser HTML Parser Programming Language Parsers

JSON is simple enough to parse by hand:

value   = object | array | string | number | "true" | "false" | "null"
object  = "{" [ pair { "," pair } ] "}"
pair    = string ":" value
array   = "[" [ value { "," value } ] "]"

Most languages have built-in JSON parsers because it's so common. The grammar is regular enough that a hand-written recursive descent parser works well.

HTML is messy—browsers handle malformed HTML gracefully. Real HTML parsers use complex error recovery:

Malformed HTML Example
<p>This is <b>bold and <i>italic</b> text</i>

Technically invalid, but browsers render it anyway! The web is wild.

Why HTML parsing is hard:

Browsers must handle broken HTML (unclosed tags, misnested elements)
Contextual parsing rules (what's valid inside <script> differs from <div>)
Historical quirks mode for backward compatibility

Modern language parsers are sophisticated:

Error recovery — IDE features like syntax highlighting and autocomplete work even with incomplete/invalid code

Incremental parsing — Fast re-parsing on edits by only updating changed parts of the tree

Loose parsing modes — Handle incomplete code during typing for real-time feedback

Practice Problems

Challenge 1: Add Unary Minus

Modify the expression grammar and parser to handle unary minus:

-5
2 + -3
-(-4)

Where does the new rule go in the precedence hierarchy?

Challenge 2: Parse Variable Assignments

Extend the grammar to handle:

x = 10
y = x + 5

You'll need to track variable names and store their values.

Challenge 3: Error Messages

Improve the parser to give line and column numbers in error messages. What information do you need to track during lexing?

Key Takeaways

Concept	What It Does
Lexer	Breaks text into tokens
Parser	Builds tree from tokens
Token	Meaningful chunk (keyword, number, operator)
AST	Tree representing program structure
Recursive Descent	Each grammar rule = one function
Precedence	Handled by grammar structure (nesting depth)