How Parsers Work
Every time you call json.loads(), yaml.safe_load(), or JSON.parse(), a parser runs. Every time Python reports SyntaxError: unexpected indent or Go says syntax error: unexpected {, a parser has analyzed your code and found the exact point where it stopped making sense. Every SQL query you write is parsed before it touches a database.
Parsers are everywhere in your stack — and understanding how they work makes you better at every tool built on top of one.
Parsing is the process of analyzing text according to a formal grammar. It's how Python knows that print("hello") is valid but print("hello isn't. It's how browsers turn HTML into a DOM tree. It's how your config files become data structures your code can use.
Learning Objectives
By the end of this article, you'll be able to:
- Explain the two-phase pipeline: lexing (tokenizing raw text) then parsing (building a tree)
- Read a grammar rule and trace how a recursive descent parser uses it to process input
- Distinguish a concrete syntax tree from an abstract syntax tree (AST)
- Implement a simple recursive descent parser for arithmetic expressions
- Explain how operator precedence is encoded in grammar structure — not as special-case logic
Where You've Seen This
Parsers run constantly behind the scenes in your daily work:
- Language runtimes — Python, Go, Node.js, the JVM all parse your source code before executing it
- JSON and YAML —
json.loads(),yaml.safe_load(),JSON.parse()are all parser calls - SQL — every query goes through a parser before the query planner optimizes it
- HTML/DOM — browsers parse HTML into a tree structure;
BeautifulSoupwraps a parser - Template engines — Jinja2, Handlebars, Helm charts all parse template syntax
- Config formats — Docker Compose, Kubernetes manifests, Terraform HCL,
.tomlfiles - GraphQL — every incoming query is parsed against the schema grammar
- Your IDE — syntax highlighting, autocomplete, and linting all use parsers in real time
Why This Matters for Production Code
Every SyntaxError: unexpected indent or unexpected token message comes from a parser. The line and column numbers pinpoint exactly where the parser was in the input stream when it failed. Run python -m ast your_file.py to print the full parse tree for any Python file — the same structure the runtime uses to execute your code.
Understanding parsers means understanding what your error messages are actually telling you.
YAML, JSON, TOML, HCL, XML — every configuration format in your infrastructure stack is parsed on every deployment. Docker Compose, Kubernetes manifests, Terraform, Helm charts all go through parsers before they do anything. When Kubernetes rejects your manifest with a YAML parsing error, you've violated the grammar.
The error message quality (and your ability to read it) depends on understanding what a parser does with your input.
Every SQL query is parsed before it touches a database. The query planner that decides whether to use an index or a full table scan operates on the parsed AST. GraphQL servers parse every incoming query against the schema grammar. ORMs generate SQL strings that get parsed again on the database side.
Understanding parsers explains why query structure matters and why parameterized queries are safer than string interpolation.
SQL injection, XSS, and template injection attacks all exploit the gap between what a developer thinks a parser will do with input and what it actually does. The parser boundary — where input transitions from data to structure — is where these vulnerabilities live.
Parameterized queries work because they keep user input outside the parse boundary. Understanding parsers is foundational to writing code that handles untrusted input safely.
Why Is Parsing Necessary?
Computers can't directly understand text—not even simple code like x = 2 + 3. Here's why parsing is essential:
The Problem: Text is ambiguous and unstructured
When you write 2 + 3 * 4, you know multiplication happens first. But to a computer, it's just a string: "2 + 3 * 4". How does it know:
- What's an operator vs. a number?
- Which operation happens first?
- Whether the syntax is valid?
- How to represent nested structures like
f(g(x))?
What Happens Without Parsing?
Imagine trying to execute code directly from text:
"x = 2 + 3 * 4"- Which+or*do you do first?"if (x > 5) print(x)"- Where does the condition end and the body begin?"func(a, b)"- How do you knowfuncis a function call anda, bare arguments?
You'd need custom logic for every possible code pattern. It's chaos.
The Solution: Structured Representation
Parsing transforms unstructured text into a tree structure that:
- ✓ Validates syntax ("Is this valid code?")
- ✓ Captures meaning ("What operations are being done?")
- ✓ Shows relationships ("Which operations happen first?")
- ✓ Enables execution ("Now I can evaluate/compile this!")
Without parsing, you can't compile, execute, or even validate code. It's the bridge between human-readable text and computer-executable instructions.
The Big Picture
When a compiler or interpreter processes your code, it goes through three stages:
Step 1: Lexical Analysis
flowchart LR
A[("<b>Source Code</b>")]
B["<b>Lexer</b>"]
C[("<b>Tokens</b>")]
A -->|input| B
B -->|output| C
style A fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
style B fill:#4a5568,stroke:#cbd5e0,stroke-width:2px,color:#fff
style C fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
Step 2: Parsing
flowchart LR
A[("<b>Tokens</b>")]
B["<b>Parser</b>"]
C[("<b>Parse Tree / AST</b>")]
A -->|input| B
B -->|output| C
style A fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
style B fill:#4a5568,stroke:#cbd5e0,stroke-width:2px,color:#fff
style C fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
Step 3: Compilation/Interpretation
flowchart LR
A[("<b>Parse Tree / AST</b>")]
B["<b>Compiler / Interpreter</b>"]
C[("<b>Result</b>")]
A -->|input| B
B -->|output| C
style A fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
style B fill:#4a5568,stroke:#cbd5e0,stroke-width:2px,color:#fff
style C fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
| Stage | Input | Output | What It Does |
|---|---|---|---|
| Lexer | Source Code | Tokens | Breaks text into meaningful chunks Example: "x = 42" → [ID:x, EQUALS, NUM:42] |
| Parser | Tokens | Parse Tree / AST | Builds a tree structure from tokens Example: Tokens → Tree with assignment node |
| Compiler/Interpreter | Parse Tree / AST | Result | Executes or compiles the tree Example: Tree → machine code or output |
What is AST?
AST stands for Abstract Syntax Tree - a simplified tree structure representing the code's meaning and structure, without unnecessary syntax details. More details below!
We'll focus on the first two stages: lexing and parsing.
Stage 1: Lexical Analysis (Lexing)
The lexer (or tokenizer) breaks raw text into meaningful chunks called tokens. It's like turning a stream of characters into a stream of words.
What Tokens Look Like
For the expression total = price * 2 + tax:
| Token Type | Value |
|---|---|
| IDENTIFIER | "total" |
| EQUALS | "=" |
| IDENTIFIER | "price" |
| STAR | "*" |
| NUMBER | "2" |
| PLUS | "+" |
| IDENTIFIER | "tax" |
The lexer doesn't understand grammar — it just recognizes patterns. "Is this a number? A keyword? An operator?" That's all it asks. Simple creature, the lexer.
Lexer Implementation
Lexers are typically implemented as Finite State Machines and use Regular Expressions to define token patterns:
stateDiagram-v2
direction LR
[*] --> Start
Start --> InNumber: digit
InNumber --> InNumber: digit
InNumber --> Start: whitespace [emit NUMBER]
Start --> InIdent: letter
InIdent --> InIdent: letter/digit
InIdent --> Start: whitespace [emit IDENTIFIER]
Start --> Start: whitespace [skip]
Start --> Start: + [emit PLUS]
Start --> Start: * [emit STAR]
Start --> Start: = [emit EQUALS]
This produces a list of tokens, where each token is a tuple of (TOKEN_TYPE, VALUE). The lexer has broken the input string into meaningful pieces that the parser can work with.
Stage 2: Parsing
The parser takes tokens and builds a structured representation—usually a tree—according to the grammar rules.
Why Trees?
A flat list of tokens doesn't capture the structure and relationships in code. Consider 2 + 3 * 4:
- As tokens:
[NUM:2, PLUS, NUM:3, STAR, NUM:4]- just a flat sequence - As a tree: Shows that
3 * 4happens first (nested deeper), then2 +the result
Trees naturally represent:
- Operator precedence: Deeper operations execute first (
*before+) - Hierarchical structure: Nested expressions, function calls, code blocks
- Grammar rules: Each tree node represents a grammar rule being applied
- Evaluation order: Walk the tree to know what to compute when
Without a tree, you'd need complex logic to figure out "which operation happens first?" The tree makes it obvious—start at the leaves, work up to the root.
Connection to BNF and RTN
BNF and RTN are grammar notations that formally describe valid syntax. The parser uses those rules to build the tree:
- Each BNF rule becomes a tree node: When the parser applies
<expression> ::= <term> "+" <term>, it creates an expression node with two term children - RTN paths trace tree construction: Following an RTN from start to end corresponds to building a subtree
- Recursion in grammar = Recursion in tree: Nested expressions in code become nested nodes in the tree
The grammar is the blueprint; the tree is the structure built from that blueprint. The parser is the construction worker following the plans. See the Mastery section for the full formal treatment of grammars.
Parse Trees vs Abstract Syntax Trees
Parse Tree (Concrete Syntax Tree): Shows every grammar rule applied.
Abstract Syntax Tree (AST): Simplified tree focusing on meaning, not syntax.
For 2 + 3 * 4:
Expression
|
+---------+---------+
| | |
Term '+' Term
| |
Factor +-------+-------+
| | | |
'2' Term '*' Factor
|
Factor '4'
|
'3'
The AST is what most compilers actually work with.
Scheme: Where Parse Tree = AST
In Scheme & Parse Trees, you'll see a language where the written code is the AST. Because Scheme uses prefix notation with explicit parentheses (+ 1 (* 2 3)), there is no ambiguity, and the parse tree matches the code structure 1:1.
Why ASTs Drop Punctuation
Notice how the AST doesn't include parentheses, commas, or semicolons? That's intentional. Once you've parsed the code and built the tree structure, punctuation has served its purpose—it told the parser how to group things.
What gets dropped:
- Parentheses () - the tree structure already shows grouping
- Commas , - already represented as multiple children
- Semicolons ; - just statement separators, not meaningful after parsing
- Keywords like then, do - the tree node type already captures the meaning
What gets kept:
- Operators +, -, * - needed to know which operation to perform
- Literals 42, "hello" - the actual values
- Identifiers x, foo - variable and function names
The AST keeps only what's needed for execution or compilation. This makes it smaller and easier to work with than the full parse tree.
SQL Execution Plans: A Related Concept
If you've worked with databases, you might recognize tree structures from SQL execution plans. They're related but different:
Parse Tree/AST (what parsers build): - Represents the syntactic structure of the code - Shows what the query means grammatically - Created during parsing (before execution)
Execution Plan (what query optimizers build): - Represents the execution strategy for running the query - Shows how to efficiently execute the query (which indexes, join order, etc.) - Created after parsing, during query optimization
The full pipeline is: SQL text → Lexer → Parser → AST → Query Optimizer → Execution Plan → Results
When database administrators tune queries by examining execution plans, they're looking at a tree structure that comes after parsing—it's the optimized plan for executing the already-parsed query. Both are trees showing hierarchical operations, but the AST captures syntax while the execution plan captures runtime strategy.
Grammar Ambiguity
Some grammars are ambiguous—they allow multiple valid parse trees for the same input. This is a problem because the program's meaning becomes unclear.
The Classic Example: Dangling Else
Consider this grammar for if-statements:
| Ambiguous If-Statement Grammar | |
|---|---|
Now parse this code:
Two valid parse trees exist:
Interpretation 1: The else belongs to the inner if:
Interpretation 2: The else belongs to the outer if:
Both are valid according to the grammar! This is the Dangling Else problem.
How Languages Solve It
Most languages resolve this through precedence rules:
- Convention: The
elsematches the nearest unmatchedif(Interpretation 1) - Require explicit delimiters: Python uses indentation; C uses braces
{ } - Require
endkeywords: Pascal and Ada make all blocks explicit
Why Ambiguity Matters for Parsing
- Parsers must make choices: When faced with ambiguity, the parser picks one interpretation (usually via precedence rules)
- Different parsers might choose differently: Without explicit rules, two implementations could parse the same code differently
- Grammar design is crucial: Good language design avoids ambiguity through careful grammar construction
Key insight: Ambiguity is a property of the grammar, not the parser. A well-designed grammar eliminates ambiguity before parsing begins.
Parsing Strategies
Not all grammars are equally easy to parse, and different approaches have different trade-offs. Think of parsing strategies like different tools in a toolbox—a hammer for some jobs, a screwdriver for others.
Choosing Your Approach
Why Different Strategies?
Different parsing strategies exist because:
- Grammar constraints: Some grammars are ambiguous or have features (like left-recursion) that break certain parsing approaches
- Implementation complexity: Hand-written parsers need simplicity; generated parsers can be more complex
- Error handling: Some strategies give better error messages than others
- Performance: Different strategies have different speed and memory characteristics
- Parsing power: Some strategies can handle more complex grammars than others
Which Strategy Should You Use?
| Situation | Recommended Strategy | Why? |
|---|---|---|
| Writing by hand | Top-down (Recursive Descent) | Easy to understand, maps directly to grammar, good error messages |
| Using a parser generator | Bottom-up (LR/LALR) | More powerful, handles more grammars, tools do the hard work |
| Simple expressions | Top-down | Quick to implement, sufficient for most expression grammars |
| Complex language (C, Java) | Bottom-up (via tool) | Handles complex grammar features, proven at scale |
| Educational purposes | Top-down | Easier to understand and trace execution |
For most projects: start with recursive descent (top-down) because it's intuitive. Only reach for more powerful strategies when you hit limitations.
Top-down (LL) parsers start with the goal (e.g., "program") and work down to terminals. Recursive descent is the most common variant — each grammar rule becomes a function. It's intuitive, easy to debug, and produces clear error messages.
Bottom-up (LR) parsers start with terminals and work up, deferring decisions until more context is available. They handle more complex grammars but are harder to write by hand. Parser generators like YACC and GNU Bison automate the hard parts.
Working with Parse Trees
Once you have a parse tree or AST, you need to do something with it—evaluate it, compile it, or analyze it. This section covers common operations.
Evaluating the AST
Once you have an AST, you can walk it to compute results:
```python title="AST Evaluator in Python" linenums="1"
def evaluate(node): # (1)!
if node[0] == 'number': # (2)!
return node[1]
elif node[0] == 'binop': # (3)!
op, left, right = node[1], node[2], node[3] # (4)!
left_val = evaluate(left) # (5)!
right_val = evaluate(right)
if op == '+': return left_val + right_val # (6)!
if op == '-': return left_val - right_val
if op == '*': return left_val * right_val
if op == '/': return left_val / right_val
raise ValueError(f"Unknown node type: {node[0]}")
# Using our earlier AST
ast = ('binop', '+', ('number', 2), ('binop', '*', ('number', 3), ('number', 4)))
print(evaluate(ast)) # 14
```
- Recursively evaluate an AST node and return its computed value
- Base case: if it's a number node, return its value
- If it's a binary operation, evaluate both operands and apply the operator
- Unpack the operator and operands from the tuple
- Recursively evaluate left and right subtrees first
- Apply the operator to the evaluated operands
This tree-walking interpreter is the simplest approach. Real interpreters might:
- Compile the AST to bytecode
- Optimize the tree before execution
- Generate machine code
Error Handling
Good parsers give helpful error messages:
```python title="Error Handling in Parser" linenums="1"
def consume(self, expected_type):
token = self.current_token()
if token is None: # (1)!
raise SyntaxError(f"Unexpected end of input, expected {expected_type}")
if token[0] != expected_type: # (2)!
raise SyntaxError(
f"Line {self.line}, column {self.column}: " # (3)!
f"Expected {expected_type}, got {token[0]} ('{token[1]}')"
)
self.pos += 1
return token
```
- Check if we've run out of tokens unexpectedly
- Validate that the token type matches what we expected
- Provide location information (line and column) for better error messages
More sophisticated parsers can:
- Recover from errors and continue parsing
- Suggest corrections ("Did you mean...?"')
- Highlight the exact location of the problem
Parser Generators
Writing parsers by hand is educational, but for real projects, consider parser generators:
| Tool | Language | Grammar Style |
|---|---|---|
| ANTLR | Java, Python, etc. | EBNF-like |
| PLY | Python | YACC-like |
| Lark | Python | EBNF |
| Peggy (formerly PEG.js) | JavaScript | PEG |
| GNU Bison | C/C++ | YACC |
You write the grammar, the tool generates the parser code. Parser generators make this pattern practical for complex grammars.
Example with Lark (Python):
```python title="Parser Generator with Lark" linenums="1"
from lark import Lark
grammar = """ # (1)!
start: expr # (2)!
expr: term (("|"|"-") term)* # (3)!
term: factor (("*"|"/") factor)* # (4)!
factor: NUMBER | "(" expr ")" # (5)!
NUMBER: /\d+/ # (6)!
%ignore " " # (7)!
"""
parser = Lark(grammar) # (8)!
tree = parser.parse("2 + 3 * 4")
print(tree.pretty())
```
- Define grammar using EBNF-like syntax as a multi-line string
- Entry point of the grammar - must start with an expression
- Expression handles lowest precedence: addition and subtraction
- Term handles higher precedence: multiplication and division
- Factor handles highest precedence: numbers and parenthesized expressions
- Define what a NUMBER token looks like using regex
- Ignore whitespace in the input
- Create parser from grammar - Lark generates all parsing code automatically
Real-World Parsing
JSON is simple enough to parse by hand:
value = object | array | string | number | "true" | "false" | "null"
object = "{" [ pair { "," pair } ] "}"
pair = string ":" value
array = "[" [ value { "," value } ] "]"
Most languages have built-in JSON parsers because it's so common. The grammar is regular enough that a hand-written recursive descent parser works well.
HTML is messy—browsers handle malformed HTML gracefully. Real HTML parsers use complex error recovery:
| Malformed HTML Example | |
|---|---|
Technically invalid, but browsers render it anyway! The web is wild.
Why HTML parsing is hard:
- Browsers must handle broken HTML (unclosed tags, misnested elements)
- Contextual parsing rules (what's valid inside
<script>differs from<div>) - Historical quirks mode for backward compatibility
Modern language parsers are sophisticated:
Error recovery — IDE features like syntax highlighting and autocomplete work even with incomplete/invalid code
Incremental parsing — Fast re-parsing on edits by only updating changed parts of the tree
Loose parsing modes — Handle incomplete code during typing for real-time feedback
Technical Interview Context
Parsers come up in security-focused interviews (injection attacks) and in any discussion about building interpreters, config file processors, or DSLs.
Why does parameterized SQL prevent SQL injection?
SQL injection exploits the parser boundary: user input is treated as SQL syntax rather than data. Parameterized queries keep user data outside the parse boundary entirely — the SQL structure is fixed before user input arrives, so no amount of crafted input can modify the query structure.
What is an AST and where do you see them?
An Abstract Syntax Tree is the parsed representation of structured input. Python's ast module exposes the AST for any Python file. Linters, type checkers, transpilers, and compilers all operate on ASTs rather than raw source text.
What's the difference between top-down and bottom-up parsing?
Top-down parsers start from the root grammar rule and work toward the leaves; recursive descent parsers are top-down. Bottom-up parsers build the tree from leaves up, recognizing patterns as they accumulate input. Most hand-written parsers are top-down; most generated parsers (yacc, LALR) are bottom-up.
When would you write a parser vs use an existing library?
Use an existing library for any standard format (JSON, YAML, TOML, SQL). Write a parser when you're defining a custom DSL or config format — and even then, parser generators (ANTLR, PLY, pest in Rust) are usually faster to build with than hand-rolling from scratch.
Practice Problems
Challenge 1: Add Unary Minus
Modify the expression grammar and parser to handle unary minus:
-52 + -3-(-4)
Where does the new rule go in the precedence hierarchy?
Approach
Where it goes: the factor level — same depth as number literals and parenthesized expressions. This gives unary minus the highest precedence, so -2 * 3 parses as (-2) * 3 and 2 + -3 parses as 2 + (-3).
Updated grammar rule:
Why factor calls itself recursively: This handles -(-4). The outer - consumes itself, then calls parse_factor() again — which sees another -, consumes it, and parses 4. Without recursion here, double negation would fail to parse.
Parser change:
Why not at the expression level? Placing it there would make -2 + 3 parse as -(2 + 3) = -5 instead of (-2) + 3 = 1. Grammar depth controls precedence — deeper means higher priority.
Challenge 2: Parse Variable Assignments
Extend the grammar to handle:
You'll need to track variable names and store their values.
Approach
Updated grammar:
program → statement*
statement → IDENTIFIER '=' expression ← assignment
| expression ← expression statement
expression → term ('+' term | '-' term)*
...
Key additions:
1. Symbol table — a dict mapping variable names to current values:
| Symbol Table | |
|---|---|
2. Assignment parsing — look ahead one token: if the current token is an identifier AND the next is =, it's an assignment:
| Parsing Assignment | |
|---|---|
3. Variable lookup — when an identifier appears in expression context, look it up:
| Variable Lookup in parse_factor() | |
|---|---|
The order matters: y = x + 5 only works if x was assigned first. Your parser can enforce this at parse time (eager lookup, as above) or defer to evaluation time (build a full AST, evaluate later). Most real compilers do the latter — it enables forward references and function calls before definitions.
Challenge 3: Error Messages
Improve the parser to give line and column numbers in error messages. What information do you need to track during lexing?
Approach
Track position in the lexer — capture line and column as you scan, and snapshot them at the start of each token:
| Line/Column Tracking in the Lexer | |
|---|---|
Store position in every token:
| Token with Position | |
|---|---|
Use position in parser errors:
| Error Message with Position | |
|---|---|
Critical insight: Snapshot line/column at the start of each token during lexing — not at parse time. By the time the parser raises an error, the lexer has already moved past the relevant position. This is exactly how Python, Go, and Rust compilers track source positions in their token types, enabling error messages like SyntaxError: unexpected token '+' at line 3, column 12.
Key Takeaways
| Concept | What It Does |
|---|---|
| Lexer | Breaks text into tokens |
| Parser | Builds tree from tokens |
| Token | Meaningful chunk (keyword, number, operator) |
| AST | Tree representing program structure |
| Recursive Descent | Each grammar rule = one function |
| Precedence | Handled by grammar structure (nesting depth) |
Further Reading
On This Site
- Trees — Tree structures and hierarchical data
- Finite State Machines — Foundation for lexers
External
- Crafting Interpreters by Robert Nystrom — Free online book, excellent deep dive
Parsing bridges the gap between human-readable text and computer-manipulable structure. It's where formal grammar theory meets production code — where BNF rules become the error messages you read, the config formats you write, and the query languages you depend on. Every tool that reads structured text runs a parser. Knowing how parsers work turns those tools from black boxes into understandable systems you can reason about, debug, and extend.