How tf postgres works under the hood?

How SQL Queries Are Processed?

Every query takes the same fundamental path through PostgreSQL:

Stage 1: Parsing - 'Is this even valid SQL?'

The parser identifies names, verbs, and clauses, but it doesn’t yet understand what they mean in the context of your database. (Like english teacher!)

We send:

SELECT * FROM users WHERE age > 30;

PostgreSQL first checks:

Syntax correct?

Table/keywords valid? etc

The lexer breaks the raw SQL text into tokens.

The parser takes those tokens and checks whether they follow SQL grammar rules.

reference

Stage 2: Analysis - 'Wtf mean?'

The analyzer does several critical things:

Resolves references: It looks up employees in the system catalogs to verify the table exists. It checks whether the columns exist in that table.

etc… wait? so parser and analyzer are same? not really.

Parser = "Is this SQL syntactically valid?"

Analyzer = "Do these tables/columns actually exist?"

reference

Now PostgreSQL understands what our query means! yay!

Stage 3: Rewriting - “Can I simplify this(even if no one asked for!)?”

The rewriter takes our semantically valid query tree and applies automatic transformations.

We can use this phase as security validation as well! After applying all transformations, PostgreSQL has the final version of the query. The next step is figuring out the most efficient way to execute it.

It answer: Can this query be transformed into another equivalent query before optimization?

Example: View Expansion

Suppose you have:

CREATE VIEW active_users 
AS SELECT id, name
FROM users
WHERE is_active = true;

And you run:

SELECT * FROM active_users;

Parse Tree

Initially:

SELECT * FROM active_users

Rewriter Output

The rewriter replaces the view with its underlying query:

SELECT id, name
FROM users
WHERE is_active = true;

Basically rewriting the query to its expanded form.

Stage 4: Planning - “What’s the best way to run this?”

(most sophisticated part)

The planner examines all available indexes for each table and evaluates which access method will be fastest. Sometimes an index lookup is the clear winner. Other times, especially for small tables or when you need most of the rows, a sequential scan is actually faster because it avoids the overhead of index navigation.

This is why the planner, which optimizes the parsed query, is one of the most complex elements of the system.

With all this information, the planner evaluates different execution strategies. For each possible plan, PostgreSQL calculates an estimated cost based on:

Expected disk I/O (reading pages from disk or cache)
CPU usage (processing rows, evaluating conditions)
Memory requirements (sorting, hashing)

The planner explores multiple options and picks the one with the lowest estimated cost

But wait have you noticed any problem here?

When your query joins many tables, the number of possible join orders explodes exponentially. ex. for 3 tables how many possibilities?

(PostgreSQL switches to a Genetic Query Optimizer (geqo_threshold) that uses DP and heuristics approach.)

“Dynamic programming: “Don’t solve the same sub-problem again and again”
Heuristics: “Don’t try everything: use rules to skip bad options”

(Like the obvious ones)

reference

Wait a minute! Is there any way we can stop here and see what results we got before actually executing?(So we don’t delete prod accidentally?)

yes! we do have! that’s what EXPLAIN do!

Suppose you have:

SELECT *
FROM employees
WHERE department = 'Sales';

If you run:

EXPLAIN
SELECT *
FROM employees
WHERE department = 'Sales';

You might see:

Seq Scan on employees  (cost=0.00..18.10 rows=3 width=64)
  Filter: (department = 'Sales')
  
  👉 PostgreSQL plans to:
  
  Scan entire employees table
  Expects ~3 rows
Thinks it’s very cheap (cost 18.10)
Each row ≈ 64 bytes

reference

Stage 5: Execution -”enough talk! now do the actual work”

“Okay, I know the best way. now let’s actually run it.”

It does NOT rethink

It does NOT optimize anymore

It just follows instructions

PostgreSQL uses a pull-based execution model.

It’s like “Nothing happens until the top node asks for data.”

The top node of the execution plan initiates the process by requesting rows from its child nodes. Each node processes data and passes results up the tree until the final output is produced.

Again wait a minute! Is there anyway to see final stats? yes! we have explain analyze.

EXPLAIN ANALYZE

Seq Scan on users  (cost=0.00..2.00 rows=100 width=11)
(actual time=0.010..0.014 rows=99 loops=1)
Planning Time: 0.049 ms
Execution Time: 0.048 ms

👉 What actually happened:

Full table scan executed
Returned 99 rows (close to estimate ✔)
Took ~0.014 ms to run (very fast)
Planning took ~0.049 ms

Done Execution of your query! hurrah!

Why This Matters?

Write better queries: Knowing how the planner works helps you structure queries for optimal performance. You’ll understand why certain queries can’t use indexes, why join order matters,etc
Debug performance issues: When a query is slow, you can use EXPLAIN to see which stage is problematic. Is the planner choosing a bad path? Is an index missing?
Design better schemas: Understanding how PostgreSQL processes queries informs index design, table partitioning, and view usage.
Appreciate the complexity: There’s a lot happening behind that simple SELECT statement. PostgreSQL handles connection management, memory allocation, syntax validation, semantic analysis, rule application, cost-based optimization, and pull-based execution(all transparently.)

(Articles I love!)

Thank you for reading! See you in the next one!