Query Optimization and Cost Estimation

Overview

Query types and SQL join topologies.
How the DBMS parses, rewrites, optimizes and executes a query.
Relational algebra equivalences and algebraic trees.
Heuristic optimization rules and plan generation.
Cost estimation with statistics, access methods and join algorithms.

1. Query typology

1.1 Types of queries

Simple queries: basic operations on a single table (selection, insertion, update, deletion).
Complex queries: joins, subqueries, aggregations, analytical functions.

Example of a complex query:

SELECT c.customer_id, c.customer_name, SUM(o.amount) AS total_spent
FROM Customers c
JOIN Orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name
HAVING SUM(o.amount) > (SELECT AVG(amount) FROM Orders);

1.2 Query topology (join shapes)

Query topology describes how tables are organized and linked in a query, especially when joins are involved.

Star schema join: a central fact table connected to several dimension tables.

SELECT F.date, D.nom_produit, C.nom_client
FROM Fournisseur F
JOIN Produits D ON F.id_produit = D.id_produit
JOIN Clients C ON F.id_client = C.id_client;

Snowflake schema join: a normalized star schema; dimension tables are decomposed into sub-dimensions and joins involve several levels.
Chain join: tables are joined sequentially, each linked to the next through join attributes.
Tree join: joins form a tree structure; one table connects to several others and some of those connect further.
Cyclic join: tables form cycles; some joins loop back to already-included tables.

2. The different query processing phases

SQL is declarative: it specifies what to retrieve, not how. The DBMS turns SQL into an executable plan using several phases.

2.1 Parsing (Query Parser)

1) Lexical and syntactic analysis
- Lexical: split into tokens (keywords, identifiers, literals).
- Syntactic: validate the structure against SQL grammar.

2) Semantic analysis
- Object verification: tables, columns, views exist.
- Privilege checking: user permissions are valid.
- Type checking: data types are compatible.

2.2 Rewriting (Query Rewriter)

The SQL query is transformed into a relational algebra tree.

2.3 Optimization (Query Optimizer)

The optimizer applies rules and heuristics to generate multiple execution plans and selects the least-cost plan.

2.4 Execution (Query Executor)

Execution operators (selection, projection, join, etc.) are applied to produce the final result.

3. Query optimization (syntactic method)

3.1 Equivalences in relational algebra

Projection (cascade)

πa1,a2,...,an(R) = πa1( ... (πan(R)) ... )

Cartesian product

(T x S) x R = T x (S x R)    (associative)
T x S = S x T               (commutative)

Join = Cartesian product + selection

Join reordering relies on associativity/commutativity under valid attribute conditions. The lecture cases use:
- R(a,z), S(a,b), T(b,y)
- Case 1, Case 2, Case 3: equivalent join orderings (see slides for diagrams).

3.2 Algebraic trees

Leaf nodes: base relations.
Internal nodes: relational algebra operators.
Root: final operation.

(Exercise in slides: build the algebraic tree for a query using R(a,z), S(a,b), T(b,y).)

3.3 Transformation rules for algebraic trees

1) Push down selection: apply σ (selection) as early as possible to reduce intermediate results.

Example relations:
- Reserves (sid: integer, bid: integer, day: date, rname: text)
- Sailors (sid: integer, sname: text, rating: integer, age: real)

2) Push down projection: move π (projection) downward to eliminate unneeded columns early.

3) Avoid Cartesian products: replace them with joins whenever possible.

3.4 Simple heuristic algorithm to generate a query plan

Input: SQL query
Output: algebraic tree representing the execution plan

1) Translate SQL to relational algebra
2) Build the initial algebraic tree
3) Apply heuristic transformations:
- Push-down selection
- Push-down projection
- Remove unnecessary Cartesian products

4. Cost estimation

Multiple logically equivalent plans may exist; the optimizer chooses the least-cost plan.
Logical equivalence is not the same as physical equivalence.

4.1 Statistical data for optimization

Statistics guide the optimizer by summarizing the data distribution.

Table cardinality: number of tuples.
Example: a table Employees contains 10,000 rows.
Selectivity factor: fraction of tuples satisfying a predicate.
Example: condition age > 40, if 30% satisfy it, selectivity = 0.3.
Page size: tuples per disk page; used to estimate I/O.
Indexing: existence of indexes on specific columns.
Example: if an index is defined on id, a query id = 10 may use an Index Scan.

4.2 Algorithms for selection

1) Table Scan (Heap Scan)
2) Index Scan
3) Binary Search
4) Bitmap Scan

4.3 Algorithms for projection

1) Without duplicate elimination
2) With duplicate elimination:
- Sort-based
- Hash-based
3) With indexing

4.4 Join algorithms

1) Nested Loop Join (NLJ)
2) Block Nested Loop Join (BNLJ)
3) Sort-Merge Join (SMJ)
4) Hash Join
5) Index Nested Loop Join

4.5 Join implementation techniques

4.5.1 Page Nested Loop Join

Principle: scan the outer table, and for each page, scan the inner table for matching tuples.
Cost: C = T1 * T2 where T1 and T2 are the table cardinalities.
Use case: outer table is small.

Example:

SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid = S.sid
AND R.bid = 100
AND S.rating > 5;

Given:
- Table S has 500 pages
- Table R has 1,000 pages

Costs:
- Basic scan: 500 + (500 * 1,000) = 500,500 I/Os.
- If rating > 5 keeps 250 pages of S: 500 + (250 * 1,000) = 250,500 I/Os.
- If bid = 100 keeps 10 pages of R and R is the outer table:
1,000 + (10 * 500) = 6,000 I/Os.

4.5.2 Sort-Merge Join

Principle: sort both relations on the join attribute, then merge them in order.

Steps:
1) Sort both relations on the join attribute (External Sort-Merge).
2) Merge sequentially:
- If R[i] < S[j], advance in R.
- If R[i] > S[j], advance in S.
- If equal, output all matching pairs.
- Duplicates are handled by generating the Cartesian product of matching ranges.

External Sort-Merge Algorithm

Input:
- Relation R stored on disk
- B available buffer pages in memory
- Sort key A

Output:
- Relation R sorted on attribute A

Phase 1: Run generation (initial runs)
1) While pages of R remain:
- Read up to B pages into memory.
- Sort the tuples in memory (e.g., quicksort).
- Write the sorted pages back to disk as a sorted run.
2) End of phase:
- R is partitioned into ceil(|R| / B) sorted runs.
- Each run is sorted, but the relation is not globally sorted yet.

Phase 2: Multi-way merge (merge passes)
1) While more than one run exists:
- Select up to B - 1 runs.
- Allocate input buffers (one per run) and one output buffer.
- Read the first page of each run.
- Repeatedly select the smallest tuple and write it to output.
- Refill input buffers as needed and flush output when full.
- Write the merged output as a new run.
2) Repeat until a single sorted run remains.

External Sort-Merge Cost

Cost = 2 * size(R) * (ceil(log_B(size(R) / B)) + 1)

Sort-Merge Join Cost

Cost = size(R) + size(S)
     + 2 * size(R) * (ceil(log_B(size(R) / B)) + 1)
     + 2 * size(S) * (ceil(log_B(size(S) / B)) + 1)

5. Key takeaways

Query optimization is a mix of algebraic equivalences and cost-based decisions.
Heuristic transformations reduce intermediate results before cost estimation.
Statistics (cardinality, selectivity, page size, indexing) drive cost models.
Join choice and join order dominate overall query performance.