SQL Optimizations

Other sources

a few postgres specific configurations

https://explain.depesz.com/

Most part comes from (Willems, 2019)

Time Coomplexity

Search on indexed column is O[log(N)] (e.g. index [a.k.a table, or full index] or cluster index scan) and search on non-indexed column is O[N] (e.g. seq [a.k.a. full table] scan).
- SELECT COUNT(*) FROM TABLE is full search.
A hash join is O[M+N].
- The classical hash join for an inner join of two tables prepares a hash table of the smaller table, where an entry consist of the join attribut and its row. Then the larger table is scanned and the relevant rows from the smaller are found looking at the hash table.
Merge join is O[M+N] in the general case, but depends heavily on the indexes on the join column, or, in case of non-indexed column, wheter rows are sorted according to the join keys.
- If both tables are sorted according to the join keys, then O[M+N].
- If both tables have indexed join columns, then no need of sorting, so O[M+N].
- If neither table has an index on the join column, then sorting must be performed in advance, so O[M log(M) + N log(N)].
- If only one of the tables has an indexed join column, then O[M + N log(N)].
Nested join is O[MN] in the general case. It compares every record in one table to every record in the other. It is efficient when one of the tables is extremely small (e.g. 10 rows). This is common situation as some subqueries are written to return one or few rows only.
!! When performing a join, occasionally, when a table does not fit into memory, the first full table scan does not pose a problem, but then there are a lot of randomreads to fetch indexes that are generally order of magnitudes slower than sequential reads. In these cases a full table scan is indeed faster than a full index scan.

Operators

LIKE + pattern starting with % or _ : prevents the DB from using indexes.
OR : it is likely a non-index c olumn is used. Can be replace with IN.
- SELECT license, name FROM Drivers WHERE license = 123 OR license = 234 OR license = 345
- SELECT license, name FROM Drivers WHERE license IN (123, 234, 345)
NOT : it is likely that the index is not used. Consider replacing it with comparison operators >, <=, <>, !> …
- SELECT license, name FROM Drivers WHERE NOT (year > 1980)
- SELECT license, name FROM Drivers WHERE year <= 1980
AND : another operator that doesn’t make use of the indexes. Consider using BETWEEN.
- SELECT license, name FROM Drivers WHERE year >= 1960 AND year <= 1980
- SELECT license, name FROM Drivers WHERE year BETWEEN 1960 AND 1980
ANY, ALL : another two operators that do not use indexes. Consider MIN and MAX. However, note taht all agregator operatos SUM, AVG, MIN, MAX can result in a long-running operations.
UNION : it is likely that you go thru the same table mulitple times. Consider updating the query with one SELECT containing all conditions, or replacing the UNION with OUTER JOIN.
HAVING : no indexes. It was introduced in SQL as WHERE is not working on GROUPS. The difference between them is that WHERE introduces a condition on individual rows, while HAVING do so on aggregated ones.
- SELECT state, COUNT(*) FROM Drivers GROUP BY state HAVING state IN ('GA', 'TX') ORDER BY state
- SELECT state, COUNT(*) FROM Drivers WHERE state IN ('GA', 'TX') GROUP BY state ORDER BY state

Good Practices

Limit your results: TOP, LIMIT, and ROWNUM.
When using joins, put the bigger table last in the join.
Cache small table full table scans
Choose highly selective indexes on commonly-used data.

General Notes

If the execution plan is using a Hash Join it can be very slow if table size estimates are wrong. Therefore, make sure your table statistics are accurate by reviewing your vacuuming strategy.
Avoid correlated subqueries where possible; they can significantly increase query cost.
Look for a large variance between estimated rows and actual rows in the EXPLAIN statement.

Examples

-- unoptimized due to the OR 
SELECT DISTINCT
	PRODUCT.ProductID,
	PRODUCT.Name
FROM Production.Product PRODUCT
INNER JOIN Sales.SalesOrderDetail DETAIL
ON PRODUCT.ProductID = DETAIL.ProductID -- indexed
OR PRODUCT.rowguid = DETAIL.rowguid;    -- non-indexed
-- turning the OR into its own select with UNION
-- althoug longer, it is far better in terms of performance.
SELECT
	PRODUCT.ProductID,
	PRODUCT.Name
FROM Production.Product PRODUCT
INNER JOIN Sales.SalesOrderDetail DETAIL
ON PRODUCT.ProductID = DETAIL.ProductID
UNION
SELECT
	PRODUCT.ProductID,
	PRODUCT.Name
FROM Production.Product PRODUCT
INNER JOIN Sales.SalesOrderDetail DETAIL
ON PRODUCT.rowguid = DETAIL.rowguid

AWS Aurora PostrgeSQL Perfomrance Tuning

random_page_cost - set to 1 or 1.5 instead of default 4, as SSD (no rotation latency) are used, and buffer ahead of time.

References

Willems, K. (2019). SQL Tutorial: How to write better queries. https://www.datacamp.com/community/tutorials/sql-tutorial-query