SQL Optimizations
created: 14 May 2021
modified: 19 December 2021
revision: 3
Other sources
a few postgres specific configurations
https://explain.depesz.com/
Most part comes from (Willems, 2019)
Time Coomplexity
- Search on indexed column is
O[log(N)]
(e.g. index [a.k.a table, or full index] or cluster index scan) and search on non-indexed column isO[N]
(e.g. seq [a.k.a. full table] scan).SELECT COUNT(*) FROM TABLE
is full search.
- A hash join is
O[M+N]
.- The classical hash join for an inner join of two tables prepares a hash table of the smaller table, where an entry consist of the join attribut and its row. Then the larger table is scanned and the relevant rows from the smaller are found looking at the hash table.
- Merge join is
O[M+N]
in the general case, but depends heavily on the indexes on the join column, or, in case of non-indexed column, wheter rows are sorted according to the join keys.- If both tables are sorted according to the join keys, then
O[M+N]
. - If both tables have indexed join columns, then no need of sorting, so
O[M+N]
. - If neither table has an index on the join column, then sorting must be performed in advance, so
O[M log(M) + N log(N)]
. - If only one of the tables has an indexed join column, then
O[M + N log(N)]
.
- If both tables are sorted according to the join keys, then
- Nested join is
O[MN]
in the general case. It compares every record in one table to every record in the other. It is efficient when one of the tables is extremely small (e.g. 10 rows). This is common situation as some subqueries are written to return one or few rows only. - !! When performing a join, occasionally, when a table does not fit into memory, the first full table scan does not pose a problem, but then there are a lot of randomreads to fetch indexes that are generally order of magnitudes slower than sequential reads. In these cases a full table scan is indeed faster than a full index scan.
Operators
LIKE
+ pattern starting with % or _ : prevents the DB from using indexes.OR
: it is likely a non-index c olumn is used. Can be replace withIN
.SELECT license, name FROM Drivers WHERE license = 123 OR license = 234 OR license = 345
SELECT license, name FROM Drivers WHERE license IN (123, 234, 345)
NOT
: it is likely that the index is not used. Consider replacing it with comparison operators>
,<=
,<>
,!>
…SELECT license, name FROM Drivers WHERE NOT (year > 1980)
SELECT license, name FROM Drivers WHERE year <= 1980
AND
: another operator that doesn’t make use of the indexes. Consider usingBETWEEN
.SELECT license, name FROM Drivers WHERE year >= 1960 AND year <= 1980
SELECT license, name FROM Drivers WHERE year BETWEEN 1960 AND 1980
ANY
,ALL
: another two operators that do not use indexes. ConsiderMIN
andMAX
. However, note taht all agregator operatosSUM
,AVG
,MIN
,MAX
can result in a long-running operations.UNION
: it is likely that you go thru the same table mulitple times. Consider updating the query with oneSELECT
containing all conditions, or replacing theUNION
withOUTER JOIN
.HAVING
: no indexes. It was introduced in SQL asWHERE
is not working on GROUPS. The difference between them is thatWHERE
introduces a condition on individual rows, whileHAVING
do so on aggregated ones.SELECT state, COUNT(*) FROM Drivers GROUP BY state HAVING state IN ('GA', 'TX') ORDER BY state
SELECT state, COUNT(*) FROM Drivers WHERE state IN ('GA', 'TX') GROUP BY state ORDER BY state
Good Practices
- Limit your results:
TOP
,LIMIT
, andROWNUM
. - When using joins, put the bigger table last in the join.
- Cache small table full table scans
- Choose highly selective indexes on commonly-used data.
General Notes
- If the execution plan is using a Hash Join it can be very slow if table size estimates are wrong. Therefore, make sure your table statistics are accurate by reviewing your vacuuming strategy.
- Avoid correlated subqueries where possible; they can significantly increase query cost.
- Look for a large variance between estimated rows and actual rows in the EXPLAIN statement.
Examples
-- unoptimized due to the OR
SELECT DISTINCT
PRODUCT.ProductID,
PRODUCT.Name
FROM Production.Product PRODUCT
INNER JOIN Sales.SalesOrderDetail DETAIL
ON PRODUCT.ProductID = DETAIL.ProductID -- indexed
OR PRODUCT.rowguid = DETAIL.rowguid; -- non-indexed
-- turning the OR into its own select with UNION
-- althoug longer, it is far better in terms of performance.
SELECT
PRODUCT.ProductID,
PRODUCT.Name
FROM Production.Product PRODUCT
INNER JOIN Sales.SalesOrderDetail DETAIL
ON PRODUCT.ProductID = DETAIL.ProductID
UNION
SELECT
PRODUCT.ProductID,
PRODUCT.Name
FROM Production.Product PRODUCT
INNER JOIN Sales.SalesOrderDetail DETAIL
ON PRODUCT.rowguid = DETAIL.rowguid
AWS Aurora PostrgeSQL Perfomrance Tuning
random_page_cost
- set to 1 or 1.5 instead of default 4, as SSD (no rotation latency) are used, and buffer ahead of time.
References
- Willems, K. (2019). SQL Tutorial: How to write better queries. https://www.datacamp.com/community/tutorials/sql-tutorial-query