SQL Join Optimization

Anonymous user · ‎2021-06-09

📖 Home Back
This page describes in detail how the Query Processor's join optimisation algorithm works. It is intended to help people construct queries that can be optimised efficiently and won't be rejected. Knowledge of SQL, particularly joins, is assumed. Some knowledge of the mathematical concept of permutations would be helpful.

Overview

The two main factors that affect how efficiently a join can be processed are the order of the tables in the join, and how each table is accessed (called its access path). The combination of join order and each table's access path is known as the access plan for the query. A table's access path is determined by its position in the table order, the join type and certain sub-clauses of the WHERE clause that reference the table. Note that a query that only selects from a single table is still classed as a join for the purposes of optimisation, though in this case the order of tables is irrelevant.
The basic algorithm for join optimisation is to try each possible permutation of table order, determine the access path of each table, then calculate an execution cost for that access plan. The access plan with the lowest cost is then used for execution.

Access Paths

There are five possible access paths for a table, which are (in order of preference):

Unique Lookup By Value
Unique Lookup By Column
Non-Unique Lookup By Value
Non-Unique Lookup By Column
Scan

An access path of Scan, which is the default path, is the most expensive as the Query Processor simply iterates around all rows in the table. A Lookup access path is preferable, as the Query Processor can go directly to the required row or rows using an indexed column. If the indexed column is a unique index, such as the Id of an object, the access path is Unique Lookup, otherwise it is a Non-Unique Lookup. If the value of the indexed column is specified in the query using a Constant Expressions, the access path is Lookup By Value, otherwise it is Lookup By Column. An index column has an associated cost which indicates the relative expense of lookup up a row using that column. A column with a lower cost is preferred to a column with a higher cost. For example, is is quicker to look up an object by Id than by FullName.

The following table lists the index columns for the various table types, with the lowest cost columns listed first:

Table Type	Unique Indices	Non-Unique Indices
Object	Id, FullName	ParentGroupId, Search Keys
Aggregate	-	Id
Alarm Condition	-	Id
Event Journal	RecordId	Id, Source, FileId
Point Historic	RecordId	Id
Processed Historic	RecordId	Id
Forecast Historic	RecordId	Id
Data Grid	-	-
Data Table	-	As configured
Permits	Id, Name	-

Sub-Clauses

The access path of a table is determined by the sub-clauses that reference that table. A sub-clause is a simple condition that contains no AND operators, for example:

FullName LIKE '%Pump%'

RecordTime BETWEEN TIMESTAMP '2007-01-01 00:00:00' AND TIMESTAMP '2007-01-01 23:59:59'

Id = 123 OR Id = 456

Filter Conditions

In addition to an access path, a table also has a filter condition, consisting of any sub-clauses from the clause list that refer to that table and aren't used to determine the access path or as constraint columns. The filter condition is evaluated for each row of the join to determine if that row should be in the resultset. A sub-clause that references more than one table will only be assigned to one of those table's filter conditions, which table that is depends on the table order.

Constant Expressions

A constant expression is an expression that consists only of literals, parameter markers and constant functions such as CURRENT_TIMESTAMP, for example:

123
1 + 2
TIMESTAMP '2007-01-01 00:00:00'
OPC'D'
CURRENT_TIMESTAMP
?
? || '%'

In the various expressions and conditions used throughout this page, Value1, Value2 etc. represent constant expressions.

Optimisation Algorithm

The algorithm to determine the best access plan consists of the following steps:

Build Sub-Clause List
- Transform Sub-Clauses
- Collate Comparisons
Determine Sub-Clause Types
Calculate Table Order Permutations
- Determine Access Paths
- Calculate Cost
Determine Final Access Plan

Build Sub-Clause List

The initial list of sub-clauses is built from the join specifiers and the WHERE clause:

For a join Table A NATURAL JOIN TableB a sub-clause of the form TableA.Column = TableB.Column is added for each column that is present in both TableA and TableB.

For a join TableA JOIN TableB ON ( Condition ), Condition is split into AND-seperated sub-clauses.

For a join TableA JOIN TableB USING ( ColumnList ), a sub-clause of the form TableA.Column = TableB.Column is added for each column in ColumnList.

The WHERE clause is also split into AND-seperated sub-clauses.

Transform Sub-Clauses

Once the initial sub-clause list has been built, sub-clauses that in certain forms are transformed into one or more equivalent sub-clauses that are easier to use for determining access paths. This allows queries to be expressed more precisely, or to convey more semantic information to a human reader, and still be optimised efficiently. The possible transformations are:

Column BETWEEN Value1 AND Value2 is transformed into Column >= Value1 and Column <= Value2

Column LIKE 'StringWithNoWildcards' is transformed into Column = 'StringWithNoWildcards'. A similar transformation exists for NOT LIKE.

Column IN ( Value1, Value2, ... ) is transformed into Column = Value1 OR Column = Value2 ...

Column = ANY ( Value1, Value2, ... ) is transformed into Column = Value1 OR Column = Value2 OR .... Similar transformations exist for <>, <, <=, > and =.

Column = ALL ( Value1, Value2, ... ) is transformed into Column = Value1 and Column = Value2 etc. Similar transformations exist for <>, <, <=, > and >=.

NOT Column = Value1 is transformed into Column <> Value1. Similar transformations exist for <>, <, <=, > and >=.

Collate Comparisons

If there are multiple sub-clauses of the form Column ''op'' Value, where ''op'' is one of comparison operators <, <=, > or >=, and Value is a literal value (not a parameter), then these are collated into a single < or <= sub-clause and/or a single > or >= sub-clause with the appropriate values.

Determine Sub-Clause Type

Each sub-clause is categorised into one of five types, depending on its form and the number of tables referenced.

A no-dependency sub-clause is a sub-clause that does not reference any table. These are of little use in a realistic query, but are syntactically valid. No-dependency sub-clauses are added to a special filter condition that is evaluated before the join is executed. If this condition evaluates to false, then the join is not executed and an empty resultset is returned.

A constraint sub-clause is a sub-clause that references a single table and can be used to reduce the amount of rows searched in the table, but does not determine the table's access path. See the main section on constraint sub-clauses for more details.

An access sub-clause is a sub-clause that refences one or two tables and is in the correct form to determine a table's access path. An access sub-clause that references one table must reference exactly one column, and an access sub-clause that references two table must reference exactly two columns. At least one of the columns reference must be an index column. The possible forms of an access sub-clause are:
- TableA.Column = Value1
- TableA.Column = Value1 OR TableA.Column = Value2 OR ...
- TableA.Column1 = TableB.Colum2

A filter sub-clause is a sub-clause that reference a single table and is not suitable as a constraint or access sub-clause. Filter-sub clauses are added to the tables Filter Conditions.

A link sub-clause is a sub-clause that references two table and is not suitable as an access sub-clause, or that references more than two tables.

Calculate Table Order Permutations

The permutations of the table order are determined by the number of consecutive initial inner joins. If the query contains an outer join, subsequent tables will not be reordered as outer joins are not commutative. Only the first five inner joins will be reordered (for a maximum of 120 permutations), subsequent tables will not be ordered. It is the responsibility of the person writing the query to ensure the query can be efficiently optimised if it contains outer joins or more than five tables. For example:

A JOIN B will result in 2 permutations (A, B; B, A).

A JOIN B JOIN C will result in 6 permutations (A, B, C; A, C, B; B, A, C; B, C, A; C, A, B; C, B, A).

A OUTER JOIN B will result in 1 permutations (A, B).

A JOIN B OUTER JOIN C will result in 2 permutations (A, B, C; B, A, C).

A JOIN B JOIN C JOIN D JOIN E JOIN F JOIN G will result in 120 permutations (A, B, C, D, E, F, G; A, B, C, E, D, F, G; etc.).

Determine Access Paths

For each permutation of table order, the access and link sub-clauses are assigned to tables to determine the access paths, using the following rules:

An access sub-clause that references a single table is assigned to that table.

An access sub-clause that references two tables is assigned to the right-most of those tables in the table order, if the column of that table referenced by the sub-clause is indexed. If the column referenced is not indexed, the sub-clause is added to the table's filter condition.

Only a single sub-clause can be used to determine a table's access path, so if another access sub-clause can improve the access path (e.g. by changing from Non-Unique Lookup By Value to Unique Lookup By Value), then the sub-clause used for the worse access path is added to the table's filter condition.

Link sub-clauses are added to the filter condition of the right-most table referenced by the sub-clause.

Calculate Cost

The cost of an access plan is the product of the costs of each of the tables in the join. The cost of a table is determined by the table type and its access path. The cost of a table with a Unique Lookup access path is always 1. All other cost calculations are approximations that can be performed quickly:

Table Type	Scan Cost	Non-Unique Lookup Cost
Object	Object count / 100	Average no. object per group, no. objects in search map
Aggregate	Aggregate count / 100	1
Alarm Condition	Alarmable object count / 33	1
Event Journal	Total no. records	Average no. records per object
Point Historic	-	Average no. records per point
Processed Historic	-	Average no. records per point
Forecast Historic	-	100
Data Grid	No. rows in grid	-
Data Table	No. rows in table	No. rows in table
Permits	No. permits	-

The number of objects or aggregates in a particular table is approximated by using the nearest database index. This number is reduced by a factor of 100 to represent the fact that scanning objects is quicker than scanning the same number of historic records.

If an Id or Source constraint has been applied to an event journal scan, the cost is the average number of records per object multiple by the number of objects that match the constraint. If the constraint includes LIKEs, the number of objects that match each LIKE is approximated as 1000.

If a RecordTime constraint has been applied to an event journal, point historic or processed historic lookup, the cost is reduced by a factor of 100.

It is not possible to do a scan of a historic table i.e. retrieve all historic records for all objects. An access plan that contains a historic scan is rejected outright. If all possible access plans for a query include a historic scan, the query is rejected. Note that it is possible to each a similar effect using a query such as SELECT ... FROM CDBHistoric WHERE Id IN ( SELECT Id FROM CHistory ), though this is not recommended unless you can be certain the database does not contain large amounts of historic data.

If the cost of a table is calculated as 0 (e.g. there are no objects in that table), that table is not used to calculate the overall cost. This is to prevent an empty table causing an access plan with expensive tables at the beginning of the table order being chosen.

If there are no potential access sub-clauses that reference two tables, or there is only one permutation, then the join order specified in the the query is used. In this case, access and link sub-clauses are still assigned to tables to determine access paths, but the cost calculation is skipped.

Determine Final Access Plan

The chosen final access plan is the plan with the lowest cost. If multiple plans have the same cost, the chosen plan will be the first one that was calculated.
The cost of each access plan and the chosen final access plan can be found in the database log file:

06-DEC-2007 11:49:42.578 0F8C [SCX] 2:12 Prepare( IN: GetColumnEnums false )
06-DEC-2007 11:49:42.578 0F8C [SCX] 2:12 SQL: SELECT H.RecordId, H.RecordTime, H.Value FROM CDBHistoric AS H JOIN CDBPoint As P USING (Id)
06-DEC-2007 11:49:42.593 0F8C [SCX] 2:12 (PHIL::CQPAccess) Join Permutation 0: CDBHISTORIC = Scan, CDBPOINT = Unique By Column; Cost = -1
06-DEC-2007 11:49:42.593 0F8C [SCX] 2:12 (PHIL::CQPAccess) Join Permutation 1: CDBPOINT = Scan, CDBHISTORIC = Non-Unique By Column; Cost = 18.2
06-DEC-2007 11:49:42.593 0F8C [SCX] 2:12 (PHIL::CQPAccess) Final Access Plan: CDBPOINT = Scan, CDBHISTORIC = Non-Unique By Column; Cost = 18.2
06-DEC-2007 11:49:42.593 0F8C [SCX] 2:12 Prepare( OUT: hr 0 -> Ok., TableId 0x2 )

A cost of -1 indicates that the access was rejected because it contained a scan of a historic table.

Constraint Sub-Clauses

Certain clauses that reference a single table may be chosen as constraint sub-clauses. A constraint sub-clause reduces the amount of data searched by an access path of Scan or Non-Unique Lookup. A sub-clause that is used for a constraint will not be used to determine a table's access path or added to a table's filter condition, even if it meets the requirements for an access sub-clause.
A constraint sub-clause must reference a constraint column (which may or may not be an indexed column) and must be in a specific form. The possible constraint columns and their required forms are:

Table Type	Constraint Columns	Required Form
Event Journal	Id, Source	Column = Value1
Event Journal	Id, Source	Column = Value1 OR Column = Value2 OR ...
Event Journal	Source	Source LIKE Value1
Event Journal	Source	Source LIKE Value1 OR Source = Value2
Event Journal, Historic	RecordTime	RecordTime ''op'' Value1 (where ''op'' is any comparison operator other than <>)

Multiple constraints constraints can be applied to the same column, and a table can have constraints on multiple constraints. Each constraint reduces the amount of data to be searched and can vastly improve the performance of a query.

Database Indices

A database index is an internal set of all the objects or aggregates in the database derived from a particular class. These are used to quickly iterate around all objects of a class, by both the Query Processor and other areas. Having an index for every class would be prohibitively expensive in terms of memory, so only certain strategic base classes have an index. When calculating costs for an object or aggregate table, the nearest base class with an index is used, for example when scanning CPointAlgManual, the index for CDBPoint is used. The classes that have indices are:

CAggregate
CAlarmObject
CDBObject
CDBPoint
CDBPointSource
CDBPulse
CDBServer
CDBUser
CFileObject
CGroup
CHisFiles
CHistory
CParam
CResettingAccumulator
CTemplate
CTemplateInstance
CVariable

Go: Home Back

SQL Join Optimization