
- Bookmark
- Subscribe
- Email to a Friend
- Printer Friendly Page
- Report Inappropriate Content
Link copied. Please paste this link to share this article on your social media post.
SQL Join Optimization
Originally published on Geo SCADA Knowledge Base by Anonymous user | June 09, 2021 05:27 PM
📖 Home Back
This page describes in detail how the Query Processor's join optimisation algorithm works. It is intended to help people construct queries that can be optimised efficiently and won't be rejected. Knowledge of SQL, particularly joins, is assumed. Some knowledge of the mathematical concept of permutations would be helpful.
Overview
The two main factors that affect how efficiently a join can be processed are the order of the tables in the join, and how each table is accessed (called its access path). The combination of join order and each table's access path is known as the access plan for the query. A table's access path is determined by its position in the table order, the join type and certain sub-clauses of the WHERE clause that reference the table. Note that a query that only selects from a single table is still classed as a join for the purposes of optimisation, though in this case the order of tables is irrelevant.
The basic algorithm for join optimisation is to try each possible permutation of table order, determine the access path of each table, then calculate an execution cost for that access plan. The access plan with the lowest cost is then used for execution.
Access Paths
There are five possible access paths for a table, which are (in order of preference):
- Unique Lookup By Value
- Unique Lookup By Column
- Non-Unique Lookup By Value
- Non-Unique Lookup By Column
- Scan
An access path of Scan, which is the default path, is the most expensive as the Query Processor simply iterates around all rows in the table. A Lookup access path is preferable, as the Query Processor can go directly to the required row or rows using an indexed column. If the indexed column is a unique index, such as the Id of an object, the access path is Unique Lookup, otherwise it is a Non-Unique Lookup. If the value of the indexed column is specified in the query using a Constant Expressions, the access path is Lookup By Value, otherwise it is Lookup By Column. An index column has an associated cost which indicates the relative expense of lookup up a row using that column. A column with a lower cost is preferred to a column with a higher cost. For example, is is quicker to look up an object by Id than by FullName.
The following table lists the index columns for the various table types, with the lowest cost columns listed first:
Table Type | Unique Indices | Non-Unique Indices |
---|---|---|
Object | Id, FullName | ParentGroupId, Search Keys |
Aggregate | - | Id |
Alarm Condition | - | Id |
Event Journal | RecordId | Id, Source, FileId |
Point Historic | RecordId | Id |
Processed Historic | RecordId | Id |
Forecast Historic | RecordId | Id |
Data Grid | - | - |
Data Table | - | As configured |
Permits | Id, Name | - |
Sub-Clauses
The access path of a table is determined by the sub-clauses that reference that table. A sub-clause is a simple condition that contains no AND operators, for example:
FullName LIKE '%Pump%'
RecordTime BETWEEN TIMESTAMP '2007-01-01 00:00:00' AND TIMESTAMP '2007-01-01 23:59:59'
Id = 123 OR Id = 456
Filter Conditions
In addition to an access path, a table also has a filter condition, consisting of any sub-clauses from the clause list that refer to that table and aren't used to determine the access path or as constraint columns. The filter condition is evaluated for each row of the join to determine if that row should be in the resultset. A sub-clause that references more than one table will only be assigned to one of those table's filter conditions, which table that is depends on the table order.
Constant Expressions
A constant expression is an expression that consists only of literals, parameter markers and constant functions such as CURRENT_TIMESTAMP, for example:
- 123
- 1 + 2
- TIMESTAMP '2007-01-01 00:00:00'
- OPC'D'
- CURRENT_TIMESTAMP
- ?
- ? || '%'
In the various expressions and conditions used throughout this page, Value1, Value2 etc. represent constant expressions.
Optimisation Algorithm
The algorithm to determine the best access plan consists of the following steps:
- Build Sub-Clause List
- Transform Sub-Clauses
- Collate Comparisons
- Determine Sub-Clause Types
- Calculate Table Order Permutations
- Determine Access Paths
- Calculate Cost
- Determine Final Access Plan
Build Sub-Clause List
The initial list of sub-clauses is built from the join specifiers and the WHERE clause:
- For a join Table A NATURAL JOIN TableB a sub-clause of the form TableA.Column = TableB.Column is added for each column that is present in both TableA and TableB.
- For a join TableA JOIN TableB ON ( Condition ), Condition is split into AND-seperated sub-clauses.
- For a join TableA JOIN TableB USING ( ColumnList ), a sub-clause of the form TableA.Column = TableB.Column is added for each column in ColumnList.
- The WHERE clause is also split into AND-seperated sub-clauses.
Transform Sub-Clauses
Once the initial sub-clause list has been built, sub-clauses that in certain forms are transformed into one or more equivalent sub-clauses that are easier to use for determining access paths. This allows queries to be expressed more precisely, or to convey more semantic information to a human reader, and still be optimised efficiently. The possible transformations are:
- Column BETWEEN Value1 AND Value2 is transformed into Column >= Value1 and Column <= Value2
- Column LIKE 'StringWithNoWildcards' is transformed into Column = 'StringWithNoWildcards'. A similar transformation exists for NOT LIKE.
- Column IN ( Value1, Value2, ... ) is transformed into Column = Value1 OR Column = Value2 ...
- Column = ANY ( Value1, Value2, ... ) is transformed into Column = Value1 OR Column = Value2 OR .... Similar transformations exist for <>, <, <=, > and =.
- Column = ALL ( Value1, Value2, ... ) is transformed into Column = Value1 and Column = Value2 etc. Similar transformations exist for <>, <, <=, > and >=.
- NOT Column = Value1 is transformed into Column <> Value1. Similar transformations exist for <>, <, <=, > and >=.
Collate Comparisons
If there are multiple sub-clauses of the form Column ''op'' Value, where ''op'' is one of comparison operators <, <=, > or >=, and Value is a literal value (not a parameter), then these are collated into a single < or <= sub-clause and/or a single > or >= sub-clause with the appropriate values.
Determine Sub-Clause Type
Each sub-clause is categorised into one of five types, depending on its form and the number of tables referenced.
- A no-dependency sub-clause is a sub-clause that does not reference any table. These are of little use in a realistic query, but are syntactically valid. No-dependency sub-clauses are added to a special filter condition that is evaluated before the join is executed. If this condition evaluates to false, then the join is not executed and an empty resultset is returned.
- A constraint sub-clause is a sub-clause that references a single table and can be used to reduce the amount of rows searched in the table, but does not determine the table's access path. See the main section on constraint sub-clauses for more details.
- An access sub-clause is a sub-clause that refences one or two tables and is in the correct form to determine a table's access path. An access sub-clause that references one table must reference exactly one column, and an access sub-clause that references two table must reference exactly two columns. At least one of the columns reference must be an index column. The possible forms of an access sub-clause are:
- TableA.Column = Value1
- TableA.Column = Value1 OR TableA.Column = Value2 OR ...
- TableA.Column1 = TableB.Colum2
- A filter sub-clause is a sub-clause that reference a single table and is not suitable as a constraint or access sub-clause. Filter-sub clauses are added to the tables Filter Conditions.
- A link sub-clause is a sub-clause that references two table and is not suitable as an access sub-clause, or that references more than two tables.
Calculate Table Order Permutations
The permutations of the table order are determined by the number of consecutive initial inner joins. If the query contains an outer join, subsequent tables will not be reordered as outer joins are not commutative. Only the first five inner joins will be reordered (for a maximum of 120 permutations), subsequent tables will not be ordered. It is the responsibility of the person writing the query to ensure the query can be efficiently optimised if it contains outer joins or more than five tables. For example:
- A JOIN B will result in 2 permutations (A, B; B, A).
- A JOIN B JOIN C will result in 6 permutations (A, B, C; A, C, B; B, A, C; B, C, A; C, A, B; C, B, A).
- A OUTER JOIN B will result in 1 permutations (A, B).
- A JOIN B OUTER JOIN C will result in 2 permutations (A, B, C; B, A, C).
- A JOIN B JOIN C JOIN D JOIN E JOIN F JOIN G will result in 120 permutations (A, B, C, D, E, F, G; A, B, C, E, D, F, G; etc.).
Determine Access Paths
For each permutation of table order, the access and link sub-clauses are assigned to tables to determine the access paths, using the following rules:
- An access sub-clause that references a single table is assigned to that table.
- An access sub-clause that references two tables is assigned to the right-most of those tables in the table order, if the column of that table referenced by the sub-clause is indexed. If the column referenced is not indexed, the sub-clause is added to the table's filter condition.
- Only a single sub-clause can be used to determine a table's access path, so if another access sub-clause can improve the access path (e.g. by changing from Non-Unique Lookup By Value to Unique Lookup By Value), then the sub-clause used for the worse access path is added to the table's filter condition.
- Link sub-clauses are added to the filter condition of the right-most table referenced by the sub-clause.
Calculate Cost
The cost of an access plan is the product of the costs of each of the tables in the join. The cost of a table is determined by the table type and its access path. The cost of a table with a Unique Lookup access path is always 1. All other cost calculations are approximations that can be performed quickly:
Table Type | Scan Cost | Non-Unique Lookup Cost |
---|---|---|
Object | Object count / 100 | Average no. object per group, no. objects in search map |
Aggregate | Aggregate count / 100 | 1 |
Alarm Condition | Alarmable object count / 33 | 1 |
Event Journal | Total no. records | Average no. records per object |
Point Historic | - | Average no. records per point |
Processed Historic | - | Average no. records per point |
Forecast Historic | - | 100 |
Data Grid | No. rows in grid | - |
Data Table | No. rows in table | No. rows in table |
Permits | No. permits | - |
- The number of objects or aggregates in a particular table is approximated by using the nearest database index. This number is reduced by a factor of 100 to represent the fact that scanning objects is quicker than scanning the same number of historic records.
- If an Id or Source constraint has been applied to an event journal scan, the cost is the average number of records per object multiple by the number of objects that match the constraint. If the constraint includes LIKEs, the number of objects that match each LIKE is approximated as 1000.
- If a RecordTime constraint has been applied to an event journal, point historic or processed historic lookup, the cost is reduced by a factor of 100.
- It is not possible to do a scan of a historic table i.e. retrieve all historic records for all objects. An access plan that contains a historic scan is rejected outright. If all possible access plans for a query include a historic scan, the query is rejected. Note that it is possible to each a similar effect using a query such as SELECT ... FROM CDBHistoric WHERE Id IN ( SELECT Id FROM CHistory ), though this is not recommended unless you can be certain the database does not contain large amounts of historic data.
- If the cost of a table is calculated as 0 (e.g. there are no objects in that table), that table is not used to calculate the overall cost. This is to prevent an empty table causing an access plan with expensive tables at the beginning of the table order being chosen.
- If there are no potential access sub-clauses that reference two tables, or there is only one permutation, then the join order specified in the the query is used. In this case, access and link sub-clauses are still assigned to tables to determine access paths, but the cost calculation is skipped.
Determine Final Access Plan
The chosen final access plan is the plan with the lowest cost. If multiple plans have the same cost, the chosen plan will be the first one that was calculated.
The cost of each access plan and the chosen final access plan can be found in the database log file:
06-DEC-2007 11:49:42.578 0F8C [SCX] 2:12 Prepare( IN: GetColumnEnums false )
06-DEC-2007 11:49:42.578 0F8C [SCX] 2:12 SQL: SELECT H.RecordId, H.RecordTime, H.Value FROM CDBHistoric AS H JOIN CDBPoint As P USING (Id)
06-DEC-2007 11:49:42.593 0F8C [SCX] 2:12 (PHIL::CQPAccess) Join Permutation 0: CDBHISTORIC = Scan, CDBPOINT = Unique By Column; Cost = -1
06-DEC-2007 11:49:42.593 0F8C [SCX] 2:12 (PHIL::CQPAccess) Join Permutation 1: CDBPOINT = Scan, CDBHISTORIC = Non-Unique By Column; Cost = 18.2
06-DEC-2007 11:49:42.593 0F8C [SCX] 2:12 (PHIL::CQPAccess) Final Access Plan: CDBPOINT = Scan, CDBHISTORIC = Non-Unique By Column; Cost = 18.2
06-DEC-2007 11:49:42.593 0F8C [SCX] 2:12 Prepare( OUT: hr 0 -> Ok., TableId 0x2 )
A cost of -1 indicates that the access was rejected because it contained a scan of a historic table.
Constraint Sub-Clauses
Certain clauses that reference a single table may be chosen as constraint sub-clauses. A constraint sub-clause reduces the amount of data searched by an access path of Scan or Non-Unique Lookup. A sub-clause that is used for a constraint will not be used to determine a table's access path or added to a table's filter condition, even if it meets the requirements for an access sub-clause.
A constraint sub-clause must reference a constraint column (which may or may not be an indexed column) and must be in a specific form. The possible constraint columns and their required forms are:
Table Type | Constraint Columns | Required Form |
---|---|---|
Event Journal | Id, Source | Column = Value1 |
Event Journal | Id, Source | Column = Value1 OR Column = Value2 OR ... |
Event Journal | Source | Source LIKE Value1 |
Event Journal | Source | Source LIKE Value1 OR Source = Value2 |
Event Journal, Historic | RecordTime | RecordTime ''op'' Value1 (where ''op'' is any comparison operator other than <>) |
Multiple constraints constraints can be applied to the same column, and a table can have constraints on multiple constraints. Each constraint reduces the amount of data to be searched and can vastly improve the performance of a query.
Database Indices
A database index is an internal set of all the objects or aggregates in the database derived from a particular class. These are used to quickly iterate around all objects of a class, by both the Query Processor and other areas. Having an index for every class would be prohibitively expensive in terms of memory, so only certain strategic base classes have an index. When calculating costs for an object or aggregate table, the nearest base class with an index is used, for example when scanning CPointAlgManual, the index for CDBPoint is used. The classes that have indices are:
- CAggregate
- CAlarmObject
- CDBObject
- CDBPoint
- CDBPointSource
- CDBPulse
- CDBServer
- CDBUser
- CFileObject
- CGroup
- CHisFiles
- CHistory
- CParam
- CResettingAccumulator
- CTemplate
- CTemplateInstance
- CVariable
Author
Link copied. Please paste this link to share this article on your social media post.