About Spectre Joins
Spectre supports join operations similar to those available in the Integrator language. The idea of a join is to match rows from a left and right flow and either discard or duplicate records. The join types match those of classical relational database logic. That is, the Spectre join types are as follows:
- Left Outer: All rows from the left flow are returned.
- For every row in the left table, Spectre finds the set of rows in the right table with matching values in the key fields. If there is no right match, nulls are used as the values for the columns for which the right table would otherwise supply values. When there is a match, the left row is combined with each of the matching rows on the right.
- Right Outer: All rows from the right flow are returned.
- The right join is like a left join, except that the engine is examining the rows of the right table for matches in the left table. That is, for every row in the right table, Spectre finds the set of rows in the left table with matching values in the key fields. If there is no left match, nulls are used as the values for the columns for which the left table would otherwise supply values. When there is a match, the right row is combined with each of the matching rows on the left.
- Inner Join: Only rows that have matching keys are returned.
- The inner join is identical to the left outer join, except that when there is no match between the left and right, no row is returned.
- Outer Join: All rows are returned in the output flow.
- The outer join follows a similar key matching pattern as the inner join, but every key value in either of the tables is represented. For the columns supplied by the table where the key value is absent, nulls are placed.
To understand how these operations fit together, consider that the following holds for all tables l, r:
left-outer-join(l, r) + right-outer-join(l, r) = outer-join(l, r) + inner-join(l, r)
- Unlike in a lookup, UNKNOWN is treated as a normal value in an input table and is not thought of as a special case.
- Like in a lookup, NULL is treated as a normal value in an input table and is not to be thought of as a special case.
- The row order of the output is unspecified. It is not guaranteed to be the same on successive runs of the same join operation.
- The output columns are ordered as follows, from left to right:
- Key columns in the order they are specified.
- Columns from the left table in the order they are specified.
- Columns from the right table in the order they are specified.
- Key columns are brought in automatically, in addition to any other columns specified. By their nature, the key columns brought in will never be filled in with nulls due to missing values.
- The output table's columns must all have unique names.
- If a user wants a version of a key column brought in with values from only one side, then that key column should be brought in as a column. Users should be careful to ensure that they choose names that will not result in duplicate column names in the output table.
- A join with no keys is permitted. All rows in the left table will match all rows in the right table, and so on. Note that when no keys are specified and there is at least one row in both the left and right table, all of the join types are equivalent.
For best performance: avoid an outer join when one of the other join types will do (for example, is known to produce identical or otherwise acceptable output). Outer joins are likely to be slower than other joins.
An example join follows:
join "left outer" { text-input "right.txt" column "LEFT" column "RIGHT" key "KeyColumn" "KeyColumn" "KeyColumn" }
The enclosing data source is considered to be the left table. The right table is specified by the data source within the join block (in this case, text-input "right.txt").
The left outer attribute specifies that this is a left join. Columns LEFT and RIGHT will be brought in. In order for this to work, only one of the tables should have a matching column for each name.
The key tag specifies the name of the key in the left table, the name of the key in the right table, and the desired name of the key column in the output table. If fewer than three names are provided, the last provided is duplicated to fill the other fields (that is, (x,,)->(x,x,x) and (x,y,)->(x,y,y)).