Squash Process Object

The Integrator Squash process object combines rows that are identical for a set of Dimension columns, and optionally summarizes numeric columns by calculating the total, minimum, maximum, average, or standard deviation. This object is the equivalent of a Builder squash or a SQL GROUP BY.

The user can also define other_columns, equivalent to Builder Info columns, which will be contained in the output data flow. The values for the other_columns should be the same for all the rows that have matching Dimension values; otherwise, Integrator will print out warning messages about the integrity of the other_columns.

By default, the Squash process object works by reading the input rows and combining them all in memory. Integrator must have enough memory to contain all the summarized output rows at the same time. If the adjacent attribute is "true", the Squash object only combines adjacent rows; in this case, it would only need to store a single row in memory. This will work on data that is correctly sorted for the given Dimensions. The autosort attribute can be used to perform this sort automatically. A squash can be done on large amounts of data by processing the data using the Sort process object, followed by a Squash process object.

The Squash process object only outputs the dimensions, summary, min_columns, max_columns, std_dev_columns, average_columns, other_columns, and count_column. Columns that are not listed in these arrays are not passed to the output.

Squash Attributes

Attribute Type Description
process_type
(required)
String Identifies the object as a Squash process object. The value of this string is "squash".
input
(required)
String Defines the object from which the data flow is arriving.
dimensions
(required)
Array of Strings Defines the input columns that are used to combine rows. Rows that have identical values for all the Dimension columns are combined. To squash on zero Dimensions, use this attribute with an empty value.
summary Array of Strings Defines the input columns that should be totalled. These columns are summed into the summarized output row. The values in the summary columns must be numeric or blank (null value). Integrator will issue warnings for non-numeric summary values. It will treat null values as 0 when combining them with non-null values, but will return a null value if all input rows for an output row contain null values.
other_columns Array of Strings Defines the input columns that are preserved and passed along as output from the Squash object. The values for the other_columns should be identical across input rows with the same combination of Dimensions; otherwise, Integrator issues a warning.
To allow for easy cut and paste from Builder Descriptions, the attribute can also be named "info", and characters following a ":" in a string will be ignored (since Build Descriptions list Info Fields as "info name":"dimension name"). See the description for the remove_other_columns attribute. This attribute is optional, but useful for keeping the data intact.
count_column String

Defines an output column that will contain the count of rows with identical dimension values that have been squashed into a single output row. This option is equivalent to totaling a column containing the number 1.

remove_other_columns Array of Strings Defines columns that should not be included in other_columns. If this attribute is present, then the other_columns are defined to be all input columns that are not listed in the dimensions, summary, min_columns, max_columns, std_dev_columns, average_columns, or remove_other_columns attributes. These remaining columns should be identical across input rows with the same combination of Dimensions, as described above for the other_columns attribute.
This attribute may be blank (an empty array of strings), in which case, all remaining columns are defined as other_columns.
adjacent Boolean Determines whether the squash is limited to rows that are adjacent. This attribute is optional. Values include:
  • true—The Squash object will only combine adjacent rows that have identical Dimension values. The squash can perform this operation by storing a single row in memory. It is assumed that the data is sorted.
  • false—A sort is performed and the data needs to fit into memory. (default)
warn_limit Numeric

Controls how many data integrity warnings are displayed. By default, this limit is one. After the limit is exceeded, Integrator will issue the following message:
"Further data integrity warnings suppressed for squash object 'xxx'. Set the warn_limit attribute to see more."

The warn_limit can be set to 0, but if it is, no warning information will be provided. This setting should be used with caution.

NOTE: This attribute is Warn Limit in Visual Integrator.

autosort Boolean

Determines whether the Squash object automatically sorts the input flow according to the squash Dimensions, which is needed to squash adjacent rows. It does this by creating an anonymous Sort object with sort columns matching the squash Dimensions, thus saving the script writer the effort of writing a Sort object.
This optional attribute is only recognized if the adjacent attribute is also "true". The default value is "false". Note that the sort uses default values for Sort attributes such as temp_directory and sort_size. If these need to be set, then an explicit Sort object should be used. Large data sets (data that does not fit into the memory of the machine) are best sorted before hand with the Sort object.

NOTE: This attribute is Auto Sort in Visual Integrator.

average_columns Array of Strings Defines the input columns that will be combined by calculating the average value of the column for the rows being squashed. The resulting column will be named
"Average <name>"
where <name> is the corresponding input column.
Values in the average columns must be numeric or blank (null value). The Data Integrator will issue warnings for non-numeric values. It will treat null values as no data, which will not affect the resulting average. If all input rows for an output row contain null values, the resulting average will be blank.
max_columns Array of Strings Defines the input columns that will be combined by calculating the maximum value of the column for the rows being squashed. The resulting column will be named
"Max <name>"
where <name> is the corresponding input column.
Values in the maximum columns must be numeric or blank (null value). Integrator will issue warnings for non-numeric values. It will treat null values as no data, which will not effect the resulting value. If all input rows for an output row contain null values, then the resulting maximum value will be blank.
min_columns Array of Strings Defines the input columns that will be combined by calculating the minimum value of the column for the rows being squashed. The resulting column will be named
"Min <name>"
where <name> is the corresponding input column.
Values in the minimum columns must be numeric or blank (null value). Integrator will issue warnings for non-numeric values. It will treat null values as no data, which will not effect the resulting value. If all input rows for an output row contain null values, then the resulting minimum value will be blank.
std_dev_columns Array of Strings

Defines the input columns that will be combined by calculating the standard deviation of the column for the rows being squashed. The resulting column will be named
"Std Dev <name>"
where <name> is the corresponding input column.
Values in the standard deviation columns must be numeric or blank (null value). Integrator will issue warnings for non-numeric values. It will treat null values as no data, which will not affect the standard deviation. If all input rows for an output row contain null values, then the resulting standard deviation will be blank. Note that the standard deviation of a single value is defined to be null.

NOTE: This attribute is StdDev in Visual Integrator.

trace_after Sub-object Traces data flows leaving the specified object, which makes debugging scripts easier. This is equivalent to adding a Trace process object immediately after the current object. See Embedded Trace Object for more on using trace sub-objects.
trace_before Sub-object Traces data flows entering the specified object. This is equivalent to adding a Trace process object immediately before the current object. See Embedded Trace Object for more on using trace sub-objects.