Squash Process Object
The Integrator Squash process object combines rows that are identical for a set of Dimension columns, and optionally summarizes numeric columns by calculating the total, minimum, maximum, average, or standard deviation. This object is the equivalent of a Builder squash or a SQL GROUP BY.
The user can also define other_columns, equivalent to Builder Info columns, which will be contained in the output data flow. The values for the other_columns should be the same for all the rows that have matching Dimension values; otherwise, Integrator will print out warning messages about the integrity of the other_columns.
By default, the Squash process object works by reading the input rows and combining them all in memory. Integrator must have enough memory to contain all the summarized output rows at the same time. If the adjacent attribute is "true", the Squash object only combines adjacent rows; in this case, it would only need to store a single row in memory. This will work on data that is correctly sorted for the given Dimensions. The autosort attribute can be used to perform this sort automatically. A squash can be done on large amounts of data by processing the data using the Sort process object, followed by a Squash process object.
The Squash process object only outputs the dimensions, summary, min_columns, max_columns, std_dev_columns, average_columns, other_columns, and count_column. Columns that are not listed in these arrays are not passed to the output.
Squash Attributes
| Attribute | Type | Description |
|---|---|---|
| process_type (required) |
String | Identifies the object as a Squash process object. The value of this string is "squash". |
| input (required) |
String | Defines the object from which the data flow is arriving. |
| dimensions (required) |
Array of Strings | Defines the input columns that are used to combine rows. Rows that have identical values for all the Dimension columns are combined. To squash on zero Dimensions, use this attribute with an empty value. |
| summary | Array of Strings | Defines the input columns that should be totalled. These columns are summed into the summarized output row. The values in the summary columns must be numeric or blank (null value). Integrator will issue warnings for non-numeric summary values. It will treat null values as 0 when combining them with non-null values, but will return a null value if all input rows for an output row contain null values. |
| other_columns | Array of Strings | Defines the input columns that are preserved and passed along as output from the Squash object.
The values for the other_columns should be identical across input rows with the same
combination of Dimensions; otherwise, Integrator issues a warning. To allow for easy cut and paste from Builder Descriptions, the attribute can also be named "info", and characters following a ":" in a string will be ignored (since Build Descriptions list Info Fields as "info name":"dimension name"). See the description for the remove_other_columns attribute. This attribute is optional, but useful for keeping the data intact. |
| count_column | String |
Defines an output column that will contain the count of rows with identical dimension values that have been squashed into a single output row. This option is equivalent to totaling a column containing the number 1. |
| remove_other_columns | Array of Strings | Defines columns that should not be included in other_columns. If this attribute is present, then
the other_columns are defined to be all input columns that are not listed in the dimensions,
summary, min_columns, max_columns, std_dev_columns, average_columns, or
remove_other_columns attributes. These remaining columns should be identical across input
rows with the same combination of Dimensions, as described above for the other_columns attribute. This attribute may be blank (an empty array of strings), in which case, all remaining columns are defined as other_columns. |
| adjacent | Boolean | Determines whether the squash is limited to rows that are adjacent. This attribute is optional.
Values include:
|
| warn_limit | Numeric |
Controls how many data integrity warnings are displayed. By default, this
limit is one. After the limit is exceeded, Integrator will issue the following message: The warn_limit can be set to 0, but if it is, no warning information will be provided. This setting should be used with caution. NOTE: This attribute is Warn Limit in Visual Integrator. |
| autosort | Boolean |
Determines whether the Squash object automatically sorts the input flow
according to the squash Dimensions, which is needed to squash adjacent rows. It does this by
creating an anonymous Sort object with sort columns matching the squash Dimensions, thus
saving the script writer the effort of writing a Sort object. NOTE: This attribute is Auto Sort in Visual Integrator. |
| average_columns | Array of Strings | Defines the input columns that will be combined by calculating the average value of the column
for the rows being squashed. The resulting column will be named "Average <name>" where <name> is the corresponding input column. Values in the average columns must be numeric or blank (null value). The Data Integrator will issue warnings for non-numeric values. It will treat null values as no data, which will not affect the resulting average. If all input rows for an output row contain null values, the resulting average will be blank. |
| max_columns | Array of Strings | Defines the input columns that will be combined by calculating the maximum value of the column
for the rows being squashed. The resulting column will be named "Max <name>" where <name> is the corresponding input column. Values in the maximum columns must be numeric or blank (null value). Integrator will issue warnings for non-numeric values. It will treat null values as no data, which will not effect the resulting value. If all input rows for an output row contain null values, then the resulting maximum value will be blank. |
| min_columns | Array of Strings | Defines the input columns that will be combined by calculating the minimum value of the column
for the rows being squashed. The resulting column will be named "Min <name>" where <name> is the corresponding input column. Values in the minimum columns must be numeric or blank (null value). Integrator will issue warnings for non-numeric values. It will treat null values as no data, which will not effect the resulting value. If all input rows for an output row contain null values, then the resulting minimum value will be blank. |
| std_dev_columns | Array of Strings |
Defines the input columns that will be combined by calculating the standard deviation of the
column for the rows being squashed. The resulting column will be named NOTE: This attribute is StdDev in Visual Integrator. |
| trace_after | Sub-object | Traces data flows leaving the specified object, which makes debugging scripts easier. This is equivalent to adding a Trace process object immediately after the current object. See Embedded Trace Object for more on using trace sub-objects. |
| trace_before | Sub-object | Traces data flows entering the specified object. This is equivalent to adding a Trace process object immediately before the current object. See Embedded Trace Object for more on using trace sub-objects. |