Datagen Input Object

The Integrator Datagen input object provides data generation capability within the Integrator. A Datagen object creates a given number of rows containing integer sequences and random data of various types.

Datagen Attributes

Attribute Type Description
process_type
(required)
String Identifies the object as a Datagen input object. The value of this string is "datagen".
count
(required)
Integer Defines the number of rows to generate.
seed Integer Defines the seed (default) used to initialize the pseudo-random number generator. Using a seed allows one to regenerate the same set of numbers, else each script execution will generate a new data set.
If non-zero, the random data will be generated based on this seed. If 0 or not set, the seed will be based on the time the Data Integrator was run.
trace_after Sub-object

Traces data flows leaving the specified object. This is equivalent to adding a Trace process object immediately after the current object.

See Embedded Trace Object for more on using trace sub-objects.

gen_columns
(required)
Array of Objects Describes input columns for the generated data. Each sub-object can have a variety of attributes as described in the following table:

Datagen Sub-Object Attributes

Attribute Type Description
name String Defines the input column name.
type String Specifies the generating type for the column. It controls the contents and the randomness of data. The type attribute can have the following values:
  • city—A random American city/town.
  • date—A random date in standard DI date format. "YYYY/MM/DD", chosen between the start date and end date inclusively. The start and end attributes are described later in this section.
  • decimal—A random two-decimal place number, chosen between the min and max attributes inclusively.
  • integer—A random integer, chosen between the min and max attributes inclusively.
  • firstname—A random first name, based on the 1000 most popular American first names.
  • lastname—A random last name, based on the 1000 most popular American last names.
  • month—A month string, e.g. January, February, etc.
  • name—A concatenation of a random first name and a random last name. Example: John Smith.
  • sequence—An integer sequence, starting from the start attribute to the end attribute. If more rows are required after the sequence hits the end number, the sequence starts again with the start value.
  • state—A two-character US state code.
  • strings—A string, chosen from the strings listed in the strings array attribute.
  • word—A random pronounceable word of a given length or range of lengths. This may not produce an actual word.
  • zip—A random number between 01000 and 99999, in 5-digit zip-code format. It does not necessarily correlate with generated city and/or state columns (see above).
distribution String Controls distribution of random values, and is valid for all generating types except sequence. Values include:
  • linear—Specifies an even linear distribution. Each of the possible values will appear with about the same frequency. (default)
  • standard—Specifies a normal bell curve distribution, centered about the midpoint of possible values, with a standard deviation of a quarter of the range. For example, a standard distribution of a random integer with min=0 and max=100 will be a normal distribution centered about 50, with a standard deviation of 25. Over half the values will be between 25 and 75.

  • By fixing the standard deviation to be one quarter of the range, standard distribution values outside two standard deviations will not fit inside the range and will not be in the output. This will result in higher tails than normal.

  • 80-20—Specifies a random distribution that corresponds to an "80-20" Pareto rule found in many data sets where a small number of discrete values (10%-20%) contain 80%-90% of the total data. This distribution, valid for non-decimal data, sorts all possible values using a random permutation, and chooses the sorted values using an exponential distribution. This results in an 80-20 distribution, and requires memory corresponding to 8 bytes per possible output value. It is not recommended for selections of more than 50,000 possible values.

strings Array of Strings Contains an array of string values to be used by the strings generating type (see type attribute above).
start Integer for sequence, String for date If the generating type is "sequence", this is the start of the sequence, with a default of 1. If the generating type is "date", this is the start of the date range, with no default. Use the standard DI date format.
end Integer for sequence, String for date If the generating type is "sequence", this is the end of the sequence, with a default of the count of rows being generated. If the generating type is "date", this is the end of the date range, with a default of the current date when the Data Integrator is started. Use the standard DI date format.
max Integer or Decimal Indicates the highest possible value that can be generated.
min Integer or Decimal Indicates the lowest possible value that can be generated.
length Integer Determines the maximum length of a word. It defaults to 8.
min_length Integer If the generating type is "word", defines the minimum length of the word. It defaults to be equal to length.
capitalize Boolean If this attribute is "true" and the generating type is "word", the generated word is capitalized. The default is not capitalized.