Datagen Input Object

The Integrator Datagen input object provides data generation capability within the Integrator. A Datagen object creates a given number of rows containing integer sequences and random data of various types.

Datagen Attributes

Attribute	Type	Description
process_type (required)	String	Identifies the object as a Datagen input object. The value of this string is "datagen".
count (required)	Integer	Defines the number of rows to generate.
seed	Integer	Defines the seed (default) used to initialize the pseudo-random number generator. Using a seed allows one to regenerate the same set of numbers, else each script execution will generate a new data set. If non-zero, the random data will be generated based on this seed. If 0 or not set, the seed will be based on the time the Data Integrator was run.
trace_after	Sub-object	Traces data flows leaving the specified object. This is equivalent to adding a Trace process object immediately after the current object. See Embedded Trace Object for more on using trace sub-objects.
gen_columns (required)	Array of Objects	Describes input columns for the generated data. Each sub-object can have a variety of attributes as described in the following table:

Datagen Sub-Object Attributes

Attribute	Type	Description
name	String	Defines the input column name.
type	String	Specifies the generating type for the column. It controls the contents and the randomness of data. The type attribute can have the following values: city—A random American city/town. date—A random date in standard DI date format. "YYYY/MM/DD", chosen between the start date and end date inclusively. The start and end attributes are described later in this section. decimal—A random two-decimal place number, chosen between the min and max attributes inclusively. integer—A random integer, chosen between the min and max attributes inclusively. firstname—A random first name, based on the 1000 most popular American first names. lastname—A random last name, based on the 1000 most popular American last names. month—A month string, e.g. January, February, etc. name—A concatenation of a random first name and a random last name. Example: John Smith. sequence—An integer sequence, starting from the start attribute to the end attribute. If more rows are required after the sequence hits the end number, the sequence starts again with the start value. state—A two-character US state code. strings—A string, chosen from the strings listed in the strings array attribute. word—A random pronounceable word of a given length or range of lengths. This may not produce an actual word. zip—A random number between 01000 and 99999, in 5-digit zip-code format. It does not necessarily correlate with generated city and/or state columns (see above).
distribution	String	Controls distribution of random values, and is valid for all generating types except sequence. Values include: linear—Specifies an even linear distribution. Each of the possible values will appear with about the same frequency. (default) standard—Specifies a normal bell curve distribution, centered about the midpoint of possible values, with a standard deviation of a quarter of the range. For example, a standard distribution of a random integer with min=0 and max=100 will be a normal distribution centered about 50, with a standard deviation of 25. Over half the values will be between 25 and 75. By fixing the standard deviation to be one quarter of the range, standard distribution values outside two standard deviations will not fit inside the range and will not be in the output. This will result in higher tails than normal. 80-20—Specifies a random distribution that corresponds to an "80-20" Pareto rule found in many data sets where a small number of discrete values (10%-20%) contain 80%-90% of the total data. This distribution, valid for non-decimal data, sorts all possible values using a random permutation, and chooses the sorted values using an exponential distribution. This results in an 80-20 distribution, and requires memory corresponding to 8 bytes per possible output value. It is not recommended for selections of more than 50,000 possible values.
strings	Array of Strings	Contains an array of string values to be used by the strings generating type (see type attribute above).
start	Integer for sequence, String for date	If the generating type is "sequence", this is the start of the sequence, with a default of 1. If the generating type is "date", this is the start of the date range, with no default. Use the standard DI date format.
end	Integer for sequence, String for date	If the generating type is "sequence", this is the end of the sequence, with a default of the count of rows being generated. If the generating type is "date", this is the end of the date range, with a default of the current date when the Data Integrator is started. Use the standard DI date format.
max	Integer or Decimal	Indicates the highest possible value that can be generated.
min	Integer or Decimal	Indicates the lowest possible value that can be generated.
length	Integer	Determines the maximum length of a word. It defaults to 8.
min_length	Integer	If the generating type is "word", defines the minimum length of the word. It defaults to be equal to length.
capitalize	Boolean	If this attribute is "true" and the generating type is "word", the generated word is capitalized. The default is not capitalized.