Skip to main content

File Settings

When you generate file-based results, you can configure the filename, storage format, compression, number of files, and the updating actions in the right-hand panel.

Note

By default, when scheduled or API jobs are executed, no validations are performed of any write settings objects for file-based outputs. Issues with these objects may cause failures during transformation or publishing stages of job execution. Jobs of these types should be tested through the Trifacta Application first. A workspace administrator can disable the skipping of these validations.

RunJobPage-FileSettings.png

Figure: Output File Settings

Configure the following settings.

  1. Create a new file: Enter the filename to create. A filename extension is automatically added for you, so you should omit the extension from the filename.

    1. File output paths can have a maximum length of 2048 characters.

  2. Output directory: Read-only value for the current directory. To change it, navigate to the proper directory.

    Note

    During job execution, a canary file is written for each set of results to validate the path. For datasets with parameters, if the path includes folder-level parameterization, a separate folder is created for each parameterized path. During cleanup, only the the canary files and the original folder path are removed. The parameterized folders are not removed. This is a known issue.

  3. Data Storage Format: Select the output format you want to generate for the job.

    1. Avro: This open source format is used widely for data serialization and data exchange between systems.

    2. CSV and JSON: These formats are supported for all types of imported datasets and all running environments.

      Note

      JSON-formatted files that are generated by Designer Cloud Powered by Trifacta Enterprise Edition are rendered in JSON Lines format, which is a single line per-record variant of JSON. For more information, see http://jsonlines.org.

    3. Parquet: This format is a columnar storage format.

    4. HYPER: Choose HYPER to generate results that can be imported into Tableau.

      If you have created a Tableau Server connection, you can write results to Tableau Server or publish them after they have been generated in Hyper format.

      Note

      If you encounter errors generating results in Hyper format, additional configuration may be required. See Supported File Formats.

    5. For more information, see Supported File Formats.

  4. Publishing action: Select one of the following:

    Note

    OTE: If multiple jobs are attempting to publish to the same filename, a numeric suffix (_N) is added to the end of subsequent filenames (e.g. filename_1.csv).

    Note

    If a single user executes two jobs with the same output settings except for different methods (e.g. create vs. replace) on the same output destination, the generated results and potential error conditions are unpredictable. Please wait for the first job to complete execution before changing the configuration for the second job.

    1. Create new file every run: For each job run with the selected publishing destination, a new file is created with the same base name with the job number appended to it (e.g. myOutput_2.csv, myOutput_3.csv, and so on).

    2. Append to this file every run: For each job run with the selected publishing destination, the same file is appended, which means that the file grows until it is purged or trimmed.

      Note

      The append action is not supported when publishing to S3.

      Note

      When publishing single files to WASB, the append action is not supported.

      Note

      When appending data into a Hive table, the columns displayed in the Transformer page must match the order and data type of the columns in the Hive table.

      Note

      Compression of published files is not supported for an append action.

    3. Replace this file every run: For each job run with the selected publishing destination, the existing file is overwritten by the contents of the new results.

  5. More Options:

    1. Include headers as first row on creation: For CSV outputs, you can choose to include the column headers as the first row in the output. For other formats, these headers are included automatically.

      Note

      Headers cannot be applied to compressed outputs.

    2. Include quotes: For CSV outputs, you can choose to include double quote marks around all values, including headers.

    3. Include mismatched values: For CSV outputs, you can choose to include any value that is mismatched for its column data type. When disabled, mismatched values are written as null values.

    4. Delimiter: For CSV outputs, you can enter the delimiter that is used to separate fields in the output. The default value is the global delimiter, which you can override on a per-job basis in this field.

      Tip

      If needed for your job, you can enter Unicode characters in the following format: \uXXXX.

      Note

      The Spark running environment does not support use of multi-character delimiters for CSV outputs. You can switch your job to a different running environment or use single-character delimiters. For more information on this issue, see https://issues.apache.org/jira/browse/SPARK-24540.

    5. Single File: Output is written to a single file. Default setting for smaller, file-based jobs.

    6. Multiple Files: Output is written to multiple files. Default setting for larger file-based jobs.

  6. Compression: For text-based outputs, compression can be applied to significantly reduce the size of the output. Select a preferred compression format for each format you want to compress.

    Note

    If you encounter errors generating results using Snappy, additional configuration may be required. See Supported File Formats.

  7. To save the publishing action, click Add.