Snowflake Running Environment
Snowflake provides cloud-based data storage and analytics as a service. Among other infrastructures, Snowflake runs on Amazon S3. If all of your source datasets and outputs are in Snowflake locations and other conditions are met, then the entire execution of the transformations can occur in Snowflake.
Transferring the execution steps from the Trifacta node to Snowflake yields the following benefits:
A minimum of data (recipe steps and associated metadata) is transferred between systems. Everything else remains in Snowflake.
Recipe steps are converted into SQL that is understandable and native to Snowflake. Execution times are much faster.
Depending on your environment, total cost of executing the job may be lower in Snowflake.
In this scenario, the recipe steps are converted to SQL, which is sequentially executed your source data in temporary tables, from which the results that you have defined for your output are written.
Tip
When running a job in Snowflake, your data never leaves Snowflake.
Tip
Execution on datasets created with custom SQL is supported.
If the requirements and limitations are met, the Trifacta Application automatically executes the job in Snowflake.
Requirements
General
This feature must be enabled by the workspace admin. See below.
Trifacta Application must be integrated with Snowflake. See Snowflake Connections.
The permission to execute jobs in Snowflake must be enabled.
All sources and outputs must reside in Snowflake.
Spark + Snowflake
must be selected as running environment. See Run Job Page.Jobs are executed in the virtual warehouse that is specified as part of the Snowflake connection.
Note
Job execution requires significantly more resources than ingest or publish jobs on Snowflake. Before you begin using Snowflake, you should verify that your Snowflake virtual warehouse has sufficient resources to handle the expected load. For more information, see Snowflake Connections.
In your flow, you must enable all general and Snowflake-specific flow optimizations. When all of these optimizations are enabled, the job can be pushed down to Snowflake for execution. See "Flow Optimizations" below.
For Designer Cloud Enterprise Edition
For customer-managed deployments, the following additional requirements apply:
S3: Base storage layer must be S3. See Set Base Storage Layer.
AWS running environment: The Trifacta node must be integrated with a running environment that is compatible with AWS.
For more information, see Configure for EMR.
For more information, see Configure for AWS Databricks.
Requirements across multiple Snowflake connections
If you are executing a job on Snowflake that utilizes multiple connections, the following requirements must also be met for execution of the job on Snowflake:
All Snowflake connections used in the job must utilize to the same Snowflake account.
All Snowflake connections used in the job must be backed by the same Snowflake primary role. For more information, see https://docs.snowflake.com/en/user-guide/security-access-control-overview.html#enforcement-model-the-primary-role-and-secondary-roles.
Limitations
Snowflake as a running environment requires that pushdowns be enabled for the workspace and for the specific flow for which the job is executed. If the flow and the workspace are properly configured, the job is automatically executed in Snowflake.
Note
Snowflake is not a running environment that you explicitly select or specify as part of a job. If all of the requirements are met, then the job is executed in Snowflake when you select EMR.
All datasources and all outputs specified in a job must be located within Snowflake.
All recipe steps, including all Wrangle functions in the recipe, must be translatable to SQL.
Note
When attempting to execute a job in Snowflake, Trifacta Application executes each recipe in Snowflake, until it reaches a step that cannot be executed there. At that point, data is transferred to EMR, where the remainder of the job is executed.
If the schemas have changed for your datasets, pushdown execution on Snowflake is not supported. Designer Cloud Powered by Trifacta Enterprise Edition falls back to submitting the job through another running environment.
Some transformations and functions are not currently supported for execution in Snowflake. See below.
Sampling jobs are not supported for execution in Snowflake.
If your recipe includes data quality rules, the job cannot be fully executed in Snowflake.
Visual profiling is supported with the following conditions or requirements.
Visual profiles are unloaded to a stage in an S3 bucket.
If a stage is named in the connection, it is used. This stage must point to the default S3 bucket in use.
If no stage is named, a temporary stage is be created in the
PUBLIC
schema. The connecting user must have write access toPUBLIC
.Note
Creating a temporary stage requires temporary credentials from AWS. These credentials are valid for 1 hour only. If a job is expected to run longer than one hour, you should define a named stage.
For more information, see Snowflake Connections.
Enable
Workspace Settings
The following setting must be enabled in the workspace. Select User menu > Admin console > Workspace settings.
Optimization | Description |
---|---|
Logical and physical optimization of jobs | When enabled, the Trifacta Application attempts to optimize job execution through logical optimizations of your recipe and physical optimizations of your recipes interactions with data. |
For more information, see Workspace Settings Page.
Flow Optimizations
You must enable the Snowflake optimizations in your flow. In Flow View, select More menu > Optimization settings.
Note
All general optimizations must be enabled for your flow, as well as the following optimizations, which are specific to Snowflake.
Optimization | Description |
---|---|
Snowflake > Column pruning from source | When enabled, job execution performance is improved by removing any unused or redundant columns from the source database. |
Snowflake > Filter pushdown | When this setting is enabled, the Trifacta Application optimizes job performance on this flow by pushing data filters directly on the source database. |
Snowflake > Full pushdown | When this setting is enabled, all supported pushdown operations, including full transformation and profiling job execution, is pushed down to Snowflake, where possible. |
For more information, see Flow Optimization Settings Dialog.
Run Job
To execute a job in Snowflake in theDesigner Cloud application:
Your job must meet the requirements listed above.
Your job must not include the functions, transformations, or other unsupported elements that are listed below.
You must select
Snowflake + Spark
as your running environment in the Run Job page.Note
If this running environment option does not appear in the Run Job page, then all required optimization settings have not been enabled for the workspace or the flow (see above) or the data or recipes do not meet the criteria for execution.
See Run Job Page.
Tip
After launching the job, you can monitor job execution through the Job Details page, which includes a link to the corresponding job in the Snowflake console.
Unsupported Wrangle for Snowflake Execution
The following transformations and functions are not currently supported for execution in Snowflake.
Note
If your recipe contains any of the following transformations or functions, full job execution in Snowflake is not possible at this time. These transformations are expected to be supported and removed from this list in future releases.
General limitations
For more information on limitations on specific push-downs, see Flow Optimization Settings Dialog.
Unsupported input data types
The following Snowflake data types are not supported for input into Designer Cloud Powered by Trifacta Enterprise Edition:
BINARY
VARBINARY
GEOGRAPHY
Unsupported Alteryx data types
None.
Unsupported transformations
The following Wrangle functions are not currently supported for execution in Snowflake.
Standardize
Unsupported functions
The following Wrangle functions are not currently supported for execution in BigQuery.
KTHLARGEST
KTHLARGESTIF
KTHLARGESTUNIQUE
KTHLARGESTUNIQUEIF
APPROXIMATEMEDIAN
APPROXIMATEPERCENTILE
APPROXIMATEQUARTILE
QUARTILE
For more information, see Aggregate Functions.
LCM
NUMVALUE
Partially supported:
NUMFORMAT: Only supported when used for rounding.
For more information, see Math Functions.
NETWORKDAYS
NETWORKDAYSINTL
WORKDAY
WORKDAYINTL
KTHLARGESTDATE
KTHLARGESTUNIQUEDATE
KTHLARGESTUNIQUEDATEIF
KTHLARGESTDATEIF
EOMONTH
SERIALNUMBER
DOUBLEMETAPHONEEQUALS
TRANSLITERATE
For more information, see String Functions.
Partially supported:
IFMISSING
Note
When the IFMISSING function immediately follows the PREV function in your recipe steps, Snowflake generates an incorrect value. This is a known issue and will be fixed in a future Snowflake release.
SESSION
For more information, see Window Functions.
Verify Execution
To verify execution in Snowflake, please do the following:
Steps:
In the left nav bar, click the Jobs link.
In the Jobs page, select the job that you executed.
In the Overview tab, the value for Environment under the Execution summary should be:
Snowflake
.
For more information, see Job Details Page.