[en] Apache Spark Direct
[en] Connection Type | [en] REST/HTML Server |
[en] Distributions Validated On | [en] Hortonworks 2.6, Cloudera 5.7 |
[en] Server Details | [en] Apache Livy download information can be found here. |
[en] Type of Support | [en] In-Database |
[en] Validated On | [en] Apache Livy 0.3, Apache Spark 1.6, 2.0, 2.1, and 2.2 |
[en] Alteryx Tools Used to Connect
[en] In-database Workflow Processing
[en] Connect In-DB Tool

[en] Link
[en] Data Stream In Tool

[en] Link
[en] Apache Spark Code Tool

[en] Link
[en] Connect to Apache Spark by dragging a Connect In-DB tool or the Apache Spark Code tool onto the canvas. Create a new Livy connection using the Apache Spark Direct driver. Use the instructions below to configure the connection.
[en] Configure the Livy Connection Window
[en] To connect to Livy Server and create an Alteryx connection string:
[en] Add a new In-DB connection, setting DataSource to ApacheSparkDirect. For more information on setting up an In-DB connection, visit Connect In-DB Tool.
[en] On the Read tab, Driver will be locked to Apache Spark Direct. Click the ConnectionString drop-down arrow and select New database connection.
[en] Configure the LivyConnection window.
[en] Livy Server Configuration
[en] Select your security preference:
[en] Enter or paste the Host IP Address or DNS name of the Livy node within your Apache Spark cluster.
[en] Enter the Port used by Livy. The default port is 8998.
[en] Optionally provide the UserName to set user impersonation, the name that Apache Spark will use when running jobs.
[en] Enter or paste the URL of your Knox gateway.
[en] Enter the UserName and Password associated with the specified gateway.
[en] Optionally test the connection:
[en] Select the ApacheSparkVersion used on your cluster.
[en] Select the Kerberos connection type.
[en] Select Test.
[en] Set the ConnectionMode to the coding language to use in the Apache Spark Code tool.
[en] HDFS Connection
[en] Select the ServerConfiguration option that matches the HDFS protocol used to communicate with the cluster.
[en] Enter the Host IP Address or DNS name for the HDFS name node within your Apache Spark cluster.
[en] Enter the Port number. The default port will be populated automatically.
[en] Enter the Host IP Address or DNS name for the HDFS name node within your Apache Spark cluster.
[en] Enter the Port number. The default port will be populated automatically.
[en] Enter or paste the URL of your Knox gateway.
[en] Optionally enter the Username for the HDFS connection.
[en] Optionally enter the Password for the HDFS connection.
[en] Select the Kerberos protocol to use.
[en] Advanced Options
[en] Set the Poll Interval (ms), the time between checks from Alteryx for Apache Spark code execution requests. The default is 1,000 ms, or 1 second.
[en] Set the Wait Time (ms), the time that Alteryx waits for execution requests to complete. Operations that take longer than the set wait time result in a time-out error. The default is 60,000 ms or 1 minute.
[en] The Apache Spark Configuration Options customize the created Apache Spark context and allow advanced users to override the default Apache Spark settings.
注意
[en] By default, the Configuration Option is spark.jars.packages and the Value is com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.10:2.0.1. Depending on your Apache Spark version, you might need to override the default value.
[en] Apache Spark Version | [en] Value |
---|---|
2.0 - 2.1 | [en] com.databricks:spark-avro_2.11:3.2.0;com.databricks:spark-csv_2.11:1.5.0 |
2.2 | [en] com.databricks:spark-avro_2.11:4.0.0;com.databricks:spark-csv_2.11:1.5.0 |
[en] Select (+ icon) to add another row to the configuration options table.
[en] Select (save icon) to save the current advanced settings as a JSON file. The file can then be loaded into the advanced settings of another connection.
[en] Select (open icon) to load a JSON file into the configuration options table.
[en] Select OK to create your Apache Spark Direct connection.
[en] Limitations
[en] At this time, Alteryx supports native Spark in Cloudera Data Platform (CDP) but not Cloudera Distributed Hadoop (CDH).
[en] TLS/SSL enabled Livy Servers are not supported.