[en] Apache Spark Direct

[en] Connection Type	[en] REST/HTML Server
[en] Distributions Validated On	[en] Hortonworks 2.6, Cloudera 5.7
[en] Server Details	[en] Apache Livy download information can be found here.
[en] Type of Support	[en] In-Database
[en] Validated On	[en] Apache Livy 0.3, Apache Spark 1.6, 2.0, 2.1, and 2.2

[en] Alteryx Tools Used to Connect

[en] In-database Workflow Processing

[en] Connect In-DB Tool

Blue icon with database being plugged in.

[en] Link

[en] Data Stream In Tool

Blue icon with a stream-like object flowing into a database.

[en] Link

[en] Apache Spark Code Tool

[en] Link

[en] Connect to Apache Spark by dragging a Connect In-DB tool or the Apache Spark Code tool onto the canvas. Create a new Livy connection using the Apache Spark Direct driver. Use the instructions below to configure the connection.

[en] Configure the Livy Connection Window

[en] To connect to Livy Server and create an Alteryx connection string:

[en] Add a new In-DB connection, setting DataSource to ApacheSparkDirect. For more information on setting up an In-DB connection, visit Connect In-DB Tool.

[en] On the Read tab, Driver will be locked to Apache Spark Direct. Click the ConnectionString drop-down arrow and select New database connection.

[en] Configure the LivyConnection window.

[en] Livy Server Configuration

[en] Select your security preference:

[en] None

[en] Enter or paste the Host IP Address or DNS name of the Livy node within your Apache Spark cluster.
[en] Enter the Port used by Livy. The default port is 8998.
[en] Optionally provide the UserName to set user impersonation, the name that Apache Spark will use when running jobs.

[en] Knox

[en] Enter or paste the URL of your Knox gateway.
[en] Enter the UserName and Password associated with the specified gateway.

[en] Optionally test the connection:

[en] Select the ApacheSparkVersion used on your cluster.
[en] Select the Kerberos connection type.
[en] Select Test.

[en] Set the ConnectionMode to the coding language to use in the Apache Spark Code tool.

[en] HDFS Connection

[en] Select the ServerConfiguration option that matches the HDFS protocol used to communicate with the cluster.

[en] HTTPFS

[en] Enter the Host IP Address or DNS name for the HDFS name node within your Apache Spark cluster.
[en] Enter the Port number. The default port will be populated automatically.

[en] WebHDFS

[en] Enter the Host IP Address or DNS name for the HDFS name node within your Apache Spark cluster.
[en] Enter the Port number. The default port will be populated automatically.

[en] Knox Gateway

[en] Enter or paste the URL of your Knox gateway.

[en] Optionally enter the Username for the HDFS connection.

[en] Optionally enter the Password for the HDFS connection.

[en] Select the Kerberos protocol to use.

[en] Advanced Options

[en] Set the Poll Interval (ms), the time between checks from Alteryx for Apache Spark code execution requests. The default is 1,000 ms, or 1 second.

[en] Set the Wait Time (ms), the time that Alteryx waits for execution requests to complete. Operations that take longer than the set wait time result in a time-out error. The default is 60,000 ms or 1 minute.

[en] The Apache Spark Configuration Options customize the created Apache Spark context and allow advanced users to override the default Apache Spark settings.

注意

[en] By default, the Configuration Option is spark.jars.packages and the Value is com.databricks:spark-csv_2.10:1.5.0,com.databricks:spark-avro_2.10:2.0.1. Depending on your Apache Spark version, you might need to override the default value.

[en] Apache Spark Version	[en] Value
2.0 - 2.1	[en] com.databricks:spark-avro_2.11:3.2.0;com.databricks:spark-csv_2.11:1.5.0
2.2	[en] com.databricks:spark-avro_2.11:4.0.0;com.databricks:spark-csv_2.11:1.5.0

[en] Select (+ icon) to add another row to the configuration options table.
[en] Select (save icon) to save the current advanced settings as a JSON file. The file can then be loaded into the advanced settings of another connection.
[en] Select (open icon) to load a JSON file into the configuration options table.

[en] Select OK to create your Apache Spark Direct connection.

[en] Limitations

[en] At this time, Alteryx supports native Spark in Cloudera Data Platform (CDP) but not Cloudera Distributed Hadoop (CDH).

[en] TLS/SSL enabled Livy Servers are not supported.