Working with other libraries

Working with Spark

There are two configuration tweaks required to work with Spark :

  • Set up Java correctly as Spark only works with Java 8 while activeviam works with Java 11.

  • Use the correct version of py4j as pyspark uses version 0.10.7 while activeviam uses 0.10.8.1

Setup Java version

To work with both activeviam and Spark you will need to have both Java 11 and 8 installed on your system, and make sure that each library uses the correct version.

As Java 8 will soon be deprecated we recommend using Java 11 as your default Java installation. Below are two ways to provide the required Java version to spark.

Setup JAVA_HOME directly inside Python

This is not an elegant way of doing it, but it is the easiest: modify the Java version in the environment when starting the Spark session.

import os
# First modify the env to point to java 8
previous_java_home = os.environ["JAVA_HOME"]
os.environ["JAVA_HOME"] = "path/to/java8"

# Start the Spark session
spark = SparkSession.builder.appName("Demo").getOrCreate()

# Set the env variable back to initial value
os.environ["JAVA_HOME"] = previous_java_home

Using standalone Spark

Pyspark’s main purpose is to connect to another Spark instance. One solution is to install a standalone Spark, configure it and then use it from pyspark :

  • Install Spark standalone and pyspark (same version)

  • Set your SPARK_HOME env variable to your Spark standalone version (pyspark will now use it)

  • In your $SPARK_HOME/conf/spark-env.sh set JAVA_HOME=/path/to/java8

Setup py4j version

py4j is a connector between Python and Java used by both activeviam and Spark

We recommend using the latest version of py4j, even if Spark requires 0.10.7. There is no conflict and the latest version has better support for newest Python versions.

For instance with conda you can do conda install py4j==0.10.8.1