Sparkube Tutorial
Learn how to start a multidimensional cube on Spark

What is Sparkube?

Sparkube is a tool to analyse Apache Spark datasets as multidimensional cubes. Running on top of Spark, it will examine a dataset, create dimensions and aggregation functions for each column and launch a server to expose the resulting cube. From Excel or other analytical tools like ActiveUI, the cube can be accessed and analysed. If a user creates a pivot table for instance, Sparkube will automatically convert the actions into Spark transformations and return the data.

Sparkube allows data scientists to perform projection operations along the dimensions effortlessly and in the graphical tools they love.

Getting started!

This tutorial will teach you how to launch a Sparkube server on top of your Spark cluster.

Deployment platform: Databricks Linux Windows


In you workspace, create a new library by uploading the Sparkube jar. Then attach it to the cluster(s) on which you want to run Sparkube.

Update the Databricks security group in your AWS account to give ingress access to incoming Sparkube connections. You will need to specify which IP addresses will be allowed to connect to Sparkube. You or your admin only need to complete this step once:

  • From your AWS console, go to the VPC dashboard and find the Databricks security group. It will have a label similar to -worker-unmanaged. For example, dbc-fb3asdddd3-worker-unmanaged.
  • Edit the security group, and add an inbound TCP rule to allow port 6006 to worker machines. It can be a single IP address of your machine or a range.
  • Make sure your laptop and office allows sending TCP traffic on port 6006.

First Cube

You can build your first cube with any dataset. We will use this small dataset in this tutorial. It is a record of online retail invoices which provides interesting hierarchies and measures.

In the data menu, you can upload it from your computer as a new Table:

Start by creating a dataset as usual (for instance from a CSV file):

var retailDataset =

Note that when you import your dataset from a CSV file, you have to infer the Schema explicitly, or else all your columns will be string type and your cube won't have any measures

Import Sparkube dependencies:

import com.activeviam.sparkube._

You can do any transformation you want on your dataset. For instance, you can create a new column "Sales Revenue" by multiplying "Quantity" and "UnitPrice":

retailDataset = retailDataset.withColumn("Sales Revenue", $"Quantity" * $"UnitPrice")

When you are ready you can expose it as a multidimensional cube:

new Sparkube()

If the launch was successful, you should get a success message with some links to connect to your new OLAP cube.
That's it! Your cube is ready, you can now use it from Excel or ActiveUI with the provided links.

Visualization in ActiveUI.

Once your cube is exposed, you can access it in your browser at the link given by expose, for example (or http://localhost:9090/ui if you're running locally). All your actions in the UI will be converted by Sparkube into Spark queries.

In the right panel you will find the Data Explorer. Every string column of your dataset has been converted into a Hierarchy:

All numerical columns have been turned into a measure. Because Spark will do all the aggregations Sparkube has created one measure for each Spark aggregation function (sum, average, min, max, variance...).

Query the cube in ActiveUI.

In ActiveUI click on "Pivot Table" to create your first view. You should have a table with a single cell: the numbers of rows in your dataset. "Contributors.COUNT" is the default measure that simply count the number of elements. You can add with the "plus" button or drag and drop another measure such as "Sales Revenue.SUM" and this will give you the total revenue of the transactions in your dataset.

You can drag-and-drop any hierarchy on "Row" or "Column". This will aggregate per member of your hierarchy. In the example dataset you can try grouping by "Country" on rows.

Visualization in Excel.

In Excel you can access your cube as a Microsoft Analysis Services server. Data will not be loaded into Excel and all the computation will be done by Spark. Excel is only used as a User Interface.

Once your cube is exposed, your cube is available at the link given by expose, for example (or http://localhost:9090/xmla if you're running locally).

You can load the cube by clicking on Data > Get Data > From Database > From Analysis Services.

Enter your cube address and leave the login blank, there is no login required.

All the cubes you have exposed will appear here. The name showed in here is the name that you provided in the withName() method. Select the one you want to explore and click on “Finish”.

That's it, you now have your data in Spark exposed as a multidimensional pivot table in Excel.

You can read more about Analysis Services at

Query the cube in Excel

In this part of the tutorial you will learn how to query the cube containing your data. All your multidimensional Excel queries will be converted by Sparkube into Spark queries.

In your pivot table you can see that every string column of your dataset has been converted into a Hierarchy.

All numerical columns have been turned into a measure. Because Spark will do all the aggregations Sparkube has created one measure for each Spark aggregation function (sum, average, min, max, variance...).

You can drag-and-drop any measure on "values". This will aggregate on all the lines of your dataset. If you use the example dataset about retail you can use UnitPrice.MEAN to get the mean price of the products sold by the retailer.

You can drag-and-drop any hierarchy on "Row" or "Column". This will aggregate per member of your hierarchy. In the example dataset you can try grouping by "Countries" on rows and add more measures on columns.

Any multidimensional query is possible, even more complex ones. For example you can ask for the 10 countries with the highest products products sold. To do that click on the drop down button next to a hierarchy and then Value Filters > Top 10.

There are lots of possible multidimensional queries and Sparkube lets you do them through Excel UI with the power of a Spark cluster.

You now know all the basics of Sparkube. We invite you to try it with different and bigger datasets. This tutorial will now explain more advanced features that will help you understand more deeply how Sparkube works.

Cube and server management

When you create a cube you can store it in a variable to use it later.

var cube = new Sparkube().fromDataset(df).expose()

You can use this to rename or delete your cube.


You can also stop the whole server:


If you want you can create a server on a specific port. This is useful if the default port (9090) is already in use. You need to call this before exposing your cube.


Any cube exposed after that (by calling the expose function) will be added to this new server.

Hierarchy Creation

Sparkube keeps a list of every distinct member for each hierarchy in order to optimize requests. Creating this list is done at the first request on the hierarchy, we call it lazy loading of hierarchies. The first request can thus take time and memory to complete. You can decide to desactivate the lazy loading at cube creation.

There is also a limit to the number of distinct members in a hierarchy (1.5M by default) to avoid latency issues. You can modify this upper bound at cube creation.

new Sparkube()

Slicing Hierarchies

When you create your cube you can use the slicing hierarchy option. A slicing hierarchy is a hierarchy that will not be aggregated on all the members. For example currency is usually a slicing hierarchy because you don't want to aggregate Euros and Dollars together but filter only one of them.

You can define a slicing hierarchy like that:

new Sparkube()
 .withName("My first dataset")

Custom Aggregate Functions

For each dimension, Sparkube offers by default basics measures from Spark: count, min, max, avg, sum...
With Spark, you can also write user defined aggregate functions to aggregate some columns of your dataset in a particular way. Sparkube is able to add these aggregate functions as new measures to your cube dimensions.

You can define a UserDefinedAggregateFunction and add it at cube creation like this:

val valueAtRisk = new UserDefinedAggregateFunction {
 override def inputSchema: org.apache.spark.sql.types.StructType =

new Sparkube()
 .withName("My first dataset")
 .withUDAF(valueAtRisk, "VAR95")