Advanced In-Memory Analytics in Azure with the ActiveViam platform

ActivePivot & AzureThis blog explores new possibilities for in-memory analytics in Azure. We fire on demand a cluster of 128 G5 instances and load in memory a 40TiB dataset, right from blob storage. The ActiveViam analytical platform puts those 4,096 cores and 60TiB of RAM into the hands of business users.

In-memory computing combined with multi-core processors is arguably the most powerful force of digitalization, it truly enables the transition from batch processing, pre-canned reporting and spreadsheets towards interactive applications delivering analytics and calculations on the fly, on live data.

For the last ten years in-memory computing has been on the rise, fulfilling the growing need for real-time insight and interactivity in support of decision making. Ten years ago, the technology was still associated with mission critical applications that justified the investment in dedicated hardware, but the continuing fall of memory prices helped in-memory computing becoming more mainstream. By making computing and memory resources elastic and available on demand, cloud computing can now put it in the hands of everyone.

As in-memory analytics specialists, we are partnering with Microsoft on this subject and this blog series relates our most significant joint achievements, touching on topics such as high-speed data transfers from blob storage, on-demand elastic clusters, getting the best out of high-memory instances, etc.

Today we look at large scale clustering. The use case is back testing and trend analysis on a large financial dataset: millions of transactions, prices, sensitivities recorded for hundreds of historical dates, amounting to tens of terabytes of data.

Ideally business users would like to perform trend analysis on the fly following a train of thought, filtering and grouping by any attribute, testing different calculations, doing what-if simulations. However, purchasing servers with enough memory to bring that data online is very expensive, disproportionately so for analyses done periodically at the end of the month or for expertise needs. That’s why it’s still done with batch processing and static reporting.

We leverage Azure cloud infrastructure to overcome this situation. We store the historical dataset in Azure blob storage, we start the servers on the fly when a user needs them and shut them down when they are done. To operate the cluster, mount the data in memory and deliver analytics in the hands of business users, we use the ActiveViam platform.

Let’s do it step by step for 400 days of historical data and 100GiB of data per day. That’s around 40TB of data that we will load into a cluster of 128 G5 instances. The G5 instance type comes with 32 cores and 448GiB of RAM, so our cluster will have a capacity of 4,096 cores and 57TB of memory.

Step 1: starting the cluster

Our number one objective is to optimize the usage of resources. For that we leverage the cloud infrastructure to start the cluster on demand when it’s needed and shut it down when users are finished with it. To guarantee a good user experience, we must start the entire cluster in the shortest possible time.

We prepared a standard Linux image, with Java and one ActivePivot data node pre-deployed (ActivePivot is the server component of the ActiveViam platform). We start our 128 G5 instances with that image, programmatically using the Java API:

IntStream.range(0, 128).mapToObj(id -> "datacubeactiveviam" + id).map(dataCube -> ((Runnable) () -> {
final NetworkInterface nic = azure.networkInterfaces()
.define(dataCube)
.withRegion(Region.EUROPE_WEST)
.withExistingResourceGroup("benchmark")
.withExistingPrimaryNetwork(azure.networks().getByGroup("benchmark", "vnet"))
.withSubnet("subnet")
.withPrimaryPrivateIpAddressDynamic()
.withExistingNetworkSecurityGroup(
azure.networkSecurityGroups().getByGroup("benchmark", "securityGroup"))
.create();

azure.virtualMachines()
.define(dataCube)
.withRegion(Region.EUROPE_WEST)
.withExistingResourceGroup("benchmark")
.withExistingPrimaryNetworkInterface(nic)
.withStoredLinuxImage("https://vmstorage.blob.core.windows.net/container/image.vhd")
.withRootUsername("username")
.withRootPassword("password")
.withComputerName(dataCube)
.withSize(VirtualMachineSizeTypes.STANDARD_G5)
.withOsDiskName(dataCube)
.withExistingStorageAccount(azure.storageAccounts().getByGroup("benchmark", "activeviamstorage"))
.create();

})).map(EXECUTOR::submit).collect(Collectors.toList()).stream().map(future -> {
try {
return future.get();
} catch (InterruptedException | ExecutionException e) {
throw new RuntimeException("A VM creation failed.", e);
}
}).count();

All the ActivePivot data nodes are automatically started when the VMs start, and then discover each other automatically. Finally, we start a couple ActivePivot query nodes, the entry points to the cluster that coordinate distributed queries and calculations.

This entire sequence takes only ten minutes. We go from nothing to 4K cores and 60TB memory in ten minutes.

Step 2: loading data from blob storage

The next challenge is data loading. The dataset must be transferred from blob storage to the cluster within minutes to satisfy the on-demand constraint. But how are we going to move 40TB in minutes? This is where you can really leverage the world class infrastructure of a large cloud provider such as Azure.

We have about 400 days of historical data, so we load 3 days per node (300GiB). The objective is that each one of the 128 nodes downloads data at maximum speed, and that they all do it at the same time.

At the time of this test, a G5 instance has a 10Gbps network interface, so it has the capacity to transfer data at 1GiB per second. In this case the data is actually downloaded from blob storage and the throughput of a data transfer from blob is far less than that, being HTTP based. So ActiveViam developed a special connector that opens tens of HTTP connections to blob storage and performs the transfer in parallel, transparently reassembling blocks directly in ActivePivot. With this trick, an ActivePivot nodes easily saturates the G5 nic and actually downloads 300GiB in about 5 minutes.

Blob storage becomes the next bottleneck, when 128 VMs download from it at full speed. Currently the limit of a blob storage account is about 30 Gbps so 3 ActivePivot nodes would be enough to hit the cap. But this is just a per account limit, the actual infrastructure is virtually unlimited. To bypass this limitation, we use the well-known trick of splitting data in multiple storage accounts. We created 40 accounts, in each account we stored 10 days, or about 1TB of data. This gives us a theoretical aggregated bandwidth of 1200 Gbps (!). It’s enough to transfer data at more than 100GiB per second… in theory. In real life, where bandwidth aggregation is not perfect, it takes about 15mn for the 128 instances to download their data. We’ll try to beat that later, but that’s already an impressive 50GiB per second throughput!

128 VMs

128 VMs each downloading 100 GiB of data

Step 3: ActivePivot Cluster

ActivePivot stores data in-memory, in columnar structures and special indexes that accelerate analytical workloads. This is done in parallel to the data fetching so when the transfer is complete, ActivePivot is almost ready and only needs one more minute to complete some indexes.

The clustering itself does not require additional data transfer. The 128 data nodes share nothing and don’t talk to each other. ActivePivot query nodes are started and automatically discover the data nodes. Query nodes can then serve queries and calculations which are automatically distributed among the data nodes.

Azure Portal

Azure Portal showing the creation of 128 G5 VMs

To recap: 4K cores, 60TB RAM cluster from scratch to fully loaded and ready to use with ActiveViam in less than half an hour.

Benchmarking queries

Now it is time to see how well ActivePivot is able to fare on such a large cluster.

To do that, we performed two different queries and checked if they follow Gustafson’s Law: queries on twice the dataset should take the same time if you also double the CPU power and RAM.
We scaled up the cluster step by step, each time doubling its size, from 2 VMs to 128 VMs. In the end, our two queries aggregate the entire dataset, representing about 200 billions of records, in about 10s.

ActiveUI

ActiveUI showing a breakdown of the cluster per hostname

Bellow you can see that the scale-up is good: the query times are close to constant. In the end, we only have a 16% difference between a query running on a 2-VM cluster and a query running on a 128-VMs cluster.

Queries

Looking at how many GiB of dataset are handled per second, we see a similar story: when we multiplied the dataset size by 64, from 600 GiB to 37.5 TiB, the throughput increased 54 folds.

Throughput

We are able to run ActivePivot on a large cluster and to execute responsive queries on tens of terabytes of data.

Like this blog?
Follow us on LinkedIn
to stay up-to-date with the latest blog posts!

This entry was posted in Big Data Analytics, Technology. Bookmark the permalink.

Comments are closed.