To vectorise or not to vectorise?

Vectorising data for drastic memory savingsOne of the strengths and benefits of ActivePivot over the years has been its ability to load, store, aggregate and perform calculations on vectors of data. The most common example of this on projects is for computing Value at Risk (VaR)-like measures. Typically, ActivePivot would load a P&L vector (e.g. containing 250 doubles for one year of historical returns/simulations), aggregate this to the required level and perform the VaR or other calculations. We have had clients doing this for years, and ActivePivot does it very efficiently and quickly.

In this article, I want to elaborate on this technology choice and present the benefits our clients draw from ActivePivot’s capability of storing data as vectors, in terms of memory usage, load and query times.

What is a vector in this context?

In programming terms we are talking about an array of values: for example, [100, 101, 102.5, 99.5, … ,102] is what we would refer to as a vector of doubles. The vector size can vary from the small 10s, 260 for the Expected Shortfall (ES) calculations in FRTB Internal Models Approach (IMA), to 10s of thousands for Potential Future Exposure (PFE) Credit Risk use cases and even into the millions under IMA Default Risk Charge (DRC). ActivePivot efficiently stores vectors in direct memory, not on the heap. There are also highly optimised methods as part of ActivePivot’s API which can be used to calculate measures such as VaR.

Other technologies, such as relational and NoSql databases, are not designed or optimised to store arrays or vectors of numbers. Instead, they store a single value per row, which often leads to issues when this pattern is replicated in ActivePivot.

Although many projects we work on use and store their data as vectors, many others do not. As a result, these projects may run into difficulties with memory usage and various timings. In order to promote the benefits of vectorisation, we therefore decided to do an experiment based on a use case similar to a recent project, and also similar to what we have built into the Market Risk Accelerator for sensitivity analysis.

In this toy example, we want to calculate and sum our Delta sensitivities across various time buckets. We assume a simple model: each trade has 100 buckets (from 1D to 100Y), and thus 100 sensitivities. To calculate the total exposure, our Delta.SUM for each trade is the sum of all 100 buckets. So, in our non vectorised example, we have 100 sensitivities for each value eg:

TradeID Bucket Delta Value + Other fields
Trade1 1D 25
Trade2 2D 30
Trade1 100Y 45

In order to vectorise the data, we run a simple ETL job which produces the following output:

TradeID Bucket Delta Value + Other fields
Trade1 [1D,2D,...,100Y] [25, 30, …, 45]
Trade2 [1D,2D,...,100Y] [15, 20, …, 40]

We have now reduced the number of rows or “facts” in our datastore from 100 to 1.

Note: on a real ActivePivot project, not just a toy example, we would look to move the buckets to a separate store.

What about changing the store config? That is very easy:


.withField(DELTA_VALUE, DOUBLE)
Becomes
.withVectorField(DELTA_VALUES, DOUBLE)

And what about about the measures we have in the cube? I used to calculate my Delta.SUM measure with CoPPeR by doing:

protected static Dataset sensiMeasures(final BuildingContext context) {
return context
		.withFormatter(DOUBLE_FORMATTER)
		.createDatasetFromFacts()
			.agg(
				sum(DELTA_VALUE).as(DELTA_SUM)
			).publish();
}

As a vector this calculation changes slightly:


protected static Dataset sensiMeasuresFromVector(final BuildingContext context) {
	return context				
		.createDatasetFromFacts()				
		.agg(sum(DELTA_VALUES).as(DELTA_VALUES_VECTOR))
		.withColumn(DELTA_SUM, col(DELTA_VALUES_VECTOR).map((IVector v) -> v.sumDouble()))
		.agg(
			sum(DELTA_VALUES_VECTOR)
			sum(DELTA_SUM).withFormatter(DOUBLE_FORMATTER)
		);
}

Results

Spectacular to say the least:

Graphic Resultat

● Load times decreased by a factor of 8
● Memory usage down by a factor of 6
● Query times 15x quicker

As expected, the reduction in memory and load times can be observed quite clearly. By vectorising the data, the underlying file size dropped by a similar factor to the memory size.
It is easy to see why. Say a row of data has 50 attributes and 1 numerical field. When vectorizing this field into vectors of size 100, you save 99x50 attributes on the rows - but also add some attributes to a column on each row. Obviously, the file size then becomes a lot smaller.

If the file size is smaller then there are less bytes to transfer from disk drive to memory, and the process is quicker. Similarly, since the data sizes are smaller, they will occupy less space in memory. The secondary effect is that any indices, dictionaries and other internal ActivePivot structures are a lot smaller in size - resulting in performance improvements. ActivePivot also efficiently stores vectors off the heap in direct memory as mentioned earlier.

When it comes to query times, the main improvement comes from the "reduction" mapping. When aggregating data, you have to read some attributes of a fact to figure out where to aggregate the data:this is called reduction.
Say you're summing up the Delta values by currency: for each fact, you need to read the value of the currency attribute in the currency column, map it in the results dictionary to find the index in the result column at which the Delta value will be aggregated, and finally perform the sum. Although we still add the same amount of numbers together, you can see from this example that summing up doubles is actually an almost negligible part of the reduction.

Another improvement could be that, at query time, the CPU may be vectorizing some of the vector aggregations and making use of modern processing architecture. This is something which will become more powerful once Project Panama comes to fruition.

Conclusions

The results from vectorising the data speak for themselves. Projects which perform analysis across different time buckets, like sensitivities, are a perfect candidate for benefitting from this change in data structure.
Due the fact that ActivePivot is an in-memory database, reducing the amount of data stored in memory allows us to store more data. It means we could have more business dates online for analysis for example, or new products and desks available in the cube. In addition, as we have seen, query times would become shorter - not longer!

If you’re interested in the improvements vectors can bring to your project - as well as other performance optimisation techniques, speak to your account manager about attending our new Performance Optimisation course.

Like this blog?
Follow us on LinkedIn
to stay up-to-date with the latest blog posts!

This entry was posted in Big Data Analytics, Finance, Technology. Bookmark the permalink.

Comments are closed.