5 April 2016

Speeding up a MATLAB simulation – A case study

The benefit of speed

In model-based design, the model is the workhorse for product development. In general, the main objective during model definition is to implement the required features and functions. Speed of execution may be a lower priority. In practice, improving run-time performance is often postponed, as it takes time and may require re-programming the fresh implementation. However, it may be worthwhile: short execution times are beneficial both in the development and the deployment phase.

During development, a short execution time provides timely feedback: it invites people to test more implementation ideas, and to more thoroughly validate the model implementation. The resulting quick iterations will lead to higher quality models.

During deployment, one can gain much more insight from a speedy model: explore numerous what-if scenarios, explore parameter spaces, optimize parameters, run more complex cases, validate people’s gut feeling during discussions, etc. A high-performing model may even become part of real-time applications, or run on a server to be deployed by many.

In this blog we provide a real-world example of model speedup: we outline a possible approach and  show how simulation time was accelerated 15 times, while making no concessions to simulation accuracy.

Case: A subsidence model

In a recent project, we implemented a subsidence model. Subsidence is the vertical movement of the ground surface due to pressure reduction in a reservoir. Subsidence can be a side-effect of gas exploitation. Van Opstal’s method is an often used, semi-analytical approach to predict subsidence.  Van Opstal’s equations quantify the csketchontribution of a reservoir cell at position (x,y,z) to the eventual subsidence of a point (x,y) at the surface. In our model, the cells that compose an arbitrary 3D reservoir shape have an in-homogeneous  and time-dependent pressure distribution. At the ground surface we define a rectangular grid around the reservoir footprint. We use the model to study subsidence over time, by feeding it time-varying reservoir cell pressures P(x,y,z,t).

M-code implementation of Van Opstal’s equations

We implemented the subsidence model in a few MATLAB classes and translated Van Opstal’s equation set to a solver method. Our base case contains 20k surface grid elements and 115k reservoir elements. As a result, a total 2.3G subsidence contributions need to be calculated to run the case. On the laptop (Intel Core2 Duo T9400 @2.53GHz) the simulation took 163 minutes (2h43m) to complete.

Speedup options

There are many options to speed up MATLAB code in general. Which options are effective depends on factors like functions used, data dimensions, numerical accuracy, MATLAB version, hardware, bottlenecks related to I/O, etc. It is even possible to run code on a computer cluster using the MATLAB Distributed Computing Server.

In this blog we focus on a single box solution, and apply and quantify four options to speed up the model:

  1. Faster hardware
  2. M-code optimization
  3. Parallel processing on CPU
  4. Parallel processing on GPU

Option 1: Faster hardware

A logical first step is to run the simulation on faster hardware. We ran the model on a quad-core machine (Intel Core i7 CPU @2.93 GHz, 4 physical cores — an average CPU). The simulation now took 45 minutes, a 3.6x speedup. This will be our performance benchmark for further work.

Option 2: M-code optimization

We used the MATLAB profiler to identify performance bottlenecks. The main bottleneck appeared to be the calculation of factors that include element-wise divisions and element-wise powers. Especially the element-wise power is expensive in MATLAB, because MathWorks uses the fdlibm library which is more accurate but slower than straight multiplication (see also Bobby Cheng’s post). We re-coded some sections and validated the accuracy of the implementation. The simulation now took 8 minutes, a further 5.6x speedup.

Option 3: Parallel processing on CPU

The Van Opstal solver spends most time in a for loop which calculates subsidence at each surface grid point. The iterations of this loop are independent which means that the problem is massively parallel. We can turn the for loop in a parfor loop, and have a so-called parallel pool of workers (multiple headless MATLAB sessions running on separate cores) perform the iterations, using all cores of a machine.

However, it should be taken into account that recent MATLAB versions implement implicit multi threading for many base functions. In our case, vectors of more than 100k elements (that is: the reservoir data vectors) trigger multi threaded behavior for operations like addition, subtraction, multiplication, division and square root.  Therefore, the parfor construct will not speed up our code on a single machine. Actually, a parallel pool of workers slowed down the computations due to the overhead it introduces. We better skip this option.

Option 4: Parallel processing on GPU

General purpose Graphics Processing Units (GPUs) are high-performance many-core processors that can be used to accelerate engineering computations. We ran our simulation on a powerful NVIDIA graphics cards using the GPU interface provided by the Parallel Computing Toolbox.

We systematically assessed the GPU acceleration potential of individual operations and functions to estimate the overall GPU acceleration potential for the Van Opstal solver. In this case we expected an overall speedup of 10-15x for a mid-range NVIDIA Quadro GPU card.

However, it is not always straightforward to release the full GPU potential from MATLAB. Simply transferring the MATLAB arrays to the GPU (using the gpuArray class) and letting it do the math does not work in this case. On the contrary: that approach slowed down the solver.

Instead, we modified our solver implementation in order to feed the GPU tailored chunks. We turned the vectorized implementation into an element-wise implementation, and used gpuArray.arrayfun() to calculate a number of array elements in parallel on the GPU. After tuning the number of parallel GPU threads, it ran stable and fast, resulting in a 14.9x acceleration compared to benchmark. One model run takes 3 minutes to complete. This is acceptable performance: we can now simulate a 30-year time horizon in 1.5 hours instead of a day.

Summary

speedup_v2

In short, a GPU implementation accelerated our model 14.9 times compared to the original M-code running on a quad-core machine, and 53.7 times compared to running it on a dual-core laptop. It was also shown that there is no “one size fits all” approach: potential speedup options may prove ineffective, or may require tuning.

Speed increases the quality, applicability and therefore value of a model. Do you need to speed up or enhance your simulations? Contact MonkeyProof Solutions to discuss how the experts can be of help.