The Answer to Parallel Processing

Friday April 28, 2023 at 8:00am

Those of us performing FEA have often found ourselves thinking this. All too often in my three decades in the industry I’ve come across people who’ve address this by building smaller models, reducing the fidelity of output or in one horrific case using linear Tetrahedral elements instead of parabolic and fudging the elastic modulus of materials to try to counter the overstiff behaviour based on one simple cantilever example.

The proliferation of multicore hardware in the last 20 years has led to more users ticking the box for parallel processing to try to improve their turn around, but without careful consideration of the type of problem you have and the nature of the parallelisation in your chosen software this too can cause problems as we will see.

There's three main ways you can go about getting your results faster.

1. Reduce your model size.

There are legitimate ways to reduce the size of the problem without effecting accuracy, some of which have been discussed in previous blog articles, such as the use of symmetry, zoom modelling and superelements.

2. Improve your hardware.

You can throw money at bigger, better, faster hardware. A good way to tell if your hardware is letting you down is to look at the log files of your current jobs. Examples of Nastran and Marc are shown below, but your FEA solver should be producing similar output.

The former job, the Marc example, shows that the total elapsed time and the CPU time are very close. There is some time lost, about 190s, but not enough to warrant a hardware upgrade on the basis of it being overwhelmed.

The latter job, the Nastran example, shows a big discrepancy. The sum of User plus System time only represents about 62% of the total elapsed time. This difference, about 50 minutes of a 2 hour job, is essentially Nastran waiting for the computer to do stuff, most often write to or read from disk. Spending some money, perhaps on speeding up the IO performance via RAIDed disks or through more RAM that can be used for scratch memory or buffer pooling, could eat substantially into the runtime of this type of job.

3. Software solutions

It feels like the easiest, but in many ways, this is the most complex way to address job speed. In the 50 years we have been using FEA simulation the various programmes have evolved different solvers and different built-in methods of speeding up these problems though parallel processing. It’s a huge area so let’s look at some specific examples and talk about how you as a user can decide on how to get the best result for your problems.

The easiest to use is shared memory parallel. We can request this using the command line when we run Nastran or Marc. It needs licensing, but if you have MSC One token licensing this feature is included. Essentially shared memory parallel takes any part of the solution sequence that can be treated as lots of independent steps, such as Gaussian elimination when assembling a matrix or for calculating e.g. stress results once we have solved for strain. As such the scalability is limited, but can be effective. It’s important to know that the scaling isn’t linear and there’s a law of diminishing returns from adding more cores. As you throw more and more threads at your computer the workload goes up and the queue of data in and out gets longer. I’ve often seen users look at their PC as a 32 core system and submit their jobs as 32 way. Often 16 of those 32 cores are not ‘real’ they are a result of having hyperthreading turned on in the BIOS. This is supposed to make virtual cores available using spare cpu cycles, but by hammering the cores with constant operations via Nastran there are no ‘spare’ cycles so you slow everything down. It is recommended to turn of Hyperthreading in the BIOS for computers used for High Performance Computing type work. Even still, running a 16 way job on a 16 core PC leaves nothing for the OS, for Outlook, for your pre/post processor etc so they have to take turns with the solver threads which again slows everything down.

The example below is from Marc. It’s from a model used to predict springback of a composite test piece during the curing cycle. It’s not a big model but is coupling three physics domains; structure, thermal and cure chemistry, and is run as a non-linear transient solution. We’re using the Multi-frontal Sparse solver and scaling 1-2-4-6-8-16 way parallel to see how the elapsed time changes.

From the graph it’s very obvious that there’s little benefit to be had from going from 4 way to 6 way parallel on this PC with this solver and beyond that the run times get longer again as the PC is overwhelmed. If we switch to the Pardiso solver then we do get further improvements in time beyond that level, but still not out to 16 way.

An alternative approach to parallelisation is called Distributed Memory Parallel (DMP). In this approach we segment the model into domains, solving each domain on a different processor and negotiating the response on the shared boundary nodes via a message passing interface (MPI). This parallel architecture means that the threads are much more independent of each other, much more of the solution sequence is parallelised so the performance gains are larger and the parallel processes don’t all have to be in the same physical hardware – this can be run on a Linux cluster or just a set of workstations connected with a fast network such as gigabit ethernet or Infiniband.

This model is dense solid mesh with around 1.5M DOF loaded non-linear statically. It was run 1,2,4 and 8 way parallel using both SMP and DMP parallel methods to compare the relative performance.

It’s obvious that the DMP method gives the best performance, including super-linear scaling going from 1 to 2 way parallel, probably due to the two domains having much less impact on the hardware than the non-parallelised run. Both are showing the limit of useful improvement on this computer at 4 way parallel.

Dynamics

The examples so far have all been for non-linear statics. The other area where customers complain about runtime is with modal dynamics, particularly with Nastran. When running a modal dynamics analysis the solution of the normal modes can represent a huge proportion of the total elapsed time, but there are ways of reducing this time. We can take a simple shared memory parallel approach, where Nastran runs independent parts of the solution sequence as before. The example model in this case has around 2.5M DOF and 637 modes in the 0-2kHz range.

We can see a small improvement going to 4 threads but on this hardware using further threads is increasing the total time.

We can again look at splitting the model into domains, only this time we have more options – we can segment the geometry into 2, 4, 8 or more segments OR we can domain the model in the Frequency space and solve, for example, 0-500Hz on one processor, 501-1000Hz on a another and so on. This latter approach does require a lot of resource. We’re now running a set of identical jobs to solve the problem and so is best used in a distributed environment like a cluster with a dedicated compute node per domain for best results. Comparing adding the geometry domain and frequency domain results into the graph shows these methods give a better performance than the SMP option but again, will not scale beyond 4 way parallel.

There is another option with Nastran modal, Automated Component Mode Synthesis (ACMS). This method automatically cuts the model into many domains – for this example 7049 domains – reduces each one to a superelement, solves these and then recombines for the global solution. These superelements are tiny and their solution is independent of one another, so scalability is better albeit asymptotically to the time it takes to do the split and reassembly. Plotting these results with the others above makes clear the benefit of this technique.

This shows a reduction of >30% in the total time without using parallel processing with further gains from 2 and 4 way parallel.

Conclusion

The examples above show that speeding up your job is just a question of throwing all the cores at it, you could in fact be making it longer.

There’s a sweet spot for the number of cores to use on any given computer, and that too can vary by the overall size of your job. The move from RISC type UNIX hardware for high performance computing to commodity PC workstations means that the hardware most of us are using is not designed to do what we’re asking.

If you want to get the optimal job turnaround for your problems it’s worth talking to your support provider to understand all your options, and then spending some time as I have done here, running batches of your typical jobs using different settings to understand where the sweet spot is in your case.

So if runtime is an issue for your productivity, and you’d like to understand the best way to reduce it, please get in touch.

Fill in the Form Below to Contact Us

I agree to receive communications and marketing information from the Solid Solutions group. We never share your information with third parties. Full details are explained in our Privacy Policy.