Those of us performing FEA have often found ourselves thinking this. All too often in my three decades in the industry I’ve come across people who’ve address this by building smaller models, reducing the fidelity of output or in one horrific case using linear Tetrahedral elements instead of parabolic and fudging the elastic modulus of materials to try to counter the overstiff behaviour based on one simple cantilever example.
The proliferation of multicore hardware in the last 20 years has led to more users ticking the box for parallel processing to try to improve their turn around, but without careful consideration of the type of problem you have and the nature of the parallelisation in your chosen software this too can cause problems as we will see.
There's three main ways you can go about getting your results faster.
1. Reduce your model size.
There are legitimate ways to reduce the size of the problem without effecting accuracy, some of which have been
discussed in previous blog articles, such as the use of symmetry, zoom modelling and superelements.
2. Improve your hardware.
You can throw money at bigger, better, faster hardware. A good way to tell if your hardware is letting you down is
to look at the log files of your current jobs. Examples of Nastran and Marc are shown below, but your FEA solver
should be producing similar output.
The former job, the Marc example, shows that the total elapsed time and the CPU time are very close. There is some
time lost, about 190s, but not enough to warrant a hardware upgrade on the basis of it being overwhelmed.
The latter job, the Nastran example, shows a big discrepancy. The sum of User plus System time only represents about
62% of the total elapsed time. This difference, about 50 minutes of a 2 hour job, is essentially Nastran waiting for
the computer to do stuff, most often write to or read from disk. Spending some money, perhaps on speeding up the IO
performance via RAIDed disks or through more RAM that can be used for scratch memory or buffer pooling, could eat
substantially into the runtime of this type of job.
3. Software solutions
It feels like the easiest, but in many ways, this is the most complex way to address job speed. In the 50 years we
have been using FEA simulation the various programmes have evolved different solvers and different built-in methods
of speeding up these problems though parallel processing. It’s a huge area so let’s look at some
specific examples and talk about how you as a user can decide on how to get the best result for your problems.
The easiest to use is shared memory parallel. We can request this using the command line when we run Nastran or
Marc. It needs licensing, but if you have MSC One token licensing this feature is included. Essentially shared
memory parallel takes any part of the solution sequence that can be treated as lots of independent steps, such as
Gaussian elimination when assembling a matrix or for calculating e.g. stress results once we have solved for strain.
As such the scalability is limited, but can be effective. It’s important to know that the scaling isn’t
linear and there’s a law of diminishing returns from adding more cores. As you throw more and more threads at
your computer the workload goes up and the queue of data in and out gets longer. I’ve often seen users look at
their PC as a 32 core system and submit their jobs as 32 way. Often 16 of those 32 cores are not ‘real’
they are a result of having hyperthreading turned on in the BIOS. This is supposed to make virtual cores available
using spare cpu cycles, but by hammering the cores with constant operations via Nastran there are no
‘spare’ cycles so you slow everything down. It is recommended to turn of Hyperthreading in the BIOS for
computers used for High Performance Computing type work. Even still, running a 16 way job on a 16 core PC leaves
nothing for the OS, for Outlook, for your pre/post processor etc so they have to take turns with the solver threads
which again slows everything down.
The example below is from Marc. It’s from a model used to predict springback of a composite test piece during
the curing cycle. It’s not a big model but is coupling three physics domains; structure, thermal and cure
chemistry, and is run as a non-linear transient solution. We’re using the Multi-frontal Sparse solver and
scaling 1-2-4-6-8-16 way parallel to see how the elapsed time changes.
From the graph it’s very obvious that there’s little benefit to be had from going from 4 way to 6 way
parallel on this PC with this solver and beyond that the run times get longer again as the PC is overwhelmed. If we
switch to the Pardiso solver then we do get further improvements in time beyond that level, but still not out to 16
way.
An alternative approach to parallelisation is called Distributed Memory Parallel (DMP). In this approach we segment
the model into domains, solving each domain on a different processor and negotiating the response on the shared
boundary nodes via a message passing interface (MPI). This parallel architecture means that the threads are much
more independent of each other, much more of the solution sequence is parallelised so the performance gains are
larger and the parallel processes don’t all have to be in the same physical hardware –
this can be run on a Linux cluster or just a set of workstations connected with a fast network such as gigabit
ethernet or Infiniband.
This model is dense solid mesh with around 1.5M DOF loaded non-linear statically. It was run 1,2,4 and 8 way
parallel using both SMP and DMP parallel methods to compare the relative performance.
It’s obvious that the DMP method gives the best performance, including super-linear scaling going from 1 to 2
way parallel, probably due to the two domains having much less impact on the hardware than the non-parallelised run.
Both are showing the limit of useful improvement on this computer at 4 way parallel.
Dynamics
The examples so far have all been for non-linear statics. The other area where customers complain about runtime is
with modal dynamics, particularly with Nastran. When running a modal dynamics analysis the solution of the normal
modes can represent a huge proportion of the total elapsed time, but there are ways of reducing this time. We can
take a simple shared memory parallel approach, where Nastran runs independent parts of the solution sequence as
before. The example model in this case has around 2.5M DOF and 637 modes in the 0-2kHz range.
We can see a small improvement going to 4 threads but on this hardware using further threads is increasing the total
time.
We can again look at splitting the model into domains, only this time we have more options – we can segment
the geometry into 2, 4, 8 or more segments OR we can domain the model in the Frequency space and solve, for example,
0-500Hz on one processor, 501-1000Hz on a another and so on. This latter approach does require a lot of resource.
We’re now running a set of identical jobs to solve the problem and so is best used in a distributed
environment like a cluster with a dedicated compute node per domain for best results. Comparing adding the geometry
domain and frequency domain results into the graph shows these methods give a better performance than the SMP option
but again, will not scale beyond 4 way parallel.
There is another option with Nastran modal, Automated Component Mode Synthesis (ACMS). This method automatically
cuts the model into many domains – for this example 7049 domains – reduces each one to a superelement,
solves these and then recombines for the global solution. These superelements are tiny and their solution is
independent of one another, so scalability is better albeit asymptotically to the time it takes to do the split and
reassembly. Plotting these results with the others above makes clear the benefit of this technique.
This shows a reduction of >30% in the total time without using parallel processing with further gains from 2 and
4 way parallel.
Conclusion
The examples above show that speeding up your job is just a question of throwing all the cores at it, you could in
fact be making it longer.
There’s a sweet spot for the number of cores to use on any given computer, and that too can vary by the
overall size of your job. The move from RISC type UNIX hardware for high performance computing to commodity PC
workstations means that the hardware most of us are using is not designed to do what we’re asking.
If you want to get the optimal job turnaround for your problems it’s worth talking to your support provider to
understand all your options, and then spending some time as I have done here, running batches of your typical jobs
using different settings to understand where the sweet spot is in your case.
So if runtime is an issue for your productivity, and you’d like to understand the best way to reduce it,
please get in touch.