参考blog:https://www.comsol.com/blogs/added-value-task-parallelism-batch-sweeps/
我们知道并行计算可以加快计算速度,但是这个加快不是无限制的,而且这个速度的加快程度依赖于我们的algorithm的具体写法。在本文中我们从理论山解释了parallel comuting的limitations。同时展示了怎么借用comsol的batch sweep来improving performance when you reach these limits.
Amdahl’s and Gustafson-Barsis’ laws
算法分为serial algorithm和parallel algorithm。通过增加计算单元(也叫作process或者threads),可以加快paralle algorithm的速度,但是对于serial algorithm 无效。我们实际中写的algorithm大约是两种algorithm的一种混合。假定代码中parallel code 占比为,则serial algorithm为()。考虑计算时间,P代表进程(Process)数目。当时,计算时间记做,那么当active process 为P时,计算时间为:,那么相应的speedup为:
Amdahl’s Law
For 100% parallelized code, the sky is the limit. 当,speedup会有一个limit:
。比如如下图所示:
Gustafson-Barsis’ Law
Amdahl’s law assumes that the size of the problem is fixed. Yet, by assuming that the size of the problem increases with the number of added processes, then you are utilizing all the processes to an assumed level, and the speedup of the performed computations remains unbounded.
The Cost of Communication
Gustafson-Barsis’ law implies that we are only restricted in the size of the problem we can compute,but sometimes communication is expensive. Let’s consider an overhead that is dominated by the communication and synchronization required in parallel processing, and model this as time added to the computation time.
In the case of no overhead, the result is as predicted by Amdahl’s law(last picture), but when we start adding overhead, we see that something is happening.
For a quadratic function, the result is worse and, as you might recall from our earlier blog post on distributed memory computing, the increase of communication is quadratic in the case of all-to-all communication. Due to this phenomenon, we cannot expect to have a speedup on a cluster for, say, a small time-dependent problem when adding more and more processes. The amount of communication would increase faster than any gain from added processes. 不过我们此时考虑的是fixed size的problem,事实上,当我们增大problem的size的时候, “slowdown” effect introduced through communication would be less relevant。
Batch Sweeps in COMSOL Multiphysics
As our example model, we will use the electrodeless lamp, which is available in the Model Gallery. This model is small, at around 80,000 degrees of freedom, but needs about 130 time steps in its solution. To make this transient model parametric as well, we will compute the model for several values of the lamp power, namely 50 W, 60 W, 70 W, and 80 W.
On my workstation, a Fujitsu® CELSIUS® equipped with an Intel® Xeon® E5-2643 quad core processor and 16 GB of RAM, the following compute times are received:
Number of Cores | Compute Time per Parameter | Compute Time for Sweep |
---|---|---|
1 | 30 mins | 120 mins |
2 | 21 mins | 82 mins |
3 | 17 mins | 68 mins |
4 | 18 mins | 72 mins |
从上表可以看出,只是增加电脑利用的核数并不能增加速度,反而当有3核改为4核之后速度变慢了。
We will now use the batch sweep functionality to parallelize this problem in another way: we will switch from data parallelism to task parallelism. We will create a batch job for each parameter value and see what this does to our computation times.
从上图可以看出,当我们把工作分成同时工作的四份,每份工作占用一个核,速度可以大大加快。
在我在自己的电脑上测试squareloop的工作的时候发现建立batch sweep确实也可以加快速度,我的电脑是4core,16G of RAM. 所用时间是3min41s, 所用时间是2min20s, 所用时间是2min6s。加速效果并不是很明显
Conclusion
在comsol中设置并行计算是个很复杂的问题,就像怎么选择求解器一样。和要解决的问题,以及计算机的性能特点都很有关系。
Selecting the right parallel configuration is not always easy, and it can be hard to know beforehand how you should “hybridize” your parallel computations. But as in many other cases, experience comes from playing around and testing, and with COMSOL Multiphysics, you have the possibility to do that. Try it yourself with different configurations and different models, and you will soon know how to set the software up in order to get the best performance out of your hardware.