Michal,

Thank you for the links, I'll look into those.

My current plan for combined MPI + OpenMP parallelization is to have two levels:

1) The calculation is divided into a number of MPI tasks at the beginning of the transport simulation and the results are combined at the end (this is the way things are done at the moment in Serpent 1)

2) Each criticality cycle is threaded using the "omp for" pragma, so that each neutron history forms a separate thread

Something similar is done for other parts of the calculation, such as data processing and the solution of depletion equations.

Even though Monte Carlo calculation is ideal for parallelization, the unfortunate thing is that neutron transport differs from most problems discussed in textbooks and tutorials. The calculation doesn't involve CPU time-intensive mathematical routines, but frequent access to reaction data stored in memory. The calculation can involve several megabytes or even gigabytes of data. For parallelization with MPI this means rather slow data transfer, but since the tasks communicate only at the beginning and the end of the transport cycle, the speed-up is almost linear. The drawback is that memory demand is multiplied by the number of tasks.

For OpenMP and shared memory things are a bit different. I suspect that the size of the data block and the way the data is organized in the memory forms bottlenecks when different threads are trying to access it. For some reason this behavior seems to be system-dependent (see the running times in my first post), which suggests that there might be something that can be done about it.

To reduce the complexity of the problem to a minimum, I wrote another simple test routine that allocates a large block of memory, fills it with random numbers, and performs simple operations that are timed.

omptester.c:

Code: Select all

```
#include "header.h"
#include "locations.h"
#define FUNCTION_NAME "OmpTester:"
void OmpTester()
{
long max_threads, ptr, n, nt, N, test;
double t0, sum, sum0, *res;
/* Reset variables */
sum = 0.0;
sum0 = 0.0;
t0 = 0.0;
/* Initialize data structure */
InitData();
/* Set block size */
N = 100000000;
/* Allocate memory */
ptr = ReallocMem(DATA_ARRAY, N);
/* Fill data block with random numbers */
for (n = 0; n < N; n++)
DATA[ptr + n] = RandF();
/* Get maximum number of threads */
max_threads = omp_get_max_threads();
/* Allocate memory for results array */
res = (double *)calloc(max_threads, sizeof(double));
/* Loop over threads */
for (nt = 0; nt < max_threads; nt++)
{
/* reset results array */
for (n = 0; n < max_threads; n++)
res[n] = 0.0;
/* Set number of threads */
omp_set_num_threads(nt + 1);
/* Reset and start timer */
ResetTimer(0);
StartTimer(0);
/* Set test mode */
test = 4;
/* Parallelized loop */
#pragma omp parallel
{
#pragma omp for
for (n = 0; n < N; n++)
{
/* Operations */
if (test == 1)
res[omp_get_thread_num()] += erf(DATA[ptr + n]);
else if (test == 2)
res[omp_get_thread_num()] += DATA[ptr + n];
else if (test == 3)
res[omp_get_thread_num()] += erf(n);
else if (test == 4)
erf(n);
}
}
/* Stop timer */
StopTimer(0);
/* Calculate sum */
sum = 0.0;
for (n = 0; n < max_threads; n++)
sum = sum + res[n];
/* Save single-thread time and sum */
if (nt == 0)
{
t0 = TimerVal(0);
sum0 = sum;
}
/* Print */
printf("Threads: %ld check: %8.1E time: %1.2f factor: %1.2f\n", nt + 1,
sum/sum0 - 1.0, TimerVal(0), t0/TimerVal(0));
}
/* Free memory */
FreeMem();
free(res);
/* Terminate calculation */
exit(-1);
}
```

The routine was written for the new version, but it should compile with Serpent 1 source code as well, because the function names are the same. The compiler (gcc) requires an additional -fopenmp flag to handle the pragmas. To run the test, a call to the subroutine must be placed in main.c before anything else.

I tried the routine with four operations, set by the "test" variable:

1) Calculate erf() of a value stored in the data block and add to sum

2) Add value stored in data block to sum

3) Calculate erf() of loop index and add to sum

4) Calculate erf() of loop index

Options 1 and 2 involve access to the large data block, option 3 only to the small results array. The error function is there just to add some number crunching. The sum is calculated to ensure that the result is the same, regardless of the number of tasks. The value in the output after "check:" should be close to zero (or NAN with option 4).

Here's the output for some test runs (3 GHz Intel Xeon, two quad-core processors):

Code: Select all

```
test = 1
Threads: 1 check: 0.0E+00 time: 2.07 speed-up factor: 1.00
Threads: 2 check: 5.7E-14 time: 2.32 speed-up factor: 0.89
Threads: 3 check: 6.5E-14 time: 2.11 speed-up factor: 0.98
Threads: 4 check: 5.3E-14 time: 1.96 speed-up factor: 1.06
Threads: 5 check: -9.7E-14 time: 1.99 speed-up factor: 1.04
Threads: 6 check: 3.2E-14 time: 2.34 speed-up factor: 0.88
Threads: 7 check: -2.8E-14 time: 1.88 speed-up factor: 1.10
Threads: 8 check: -1.1E-15 time: 1.75 speed-up factor: 1.18
```

Code: Select all

```
test = 2
Threads: 1 check: 0.0E+00 time: 0.52 speed-up factor: 1.00
Threads: 2 check: 7.2E-14 time: 0.64 speed-up factor: 0.81
Threads: 3 check: 4.9E-14 time: 0.70 speed-up factor: 0.74
Threads: 4 check: 5.2E-14 time: 0.74 speed-up factor: 0.70
Threads: 5 check: 5.0E-14 time: 0.82 speed-up factor: 0.64
Threads: 6 check: 5.6E-14 time: 0.94 speed-up factor: 0.55
Threads: 7 check: 5.5E-14 time: 0.80 speed-up factor: 0.65
Threads: 8 check: 5.5E-14 time: 0.69 speed-up factor: 0.75
```

Code: Select all

```
test = 3
Threads: 1 check: 0.0E+00 time: 1.01 speed-up factor: 1.00
Threads: 2 check: 0.0E+00 time: 0.66 speed-up factor: 1.53
Threads: 3 check: 0.0E+00 time: 1.10 speed-up factor: 0.91
Threads: 4 check: 0.0E+00 time: 0.84 speed-up factor: 1.20
Threads: 5 check: 0.0E+00 time: 1.08 speed-up factor: 0.93
Threads: 6 check: 0.0E+00 time: 1.56 speed-up factor: 0.64
Threads: 7 check: 0.0E+00 time: 1.33 speed-up factor: 0.75
Threads: 8 check: 0.0E+00 time: 1.22 speed-up factor: 0.83
```

Code: Select all

```
test = 4
Threads: 1 check: NAN time: 0.70 speed-up factor: 1.00
Threads: 2 check: NAN time: 0.35 speed-up factor: 2.00
Threads: 3 check: NAN time: 0.24 speed-up factor: 2.96
Threads: 4 check: NAN time: 0.18 speed-up factor: 4.01
Threads: 5 check: NAN time: 0.14 speed-up factor: 5.00
Threads: 6 check: NAN time: 0.12 speed-up factor: 5.78
Threads: 7 check: NAN time: 0.10 speed-up factor: 6.84
Threads: 8 check: NAN time: 0.09 speed-up factor: 7.58
```

So again I get very poor performance whenever data is read or written. What amazes me is that even mode 3 doesn't show any speed-up, even though the value of erf(n) is stored in an array of only 8 values. The increase in speed is linear only if the value is not stored

*anywhere*. This doesn't really make any sense, so I assume I am doing something wrong...