Question

1 Approved Answer

Posted on Sep 25, 2024

below is the rubric for a parallel program that integrates a trapezoid. The rubric and all needed files are pasted below. Thank you. additional information

below is the rubric for a parallel program that integrates a trapezoid. The rubric and all needed files are pasted below. Thank you.

additional information is below.

Timing the Trapezoidal Rule

Pacheco (3.2) shows a version of the program that has been called, the Hello World of Parallel Computing. Numerical Integration is standard fare in scientific computing. Hopefully you are recall Riemann Sums from calculus where you added up the area of thinner and thinner rectangles as an approximation of the total area under the curve. abfxdx= limx0i=1nfxi* x

Figure 2 Trapezoids instead of Rectangles (Wikipedia)

Figure 1 Midpoint Riemann Sums (Wikipedia)

You should be able to tell from our two figures that Trapezoidal Integration should give much better results for fewer partial sums. From Figure 3.4 in Pacheco you can see that the area of a trapezoid should be its averageHeight * width. Average height is the sum of the left & right sides divided by two. With a bit of thought you can see that we count all the interior points 1x ( for the left trapezoid, and for the right trapezoid); the two endpoints are counted just at one half each. That gives you the large equation near the middle of page 95. To put it for our simple example in Figure 2 above:

sumOfTraps = h * [ f(x0)2+fx1+fx2+fx3+ f(x4)2 ]

0.5 * [ f(0.0)2+f0.5+f1.0+f1.5+ f(2.0)2]

Pacheco then gives the classic (serial) pseudocode for numerical trapezoidal integration.

Then Pacheco decides how to divide the work over the number of MPI processes we have. He decides to divide the data (x values). For NP processes he will divide the x-range into NP equal chunks. Then run a trapezoidal rule for each chunk of x-values. Finally, add up all those areas. In our example, with NP=2, he does one integration from 0 to 1 and a second integration from 1 to 2. Adding those two (which may each have thousands of trapezoids) yields his answer.

He uses a traditional SPMD division algorithm where the rank of each process can be used to generate what is called local-n, local_a & local_b in his Program 3.2 code.

We want to integrate 014(1+ x2) using the Trapezoidal Rule. As we did in class, we want to compare results for n = 512 to 1073741824, where you double n on each pass. You may have noticed our bounds are 29 up through 230. We are dividing the range into powers-of-two to help mitigate FLOP round-off errors. But we want to make some changes to Pachecos Program 3.2:

We want to divide the work over all our processes. That is, rather than run NP trapezoidal integrations (one for each process), we want to divide the summing up in a single trapezoidal integration across all our processes. That will mean you need to correctly handle the two endpoints: f(x0) and f(xn). This is still SPMD. Only now you need to use the rank of each process to calculate the values of i for which it is doing a sum of f(xi) values. To put it another way, Pacheco divided the data, so each process had start/stop values of x: local_a and local_b. We are dividing the work, so each process will have a start-i and a stop-i. NB: beware integer division!!

There is no good reason to make the slaves queue up to sum the results. Rewrite lines 22-26 in Pachecos code so the Master takes the slave responses as first come first serve.

To see the benefits of MPI we need to time the generation of all these 22 values. See the trap function. That is one run will be the generation of our whole table (about 25 seconds on my laptop). You should report the time of your serial program (developed in class), and then the times for NP=4 to 16 by 4s. That is, at the end of it all, we will have a table of 5 runtimes: one serial, 4 parallel (with different numbers of processes).

We are interested in the time between the point where MPI is initialized to the point where Master has printed the final result to its table. For the serial run, select the equivalent places in the code. Serial is NOT running MPI with NP=1. We shall require all the processes to wait before starting the next number-of-intervals until Master has printed the results of the current number-of-intervals. HINT: MPI_Barrier.

One approach to this data collection would have Master write timing results to a simple text file. C (and Bash) will allow you to append to the end of an existing file. Thus, each run would print the normal table of results to the screen, AND then Master would write timing results to a file.

So your output will be pretty much like our inclass example. But you will also report elapsed (wall clock) time for each pass.

end rubric

pachero 3.2 program:

#include ; #include ;

double trap() { double leftPoint; double rightPoint; int count; double baseLen; double estimate; double x; int i;

estimate = (f(leftPoint)) + (f(rightPoint)) / 2.0;

for (i = 1; i < count - 1; i++) { x = leftPoint + i * baseLen; estimate += f(x); } estimate = estimate * baseLen;

return estimate; }

int main(void) { int myRank; int commSZ; int localN; int n = 1024; double a = 0.0; double b = 1.0; double h; double localA; double localB; double localInt; double totalInt; int source;

MPI_Init(NULL, NULL); MPI_Comm_Rank(MPI_Comm_World, &myRank); MPI_Comm_size(MPI_Comm_World, &commSZ);

h = (b - a) / n; localN = n / commSZ;

localA = a + myRank * localN * h; localB = localA + localN * h; localInt = Trap(localA, localB, localN, h);

if (myRank != 0) { MPI_Send(&localInt, 1, MPI_DOUBLE, 0.0, MPI_Comm_World); } else { totalInt = localInt; for (source = 1; source < commSZ; source++) { MPI_Recv(&localInt, 1, MPI_Double, source, 0, MPI_Comm_World, MPI_Status_Ignore); totalInt += localInt; } }

if (myRank == 0) { printf("With n = %d trapezoids, our estimate ", n); printf("of the integral from %f to %f = %.15e ", a, b, totalInt); } MPI_Finalize(); return 0;

referenced in class example:

#include #include

long double f(long double x) { long double y = (4.0 / (1.0 + x * x)); return y; }

int main() { int interval = 256; // making the interval half of 512 since it will be doubled // this will allow it to start at 512

printf("#intervals \t approximation \t error");

while (interval < 1e9) // while interval is less than 1 billion { interval *= 2; // multiplying interval by 2 long double dx = 1.0 / interval; long double approx = 0; long double x = 0; for (int i = 0; i < interval; i++) { // while i is less than interval the loop executes approx = +f(x) * dx; x += dx; } // finfing the error by subrtacting the approximation from pi long double error = approx - M_PI; // printing out the values printf("%10d \t%.12Lf %.2Le ", interval, approx, error); } // wrapping up printf(" Normal Termination ");

additional info:

totalInt = localInt;

the code below needs to be change to make the work first come first serve among the clusters.

for (source = 1; source < commSZ; source++)

{

MPI_Recv(&localInt, 1, MPI_Double, source, 0, MPI_Comm_World, MPI_Status_Ignore);

totalInt += localInt;

}

also the integral needs to be divided 4 ways among each core. and a time clock needs to be added. Thank you.