Question
For this assignment youre going to use OpenMP to implement a parallel prefix sum that works with an arbitrary sized array and the default number
For this assignment youre going to use OpenMP to implement a parallel prefix sum that works with an arbitrary sized array and the default number of OpenMP threads.
Use the flat version of the algorithm presented in class that involves the following steps:
1. local prefix sum on each partition
2. a serial prefix sum of the maximum values of each partition
3. a second iteration over each partition, adding the maximum from the partition to the left.
Remember, this code evenly distributes size indices across nt threads:
int nt = omp_get_num_threads ();
int rank = omp_get_thread_num ();
int count = size / nt;
int start = rank * count ;
int mod = size % nt;
if ( rank < mod )
{ count ++; start += rank ; }
else { start += mod; }
Use omp_get_wtime() to calculate the wall clock times for various runs. How close to linear speedup do you get? You will be given accounts on the CSE Qbert cluster where you can test on 1-16 cores.
Your program should be compiled like this:
g++ -fopenmp -O3 -Wall source.cpp -o executable
Your program can be run with different numbers of threads like this:
OMP_NUM_THREADS=16 ./executable
This algorithm requires a large array and relatively many threads to see a performance improvement.
Heres a place to start:
# include
# include
# include
# include
// replace the contents of data with the prefix sum of the contents
void prefix_sum (std :: vector & data )
{
# pragma omp parallel default ( none ) shared ( data )
{
int rank = omp_get_thread_num ();
int nt = omp_get_num_threads ();
// Your code goes here .
}
}
int main (int argc , char * argv [])
{
// test with a vector of 1s
std :: vector data (10000 , 1);
double start = omp_get_wtime ();
prefix_sum ( data );
double end = omp_get_wtime ();
printf (" took %lf to prefix sum %d elements on %d ", (end - start ), (int ) data . size (), omp_get_max_threads ());
}
Hints
Use #pragma omp barrier to synchronize threads within a parallel section.
Iterate over the prefix sum results to see if it works.
Comment out extraneous print statements in the final code.
Turn in ..
The source code file.
A screenshot of your code being compiled and run.
A plot of the wall clock time as a function of number of threads, for a large enough array that the values are meaningful.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started