- Kemal Ahmed
- example is posted online, mirrors the renderer function
- Should not need to do any more then just add pragmas to the code
- bottleneck is the renderer 2 fro loops that iterate through all pixels in the image, this process is super parallel, no dependencies
- extern function parallelizing:
- make file will be posted (use pgCC for all files! )
- sshfs to mount the files!
- make sure it compiles before you add pragmas with pgCC on server
- use open acce api from openacc.org
- both loops: #pragma acc parallel
- get error that sum function need routine information. #pragma acc routine ____ above sum function use (seq)
- still erring due to extern function, write above extern declaration as well # pragma acc routine seq
- (needs to be above both)
- make data copies explicit! pragma acc parallel
- #pragma acc loop
- #pragma acc loop (above 2 for loops)
- copyin/copyout (any variables in acc region)
- copying = pcopyin (faster is p)
- #pragma acc parallel copyout(arr[0:height*width]) create(tmp, arbitrary) copyin(params)
- if passing image around with
image[0:width*height]
is 1/3 of array, since array is [r][g][b], so useimage[0:width*height*3]
- compile with tesla, no ACC, get makefile running
- compile c++ with
pgcc
NOTpgcc++
copy
: brings stuff back and forth cpu-gpucopyin
: brings stuff into gpu, but not backprivate()
: when you don't want the same variable to be changed by multiple threadsPair