Mezzano · July 29, 2016 11:56
diff --git a/deepcl_unittests.txt b/deepcl_unittests.txt
 args: ./deepcl_unittests --gtest_filter=-DATA*:SLOW*
 Note: Google Test filter = -DATA*:SLOW*
 [==========] Running 158 tests from 29 test cases.
 [----------] Global test environment set-up.
 [----------] 7 tests from testClBlas
 [ RUN      ] testClBlas.basic
 DEBUG TANGUY: 18200632Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 clblas teardown
 unknown file: Failure
 C++ exception with description "clblasSgemm() failed with -11" thrown in the test body.
 [  FAILED  ] testClBlas.basic (77 ms)
 [ RUN      ] testClBlas.transA
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 1 2 9 
 3 7 5 
 initializing clblas
 clblas teardown
 unknown file: Failure
 C++ exception with description "clblasSgemm() failed with -11" thrown in the test body.
 [  FAILED  ] testClBlas.transA (52 ms)
 [ RUN      ] testClBlas.transB
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 3 
 -1 
 initializing clblas
 clblas teardown
 unknown file: Failure
 C++ exception with description "clblasSgemm() failed with -11" thrown in the test body.
 [  FAILED  ] testClBlas.transB (55 ms)
 [ RUN      ] testClBlas.colMajor
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 clblas teardown
 unknown file: Failure
 C++ exception with description "clblasSgemm() failed with -11" thrown in the test body.
 [  FAILED  ] testClBlas.colMajor (50 ms)
 [ RUN      ] testClBlas.colMajor2
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 clblas teardown
 unknown file: Failure
 C++ exception with description "clblasSgemm() failed with -11" thrown in the test body.
 [  FAILED  ] testClBlas.colMajor2 (48 ms)
 [ RUN      ] testClBlas.colMajorTransA
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 clblas teardown
 unknown file: Failure
 C++ exception with description "clblasSgemm() failed with -11" thrown in the test body.
 [  FAILED  ] testClBlas.colMajorTransA (43 ms)
 [ RUN      ] testClBlas.colMajorTransB
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 clblas teardown
 unknown file: Failure
 C++ exception with description "clblasSgemm() failed with -11" thrown in the test body.
 [  FAILED  ] testClBlas.colMajorTransB (51 ms)
 [----------] 7 tests from testClBlas (377 ms total)

 [----------] 1 test from testDeepCL
 [ RUN      ] testDeepCL.basic
 unknown file: Failure
 C++ exception with description "No devices found" thrown in the test body.
 [  FAILED  ] testDeepCL.basic (0 ms)
 [----------] 1 test from testDeepCL (0 ms total)

 [----------] 23 tests from testupdateweights
 [ RUN      ] testupdateweights.conv1
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 layer 0:InputLayer{ outputPlanes=2 outputSize=5 }
 layer 1:ConvolutionalLayer{ LayerDimensions{ inputPlanes=2 inputSize=5 numFilters=2 filterSize=3 outputSize=3 padZeros=0 biased=0 skip=0} }
 layer 2:SquareLossLayer{}

 layer 0:InputLayer{ outputPlanes=2 outputSize=5 }
 layer 1:ConvolutionalLayer{ LayerDimensions{ inputPlanes=2 inputSize=5 numFilters=2 filterSize=3 outputSize=3 padZeros=0 biased=0 skip=0} }
 layer 2:SquareLossLayer{}

 batchSize: 4
 inputtotalsize=200 outputTotalSize=72
 layer ConvolutionalLayer{ LayerDimensions{ inputPlanes=2 inputSize=5 numFilters=2 filterSize=3 outputSize=3 padZeros=0 biased=0 skip=0} }
 weightsize=36 biassize=0
 statefultimer v0.7
 forward try kernel 0
  ... not plausibly optimal, skipping
 forward try kernel 1
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 1: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 2
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

 ForwardAuto: kernel 2: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

   ... not valid
 forward try kernel 3
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept: each workgroup handles convolving one input example with one filtercube
 8: // and writing out one single output plane
 9: //
 10: // workgroup id organized like: [imageid][outplane]
 11: // local id organized like: [outrow][outcol]
 12: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 13: // number workgroups = 32
 14: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 16: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 17: // output are organized like [imageid][filterid][row][col]
 18: void kernel forward_3_by_n_outplane(const int batchSize,
 19:       global const float *images, global const float *filters,
 20:     global float *output,
 21:     local float *_upstreamImage, local float *_filterCube) {
 22:     const int globalId = get_global_id(0);
 23: 
 24:     const int workgroupId = get_group_id(0);
 25:     const int workgroupSize = get_local_size(0);
 26:     const int n = workgroupId / gNumFilters;
 27:     const int outPlane = workgroupId % gNumFilters;
 28: 
 29:     const int localId = get_local_id(0);
 30:     const int outputRow = localId / gOutputSize;
 31:     const int outputCol = localId % gOutputSize;
 32: 
 33:     const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize;
 34:     const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow  - gEven) : gHalfFilterSize - gEven;
 35:     const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize;
 36:     const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven;
 37: 
 38:     const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 39: 
 40:     const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 41:     const int filterCubeGlobalOffset = outPlane * filterCubeLength;
 42:     const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize;
 43:     for (int i = 0; i < numPixelsPerThread; i++) {
 44:         int thisOffset = localId + i * workgroupSize;
 45:         if (thisOffset < filterCubeLength) {
 46:             _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset];
 47:         }
 48:     }
 49:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 50: 
 51:     float sum = 0;
 52:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 53:         int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 54:         barrier(CLK_LOCAL_MEM_FENCE);
 55:         for (int i = 0; i < numUpstreamsPerThread; i++) {
 56:             int thisOffset = workgroupSize * i + localId;
 57:             if (thisOffset < gInputSizeSquared) {
 58:                 _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ];
 59:             }
 60:         }
 61:         barrier(CLK_LOCAL_MEM_FENCE);
 62:         int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 63:         for (int u = minu; u <= maxu; u++) {
 64:             int inputRow = outputRow + u;
 65:             #if gPadZeros == 0
 66:                 inputRow += gHalfFilterSize;
 67:             #endif
 68:             int inputimagerowoffset = inputRow * gInputSize;
 69:             int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 70:             for (int v = minv; v <= maxv; v++) {
 71:                 int inputCol = outputCol + v;
 72:                 #if gPadZeros == 0
 73:                     inputCol += gHalfFilterSize;
 74:                 #endif
 75:                 if (localId < gOutputSizeSquared) {
 76:                     sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 77:                 }
 78:             }
 79:         }
 80:     }
 81: 
 82:     // output are organized like [imageid][filterid][row][col]
 83:     int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 84:     if (localId < gOutputSizeSquared) {
 85:         output[resultIndex ] = sum;
 86:     }
 87: }
 88: 
 89: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 3: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept: each workgroup handles convolving one input example with one filtercube
 8: // and writing out one single output plane
 9: //
 10: // workgroup id organized like: [imageid][outplane]
 11: // local id organized like: [outrow][outcol]
 12: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 13: // number workgroups = 32
 14: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 16: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 17: // output are organized like [imageid][filterid][row][col]
 18: void kernel forward_3_by_n_outplane(const int batchSize,
 19:       global const float *images, global const float *filters,
 20:     global float *output,
 21:     local float *_upstreamImage, local float *_filterCube) {
 22:     const int globalId = get_global_id(0);
 23: 
 24:     const int workgroupId = get_group_id(0);
 25:     const int workgroupSize = get_local_size(0);
 26:     const int n = workgroupId / gNumFilters;
 27:     const int outPlane = workgroupId % gNumFilters;
 28: 
 29:     const int localId = get_local_id(0);
 30:     const int outputRow = localId / gOutputSize;
 31:     const int outputCol = localId % gOutputSize;
 32: 
 33:     const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize;
 34:     const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow  - gEven) : gHalfFilterSize - gEven;
 35:     const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize;
 36:     const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven;
 37: 
 38:     const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 39: 
 40:     const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 41:     const int filterCubeGlobalOffset = outPlane * filterCubeLength;
 42:     const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize;
 43:     for (int i = 0; i < numPixelsPerThread; i++) {
 44:         int thisOffset = localId + i * workgroupSize;
 45:         if (thisOffset < filterCubeLength) {
 46:             _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset];
 47:         }
 48:     }
 49:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 50: 
 51:     float sum = 0;
 52:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 53:         int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 54:         barrier(CLK_LOCAL_MEM_FENCE);
 55:         for (int i = 0; i < numUpstreamsPerThread; i++) {
 56:             int thisOffset = workgroupSize * i + localId;
 57:             if (thisOffset < gInputSizeSquared) {
 58:                 _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ];
 59:             }
 60:         }
 61:         barrier(CLK_LOCAL_MEM_FENCE);
 62:         int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 63:         for (int u = minu; u <= maxu; u++) {
 64:             int inputRow = outputRow + u;
 65:             #if gPadZeros == 0
 66:                 inputRow += gHalfFilterSize;
 67:             #endif
 68:             int inputimagerowoffset = inputRow * gInputSize;
 69:             int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 70:             for (int v = minv; v <= maxv; v++) {
 71:                 int inputCol = outputCol + v;
 72:                 #if gPadZeros == 0
 73:                     inputCol += gHalfFilterSize;
 74:                 #endif
 75:                 if (localId < gOutputSizeSquared) {
 76:                     sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 77:                 }
 78:             }
 79:         }
 80:     }
 81: 
 82:     // output are organized like [imageid][filterid][row][col]
 83:     int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 84:     if (localId < gOutputSizeSquared) {
 85:         output[resultIndex ] = sum;
 86:     }
 87: }
 88: 
 89: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 4
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, int N) {
 8:     int numLoops = (N + get_local_size(0) - 1) / get_local_size(0);
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * get_local_size(0) + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [n][filterid]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [n][filterid][outrow][outcol]
 26: // the pixels per thread thing... :
 27: // - we have one thread (~= cuda core) per output value,
 28: //   ie one thread for each combination of [outrow][outcol]
 29: // - however, the number of threads is typically limited on a gpu,
 30: //   eg to 512 (eg Intel HD), or 1024 (eg nVidia K520)
 31: // - so what happens if the number of output points is larger than
 32: //   the maximum workgroup size?
 33: // - then we have several possibilities really:
 34: //   - we can divide the image into blocks, and process each block
 35: //     separately.  This is probably a good option, but fair amount of
 36: //     work
 37: //   - we can get each thread to handle more than one output
 38: //     pixel, by looping
 39: //   - we can consider the output image in 1d, by putting the rows
 40: //     one after another, and assign each contiguous workgroup-size
 41: //     block to one workgroup
 42: //     => this is how this kernel works
 43: //     basically, it's a hack, so larger images actually run, without
 44: //     crashing, and we can probably improve it a lot :-)
 45: //
 46: // So, when outputSize * outputSize > workgroupSize, then
 47: // multiple workgroups will be created for each output plane
 48: // the number of such workgroups is given by: `gPixelsPerThread`
 49: // the id of our workgroup within such a set of workgroups is calculated
 50: // as `pixel`
 51: // effectiveLocalId is our local id if we had one enormous workgroup
 52: // containing the whole output image plane
 53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize,
 54:       global const float *images, global const float *filters,
 55:     global float *output,
 56:     local float *_inputPlane, local float *_filterPlane) {
 57:     #define globalId (get_global_id(0))
 58: 
 59:     #define localId (get_local_id(0))
 60:     #define workgroupId (get_group_id(0))
 61: //    const int workgroupSize = get_local_size(0);
 62:     const int effectiveWorkgroupId = workgroupId / gPixelsPerThread;
 63:     const int pixel = workgroupId % gPixelsPerThread;
 64:     const int effectiveLocalId = localId + pixel * gWorkgroupSize;
 65:     const int n = effectiveWorkgroupId / gNumFilters;
 66:     const int outPlane = effectiveWorkgroupId % gNumFilters;
 67: 
 68:     const int outputRow = effectiveLocalId / gOutputSize;
 69:     const int outputCol = effectiveLocalId % gOutputSize;
 70: 
 71:     float sum = 0;
 72:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 73:         barrier(CLK_LOCAL_MEM_FENCE);
 74:         copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared);
 75:         copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared);
 76:         barrier(CLK_LOCAL_MEM_FENCE);
 77: 
 78:         if (effectiveLocalId < gOutputSizeSquared) {
 79:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 80:                 // trying to reduce register pressure...
 81:                 #if gPadZeros == 1
 82:                     #define inputRow (outputRow + u)
 83:                 #else
 84:                     #define inputRow (outputRow + u + gHalfFilterSize)
 85:                 #endif
 86:                 int inputimagerowoffset = inputRow * gInputSize;
 87:                 int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 88:                 bool rowOk = inputRow >= 0 && inputRow < gInputSize;
 89:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 90:                     #if gPadZeros == 1
 91:                         #define inputCol (outputCol + v)
 92:                     #else
 93:                         #define inputCol (outputCol + v + gHalfFilterSize)
 94:                     #endif
 95:                     bool process = rowOk && inputCol >= 0 && inputCol < gInputSize;
 96:                     if (process) {
 97:                             sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ];
 98:                     }
 99:                 }
 100:             }
 101:         }
 102:     }
 103:     // output are organized like [imageid][filterid][row][col]
 104:     #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId)
 105:     if (effectiveLocalId < gOutputSizeSquared) {
 106:         output[resultIndex ] = sum;
 107:     }
 108: }
 109: #endif
 110: 
 111: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 4: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, int N) {
 8:     int numLoops = (N + get_local_size(0) - 1) / get_local_size(0);
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * get_local_size(0) + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [n][filterid]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [n][filterid][outrow][outcol]
 26: // the pixels per thread thing... :
 27: // - we have one thread (~= cuda core) per output value,
 28: //   ie one thread for each combination of [outrow][outcol]
 29: // - however, the number of threads is typically limited on a gpu,
 30: //   eg to 512 (eg Intel HD), or 1024 (eg nVidia K520)
 31: // - so what happens if the number of output points is larger than
 32: //   the maximum workgroup size?
 33: // - then we have several possibilities really:
 34: //   - we can divide the image into blocks, and process each block
 35: //     separately.  This is probably a good option, but fair amount of
 36: //     work
 37: //   - we can get each thread to handle more than one output
 38: //     pixel, by looping
 39: //   - we can consider the output image in 1d, by putting the rows
 40: //     one after another, and assign each contiguous workgroup-size
 41: //     block to one workgroup
 42: //     => this is how this kernel works
 43: //     basically, it's a hack, so larger images actually run, without
 44: //     crashing, and we can probably improve it a lot :-)
 45: //
 46: // So, when outputSize * outputSize > workgroupSize, then
 47: // multiple workgroups will be created for each output plane
 48: // the number of such workgroups is given by: `gPixelsPerThread`
 49: // the id of our workgroup within such a set of workgroups is calculated
 50: // as `pixel`
 51: // effectiveLocalId is our local id if we had one enormous workgroup
 52: // containing the whole output image plane
 53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize,
 54:       global const float *images, global const float *filters,
 55:     global float *output,
 56:     local float *_inputPlane, local float *_filterPlane) {
 57:     #define globalId (get_global_id(0))
 58: 
 59:     #define localId (get_local_id(0))
 60:     #define workgroupId (get_group_id(0))
 61: //    const int workgroupSize = get_local_size(0);
 62:     const int effectiveWorkgroupId = workgroupId / gPixelsPerThread;
 63:     const int pixel = workgroupId % gPixelsPerThread;
 64:     const int effectiveLocalId = localId + pixel * gWorkgroupSize;
 65:     const int n = effectiveWorkgroupId / gNumFilters;
 66:     const int outPlane = effectiveWorkgroupId % gNumFilters;
 67: 
 68:     const int outputRow = effectiveLocalId / gOutputSize;
 69:     const int outputCol = effectiveLocalId % gOutputSize;
 70: 
 71:     float sum = 0;
 72:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 73:         barrier(CLK_LOCAL_MEM_FENCE);
 74:         copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared);
 75:         copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared);
 76:         barrier(CLK_LOCAL_MEM_FENCE);
 77: 
 78:         if (effectiveLocalId < gOutputSizeSquared) {
 79:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 80:                 // trying to reduce register pressure...
 81:                 #if gPadZeros == 1
 82:                     #define inputRow (outputRow + u)
 83:                 #else
 84:                     #define inputRow (outputRow + u + gHalfFilterSize)
 85:                 #endif
 86:                 int inputimagerowoffset = inputRow * gInputSize;
 87:                 int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 88:                 bool rowOk = inputRow >= 0 && inputRow < gInputSize;
 89:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 90:                     #if gPadZeros == 1
 91:                         #define inputCol (outputCol + v)
 92:                     #else
 93:                         #define inputCol (outputCol + v + gHalfFilterSize)
 94:                     #endif
 95:                     bool process = rowOk && inputCol >= 0 && inputCol < gInputSize;
 96:                     if (process) {
 97:                             sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ];
 98:                     }
 99:                 }
 100:             }
 101:         }
 102:     }
 103:     // output are organized like [imageid][filterid][row][col]
 104:     #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId)
 105:     if (effectiveLocalId < gOutputSizeSquared) {
 106:         output[resultIndex ] = sum;
 107:     }
 108: }
 109: #endif
 110: 
 111: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 5
 ForwardAuto: kernel 5: this instance cant be used: For ForwardFc, filtersize and inputimagesize must be identical
   ... not valid
 forward try kernel 6
 cl/forward_byinputplane.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept:
 8: // - load same input plane from each image
 9: // - hold filter plane for this input plane, for all filters
 10: // - reduce afterwards
 11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB
 12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB
 13: // => seems ok?
 14: // workgroupid: [inputPlaneId]
 15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...)
 16: // iterate over: [n][outCol]
 17: // output: [n][filterId][outRow][outCol][inputPlane]
 18: // need to later reduce output over: [inputPlane]
 19: void kernel forward_byinputplane(const int batchSize,
 20:       global const float *images, global const float *filters,
 21:     global float *output,
 22:     local float *_inputPlane, local float *_filterPlanes) {
 23: //    const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0;
 24: 
 25:     const int globalId = get_global_id(0);
 26:     const int workgroupId = get_group_id(0);
 27:     const int workgroupSize = get_local_size(0);
 28:     const int localId = get_local_id(0);
 29: 
 30:     const int inputPlaneId = workgroupId;
 31:     const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize;
 32:     const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize;
 33:     const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 34:     for (int loop = 0; loop < numLoops; loop++) {
 35:         const int loopLocalId = localId + loop * workgroupSize;
 36:         const int filterId = loopLocalId / gOutputSize;
 37:         const int outRow = loopLocalId % gOutputSize;
 38: 
 39:         // copy down our filter, we have gOutputSize threads to do this
 40:         global float const *globalFilterPlane = filters +
 41:             (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared;
 42:         local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared;
 43:         barrier(CLK_LOCAL_MEM_FENCE);
 44:         for (int i = 0; i < numFilterCopyLoops; i++) {
 45:             const int offset = i * gOutputSize + outRow;
 46:             bool process = filterId < gNumFilters && offset < gFilterSizeSquared;
 47:             if (process) {
 48:                 _localFilterPlane[ offset ] = globalFilterPlane[ offset ];
 49:             }
 50:         }
 51:         // loop over n ...
 52:         for (int n = 0; n < batchSize; n++) {
 53:             // copy down our imageplane, we have workgroupSize threads to do this
 54:             barrier(CLK_LOCAL_MEM_FENCE);
 55:             global float const *globalImagePlane = images +
 56:                 (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared;
 57:             for (int i = 0; i< numImageCopyLoops; i++) {
 58:                 const int offset = i * workgroupSize + localId;
 59:                 if (offset < gInputSizeSquared) {
 60:                     _inputPlane[ offset ] = globalImagePlane[ offset ];
 61:                 }
 62:             }
 63:             barrier(CLK_LOCAL_MEM_FENCE);
 64:             // calc output for each [outrow][outcol]
 65:             bool filterPlaneOk = filterId < gNumFilters;
 66:             for (int outCol = 0; outCol < gOutputSize; outCol++) {
 67:                 float sum = 0;
 68:                 for (int filterRow = 0; filterRow < gFilterSize; filterRow++) {
 69:                     int inRow = outRow + filterRow;
 70:                     #if gPadZeros == 1
 71:                         inRow -= gHalfFilterSize;
 72:                     #endif
 73:                     bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize;
 74:                     for (int filterCol = 0; filterCol < gFilterSize; filterCol++) {
 75:                         int inCol = outCol + filterCol;
 76:                         #if gPadZeros == 1
 77:                             inCol -= gHalfFilterSize;
 78:                         #endif
 79:                         bool process = rowOk && inCol >= 0 && inCol < gInputSize;
 80:                         if (process) {
 81:                             float imageValue = _inputPlane[ inRow * gInputSize + inCol ];
 82:                             float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ];
 83:                             sum += imageValue * filterValue;
 84:                         }
 85:                     }
 86:                 }
 87:                 if (filterId < gNumFilters) {
 88:                     // [n][filterId][outRow][outCol][inputPlane]
 89:                     int resultIndex = (( (n
 90:                         * gNumFilters + filterId)
 91:                         * gOutputSize + outRow)
 92:                         * gOutputSize + outCol)
 93:                         * gNumInputPlanes + inputPlaneId;
 94:                     output[resultIndex] = sum;
 95:                     //if (globalId == 2) output[0] = resultIndex;
 96: //                    output[resultIndex] = outRow;
 97:                 }
 98: //                output[localId] = _localFilterPlane[localId];
 99:             }
 100:         }
 101:     }
 102: }
 103: 
 104: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward_byinputplane.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 6: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept:
 8: // - load same input plane from each image
 9: // - hold filter plane for this input plane, for all filters
 10: // - reduce afterwards
 11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB
 12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB
 13: // => seems ok?
 14: // workgroupid: [inputPlaneId]
 15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...)
 16: // iterate over: [n][outCol]
 17: // output: [n][filterId][outRow][outCol][inputPlane]
 18: // need to later reduce output over: [inputPlane]
 19: void kernel forward_byinputplane(const int batchSize,
 20:       global const float *images, global const float *filters,
 21:     global float *output,
 22:     local float *_inputPlane, local float *_filterPlanes) {
 23: //    const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0;
 24: 
 25:     const int globalId = get_global_id(0);
 26:     const int workgroupId = get_group_id(0);
 27:     const int workgroupSize = get_local_size(0);
 28:     const int localId = get_local_id(0);
 29: 
 30:     const int inputPlaneId = workgroupId;
 31:     const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize;
 32:     const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize;
 33:     const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 34:     for (int loop = 0; loop < numLoops; loop++) {
 35:         const int loopLocalId = localId + loop * workgroupSize;
 36:         const int filterId = loopLocalId / gOutputSize;
 37:         const int outRow = loopLocalId % gOutputSize;
 38: 
 39:         // copy down our filter, we have gOutputSize threads to do this
 40:         global float const *globalFilterPlane = filters +
 41:             (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared;
 42:         local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared;
 43:         barrier(CLK_LOCAL_MEM_FENCE);
 44:         for (int i = 0; i < numFilterCopyLoops; i++) {
 45:             const int offset = i * gOutputSize + outRow;
 46:             bool process = filterId < gNumFilters && offset < gFilterSizeSquared;
 47:             if (process) {
 48:                 _localFilterPlane[ offset ] = globalFilterPlane[ offset ];
 49:             }
 50:         }
 51:         // loop over n ...
 52:         for (int n = 0; n < batchSize; n++) {
 53:             // copy down our imageplane, we have workgroupSize threads to do this
 54:             barrier(CLK_LOCAL_MEM_FENCE);
 55:             global float const *globalImagePlane = images +
 56:                 (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared;
 57:             for (int i = 0; i< numImageCopyLoops; i++) {
 58:                 const int offset = i * workgroupSize + localId;
 59:                 if (offset < gInputSizeSquared) {
 60:                     _inputPlane[ offset ] = globalImagePlane[ offset ];
 61:                 }
 62:             }
 63:             barrier(CLK_LOCAL_MEM_FENCE);
 64:             // calc output for each [outrow][outcol]
 65:             bool filterPlaneOk = filterId < gNumFilters;
 66:             for (int outCol = 0; outCol < gOutputSize; outCol++) {
 67:                 float sum = 0;
 68:                 for (int filterRow = 0; filterRow < gFilterSize; filterRow++) {
 69:                     int inRow = outRow + filterRow;
 70:                     #if gPadZeros == 1
 71:                         inRow -= gHalfFilterSize;
 72:                     #endif
 73:                     bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize;
 74:                     for (int filterCol = 0; filterCol < gFilterSize; filterCol++) {
 75:                         int inCol = outCol + filterCol;
 76:                         #if gPadZeros == 1
 77:                             inCol -= gHalfFilterSize;
 78:                         #endif
 79:                         bool process = rowOk && inCol >= 0 && inCol < gInputSize;
 80:                         if (process) {
 81:                             float imageValue = _inputPlane[ inRow * gInputSize + inCol ];
 82:                             float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ];
 83:                             sum += imageValue * filterValue;
 84:                         }
 85:                     }
 86:                 }
 87:                 if (filterId < gNumFilters) {
 88:                     // [n][filterId][outRow][outCol][inputPlane]
 89:                     int resultIndex = (( (n
 90:                         * gNumFilters + filterId)
 91:                         * gOutputSize + outRow)
 92:                         * gOutputSize + outCol)
 93:                         * gNumInputPlanes + inputPlaneId;
 94:                     output[resultIndex] = sum;
 95:                     //if (globalId == 2) output[0] = resultIndex;
 96: //                    output[resultIndex] = outRow;
 97:                 }
 98: //                output[localId] = _localFilterPlane[localId];
 99:             }
 100:         }
 101:     }
 102: }
 103: 
 104: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward_byinputplane.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 7
   ... seems valid
 ForwardIm2Col.cl build log: 
 (19:0) : error : invalid global address space qualifier specified for parameter type
 (19:0) : error : syntax error at 'const'

 kernel build error:

 kernel source:
 1: // from SpatialConvolutionMM.cu:
 2: 
 3: // CL: grid stride looping
 4: #define CL_KERNEL_LOOP(i, n)                        \
 5:   for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \
 6:       i < (n);                                       \
 7:       i += get_local_size(0) * get_num_groups(0))
 8: 
 9: //#define gPadding 0
 10: //#define gStride 1
 11: //#define gColSize 3
 12: //#define gFilterSize 3
 13: //#define gSize 5
 14: 
 15: // Kernel for fast unfold+copy
 16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu)
 17: kernel void im2col(
 18:     const int n,
 19:     global float const * im_data, int im_offset,
 20:     global float* data_col) {
 21:   global const float *data_im = im_data + im_offset;
 22: 
 23:   CL_KERNEL_LOOP(index, n) {
 24:     int w_out = index % 3;
 25:     index /= 3;
 26:     int h_out = index % 3;
 27:     int channel_in = index / 3;
 28:     int channel_out = channel_in * 3 * 3;
 29:     int h_in = h_out * 1 - 0;
 30:     int w_in = w_out * 1 - 0;
 31:     data_col += (channel_out * 3 + h_out) * 3 + w_out;
 32:     data_im += (channel_in * 5 + h_in) * 5 + w_in;
 33:     for (int i = 0; i < 3; ++i) {
 34:       for (int j = 0; j < 3; ++j) {
 35:         int h = h_in + i;
 36:         int w = w_in + j;
 37:         *data_col = (h >= 0 && w >= 0 && h < 5 && w < 5) ?
 38:           data_im[i * 5 + j] : 0;
 39:         data_col += 3 * 3;
 40:       }
 41:     }
 42:   }
 43: }
 44: 
 45: kernel void col2im(
 46:     const int n,
 47:     global float const *data_col,
 48:     global float* im_data, int im_offset) {
 49:   global float *data_im = im_data + im_offset;
 50: 
 51:   for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) {
 52:     float val = 0;
 53:     int w = index % 5 + 0;
 54:     int h = (index / 5) % 5 + 0;
 55:     int c = index / (5 * 5);
 56:     // compute the start and end of the output
 57:     int w_col_start = (w < 3) ? 0 : (w - 3) / 1 + 1;
 58:     int w_col_end = min(w / 1 + 1, 3);
 59:     int h_col_start = (h < 3) ? 0 : (h - 3) / 1 + 1;
 60:     int h_col_end = min(h / 1 + 1, 3);
 61: 
 62:     int offset = (c * 3 * 3 + h * 3 + w) * 3 * 3;
 63:     int coeff_h_col = (1 - 1 * 3 * 3) * 3;
 64:     int coeff_w_col = (1 - 1 * 3 * 3);
 65:     for (int h_col = h_col_start; h_col < h_col_end; ++h_col) {
 66:       for (int w_col = w_col_start; w_col < w_col_end; ++w_col) {
 67:         val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col];
 68:       }
 69:     }
 70:     data_im[index] = val;
 71:   }
 72: }
 73: 
 74: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 ForwardIm2Col.cl build log: 
 (19:0) : error : invalid global address space qualifier specified for parameter type
 (19:0) : error : syntax error at 'const'

 ForwardAuto: kernel 7 this instance cant be used: 
 kernel source:
 1: // from SpatialConvolutionMM.cu:
 2: 
 3: // CL: grid stride looping
 4: #define CL_KERNEL_LOOP(i, n)                        \
 5:   for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \
 6:       i < (n);                                       \
 7:       i += get_local_size(0) * get_num_groups(0))
 8: 
 9: //#define gPadding 0
 10: //#define gStride 1
 11: //#define gColSize 3
 12: //#define gFilterSize 3
 13: //#define gSize 5
 14: 
 15: // Kernel for fast unfold+copy
 16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu)
 17: kernel void im2col(
 18:     const int n,
 19:     global float const * im_data, int im_offset,
 20:     global float* data_col) {
 21:   global const float *data_im = im_data + im_offset;
 22: 
 23:   CL_KERNEL_LOOP(index, n) {
 24:     int w_out = index % 3;
 25:     index /= 3;
 26:     int h_out = index % 3;
 27:     int channel_in = index / 3;
 28:     int channel_out = channel_in * 3 * 3;
 29:     int h_in = h_out * 1 - 0;
 30:     int w_in = w_out * 1 - 0;
 31:     data_col += (channel_out * 3 + h_out) * 3 + w_out;
 32:     data_im += (channel_in * 5 + h_in) * 5 + w_in;
 33:     for (int i = 0; i < 3; ++i) {
 34:       for (int j = 0; j < 3; ++j) {
 35:         int h = h_in + i;
 36:         int w = w_in + j;
 37:         *data_col = (h >= 0 && w >= 0 && h < 5 && w < 5) ?
 38:           data_im[i * 5 + j] : 0;
 39:         data_col += 3 * 3;
 40:       }
 41:     }
 42:   }
 43: }
 44: 
 45: kernel void col2im(
 46:     const int n,
 47:     global float const *data_col,
 48:     global float* im_data, int im_offset) {
 49:   global float *data_im = im_data + im_offset;
 50: 
 51:   for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) {
 52:     float val = 0;
 53:     int w = index % 5 + 0;
 54:     int h = (index / 5) % 5 + 0;
 55:     int c = index / (5 * 5);
 56:     // compute the start and end of the output
 57:     int w_col_start = (w < 3) ? 0 : (w - 3) / 1 + 1;
 58:     int w_col_end = min(w / 1 + 1, 3);
 59:     int h_col_start = (h < 3) ? 0 : (h - 3) / 1 + 1;
 60:     int h_col_end = min(h / 1 + 1, 3);
 61: 
 62:     int offset = (c * 3 * 3 + h * 3 + w) * 3 * 3;
 63:     int coeff_h_col = (1 - 1 * 3 * 3) * 3;
 64:     int coeff_w_col = (1 - 1 * 3 * 3);
 65:     for (int h_col = h_col_start; h_col < h_col_end; ++h_col) {
 66:       for (int w_col = w_col_start; w_col < w_col_end; ++w_col) {
 67:         val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col];
 68:       }
 69:     }
 70:     data_im[index] = val;
 71:   }
 72: }
 73: 
 74: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 ForwardIm2Col.cl build log: 
 (19:0) : error : invalid global address space qualifier specified for parameter type
 (19:0) : error : syntax error at 'const'

   forward kernel 0: cannot be used
   forward kernel 1: cannot be used
   forward kernel 2: cannot be used
   forward kernel 3: cannot be used
   forward kernel 4: cannot be used
   forward kernel 5: cannot be used
   forward kernel 6: cannot be used
   forward kernel 7: cannot be used
 clblas teardown
 unknown file: Failure
 C++ exception with description "No valid forward implementations found" thrown in the test body.
 [  FAILED  ] testupdateweights.conv1 (147 ms)
 [ RUN      ] testupdateweights.conv1z
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 layer 0:InputLayer{ outputPlanes=2 outputSize=3 }
 layer 1:ConvolutionalLayer{ LayerDimensions{ inputPlanes=2 inputSize=3 numFilters=2 filterSize=3 outputSize=3 padZeros=1 biased=0 skip=0} }
 layer 2:SquareLossLayer{}

 layer 0:InputLayer{ outputPlanes=2 outputSize=3 }
 layer 1:ConvolutionalLayer{ LayerDimensions{ inputPlanes=2 inputSize=3 numFilters=2 filterSize=3 outputSize=3 padZeros=1 biased=0 skip=0} }
 layer 2:SquareLossLayer{}

 batchSize: 4
 inputtotalsize=72 outputTotalSize=72
 layer ConvolutionalLayer{ LayerDimensions{ inputPlanes=2 inputSize=3 numFilters=2 filterSize=3 outputSize=3 padZeros=1 biased=0 skip=0} }
 weightsize=36 biassize=0
 forward try kernel 0
  ... not plausibly optimal, skipping
 forward try kernel 1
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 1: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 2
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

 ForwardAuto: kernel 2: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

   ... not valid
 forward try kernel 3
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept: each workgroup handles convolving one input example with one filtercube
 8: // and writing out one single output plane
 9: //
 10: // workgroup id organized like: [imageid][outplane]
 11: // local id organized like: [outrow][outcol]
 12: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 13: // number workgroups = 32
 14: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 16: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 17: // output are organized like [imageid][filterid][row][col]
 18: void kernel forward_3_by_n_outplane(const int batchSize,
 19:       global const float *images, global const float *filters,
 20:     global float *output,
 21:     local float *_upstreamImage, local float *_filterCube) {
 22:     const int globalId = get_global_id(0);
 23: 
 24:     const int workgroupId = get_group_id(0);
 25:     const int workgroupSize = get_local_size(0);
 26:     const int n = workgroupId / gNumFilters;
 27:     const int outPlane = workgroupId % gNumFilters;
 28: 
 29:     const int localId = get_local_id(0);
 30:     const int outputRow = localId / gOutputSize;
 31:     const int outputCol = localId % gOutputSize;
 32: 
 33:     const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize;
 34:     const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow  - gEven) : gHalfFilterSize - gEven;
 35:     const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize;
 36:     const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven;
 37: 
 38:     const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 39: 
 40:     const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 41:     const int filterCubeGlobalOffset = outPlane * filterCubeLength;
 42:     const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize;
 43:     for (int i = 0; i < numPixelsPerThread; i++) {
 44:         int thisOffset = localId + i * workgroupSize;
 45:         if (thisOffset < filterCubeLength) {
 46:             _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset];
 47:         }
 48:     }
 49:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 50: 
 51:     float sum = 0;
 52:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 53:         int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 54:         barrier(CLK_LOCAL_MEM_FENCE);
 55:         for (int i = 0; i < numUpstreamsPerThread; i++) {
 56:             int thisOffset = workgroupSize * i + localId;
 57:             if (thisOffset < gInputSizeSquared) {
 58:                 _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ];
 59:             }
 60:         }
 61:         barrier(CLK_LOCAL_MEM_FENCE);
 62:         int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 63:         for (int u = minu; u <= maxu; u++) {
 64:             int inputRow = outputRow + u;
 65:             #if gPadZeros == 0
 66:                 inputRow += gHalfFilterSize;
 67:             #endif
 68:             int inputimagerowoffset = inputRow * gInputSize;
 69:             int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 70:             for (int v = minv; v <= maxv; v++) {
 71:                 int inputCol = outputCol + v;
 72:                 #if gPadZeros == 0
 73:                     inputCol += gHalfFilterSize;
 74:                 #endif
 75:                 if (localId < gOutputSizeSquared) {
 76:                     sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 77:                 }
 78:             }
 79:         }
 80:     }
 81: 
 82:     // output are organized like [imageid][filterid][row][col]
 83:     int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 84:     if (localId < gOutputSizeSquared) {
 85:         output[resultIndex ] = sum;
 86:     }
 87: }
 88: 
 89: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 3: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept: each workgroup handles convolving one input example with one filtercube
 8: // and writing out one single output plane
 9: //
 10: // workgroup id organized like: [imageid][outplane]
 11: // local id organized like: [outrow][outcol]
 12: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 13: // number workgroups = 32
 14: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 16: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 17: // output are organized like [imageid][filterid][row][col]
 18: void kernel forward_3_by_n_outplane(const int batchSize,
 19:       global const float *images, global const float *filters,
 20:     global float *output,
 21:     local float *_upstreamImage, local float *_filterCube) {
 22:     const int globalId = get_global_id(0);
 23: 
 24:     const int workgroupId = get_group_id(0);
 25:     const int workgroupSize = get_local_size(0);
 26:     const int n = workgroupId / gNumFilters;
 27:     const int outPlane = workgroupId % gNumFilters;
 28: 
 29:     const int localId = get_local_id(0);
 30:     const int outputRow = localId / gOutputSize;
 31:     const int outputCol = localId % gOutputSize;
 32: 
 33:     const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize;
 34:     const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow  - gEven) : gHalfFilterSize - gEven;
 35:     const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize;
 36:     const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven;
 37: 
 38:     const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 39: 
 40:     const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 41:     const int filterCubeGlobalOffset = outPlane * filterCubeLength;
 42:     const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize;
 43:     for (int i = 0; i < numPixelsPerThread; i++) {
 44:         int thisOffset = localId + i * workgroupSize;
 45:         if (thisOffset < filterCubeLength) {
 46:             _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset];
 47:         }
 48:     }
 49:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 50: 
 51:     float sum = 0;
 52:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 53:         int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 54:         barrier(CLK_LOCAL_MEM_FENCE);
 55:         for (int i = 0; i < numUpstreamsPerThread; i++) {
 56:             int thisOffset = workgroupSize * i + localId;
 57:             if (thisOffset < gInputSizeSquared) {
 58:                 _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ];
 59:             }
 60:         }
 61:         barrier(CLK_LOCAL_MEM_FENCE);
 62:         int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 63:         for (int u = minu; u <= maxu; u++) {
 64:             int inputRow = outputRow + u;
 65:             #if gPadZeros == 0
 66:                 inputRow += gHalfFilterSize;
 67:             #endif
 68:             int inputimagerowoffset = inputRow * gInputSize;
 69:             int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 70:             for (int v = minv; v <= maxv; v++) {
 71:                 int inputCol = outputCol + v;
 72:                 #if gPadZeros == 0
 73:                     inputCol += gHalfFilterSize;
 74:                 #endif
 75:                 if (localId < gOutputSizeSquared) {
 76:                     sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 77:                 }
 78:             }
 79:         }
 80:     }
 81: 
 82:     // output are organized like [imageid][filterid][row][col]
 83:     int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 84:     if (localId < gOutputSizeSquared) {
 85:         output[resultIndex ] = sum;
 86:     }
 87: }
 88: 
 89: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 4
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, int N) {
 8:     int numLoops = (N + get_local_size(0) - 1) / get_local_size(0);
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * get_local_size(0) + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [n][filterid]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [n][filterid][outrow][outcol]
 26: // the pixels per thread thing... :
 27: // - we have one thread (~= cuda core) per output value,
 28: //   ie one thread for each combination of [outrow][outcol]
 29: // - however, the number of threads is typically limited on a gpu,
 30: //   eg to 512 (eg Intel HD), or 1024 (eg nVidia K520)
 31: // - so what happens if the number of output points is larger than
 32: //   the maximum workgroup size?
 33: // - then we have several possibilities really:
 34: //   - we can divide the image into blocks, and process each block
 35: //     separately.  This is probably a good option, but fair amount of
 36: //     work
 37: //   - we can get each thread to handle more than one output
 38: //     pixel, by looping
 39: //   - we can consider the output image in 1d, by putting the rows
 40: //     one after another, and assign each contiguous workgroup-size
 41: //     block to one workgroup
 42: //     => this is how this kernel works
 43: //     basically, it's a hack, so larger images actually run, without
 44: //     crashing, and we can probably improve it a lot :-)
 45: //
 46: // So, when outputSize * outputSize > workgroupSize, then
 47: // multiple workgroups will be created for each output plane
 48: // the number of such workgroups is given by: `gPixelsPerThread`
 49: // the id of our workgroup within such a set of workgroups is calculated
 50: // as `pixel`
 51: // effectiveLocalId is our local id if we had one enormous workgroup
 52: // containing the whole output image plane
 53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize,
 54:       global const float *images, global const float *filters,
 55:     global float *output,
 56:     local float *_inputPlane, local float *_filterPlane) {
 57:     #define globalId (get_global_id(0))
 58: 
 59:     #define localId (get_local_id(0))
 60:     #define workgroupId (get_group_id(0))
 61: //    const int workgroupSize = get_local_size(0);
 62:     const int effectiveWorkgroupId = workgroupId / gPixelsPerThread;
 63:     const int pixel = workgroupId % gPixelsPerThread;
 64:     const int effectiveLocalId = localId + pixel * gWorkgroupSize;
 65:     const int n = effectiveWorkgroupId / gNumFilters;
 66:     const int outPlane = effectiveWorkgroupId % gNumFilters;
 67: 
 68:     const int outputRow = effectiveLocalId / gOutputSize;
 69:     const int outputCol = effectiveLocalId % gOutputSize;
 70: 
 71:     float sum = 0;
 72:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 73:         barrier(CLK_LOCAL_MEM_FENCE);
 74:         copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared);
 75:         copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared);
 76:         barrier(CLK_LOCAL_MEM_FENCE);
 77: 
 78:         if (effectiveLocalId < gOutputSizeSquared) {
 79:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 80:                 // trying to reduce register pressure...
 81:                 #if gPadZeros == 1
 82:                     #define inputRow (outputRow + u)
 83:                 #else
 84:                     #define inputRow (outputRow + u + gHalfFilterSize)
 85:                 #endif
 86:                 int inputimagerowoffset = inputRow * gInputSize;
 87:                 int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 88:                 bool rowOk = inputRow >= 0 && inputRow < gInputSize;
 89:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 90:                     #if gPadZeros == 1
 91:                         #define inputCol (outputCol + v)
 92:                     #else
 93:                         #define inputCol (outputCol + v + gHalfFilterSize)
 94:                     #endif
 95:                     bool process = rowOk && inputCol >= 0 && inputCol < gInputSize;
 96:                     if (process) {
 97:                             sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ];
 98:                     }
 99:                 }
 100:             }
 101:         }
 102:     }
 103:     // output are organized like [imageid][filterid][row][col]
 104:     #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId)
 105:     if (effectiveLocalId < gOutputSizeSquared) {
 106:         output[resultIndex ] = sum;
 107:     }
 108: }
 109: #endif
 110: 
 111: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 4: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, int N) {
 8:     int numLoops = (N + get_local_size(0) - 1) / get_local_size(0);
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * get_local_size(0) + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [n][filterid]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [n][filterid][outrow][outcol]
 26: // the pixels per thread thing... :
 27: // - we have one thread (~= cuda core) per output value,
 28: //   ie one thread for each combination of [outrow][outcol]
 29: // - however, the number of threads is typically limited on a gpu,
 30: //   eg to 512 (eg Intel HD), or 1024 (eg nVidia K520)
 31: // - so what happens if the number of output points is larger than
 32: //   the maximum workgroup size?
 33: // - then we have several possibilities really:
 34: //   - we can divide the image into blocks, and process each block
 35: //     separately.  This is probably a good option, but fair amount of
 36: //     work
 37: //   - we can get each thread to handle more than one output
 38: //     pixel, by looping
 39: //   - we can consider the output image in 1d, by putting the rows
 40: //     one after another, and assign each contiguous workgroup-size
 41: //     block to one workgroup
 42: //     => this is how this kernel works
 43: //     basically, it's a hack, so larger images actually run, without
 44: //     crashing, and we can probably improve it a lot :-)
 45: //
 46: // So, when outputSize * outputSize > workgroupSize, then
 47: // multiple workgroups will be created for each output plane
 48: // the number of such workgroups is given by: `gPixelsPerThread`
 49: // the id of our workgroup within such a set of workgroups is calculated
 50: // as `pixel`
 51: // effectiveLocalId is our local id if we had one enormous workgroup
 52: // containing the whole output image plane
 53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize,
 54:       global const float *images, global const float *filters,
 55:     global float *output,
 56:     local float *_inputPlane, local float *_filterPlane) {
 57:     #define globalId (get_global_id(0))
 58: 
 59:     #define localId (get_local_id(0))
 60:     #define workgroupId (get_group_id(0))
 61: //    const int workgroupSize = get_local_size(0);
 62:     const int effectiveWorkgroupId = workgroupId / gPixelsPerThread;
 63:     const int pixel = workgroupId % gPixelsPerThread;
 64:     const int effectiveLocalId = localId + pixel * gWorkgroupSize;
 65:     const int n = effectiveWorkgroupId / gNumFilters;
 66:     const int outPlane = effectiveWorkgroupId % gNumFilters;
 67: 
 68:     const int outputRow = effectiveLocalId / gOutputSize;
 69:     const int outputCol = effectiveLocalId % gOutputSize;
 70: 
 71:     float sum = 0;
 72:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 73:         barrier(CLK_LOCAL_MEM_FENCE);
 74:         copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared);
 75:         copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared);
 76:         barrier(CLK_LOCAL_MEM_FENCE);
 77: 
 78:         if (effectiveLocalId < gOutputSizeSquared) {
 79:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 80:                 // trying to reduce register pressure...
 81:                 #if gPadZeros == 1
 82:                     #define inputRow (outputRow + u)
 83:                 #else
 84:                     #define inputRow (outputRow + u + gHalfFilterSize)
 85:                 #endif
 86:                 int inputimagerowoffset = inputRow * gInputSize;
 87:                 int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 88:                 bool rowOk = inputRow >= 0 && inputRow < gInputSize;
 89:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 90:                     #if gPadZeros == 1
 91:                         #define inputCol (outputCol + v)
 92:                     #else
 93:                         #define inputCol (outputCol + v + gHalfFilterSize)
 94:                     #endif
 95:                     bool process = rowOk && inputCol >= 0 && inputCol < gInputSize;
 96:                     if (process) {
 97:                             sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ];
 98:                     }
 99:                 }
 100:             }
 101:         }
 102:     }
 103:     // output are organized like [imageid][filterid][row][col]
 104:     #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId)
 105:     if (effectiveLocalId < gOutputSizeSquared) {
 106:         output[resultIndex ] = sum;
 107:     }
 108: }
 109: #endif
 110: 
 111: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 5
 ForwardAuto: kernel 5: this instance cant be used: For ForwardFc, padzeros must be disabled
   ... not valid
 forward try kernel 6
 cl/forward_byinputplane.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept:
 8: // - load same input plane from each image
 9: // - hold filter plane for this input plane, for all filters
 10: // - reduce afterwards
 11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB
 12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB
 13: // => seems ok?
 14: // workgroupid: [inputPlaneId]
 15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...)
 16: // iterate over: [n][outCol]
 17: // output: [n][filterId][outRow][outCol][inputPlane]
 18: // need to later reduce output over: [inputPlane]
 19: void kernel forward_byinputplane(const int batchSize,
 20:       global const float *images, global const float *filters,
 21:     global float *output,
 22:     local float *_inputPlane, local float *_filterPlanes) {
 23: //    const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0;
 24: 
 25:     const int globalId = get_global_id(0);
 26:     const int workgroupId = get_group_id(0);
 27:     const int workgroupSize = get_local_size(0);
 28:     const int localId = get_local_id(0);
 29: 
 30:     const int inputPlaneId = workgroupId;
 31:     const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize;
 32:     const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize;
 33:     const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 34:     for (int loop = 0; loop < numLoops; loop++) {
 35:         const int loopLocalId = localId + loop * workgroupSize;
 36:         const int filterId = loopLocalId / gOutputSize;
 37:         const int outRow = loopLocalId % gOutputSize;
 38: 
 39:         // copy down our filter, we have gOutputSize threads to do this
 40:         global float const *globalFilterPlane = filters +
 41:             (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared;
 42:         local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared;
 43:         barrier(CLK_LOCAL_MEM_FENCE);
 44:         for (int i = 0; i < numFilterCopyLoops; i++) {
 45:             const int offset = i * gOutputSize + outRow;
 46:             bool process = filterId < gNumFilters && offset < gFilterSizeSquared;
 47:             if (process) {
 48:                 _localFilterPlane[ offset ] = globalFilterPlane[ offset ];
 49:             }
 50:         }
 51:         // loop over n ...
 52:         for (int n = 0; n < batchSize; n++) {
 53:             // copy down our imageplane, we have workgroupSize threads to do this
 54:             barrier(CLK_LOCAL_MEM_FENCE);
 55:             global float const *globalImagePlane = images +
 56:                 (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared;
 57:             for (int i = 0; i< numImageCopyLoops; i++) {
 58:                 const int offset = i * workgroupSize + localId;
 59:                 if (offset < gInputSizeSquared) {
 60:                     _inputPlane[ offset ] = globalImagePlane[ offset ];
 61:                 }
 62:             }
 63:             barrier(CLK_LOCAL_MEM_FENCE);
 64:             // calc output for each [outrow][outcol]
 65:             bool filterPlaneOk = filterId < gNumFilters;
 66:             for (int outCol = 0; outCol < gOutputSize; outCol++) {
 67:                 float sum = 0;
 68:                 for (int filterRow = 0; filterRow < gFilterSize; filterRow++) {
 69:                     int inRow = outRow + filterRow;
 70:                     #if gPadZeros == 1
 71:                         inRow -= gHalfFilterSize;
 72:                     #endif
 73:                     bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize;
 74:                     for (int filterCol = 0; filterCol < gFilterSize; filterCol++) {
 75:                         int inCol = outCol + filterCol;
 76:                         #if gPadZeros == 1
 77:                             inCol -= gHalfFilterSize;
 78:                         #endif
 79:                         bool process = rowOk && inCol >= 0 && inCol < gInputSize;
 80:                         if (process) {
 81:                             float imageValue = _inputPlane[ inRow * gInputSize + inCol ];
 82:                             float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ];
 83:                             sum += imageValue * filterValue;
 84:                         }
 85:                     }
 86:                 }
 87:                 if (filterId < gNumFilters) {
 88:                     // [n][filterId][outRow][outCol][inputPlane]
 89:                     int resultIndex = (( (n
 90:                         * gNumFilters + filterId)
 91:                         * gOutputSize + outRow)
 92:                         * gOutputSize + outCol)
 93:                         * gNumInputPlanes + inputPlaneId;
 94:                     output[resultIndex] = sum;
 95:                     //if (globalId == 2) output[0] = resultIndex;
 96: //                    output[resultIndex] = outRow;
 97:                 }
 98: //                output[localId] = _localFilterPlane[localId];
 99:             }
 100:         }
 101:     }
 102: }
 103: 
 104: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward_byinputplane.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 6: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept:
 8: // - load same input plane from each image
 9: // - hold filter plane for this input plane, for all filters
 10: // - reduce afterwards
 11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB
 12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB
 13: // => seems ok?
 14: // workgroupid: [inputPlaneId]
 15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...)
 16: // iterate over: [n][outCol]
 17: // output: [n][filterId][outRow][outCol][inputPlane]
 18: // need to later reduce output over: [inputPlane]
 19: void kernel forward_byinputplane(const int batchSize,
 20:       global const float *images, global const float *filters,
 21:     global float *output,
 22:     local float *_inputPlane, local float *_filterPlanes) {
 23: //    const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0;
 24: 
 25:     const int globalId = get_global_id(0);
 26:     const int workgroupId = get_group_id(0);
 27:     const int workgroupSize = get_local_size(0);
 28:     const int localId = get_local_id(0);
 29: 
 30:     const int inputPlaneId = workgroupId;
 31:     const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize;
 32:     const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize;
 33:     const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 34:     for (int loop = 0; loop < numLoops; loop++) {
 35:         const int loopLocalId = localId + loop * workgroupSize;
 36:         const int filterId = loopLocalId / gOutputSize;
 37:         const int outRow = loopLocalId % gOutputSize;
 38: 
 39:         // copy down our filter, we have gOutputSize threads to do this
 40:         global float const *globalFilterPlane = filters +
 41:             (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared;
 42:         local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared;
 43:         barrier(CLK_LOCAL_MEM_FENCE);
 44:         for (int i = 0; i < numFilterCopyLoops; i++) {
 45:             const int offset = i * gOutputSize + outRow;
 46:             bool process = filterId < gNumFilters && offset < gFilterSizeSquared;
 47:             if (process) {
 48:                 _localFilterPlane[ offset ] = globalFilterPlane[ offset ];
 49:             }
 50:         }
 51:         // loop over n ...
 52:         for (int n = 0; n < batchSize; n++) {
 53:             // copy down our imageplane, we have workgroupSize threads to do this
 54:             barrier(CLK_LOCAL_MEM_FENCE);
 55:             global float const *globalImagePlane = images +
 56:                 (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared;
 57:             for (int i = 0; i< numImageCopyLoops; i++) {
 58:                 const int offset = i * workgroupSize + localId;
 59:                 if (offset < gInputSizeSquared) {
 60:                     _inputPlane[ offset ] = globalImagePlane[ offset ];
 61:                 }
 62:             }
 63:             barrier(CLK_LOCAL_MEM_FENCE);
 64:             // calc output for each [outrow][outcol]
 65:             bool filterPlaneOk = filterId < gNumFilters;
 66:             for (int outCol = 0; outCol < gOutputSize; outCol++) {
 67:                 float sum = 0;
 68:                 for (int filterRow = 0; filterRow < gFilterSize; filterRow++) {
 69:                     int inRow = outRow + filterRow;
 70:                     #if gPadZeros == 1
 71:                         inRow -= gHalfFilterSize;
 72:                     #endif
 73:                     bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize;
 74:                     for (int filterCol = 0; filterCol < gFilterSize; filterCol++) {
 75:                         int inCol = outCol + filterCol;
 76:                         #if gPadZeros == 1
 77:                             inCol -= gHalfFilterSize;
 78:                         #endif
 79:                         bool process = rowOk && inCol >= 0 && inCol < gInputSize;
 80:                         if (process) {
 81:                             float imageValue = _inputPlane[ inRow * gInputSize + inCol ];
 82:                             float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ];
 83:                             sum += imageValue * filterValue;
 84:                         }
 85:                     }
 86:                 }
 87:                 if (filterId < gNumFilters) {
 88:                     // [n][filterId][outRow][outCol][inputPlane]
 89:                     int resultIndex = (( (n
 90:                         * gNumFilters + filterId)
 91:                         * gOutputSize + outRow)
 92:                         * gOutputSize + outCol)
 93:                         * gNumInputPlanes + inputPlaneId;
 94:                     output[resultIndex] = sum;
 95:                     //if (globalId == 2) output[0] = resultIndex;
 96: //                    output[resultIndex] = outRow;
 97:                 }
 98: //                output[localId] = _localFilterPlane[localId];
 99:             }
 100:         }
 101:     }
 102: }
 103: 
 104: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward_byinputplane.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 7
   ... seems valid
 ForwardIm2Col.cl build log: 
 (19:0) : error : invalid global address space qualifier specified for parameter type
 (19:0) : error : syntax error at 'const'

 kernel build error:

 kernel source:
 1: // from SpatialConvolutionMM.cu:
 2: 
 3: // CL: grid stride looping
 4: #define CL_KERNEL_LOOP(i, n)                        \
 5:   for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \
 6:       i < (n);                                       \
 7:       i += get_local_size(0) * get_num_groups(0))
 8: 
 9: //#define gPadding 1
 10: //#define gStride 1
 11: //#define gColSize 3
 12: //#define gFilterSize 3
 13: //#define gSize 3
 14: 
 15: // Kernel for fast unfold+copy
 16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu)
 17: kernel void im2col(
 18:     const int n,
 19:     global float const * im_data, int im_offset,
 20:     global float* data_col) {
 21:   global const float *data_im = im_data + im_offset;
 22: 
 23:   CL_KERNEL_LOOP(index, n) {
 24:     int w_out = index % 3;
 25:     index /= 3;
 26:     int h_out = index % 3;
 27:     int channel_in = index / 3;
 28:     int channel_out = channel_in * 3 * 3;
 29:     int h_in = h_out * 1 - 1;
 30:     int w_in = w_out * 1 - 1;
 31:     data_col += (channel_out * 3 + h_out) * 3 + w_out;
 32:     data_im += (channel_in * 3 + h_in) * 3 + w_in;
 33:     for (int i = 0; i < 3; ++i) {
 34:       for (int j = 0; j < 3; ++j) {
 35:         int h = h_in + i;
 36:         int w = w_in + j;
 37:         *data_col = (h >= 0 && w >= 0 && h < 3 && w < 3) ?
 38:           data_im[i * 3 + j] : 0;
 39:         data_col += 3 * 3;
 40:       }
 41:     }
 42:   }
 43: }
 44: 
 45: kernel void col2im(
 46:     const int n,
 47:     global float const *data_col,
 48:     global float* im_data, int im_offset) {
 49:   global float *data_im = im_data + im_offset;
 50: 
 51:   for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) {
 52:     float val = 0;
 53:     int w = index % 3 + 1;
 54:     int h = (index / 3) % 3 + 1;
 55:     int c = index / (3 * 3);
 56:     // compute the start and end of the output
 57:     int w_col_start = (w < 3) ? 0 : (w - 3) / 1 + 1;
 58:     int w_col_end = min(w / 1 + 1, 3);
 59:     int h_col_start = (h < 3) ? 0 : (h - 3) / 1 + 1;
 60:     int h_col_end = min(h / 1 + 1, 3);
 61: 
 62:     int offset = (c * 3 * 3 + h * 3 + w) * 3 * 3;
 63:     int coeff_h_col = (1 - 1 * 3 * 3) * 3;
 64:     int coeff_w_col = (1 - 1 * 3 * 3);
 65:     for (int h_col = h_col_start; h_col < h_col_end; ++h_col) {
 66:       for (int w_col = w_col_start; w_col < w_col_end; ++w_col) {
 67:         val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col];
 68:       }
 69:     }
 70:     data_im[index] = val;
 71:   }
 72: }
 73: 
 74: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 ForwardIm2Col.cl build log: 
 (19:0) : error : invalid global address space qualifier specified for parameter type
 (19:0) : error : syntax error at 'const'

 ForwardAuto: kernel 7 this instance cant be used: 
 kernel source:
 1: // from SpatialConvolutionMM.cu:
 2: 
 3: // CL: grid stride looping
 4: #define CL_KERNEL_LOOP(i, n)                        \
 5:   for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \
 6:       i < (n);                                       \
 7:       i += get_local_size(0) * get_num_groups(0))
 8: 
 9: //#define gPadding 1
 10: //#define gStride 1
 11: //#define gColSize 3
 12: //#define gFilterSize 3
 13: //#define gSize 3
 14: 
 15: // Kernel for fast unfold+copy
 16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu)
 17: kernel void im2col(
 18:     const int n,
 19:     global float const * im_data, int im_offset,
 20:     global float* data_col) {
 21:   global const float *data_im = im_data + im_offset;
 22: 
 23:   CL_KERNEL_LOOP(index, n) {
 24:     int w_out = index % 3;
 25:     index /= 3;
 26:     int h_out = index % 3;
 27:     int channel_in = index / 3;
 28:     int channel_out = channel_in * 3 * 3;
 29:     int h_in = h_out * 1 - 1;
 30:     int w_in = w_out * 1 - 1;
 31:     data_col += (channel_out * 3 + h_out) * 3 + w_out;
 32:     data_im += (channel_in * 3 + h_in) * 3 + w_in;
 33:     for (int i = 0; i < 3; ++i) {
 34:       for (int j = 0; j < 3; ++j) {
 35:         int h = h_in + i;
 36:         int w = w_in + j;
 37:         *data_col = (h >= 0 && w >= 0 && h < 3 && w < 3) ?
 38:           data_im[i * 3 + j] : 0;
 39:         data_col += 3 * 3;
 40:       }
 41:     }
 42:   }
 43: }
 44: 
 45: kernel void col2im(
 46:     const int n,
 47:     global float const *data_col,
 48:     global float* im_data, int im_offset) {
 49:   global float *data_im = im_data + im_offset;
 50: 
 51:   for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) {
 52:     float val = 0;
 53:     int w = index % 3 + 1;
 54:     int h = (index / 3) % 3 + 1;
 55:     int c = index / (3 * 3);
 56:     // compute the start and end of the output
 57:     int w_col_start = (w < 3) ? 0 : (w - 3) / 1 + 1;
 58:     int w_col_end = min(w / 1 + 1, 3);
 59:     int h_col_start = (h < 3) ? 0 : (h - 3) / 1 + 1;
 60:     int h_col_end = min(h / 1 + 1, 3);
 61: 
 62:     int offset = (c * 3 * 3 + h * 3 + w) * 3 * 3;
 63:     int coeff_h_col = (1 - 1 * 3 * 3) * 3;
 64:     int coeff_w_col = (1 - 1 * 3 * 3);
 65:     for (int h_col = h_col_start; h_col < h_col_end; ++h_col) {
 66:       for (int w_col = w_col_start; w_col < w_col_end; ++w_col) {
 67:         val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col];
 68:       }
 69:     }
 70:     data_im[index] = val;
 71:   }
 72: }
 73: 
 74: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 ForwardIm2Col.cl build log: 
 (19:0) : error : invalid global address space qualifier specified for parameter type
 (19:0) : error : syntax error at 'const'

   forward kernel 0: cannot be used
   forward kernel 1: cannot be used
   forward kernel 2: cannot be used
   forward kernel 3: cannot be used
   forward kernel 4: cannot be used
   forward kernel 5: cannot be used
   forward kernel 6: cannot be used
   forward kernel 7: cannot be used
 clblas teardown
 unknown file: Failure
 C++ exception with description "No valid forward implementations found" thrown in the test body.
 [  FAILED  ] testupdateweights.conv1z (141 ms)
 [ RUN      ] testupdateweights.numericallytest
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=1 -D TANH"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=1 -D TANH"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=1 -D TANH"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.numericallytest (56 ms)
 [ RUN      ] testupdateweights.numericallytest_imagesize3
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.numericallytest_imagesize3 (66 ms)
 [ RUN      ] testupdateweights.numericallytest_imagesize5
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=5 -DgOutputSizeSquared=25 -DgInputSize=5 -DgInputSizeSquared=25 -DgNumPlanes=1 -D TANH"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=5 -DgOutputSizeSquared=25 -DgInputSize=5 -DgInputSizeSquared=25 -DgNumPlanes=1 -D TANH"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=5 -DgOutputSizeSquared=25 -DgInputSize=5 -DgInputSizeSquared=25 -DgNumPlanes=1 -D TANH"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.numericallytest_imagesize5 (66 ms)
 [ RUN      ] testupdateweights.numericallytest_imagesize9
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=9 -DgOutputSizeSquared=81 -DgInputSize=9 -DgInputSizeSquared=81 -DgNumPlanes=1 -D TANH"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=9 -DgOutputSizeSquared=81 -DgInputSize=9 -DgInputSizeSquared=81 -DgNumPlanes=1 -D TANH"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=9 -DgOutputSizeSquared=81 -DgInputSize=9 -DgInputSizeSquared=81 -DgNumPlanes=1 -D TANH"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.numericallytest_imagesize9 (57 ms)
 [ RUN      ] testupdateweights.numericallytest_imagesize9_filtersize9
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=1 -D TANH"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=1 -D TANH"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=1 -D TANH"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.numericallytest_imagesize9_filtersize9 (56 ms)
 [ RUN      ] testupdateweights.numericallytest_imagesize9_filtersize3
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=7 -DgOutputSizeSquared=49 -DgInputSize=7 -DgInputSizeSquared=49 -DgNumPlanes=1 -D TANH"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=7 -DgOutputSizeSquared=49 -DgInputSize=7 -DgInputSizeSquared=49 -DgNumPlanes=1 -D TANH"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=7 -DgOutputSizeSquared=49 -DgInputSize=7 -DgInputSizeSquared=49 -DgNumPlanes=1 -D TANH"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.numericallytest_imagesize9_filtersize3 (67 ms)
 [ RUN      ] testupdateweights.numericallytest_imagesize3_filtersize3
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=1 -D TANH"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=1 -D TANH"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=1 -D TANH"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.numericallytest_imagesize3_filtersize3 (68 ms)
 [ RUN      ] testupdateweights.numericallytest_imagesize5_filtersize3
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.numericallytest_imagesize5_filtersize3 (68 ms)
 [ RUN      ] testupdateweights.numericallytest_imagesize5_filtersize3_batchsize3
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.numericallytest_imagesize5_filtersize3_batchsize3 (69 ms)
 [ RUN      ] testupdateweights.numericallytest_imagesize5_filtersize3_planes3
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.numericallytest_imagesize5_filtersize3_planes3 (70 ms)
 [ RUN      ] testupdateweights.numericallytest_imagesize5_filtersize3_planes3_batchsize3
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.numericallytest_imagesize5_filtersize3_planes3_batchsize3 (71 ms)
 [ RUN      ] testupdateweights.backprop_weights_2
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 options:  -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=1 -DgInputStripeOuterSize=1 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=1 -DgOutputStripeSize=1
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=1 -DgInputStripeOuterSize=1 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=1 -DgOutputStripeSize=1"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=1 -DgInputStripeOuterSize=1 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=1 -DgOutputStripeSize=1"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=1 -DgInputStripeOuterSize=1 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=1 -DgOutputStripeSize=1"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.backprop_weights_2 (25 ms)
 [ RUN      ] testupdateweights.backprop_weights_2_upstreamimagesize2
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 options:  -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=2 -DgInputStripeOuterNumRows=2 -DgInputStripeInnerSize=4 -DgInputStripeOuterSize=4 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=4
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=2 -DgInputStripeOuterNumRows=2 -DgInputStripeInnerSize=4 -DgInputStripeOuterSize=4 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=4"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=2 -DgInputStripeOuterNumRows=2 -DgInputStripeInnerSize=4 -DgInputStripeOuterSize=4 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=4"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=2 -DgInputStripeOuterNumRows=2 -DgInputStripeInnerSize=4 -DgInputStripeOuterSize=4 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=4"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.backprop_weights_2_upstreamimagesize2 (30 ms)
 [ RUN      ] testupdateweights.backprop_weights_2_upstreamimagesize3_filtersize3
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 options:  -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=3 -DgInputStripeOuterNumRows=7 -DgInputStripeInnerSize=9 -DgInputStripeOuterSize=21 -DgInputStripeMarginSize=6 -DgOutputStripeNumRows=1 -DgOutputStripeSize=1
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=3 -DgInputStripeOuterNumRows=7 -DgInputStripeInnerSize=9 -DgInputStripeOuterSize=21 -DgInputStripeMarginSize=6 -DgOutputStripeNumRows=1 -DgOutputStripeSize=1"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=3 -DgInputStripeOuterNumRows=7 -DgInputStripeInnerSize=9 -DgInputStripeOuterSize=21 -DgInputStripeMarginSize=6 -DgOutputStripeNumRows=1 -DgOutputStripeSize=1"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=3 -DgInputStripeOuterNumRows=7 -DgInputStripeInnerSize=9 -DgInputStripeOuterSize=21 -DgInputStripeMarginSize=6 -DgOutputStripeNumRows=1 -DgOutputStripeSize=1"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.backprop_weights_2_upstreamimagesize3_filtersize3 (25 ms)
 [ RUN      ] testupdateweights.backprop_weights_2_upstreamimagesize4_filtersize3
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 options:  -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=4 -DgInputStripeOuterNumRows=8 -DgInputStripeInnerSize=16 -DgInputStripeOuterSize=32 -DgInputStripeMarginSize=8 -DgOutputStripeNumRows=2 -DgOutputStripeSize=4
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=4 -DgInputStripeOuterNumRows=8 -DgInputStripeInnerSize=16 -DgInputStripeOuterSize=32 -DgInputStripeMarginSize=8 -DgOutputStripeNumRows=2 -DgOutputStripeSize=4"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=4 -DgInputStripeOuterNumRows=8 -DgInputStripeInnerSize=16 -DgInputStripeOuterSize=32 -DgInputStripeMarginSize=8 -DgOutputStripeNumRows=2 -DgOutputStripeSize=4"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=4 -DgInputStripeOuterNumRows=8 -DgInputStripeInnerSize=16 -DgInputStripeOuterSize=32 -DgInputStripeMarginSize=8 -DgOutputStripeNumRows=2 -DgOutputStripeSize=4"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.backprop_weights_2_upstreamimagesize4_filtersize3 (34 ms)
 [ RUN      ] testupdateweights.backprop_weights_2_upstreamimagesize5_filtersize3
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 options:  -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=5 -DgInputStripeOuterNumRows=9 -DgInputStripeInnerSize=25 -DgInputStripeOuterSize=45 -DgInputStripeMarginSize=10 -DgOutputStripeNumRows=3 -DgOutputStripeSize=9
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=5 -DgInputStripeOuterNumRows=9 -DgInputStripeInnerSize=25 -DgInputStripeOuterSize=45 -DgInputStripeMarginSize=10 -DgOutputStripeNumRows=3 -DgOutputStripeSize=9"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=5 -DgInputStripeOuterNumRows=9 -DgInputStripeInnerSize=25 -DgInputStripeOuterSize=45 -DgInputStripeMarginSize=10 -DgOutputStripeNumRows=3 -DgOutputStripeSize=9"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=5 -DgInputStripeOuterNumRows=9 -DgInputStripeInnerSize=25 -DgInputStripeOuterSize=45 -DgInputStripeMarginSize=10 -DgOutputStripeNumRows=3 -DgOutputStripeSize=9"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.backprop_weights_2_upstreamimagesize5_filtersize3 (50 ms)
 [ RUN      ] testupdateweights.backprop_weights_2_upstreamimagesize3_filtersize1
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 options:  -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=3 -DgInputStripeOuterNumRows=3 -DgInputStripeInnerSize=9 -DgInputStripeOuterSize=9 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=3 -DgOutputStripeSize=9
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=3 -DgInputStripeOuterNumRows=3 -DgInputStripeInnerSize=9 -DgInputStripeOuterSize=9 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=3 -DgOutputStripeSize=9"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=3 -DgInputStripeOuterNumRows=3 -DgInputStripeInnerSize=9 -DgInputStripeOuterSize=9 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=3 -DgOutputStripeSize=9"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=3 -DgInputStripeOuterNumRows=3 -DgInputStripeInnerSize=9 -DgInputStripeOuterSize=9 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=3 -DgOutputStripeSize=9"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.backprop_weights_2_upstreamimagesize3_filtersize1 (52 ms)
 [ RUN      ] testupdateweights.backprop_weights_2_upstreamimagesize16_filtersize1
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 options:  -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=16 -D gInputSizeSquared=256 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=16 -D gOutputSizeSquared=256 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=8 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=2 -DgInputStripeOuterNumRows=2 -DgInputStripeInnerSize=32 -DgInputStripeOuterSize=32 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=32
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=16 -D gInputSizeSquared=256 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=16 -D gOutputSizeSquared=256 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=8 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=2 -DgInputStripeOuterNumRows=2 -DgInputStripeInnerSize=32 -DgInputStripeOuterSize=32 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=32"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=16 -D gInputSizeSquared=256 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=16 -D gOutputSizeSquared=256 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=8 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=2 -DgInputStripeOuterNumRows=2 -DgInputStripeInnerSize=32 -DgInputStripeOuterSize=32 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=32"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=16 -D gInputSizeSquared=256 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=16 -D gOutputSizeSquared=256 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=8 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=2 -DgInputStripeOuterNumRows=2 -DgInputStripeInnerSize=32 -DgInputStripeOuterSize=32 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=32"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.backprop_weights_2_upstreamimagesize16_filtersize1 (48 ms)
 [ RUN      ] testupdateweights.backprop_weights_2_upstreamimagesize17_filtersize1
 LayerDimensions{ inputPlanes=1 inputSize=17 numFilters=1 filterSize=1 outputSize=17 padZeros=0 biased=0 skip=0}
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 options:  -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=17 -D gInputSizeSquared=289 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=17 -D gOutputSizeSquared=289 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=16 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=17 -DgInputStripeOuterSize=17 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=34
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=17 -D gInputSizeSquared=289 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=17 -D gOutputSizeSquared=289 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=16 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=17 -DgInputStripeOuterSize=17 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=34"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=17 -D gInputSizeSquared=289 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=17 -D gOutputSizeSquared=289 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=16 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=17 -DgInputStripeOuterSize=17 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=34"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=17 -D gInputSizeSquared=289 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=17 -D gOutputSizeSquared=289 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=16 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=17 -DgInputStripeOuterSize=17 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=34"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.backprop_weights_2_upstreamimagesize17_filtersize1 (54 ms)
 [ RUN      ] testupdateweights.backprop_weights_2_upstreamimagesize17_filtersize1_moredata
 expectedresult: -958.715
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 options:  -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=17 -D gInputSizeSquared=289 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=17 -D gOutputSizeSquared=289 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=16 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=17 -DgInputStripeOuterSize=17 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=34
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=17 -D gInputSizeSquared=289 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=17 -D gOutputSizeSquared=289 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=16 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=17 -DgInputStripeOuterSize=17 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=34"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=17 -D gInputSizeSquared=289 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=17 -D gOutputSizeSquared=289 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=16 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=17 -DgInputStripeOuterSize=17 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=34"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=17 -D gInputSizeSquared=289 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=17 -D gOutputSizeSquared=289 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=16 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=17 -DgInputStripeOuterSize=17 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=34"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.backprop_weights_2_upstreamimagesize17_filtersize1_moredata (57 ms)
 [ RUN      ] testupdateweights.backprop_instance3_smaller2
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 numweights: 36
 options:  -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=96 -D gInputSizeSquared=9216 -D gNumFilters=1 -D gFilterSize=6 -D gHalfFilterSize=3 -D gFilterSizeSquared=36 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=91 -D gOutputSizeSquared=8281 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0 -DgNumStripes=512 -DgInputStripeMarginRows=5 -DgInputStripeInnerNumRows=0 -DgInputStripeOuterNumRows=10 -DgInputStripeInnerSize=0 -DgInputStripeOuterSize=960 -DgInputStripeMarginSize=480 -DgOutputStripeNumRows=1 -DgOutputStripeSize=91
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=96 -D gInputSizeSquared=9216 -D gNumFilters=1 -D gFilterSize=6 -D gHalfFilterSize=3 -D gFilterSizeSquared=36 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=91 -D gOutputSizeSquared=8281 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0 -DgNumStripes=512 -DgInputStripeMarginRows=5 -DgInputStripeInnerNumRows=0 -DgInputStripeOuterNumRows=10 -DgInputStripeInnerSize=0 -DgInputStripeOuterSize=960 -DgInputStripeMarginSize=480 -DgOutputStripeNumRows=1 -DgOutputStripeSize=91"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=96 -D gInputSizeSquared=9216 -D gNumFilters=1 -D gFilterSize=6 -D gHalfFilterSize=3 -D gFilterSizeSquared=36 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=91 -D gOutputSizeSquared=8281 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0 -DgNumStripes=512 -DgInputStripeMarginRows=5 -DgInputStripeInnerNumRows=0 -DgInputStripeOuterNumRows=10 -DgInputStripeInnerSize=0 -DgInputStripeOuterSize=960 -DgInputStripeMarginSize=480 -DgOutputStripeNumRows=1 -DgOutputStripeSize=91"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // BIASED (or not)
 9: 
 10: // workgroupId: [outputPlane][inputPlane]
 11: // localId: [filterRow][filterCol]
 12: // per-thread iteration: [n][outputRow][outputCol]
 13: // local: errorimage: outputSize * outputSize
 14: //        imageimage: inputSize * inputSize
 15: // specific characteristic: load one stripe of each image at a time,
 16: // so we dont run out of memory
 17: // number of stripes set in: gNumStripes
 18: // note that whilst we can stripe the gradOutput simply,
 19: // we actually need to add a half-filter widthed additional few rows
 20: // onto the images stripe, otherwise we will be missing data
 21: //   we will call the size of the non-overlapping image stripes: gInputStripeInnerSize
 22: //      the outersize, including the two margins is: gInputStripeOuterSize
 23: //      of course, the first and last stripes will be missing a bit off the top/bottom, where the
 24: //      corresponding outer margin would be
 25: void kernel backprop_floats_withscratch_dobias_striped(
 26:         const float learningRateMultiplier, const int batchSize,
 27:          global const float *gradOutput, global const float *images,
 28:         global float *gradWeights,
 29:         #ifdef BIASED
 30:              global float *gradBiasWeights,
 31:         #endif
 32:         local float *_errorStripe, local float *_imageStripe
 33:  ) {
 34:     // gHalfFilterSize
 35:     // gInputSize
 36:     //
 37:     // gInputStripeMarginRows => basically equal to gHalfFilterSize
 38:     // gInputStripeInnerNumRows = gInputSize / gNumStripes
 39:     // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize  (note: one row less than
 40:     //                                                         if we just added gFilterSize)
 41:     // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize
 42:     // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize
 43:     // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize
 44:     //
 45:     // gOutputStripeNumRows
 46:     // gOutputStripeSize
 47: 
 48:     const int globalId = get_global_id(0);
 49:     const int localId = get_local_id(0);
 50:     const int workgroupId = get_group_id(0);
 51:     const int workgroupSize = get_local_size(0);
 52: 
 53:     const int filterRow = localId / gFilterSize;
 54:     const int filterCol = localId % gFilterSize;
 55: 
 56:     const int outPlane = workgroupId / gInputPlanes;
 57:     const int upstreamPlane = workgroupId % gInputPlanes;
 58: 
 59:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 60:     //       aggregate over:  [outRow][outCol][n]
 61:     float thiswchange = 0;
 62: #ifdef BIASED
 63:     float thisbiaschange = 0;
 64: #endif
 65:     const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize;
 66:     const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize;
 67:     for (int n = 0; n < batchSize; n++) {
 68:         const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 69:         const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared;
 70:         const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared;
 71:         const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared;
 72:         for (int stripe = 0; stripe < gNumStripes; stripe++) {
 73:             const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize;
 74:             const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize;
 75:             // need to fetch the image, but it's bigger than us, so will need to loop...
 76:             barrier(CLK_LOCAL_MEM_FENCE);
 77:             for (int i = 0; i < numLoopsForImageStripe; i++) {
 78:                 int thisOffset = i * workgroupSize + localId;
 79:                 int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset;
 80:                 bool process = thisOffset < gInputStripeOuterSize
 81:                     && thisGlobalImagesOffset >= imageImageGlobalOffset
 82:                     && thisGlobalImagesOffset < imageImageGlobalOffsetAfter;
 83:                 if (process) {
 84:                     _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ];
 85:                 }
 86:             }
 87:             int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize;
 88:             for (int i = 0; i < numLoopsForErrorStripe; i++) {
 89:                 int thisOffset = i * workgroupSize + localId;
 90:                 int globalErrorsOffset = errorStripeOffset + thisOffset;
 91:                 bool process = thisOffset < gOutputStripeSize
 92:                     && globalErrorsOffset < errorImageGlobalOffsetAfter;
 93:                 if (process) {
 94:                     _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset];
 95:                 }
 96:             }
 97:             const int stripeOutRowStart = stripe * gOutputStripeNumRows;
 98:             const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows;
 99:             barrier(CLK_LOCAL_MEM_FENCE);
 100: //            if (localId == 13) {
 101: //                for (int i = 0; i < 12; i++) {
 102: //                    gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize];
 103: //                }
 104: //                for (int i = 0; i < 20; i++) {
 105: //                    gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize];
 106: //                }
 107: //            }
 108:             if (localId < gFilterSizeSquared) {
 109:                 for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) {
 110:                     int upstreamRow = outRow - gMargin + filterRow;
 111:                     for (int outCol = 0; outCol < gOutputSize; outCol++) {
 112:                         int upstreamCol = outCol - gMargin + filterCol;
 113:                         bool proceed =
 114:                             upstreamRow >= 0 && upstreamCol >= 0
 115:                             && upstreamRow < gInputSize && upstreamCol < gInputSize
 116:                             && outRow < gOutputSize;
 117:                         if (proceed) {
 118:                             int resultIndex = outRow * gOutputSize + outCol;
 119:                             float error = _errorStripe[resultIndex - stripe * gOutputStripeSize];
 120:                             int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol;
 121:                             float upstreamResult = _imageStripe[upstreamDataIndex +  gInputStripeMarginSize
 122:                                         - stripe * gInputStripeInnerSize ];
 123:                             thiswchange += upstreamResult * error;
 124:         #ifdef BIASED
 125:                             thisbiaschange += error;
 126:         #endif
 127:                         }
 128:                     }
 129:                 }
 130:             }
 131:         }
 132:     }
 133:     if (localId < gFilterSizeSquared) {
 134:         gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange;
 135: //        weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId;
 136:     }
 137: #ifdef BIASED
 138:     bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin;
 139:     if (writeBias) {
 140:         gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange;
 141:     }
 142: #endif
 143:     // gradWeights:     [outPlane][upstreamPlane][filterRow][filterCol]
 144:     //       aggregate over:  [outRow][outCol][n]
 145: }
 146: 
 147: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/BackpropWeightsScratchLarge.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=96 -D gInputSizeSquared=9216 -D gNumFilters=1 -D gFilterSize=6 -D gHalfFilterSize=3 -D gFilterSizeSquared=36 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=91 -D gOutputSizeSquared=8281 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0 -DgNumStripes=512 -DgInputStripeMarginRows=5 -DgInputStripeInnerNumRows=0 -DgInputStripeOuterNumRows=10 -DgInputStripeInnerSize=0 -DgInputStripeOuterSize=960 -DgInputStripeMarginSize=480 -DgOutputStripeNumRows=1 -DgOutputStripeSize=91"
 " thrown in the test body.
 [  FAILED  ] testupdateweights.backprop_instance3_smaller2 (63 ms)
 [----------] 23 tests from testupdateweights (1443 ms total)

 [----------] 17 tests from testforward
 [ RUN      ] testforward.imagesize2_nopadzeros
 expected number of output: 4
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=2 -D gFilterSize=2 -D gHalfFilterSize=1 -D gFilterSizeSquared=4 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept: each workgroup handles convolving one input example with one filtercube
 8: // and writing out one single output plane
 9: //
 10: // workgroup id organized like: [imageid][outplane]
 11: // local id organized like: [outrow][outcol]
 12: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 13: // number workgroups = 32
 14: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 16: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 17: // output are organized like [imageid][filterid][row][col]
 18: void kernel forward_3_by_n_outplane(const int batchSize,
 19:       global const float *images, global const float *filters,
 20:     global float *output,
 21:     local float *_upstreamImage, local float *_filterCube) {
 22:     const int globalId = get_global_id(0);
 23: 
 24:     const int workgroupId = get_group_id(0);
 25:     const int workgroupSize = get_local_size(0);
 26:     const int n = workgroupId / gNumFilters;
 27:     const int outPlane = workgroupId % gNumFilters;
 28: 
 29:     const int localId = get_local_id(0);
 30:     const int outputRow = localId / gOutputSize;
 31:     const int outputCol = localId % gOutputSize;
 32: 
 33:     const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize;
 34:     const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow  - gEven) : gHalfFilterSize - gEven;
 35:     const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize;
 36:     const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven;
 37: 
 38:     const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 39: 
 40:     const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 41:     const int filterCubeGlobalOffset = outPlane * filterCubeLength;
 42:     const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize;
 43:     for (int i = 0; i < numPixelsPerThread; i++) {
 44:         int thisOffset = localId + i * workgroupSize;
 45:         if (thisOffset < filterCubeLength) {
 46:             _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset];
 47:         }
 48:     }
 49:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 50: 
 51:     float sum = 0;
 52:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 53:         int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 54:         barrier(CLK_LOCAL_MEM_FENCE);
 55:         for (int i = 0; i < numUpstreamsPerThread; i++) {
 56:             int thisOffset = workgroupSize * i + localId;
 57:             if (thisOffset < gInputSizeSquared) {
 58:                 _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ];
 59:             }
 60:         }
 61:         barrier(CLK_LOCAL_MEM_FENCE);
 62:         int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 63:         for (int u = minu; u <= maxu; u++) {
 64:             int inputRow = outputRow + u;
 65:             #if gPadZeros == 0
 66:                 inputRow += gHalfFilterSize;
 67:             #endif
 68:             int inputimagerowoffset = inputRow * gInputSize;
 69:             int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 70:             for (int v = minv; v <= maxv; v++) {
 71:                 int inputCol = outputCol + v;
 72:                 #if gPadZeros == 0
 73:                     inputCol += gHalfFilterSize;
 74:                 #endif
 75:                 if (localId < gOutputSizeSquared) {
 76:                     sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 77:                 }
 78:             }
 79:         }
 80:     }
 81: 
 82:     // output are organized like [imageid][filterid][row][col]
 83:     int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 84:     if (localId < gOutputSizeSquared) {
 85:         output[resultIndex ] = sum;
 86:     }
 87: }
 88: 
 89: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=2 -D gFilterSize=2 -D gHalfFilterSize=1 -D gFilterSizeSquared=4 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept: each workgroup handles convolving one input example with one filtercube
 8: // and writing out one single output plane
 9: //
 10: // workgroup id organized like: [imageid][outplane]
 11: // local id organized like: [outrow][outcol]
 12: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 13: // number workgroups = 32
 14: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 16: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 17: // output are organized like [imageid][filterid][row][col]
 18: void kernel forward_3_by_n_outplane(const int batchSize,
 19:       global const float *images, global const float *filters,
 20:     global float *output,
 21:     local float *_upstreamImage, local float *_filterCube) {
 22:     const int globalId = get_global_id(0);
 23: 
 24:     const int workgroupId = get_group_id(0);
 25:     const int workgroupSize = get_local_size(0);
 26:     const int n = workgroupId / gNumFilters;
 27:     const int outPlane = workgroupId % gNumFilters;
 28: 
 29:     const int localId = get_local_id(0);
 30:     const int outputRow = localId / gOutputSize;
 31:     const int outputCol = localId % gOutputSize;
 32: 
 33:     const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize;
 34:     const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow  - gEven) : gHalfFilterSize - gEven;
 35:     const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize;
 36:     const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven;
 37: 
 38:     const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 39: 
 40:     const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 41:     const int filterCubeGlobalOffset = outPlane * filterCubeLength;
 42:     const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize;
 43:     for (int i = 0; i < numPixelsPerThread; i++) {
 44:         int thisOffset = localId + i * workgroupSize;
 45:         if (thisOffset < filterCubeLength) {
 46:             _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset];
 47:         }
 48:     }
 49:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 50: 
 51:     float sum = 0;
 52:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 53:         int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 54:         barrier(CLK_LOCAL_MEM_FENCE);
 55:         for (int i = 0; i < numUpstreamsPerThread; i++) {
 56:             int thisOffset = workgroupSize * i + localId;
 57:             if (thisOffset < gInputSizeSquared) {
 58:                 _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ];
 59:             }
 60:         }
 61:         barrier(CLK_LOCAL_MEM_FENCE);
 62:         int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 63:         for (int u = minu; u <= maxu; u++) {
 64:             int inputRow = outputRow + u;
 65:             #if gPadZeros == 0
 66:                 inputRow += gHalfFilterSize;
 67:             #endif
 68:             int inputimagerowoffset = inputRow * gInputSize;
 69:             int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 70:             for (int v = minv; v <= maxv; v++) {
 71:                 int inputCol = outputCol + v;
 72:                 #if gPadZeros == 0
 73:                     inputCol += gHalfFilterSize;
 74:                 #endif
 75:                 if (localId < gOutputSizeSquared) {
 76:                     sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 77:                 }
 78:             }
 79:         }
 80:     }
 81: 
 82:     // output are organized like [imageid][filterid][row][col]
 83:     int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 84:     if (localId < gOutputSizeSquared) {
 85:         output[resultIndex ] = sum;
 86:     }
 87: }
 88: 
 89: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=2 -D gFilterSize=2 -D gHalfFilterSize=1 -D gFilterSizeSquared=4 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0"
 " thrown in the test body.
 [  FAILED  ] testforward.imagesize2_nopadzeros (75 ms)
 [ RUN      ] testforward.imagesize2_padzeros
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=2 -D gFilterSize=2 -D gHalfFilterSize=1 -D gFilterSizeSquared=4 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=1 -D gSkip=0 -DgWorkgroupSize=32"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=2 -D gFilterSize=2 -D gHalfFilterSize=1 -D gFilterSizeSquared=4 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=1 -D gSkip=0 -DgWorkgroupSize=32"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=2 -D gFilterSize=2 -D gHalfFilterSize=1 -D gFilterSizeSquared=4 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=1 -D gSkip=0 -DgWorkgroupSize=32"
 " thrown in the test body.
 [  FAILED  ] testforward.imagesize2_padzeros (49 ms)
 [ RUN      ] testforward.imagesize3
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"
 " thrown in the test body.
 [  FAILED  ] testforward.imagesize3 (91 ms)
 [ RUN      ] testforward.test2
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"
 " thrown in the test body.
 [  FAILED  ] testforward.test2 (102 ms)
 [ RUN      ] testforward.test3
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"
 " thrown in the test body.
 [  FAILED  ] testforward.test3 (50 ms)
 [ RUN      ] testforward.compare_0_1_biased_nopad
 LayerDimensions{ inputPlanes=8 inputSize=19 numFilters=8 filterSize=5 outputSize=15 padZeros=0 biased=1 skip=0}
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=15 -D gOutputSizeSquared=225 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=15 -D gOutputSizeSquared=225 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=15 -D gOutputSizeSquared=225 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"
 " thrown in the test body.
 [  FAILED  ] testforward.compare_0_1_biased_nopad (101 ms)
 [ RUN      ] testforward.compare_0_1_biased_pad
 LayerDimensions{ inputPlanes=8 inputSize=19 numFilters=8 filterSize=5 outputSize=19 padZeros=1 biased=1 skip=0}
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=19 -D gOutputSizeSquared=361 -D gPadZeros=1 -D gMargin=2 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=19 -D gOutputSizeSquared=361 -D gPadZeros=1 -D gMargin=2 -D gEven=0 -D gSkip=0"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=19 -D gOutputSizeSquared=361 -D gPadZeros=1 -D gMargin=2 -D gEven=0 -D gSkip=0"
 " thrown in the test body.
 [  FAILED  ] testforward.compare_0_1_biased_pad (47 ms)
 [ RUN      ] testforward.compare_1_n_biased_nopad
 instance: 2
 LayerDimensions{ inputPlanes=8 inputSize=19 numFilters=8 filterSize=5 outputSize=15 padZeros=0 biased=1 skip=0}
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=15 -D gOutputSizeSquared=225 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=15 -D gOutputSizeSquared=225 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=15 -D gOutputSizeSquared=225 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"
 " thrown in the test body.
 [  FAILED  ] testforward.compare_1_n_biased_nopad (61 ms)
 [ RUN      ] testforward.compare_1_n_biased_pad
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 instance: 2
 LayerDimensions{ inputPlanes=8 inputSize=19 numFilters=8 filterSize=5 outputSize=19 padZeros=1 biased=1 skip=0}
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=19 -D gOutputSizeSquared=361 -D gPadZeros=1 -D gMargin=2 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=19 -D gOutputSizeSquared=361 -D gPadZeros=1 -D gMargin=2 -D gEven=0 -D gSkip=0"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=19 -D gOutputSizeSquared=361 -D gPadZeros=1 -D gMargin=2 -D gEven=0 -D gSkip=0"
 " thrown in the test body.
 [  FAILED  ] testforward.compare_1_n_biased_pad (145 ms)
 [ RUN      ] testforward.compare_1_5_biased_nopad
 LayerDimensions{ inputPlanes=8 inputSize=19 numFilters=8 filterSize=19 outputSize=1 padZeros=0 biased=1 skip=0}
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=19 -D gHalfFilterSize=9 -D gFilterSizeSquared=361 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=19 -D gHalfFilterSize=9 -D gFilterSizeSquared=361 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=19 -D gHalfFilterSize=9 -D gFilterSizeSquared=361 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"
 " thrown in the test body.
 [  FAILED  ] testforward.compare_1_5_biased_nopad (50 ms)
 [ RUN      ] testforward.compare_1_4_fcscenario
 LayerDimensions{ inputPlanes=10 inputSize=24 numFilters=10 filterSize=24 outputSize=1 padZeros=0 biased=1 skip=0}
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=10 -D gInputPlanes=10 -D gInputSize=24 -D gInputSizeSquared=576 -D gNumFilters=10 -D gFilterSize=24 -D gHalfFilterSize=12 -D gFilterSizeSquared=576 -D gNumOutputPlanes=10 -D gOutputPlanes=10 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=10 -D gInputPlanes=10 -D gInputSize=24 -D gInputSizeSquared=576 -D gNumFilters=10 -D gFilterSize=24 -D gHalfFilterSize=12 -D gFilterSizeSquared=576 -D gNumOutputPlanes=10 -D gOutputPlanes=10 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=10 -D gInputPlanes=10 -D gInputSize=24 -D gInputSizeSquared=576 -D gNumFilters=10 -D gFilterSize=24 -D gHalfFilterSize=12 -D gFilterSizeSquared=576 -D gNumOutputPlanes=10 -D gOutputPlanes=10 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0"
 " thrown in the test body.
 [  FAILED  ] testforward.compare_1_4_fcscenario (59 ms)
 [ RUN      ] testforward.compare_break1_0_1
 LayerDimensions{ inputPlanes=1 inputSize=33 numFilters=1 filterSize=1 outputSize=33 padZeros=0 biased=0 skip=0}
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=33 -D gInputSizeSquared=1089 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=33 -D gOutputSizeSquared=1089 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=33 -D gInputSizeSquared=1089 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=33 -D gOutputSizeSquared=1089 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=33 -D gInputSizeSquared=1089 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=33 -D gOutputSizeSquared=1089 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"
 " thrown in the test body.
 [  FAILED  ] testforward.compare_break1_0_1 (101 ms)
 [ RUN      ] testforward.compare_break1_0_4
 LayerDimensions{ inputPlanes=1 inputSize=33 numFilters=1 filterSize=1 outputSize=33 padZeros=0 biased=0 skip=0}
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=545 -D gPixelsPerThread=2 -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=33 -D gInputSizeSquared=1089 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=33 -D gOutputSizeSquared=1089 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, int N) {
 8:     int numLoops = (N + get_local_size(0) - 1) / get_local_size(0);
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * get_local_size(0) + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [n][filterid]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [n][filterid][outrow][outcol]
 26: // the pixels per thread thing... :
 27: // - we have one thread (~= cuda core) per output value,
 28: //   ie one thread for each combination of [outrow][outcol]
 29: // - however, the number of threads is typically limited on a gpu,
 30: //   eg to 512 (eg Intel HD), or 1024 (eg nVidia K520)
 31: // - so what happens if the number of output points is larger than
 32: //   the maximum workgroup size?
 33: // - then we have several possibilities really:
 34: //   - we can divide the image into blocks, and process each block
 35: //     separately.  This is probably a good option, but fair amount of
 36: //     work
 37: //   - we can get each thread to handle more than one output
 38: //     pixel, by looping
 39: //   - we can consider the output image in 1d, by putting the rows
 40: //     one after another, and assign each contiguous workgroup-size
 41: //     block to one workgroup
 42: //     => this is how this kernel works
 43: //     basically, it's a hack, so larger images actually run, without
 44: //     crashing, and we can probably improve it a lot :-)
 45: //
 46: // So, when outputSize * outputSize > workgroupSize, then
 47: // multiple workgroups will be created for each output plane
 48: // the number of such workgroups is given by: `gPixelsPerThread`
 49: // the id of our workgroup within such a set of workgroups is calculated
 50: // as `pixel`
 51: // effectiveLocalId is our local id if we had one enormous workgroup
 52: // containing the whole output image plane
 53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize,
 54:       global const float *images, global const float *filters,
 55:     global float *output,
 56:     local float *_inputPlane, local float *_filterPlane) {
 57:     #define globalId (get_global_id(0))
 58: 
 59:     #define localId (get_local_id(0))
 60:     #define workgroupId (get_group_id(0))
 61: //    const int workgroupSize = get_local_size(0);
 62:     const int effectiveWorkgroupId = workgroupId / gPixelsPerThread;
 63:     const int pixel = workgroupId % gPixelsPerThread;
 64:     const int effectiveLocalId = localId + pixel * gWorkgroupSize;
 65:     const int n = effectiveWorkgroupId / gNumFilters;
 66:     const int outPlane = effectiveWorkgroupId % gNumFilters;
 67: 
 68:     const int outputRow = effectiveLocalId / gOutputSize;
 69:     const int outputCol = effectiveLocalId % gOutputSize;
 70: 
 71:     float sum = 0;
 72:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 73:         barrier(CLK_LOCAL_MEM_FENCE);
 74:         copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared);
 75:         copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared);
 76:         barrier(CLK_LOCAL_MEM_FENCE);
 77: 
 78:         if (effectiveLocalId < gOutputSizeSquared) {
 79:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 80:                 // trying to reduce register pressure...
 81:                 #if gPadZeros == 1
 82:                     #define inputRow (outputRow + u)
 83:                 #else
 84:                     #define inputRow (outputRow + u + gHalfFilterSize)
 85:                 #endif
 86:                 int inputimagerowoffset = inputRow * gInputSize;
 87:                 int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 88:                 bool rowOk = inputRow >= 0 && inputRow < gInputSize;
 89:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 90:                     #if gPadZeros == 1
 91:                         #define inputCol (outputCol + v)
 92:                     #else
 93:                         #define inputCol (outputCol + v + gHalfFilterSize)
 94:                     #endif
 95:                     bool process = rowOk && inputCol >= 0 && inputCol < gInputSize;
 96:                     if (process) {
 97:                             sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ];
 98:                     }
 99:                 }
 100:             }
 101:         }
 102:     }
 103:     // output are organized like [imageid][filterid][row][col]
 104:     #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId)
 105:     if (effectiveLocalId < gOutputSizeSquared) {
 106:         output[resultIndex ] = sum;
 107:     }
 108: }
 109: #endif
 110: 
 111: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=545 -D gPixelsPerThread=2 -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=33 -D gInputSizeSquared=1089 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=33 -D gOutputSizeSquared=1089 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, int N) {
 8:     int numLoops = (N + get_local_size(0) - 1) / get_local_size(0);
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * get_local_size(0) + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [n][filterid]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [n][filterid][outrow][outcol]
 26: // the pixels per thread thing... :
 27: // - we have one thread (~= cuda core) per output value,
 28: //   ie one thread for each combination of [outrow][outcol]
 29: // - however, the number of threads is typically limited on a gpu,
 30: //   eg to 512 (eg Intel HD), or 1024 (eg nVidia K520)
 31: // - so what happens if the number of output points is larger than
 32: //   the maximum workgroup size?
 33: // - then we have several possibilities really:
 34: //   - we can divide the image into blocks, and process each block
 35: //     separately.  This is probably a good option, but fair amount of
 36: //     work
 37: //   - we can get each thread to handle more than one output
 38: //     pixel, by looping
 39: //   - we can consider the output image in 1d, by putting the rows
 40: //     one after another, and assign each contiguous workgroup-size
 41: //     block to one workgroup
 42: //     => this is how this kernel works
 43: //     basically, it's a hack, so larger images actually run, without
 44: //     crashing, and we can probably improve it a lot :-)
 45: //
 46: // So, when outputSize * outputSize > workgroupSize, then
 47: // multiple workgroups will be created for each output plane
 48: // the number of such workgroups is given by: `gPixelsPerThread`
 49: // the id of our workgroup within such a set of workgroups is calculated
 50: // as `pixel`
 51: // effectiveLocalId is our local id if we had one enormous workgroup
 52: // containing the whole output image plane
 53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize,
 54:       global const float *images, global const float *filters,
 55:     global float *output,
 56:     local float *_inputPlane, local float *_filterPlane) {
 57:     #define globalId (get_global_id(0))
 58: 
 59:     #define localId (get_local_id(0))
 60:     #define workgroupId (get_group_id(0))
 61: //    const int workgroupSize = get_local_size(0);
 62:     const int effectiveWorkgroupId = workgroupId / gPixelsPerThread;
 63:     const int pixel = workgroupId % gPixelsPerThread;
 64:     const int effectiveLocalId = localId + pixel * gWorkgroupSize;
 65:     const int n = effectiveWorkgroupId / gNumFilters;
 66:     const int outPlane = effectiveWorkgroupId % gNumFilters;
 67: 
 68:     const int outputRow = effectiveLocalId / gOutputSize;
 69:     const int outputCol = effectiveLocalId % gOutputSize;
 70: 
 71:     float sum = 0;
 72:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 73:         barrier(CLK_LOCAL_MEM_FENCE);
 74:         copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared);
 75:         copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared);
 76:         barrier(CLK_LOCAL_MEM_FENCE);
 77: 
 78:         if (effectiveLocalId < gOutputSizeSquared) {
 79:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 80:                 // trying to reduce register pressure...
 81:                 #if gPadZeros == 1
 82:                     #define inputRow (outputRow + u)
 83:                 #else
 84:                     #define inputRow (outputRow + u + gHalfFilterSize)
 85:                 #endif
 86:                 int inputimagerowoffset = inputRow * gInputSize;
 87:                 int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 88:                 bool rowOk = inputRow >= 0 && inputRow < gInputSize;
 89:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 90:                     #if gPadZeros == 1
 91:                         #define inputCol (outputCol + v)
 92:                     #else
 93:                         #define inputCol (outputCol + v + gHalfFilterSize)
 94:                     #endif
 95:                     bool process = rowOk && inputCol >= 0 && inputCol < gInputSize;
 96:                     if (process) {
 97:                             sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ];
 98:                     }
 99:                 }
 100:             }
 101:         }
 102:     }
 103:     // output are organized like [imageid][filterid][row][col]
 104:     #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId)
 105:     if (effectiveLocalId < gOutputSizeSquared) {
 106:         output[resultIndex ] = sum;
 107:     }
 108: }
 109: #endif
 110: 
 111: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=545 -D gPixelsPerThread=2 -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=33 -D gInputSizeSquared=1089 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=33 -D gOutputSizeSquared=1089 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"
 " thrown in the test body.
 [  FAILED  ] testforward.compare_break1_0_4 (53 ms)
 [ RUN      ] testforward.comparespecific_break2
 LayerDimensions{ inputPlanes=64 inputSize=19 numFilters=64 filterSize=19 outputSize=1 padZeros=0 biased=0 skip=0}
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=64 -D gInputPlanes=64 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=64 -D gFilterSize=19 -D gHalfFilterSize=9 -D gFilterSizeSquared=361 -D gNumOutputPlanes=64 -D gOutputPlanes=64 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=64 -D gInputPlanes=64 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=64 -D gFilterSize=19 -D gHalfFilterSize=9 -D gFilterSizeSquared=361 -D gNumOutputPlanes=64 -D gOutputPlanes=64 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=64 -D gInputPlanes=64 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=64 -D gFilterSize=19 -D gHalfFilterSize=9 -D gFilterSizeSquared=361 -D gNumOutputPlanes=64 -D gOutputPlanes=64 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"
 " thrown in the test body.
 [  FAILED  ] testforward.comparespecific_break2 (138 ms)
 [ RUN      ] testforward.softmax
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 output[0]=0.0320586
 output[1]=0.0871443
 output[2]=0.643914
 output[3]=0.236883
 loss 0.44019
 loss 3.44019
 loss 2.44019
 loss 1.44019
 [       OK ] testforward.softmax (25 ms)
 [ RUN      ] testforward.softmax_byplane
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 output[0]=0.0320586
 output[1]=0.0871443
 output[2]=0.643914
 output[3]=0.236883
 loss 0.44019
 loss 3.44019
 loss 2.44019
 loss 1.44019
 [       OK ] testforward.softmax_byplane (17 ms)
 [ RUN      ] testforward.crash_from_jm
 -D gNumInputPlanes=32 -D gInputPlanes=32 -D gInputSize=28 -D gInputSizeSquared=784 -D gNumFilters=20 -D gFilterSize=28 -D gHalfFilterSize=14 -D gFilterSizeSquared=784 -D gNumOutputPlanes=20 -D gOutputPlanes=20 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=32 -D gInputPlanes=32 -D gInputSize=28 -D gInputSizeSquared=784 -D gNumFilters=20 -D gFilterSize=28 -D gHalfFilterSize=14 -D gFilterSizeSquared=784 -D gNumOutputPlanes=20 -D gOutputPlanes=20 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=32 -D gInputPlanes=32 -D gInputSize=28 -D gInputSizeSquared=784 -D gNumFilters=20 -D gFilterSize=28 -D gHalfFilterSize=14 -D gFilterSizeSquared=784 -D gNumOutputPlanes=20 -D gOutputPlanes=20 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0"

 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=32 -D gInputPlanes=32 -D gInputSize=28 -D gInputSizeSquared=784 -D gNumFilters=20 -D gFilterSize=28 -D gHalfFilterSize=14 -D gFilterSizeSquared=784 -D gNumOutputPlanes=20 -D gOutputPlanes=20 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0"
 " thrown in the test body.
 [  FAILED  ] testforward.crash_from_jm (157 ms)
 [----------] 17 tests from testforward (1322 ms total)

 [----------] 2 tests from testfilehelper
 [ RUN      ] testfilehelper.testfilehelper
 [       OK ] testfilehelper.testfilehelper (19 ms)
 [ RUN      ] testfilehelper.testreadchunk
 [       OK ] testfilehelper.testreadchunk (11 ms)
 [----------] 2 tests from testfilehelper (30 ms total)

 [----------] 12 tests from testsimpleconvolvenet
 [ RUN      ] testsimpleconvolvenet.imagesize1_planes2_filters2_unbiased_tanh
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D TANH"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D TANH"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D TANH"
 " thrown in the test body.
 [  FAILED  ] testsimpleconvolvenet.imagesize1_planes2_filters2_unbiased_tanh (77 ms)
 [ RUN      ] testsimpleconvolvenet.imagesize1_planes2_filters2_tanh
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D TANH"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D TANH"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D TANH"
 " thrown in the test body.
 [  FAILED  ] testsimpleconvolvenet.imagesize1_planes2_filters2_tanh (77 ms)
 [ RUN      ] testsimpleconvolvenet.imagesize3_n4_filtersize3_tanh
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D TANH"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D TANH"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D TANH"
 " thrown in the test body.
 [  FAILED  ] testsimpleconvolvenet.imagesize3_n4_filtersize3_tanh (78 ms)
 [ RUN      ] testsimpleconvolvenet.imagesize1_2planes_filtersize1
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 forward try kernel 0
  ... not plausibly optimal, skipping
 forward try kernel 1
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 1: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 2
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

 ForwardAuto: kernel 2: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

   ... not valid
 forward try kernel 3
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept: each workgroup handles convolving one input example with one filtercube
 8: // and writing out one single output plane
 9: //
 10: // workgroup id organized like: [imageid][outplane]
 11: // local id organized like: [outrow][outcol]
 12: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 13: // number workgroups = 32
 14: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 16: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 17: // output are organized like [imageid][filterid][row][col]
 18: void kernel forward_3_by_n_outplane(const int batchSize,
 19:       global const float *images, global const float *filters,
 20:     global float *output,
 21:     local float *_upstreamImage, local float *_filterCube) {
 22:     const int globalId = get_global_id(0);
 23: 
 24:     const int workgroupId = get_group_id(0);
 25:     const int workgroupSize = get_local_size(0);
 26:     const int n = workgroupId / gNumFilters;
 27:     const int outPlane = workgroupId % gNumFilters;
 28: 
 29:     const int localId = get_local_id(0);
 30:     const int outputRow = localId / gOutputSize;
 31:     const int outputCol = localId % gOutputSize;
 32: 
 33:     const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize;
 34:     const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow  - gEven) : gHalfFilterSize - gEven;
 35:     const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize;
 36:     const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven;
 37: 
 38:     const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 39: 
 40:     const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 41:     const int filterCubeGlobalOffset = outPlane * filterCubeLength;
 42:     const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize;
 43:     for (int i = 0; i < numPixelsPerThread; i++) {
 44:         int thisOffset = localId + i * workgroupSize;
 45:         if (thisOffset < filterCubeLength) {
 46:             _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset];
 47:         }
 48:     }
 49:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 50: 
 51:     float sum = 0;
 52:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 53:         int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 54:         barrier(CLK_LOCAL_MEM_FENCE);
 55:         for (int i = 0; i < numUpstreamsPerThread; i++) {
 56:             int thisOffset = workgroupSize * i + localId;
 57:             if (thisOffset < gInputSizeSquared) {
 58:                 _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ];
 59:             }
 60:         }
 61:         barrier(CLK_LOCAL_MEM_FENCE);
 62:         int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 63:         for (int u = minu; u <= maxu; u++) {
 64:             int inputRow = outputRow + u;
 65:             #if gPadZeros == 0
 66:                 inputRow += gHalfFilterSize;
 67:             #endif
 68:             int inputimagerowoffset = inputRow * gInputSize;
 69:             int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 70:             for (int v = minv; v <= maxv; v++) {
 71:                 int inputCol = outputCol + v;
 72:                 #if gPadZeros == 0
 73:                     inputCol += gHalfFilterSize;
 74:                 #endif
 75:                 if (localId < gOutputSizeSquared) {
 76:                     sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 77:                 }
 78:             }
 79:         }
 80:     }
 81: 
 82:     // output are organized like [imageid][filterid][row][col]
 83:     int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 84:     if (localId < gOutputSizeSquared) {
 85:         output[resultIndex ] = sum;
 86:     }
 87: }
 88: 
 89: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 3: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept: each workgroup handles convolving one input example with one filtercube
 8: // and writing out one single output plane
 9: //
 10: // workgroup id organized like: [imageid][outplane]
 11: // local id organized like: [outrow][outcol]
 12: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 13: // number workgroups = 32
 14: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 16: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 17: // output are organized like [imageid][filterid][row][col]
 18: void kernel forward_3_by_n_outplane(const int batchSize,
 19:       global const float *images, global const float *filters,
 20:     global float *output,
 21:     local float *_upstreamImage, local float *_filterCube) {
 22:     const int globalId = get_global_id(0);
 23: 
 24:     const int workgroupId = get_group_id(0);
 25:     const int workgroupSize = get_local_size(0);
 26:     const int n = workgroupId / gNumFilters;
 27:     const int outPlane = workgroupId % gNumFilters;
 28: 
 29:     const int localId = get_local_id(0);
 30:     const int outputRow = localId / gOutputSize;
 31:     const int outputCol = localId % gOutputSize;
 32: 
 33:     const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize;
 34:     const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow  - gEven) : gHalfFilterSize - gEven;
 35:     const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize;
 36:     const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven;
 37: 
 38:     const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 39: 
 40:     const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 41:     const int filterCubeGlobalOffset = outPlane * filterCubeLength;
 42:     const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize;
 43:     for (int i = 0; i < numPixelsPerThread; i++) {
 44:         int thisOffset = localId + i * workgroupSize;
 45:         if (thisOffset < filterCubeLength) {
 46:             _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset];
 47:         }
 48:     }
 49:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 50: 
 51:     float sum = 0;
 52:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 53:         int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 54:         barrier(CLK_LOCAL_MEM_FENCE);
 55:         for (int i = 0; i < numUpstreamsPerThread; i++) {
 56:             int thisOffset = workgroupSize * i + localId;
 57:             if (thisOffset < gInputSizeSquared) {
 58:                 _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ];
 59:             }
 60:         }
 61:         barrier(CLK_LOCAL_MEM_FENCE);
 62:         int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 63:         for (int u = minu; u <= maxu; u++) {
 64:             int inputRow = outputRow + u;
 65:             #if gPadZeros == 0
 66:                 inputRow += gHalfFilterSize;
 67:             #endif
 68:             int inputimagerowoffset = inputRow * gInputSize;
 69:             int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 70:             for (int v = minv; v <= maxv; v++) {
 71:                 int inputCol = outputCol + v;
 72:                 #if gPadZeros == 0
 73:                     inputCol += gHalfFilterSize;
 74:                 #endif
 75:                 if (localId < gOutputSizeSquared) {
 76:                     sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 77:                 }
 78:             }
 79:         }
 80:     }
 81: 
 82:     // output are organized like [imageid][filterid][row][col]
 83:     int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 84:     if (localId < gOutputSizeSquared) {
 85:         output[resultIndex ] = sum;
 86:     }
 87: }
 88: 
 89: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 4
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, int N) {
 8:     int numLoops = (N + get_local_size(0) - 1) / get_local_size(0);
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * get_local_size(0) + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [n][filterid]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [n][filterid][outrow][outcol]
 26: // the pixels per thread thing... :
 27: // - we have one thread (~= cuda core) per output value,
 28: //   ie one thread for each combination of [outrow][outcol]
 29: // - however, the number of threads is typically limited on a gpu,
 30: //   eg to 512 (eg Intel HD), or 1024 (eg nVidia K520)
 31: // - so what happens if the number of output points is larger than
 32: //   the maximum workgroup size?
 33: // - then we have several possibilities really:
 34: //   - we can divide the image into blocks, and process each block
 35: //     separately.  This is probably a good option, but fair amount of
 36: //     work
 37: //   - we can get each thread to handle more than one output
 38: //     pixel, by looping
 39: //   - we can consider the output image in 1d, by putting the rows
 40: //     one after another, and assign each contiguous workgroup-size
 41: //     block to one workgroup
 42: //     => this is how this kernel works
 43: //     basically, it's a hack, so larger images actually run, without
 44: //     crashing, and we can probably improve it a lot :-)
 45: //
 46: // So, when outputSize * outputSize > workgroupSize, then
 47: // multiple workgroups will be created for each output plane
 48: // the number of such workgroups is given by: `gPixelsPerThread`
 49: // the id of our workgroup within such a set of workgroups is calculated
 50: // as `pixel`
 51: // effectiveLocalId is our local id if we had one enormous workgroup
 52: // containing the whole output image plane
 53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize,
 54:       global const float *images, global const float *filters,
 55:     global float *output,
 56:     local float *_inputPlane, local float *_filterPlane) {
 57:     #define globalId (get_global_id(0))
 58: 
 59:     #define localId (get_local_id(0))
 60:     #define workgroupId (get_group_id(0))
 61: //    const int workgroupSize = get_local_size(0);
 62:     const int effectiveWorkgroupId = workgroupId / gPixelsPerThread;
 63:     const int pixel = workgroupId % gPixelsPerThread;
 64:     const int effectiveLocalId = localId + pixel * gWorkgroupSize;
 65:     const int n = effectiveWorkgroupId / gNumFilters;
 66:     const int outPlane = effectiveWorkgroupId % gNumFilters;
 67: 
 68:     const int outputRow = effectiveLocalId / gOutputSize;
 69:     const int outputCol = effectiveLocalId % gOutputSize;
 70: 
 71:     float sum = 0;
 72:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 73:         barrier(CLK_LOCAL_MEM_FENCE);
 74:         copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared);
 75:         copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared);
 76:         barrier(CLK_LOCAL_MEM_FENCE);
 77: 
 78:         if (effectiveLocalId < gOutputSizeSquared) {
 79:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 80:                 // trying to reduce register pressure...
 81:                 #if gPadZeros == 1
 82:                     #define inputRow (outputRow + u)
 83:                 #else
 84:                     #define inputRow (outputRow + u + gHalfFilterSize)
 85:                 #endif
 86:                 int inputimagerowoffset = inputRow * gInputSize;
 87:                 int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 88:                 bool rowOk = inputRow >= 0 && inputRow < gInputSize;
 89:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 90:                     #if gPadZeros == 1
 91:                         #define inputCol (outputCol + v)
 92:                     #else
 93:                         #define inputCol (outputCol + v + gHalfFilterSize)
 94:                     #endif
 95:                     bool process = rowOk && inputCol >= 0 && inputCol < gInputSize;
 96:                     if (process) {
 97:                             sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ];
 98:                     }
 99:                 }
 100:             }
 101:         }
 102:     }
 103:     // output are organized like [imageid][filterid][row][col]
 104:     #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId)
 105:     if (effectiveLocalId < gOutputSizeSquared) {
 106:         output[resultIndex ] = sum;
 107:     }
 108: }
 109: #endif
 110: 
 111: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 4: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, int N) {
 8:     int numLoops = (N + get_local_size(0) - 1) / get_local_size(0);
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * get_local_size(0) + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [n][filterid]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [n][filterid][outrow][outcol]
 26: // the pixels per thread thing... :
 27: // - we have one thread (~= cuda core) per output value,
 28: //   ie one thread for each combination of [outrow][outcol]
 29: // - however, the number of threads is typically limited on a gpu,
 30: //   eg to 512 (eg Intel HD), or 1024 (eg nVidia K520)
 31: // - so what happens if the number of output points is larger than
 32: //   the maximum workgroup size?
 33: // - then we have several possibilities really:
 34: //   - we can divide the image into blocks, and process each block
 35: //     separately.  This is probably a good option, but fair amount of
 36: //     work
 37: //   - we can get each thread to handle more than one output
 38: //     pixel, by looping
 39: //   - we can consider the output image in 1d, by putting the rows
 40: //     one after another, and assign each contiguous workgroup-size
 41: //     block to one workgroup
 42: //     => this is how this kernel works
 43: //     basically, it's a hack, so larger images actually run, without
 44: //     crashing, and we can probably improve it a lot :-)
 45: //
 46: // So, when outputSize * outputSize > workgroupSize, then
 47: // multiple workgroups will be created for each output plane
 48: // the number of such workgroups is given by: `gPixelsPerThread`
 49: // the id of our workgroup within such a set of workgroups is calculated
 50: // as `pixel`
 51: // effectiveLocalId is our local id if we had one enormous workgroup
 52: // containing the whole output image plane
 53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize,
 54:       global const float *images, global const float *filters,
 55:     global float *output,
 56:     local float *_inputPlane, local float *_filterPlane) {
 57:     #define globalId (get_global_id(0))
 58: 
 59:     #define localId (get_local_id(0))
 60:     #define workgroupId (get_group_id(0))
 61: //    const int workgroupSize = get_local_size(0);
 62:     const int effectiveWorkgroupId = workgroupId / gPixelsPerThread;
 63:     const int pixel = workgroupId % gPixelsPerThread;
 64:     const int effectiveLocalId = localId + pixel * gWorkgroupSize;
 65:     const int n = effectiveWorkgroupId / gNumFilters;
 66:     const int outPlane = effectiveWorkgroupId % gNumFilters;
 67: 
 68:     const int outputRow = effectiveLocalId / gOutputSize;
 69:     const int outputCol = effectiveLocalId % gOutputSize;
 70: 
 71:     float sum = 0;
 72:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 73:         barrier(CLK_LOCAL_MEM_FENCE);
 74:         copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared);
 75:         copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared);
 76:         barrier(CLK_LOCAL_MEM_FENCE);
 77: 
 78:         if (effectiveLocalId < gOutputSizeSquared) {
 79:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 80:                 // trying to reduce register pressure...
 81:                 #if gPadZeros == 1
 82:                     #define inputRow (outputRow + u)
 83:                 #else
 84:                     #define inputRow (outputRow + u + gHalfFilterSize)
 85:                 #endif
 86:                 int inputimagerowoffset = inputRow * gInputSize;
 87:                 int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 88:                 bool rowOk = inputRow >= 0 && inputRow < gInputSize;
 89:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 90:                     #if gPadZeros == 1
 91:                         #define inputCol (outputCol + v)
 92:                     #else
 93:                         #define inputCol (outputCol + v + gHalfFilterSize)
 94:                     #endif
 95:                     bool process = rowOk && inputCol >= 0 && inputCol < gInputSize;
 96:                     if (process) {
 97:                             sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ];
 98:                     }
 99:                 }
 100:             }
 101:         }
 102:     }
 103:     // output are organized like [imageid][filterid][row][col]
 104:     #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId)
 105:     if (effectiveLocalId < gOutputSizeSquared) {
 106:         output[resultIndex ] = sum;
 107:     }
 108: }
 109: #endif
 110: 
 111: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 5
 cl/reduce_segments.cl build log: 
 (8:0) : error : invalid global address space qualifier specified for parameter type
 (8:0) : error : syntax error at 'const'

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: kernel void reduce_segments(const int numSegments, const int segmentLength,
 8:         global float const *in, global float* out) {
 9:     const int globalId = get_global_id(0);
 10:     const int segmentId = globalId;
 11: 
 12:     if (segmentId >= numSegments) {
 13:         return;
 14:     }
 15: 
 16:     float sum = 0;
 17:     global const float *segment = in + segmentId * segmentLength;
 18:     for (int i = 0; i < segmentLength; i++) {
 19:         sum += segment[i];
 20:     }
 21:     out[segmentId] = sum;
 22: }
 23: 
 24: 
 25: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/reduce_segments.cl build log: 
 (8:0) : error : invalid global address space qualifier specified for parameter type
 (8:0) : error : syntax error at 'const'

 ForwardAuto: kernel 5: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: kernel void reduce_segments(const int numSegments, const int segmentLength,
 8:         global float const *in, global float* out) {
 9:     const int globalId = get_global_id(0);
 10:     const int segmentId = globalId;
 11: 
 12:     if (segmentId >= numSegments) {
 13:         return;
 14:     }
 15: 
 16:     float sum = 0;
 17:     global const float *segment = in + segmentId * segmentLength;
 18:     for (int i = 0; i < segmentLength; i++) {
 19:         sum += segment[i];
 20:     }
 21:     out[segmentId] = sum;
 22: }
 23: 
 24: 
 25: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/reduce_segments.cl build log: 
 (8:0) : error : invalid global address space qualifier specified for parameter type
 (8:0) : error : syntax error at 'const'

   ... not valid
 forward try kernel 6
 cl/forward_byinputplane.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept:
 8: // - load same input plane from each image
 9: // - hold filter plane for this input plane, for all filters
 10: // - reduce afterwards
 11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB
 12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB
 13: // => seems ok?
 14: // workgroupid: [inputPlaneId]
 15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...)
 16: // iterate over: [n][outCol]
 17: // output: [n][filterId][outRow][outCol][inputPlane]
 18: // need to later reduce output over: [inputPlane]
 19: void kernel forward_byinputplane(const int batchSize,
 20:       global const float *images, global const float *filters,
 21:     global float *output,
 22:     local float *_inputPlane, local float *_filterPlanes) {
 23: //    const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0;
 24: 
 25:     const int globalId = get_global_id(0);
 26:     const int workgroupId = get_group_id(0);
 27:     const int workgroupSize = get_local_size(0);
 28:     const int localId = get_local_id(0);
 29: 
 30:     const int inputPlaneId = workgroupId;
 31:     const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize;
 32:     const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize;
 33:     const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 34:     for (int loop = 0; loop < numLoops; loop++) {
 35:         const int loopLocalId = localId + loop * workgroupSize;
 36:         const int filterId = loopLocalId / gOutputSize;
 37:         const int outRow = loopLocalId % gOutputSize;
 38: 
 39:         // copy down our filter, we have gOutputSize threads to do this
 40:         global float const *globalFilterPlane = filters +
 41:             (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared;
 42:         local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared;
 43:         barrier(CLK_LOCAL_MEM_FENCE);
 44:         for (int i = 0; i < numFilterCopyLoops; i++) {
 45:             const int offset = i * gOutputSize + outRow;
 46:             bool process = filterId < gNumFilters && offset < gFilterSizeSquared;
 47:             if (process) {
 48:                 _localFilterPlane[ offset ] = globalFilterPlane[ offset ];
 49:             }
 50:         }
 51:         // loop over n ...
 52:         for (int n = 0; n < batchSize; n++) {
 53:             // copy down our imageplane, we have workgroupSize threads to do this
 54:             barrier(CLK_LOCAL_MEM_FENCE);
 55:             global float const *globalImagePlane = images +
 56:                 (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared;
 57:             for (int i = 0; i< numImageCopyLoops; i++) {
 58:                 const int offset = i * workgroupSize + localId;
 59:                 if (offset < gInputSizeSquared) {
 60:                     _inputPlane[ offset ] = globalImagePlane[ offset ];
 61:                 }
 62:             }
 63:             barrier(CLK_LOCAL_MEM_FENCE);
 64:             // calc output for each [outrow][outcol]
 65:             bool filterPlaneOk = filterId < gNumFilters;
 66:             for (int outCol = 0; outCol < gOutputSize; outCol++) {
 67:                 float sum = 0;
 68:                 for (int filterRow = 0; filterRow < gFilterSize; filterRow++) {
 69:                     int inRow = outRow + filterRow;
 70:                     #if gPadZeros == 1
 71:                         inRow -= gHalfFilterSize;
 72:                     #endif
 73:                     bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize;
 74:                     for (int filterCol = 0; filterCol < gFilterSize; filterCol++) {
 75:                         int inCol = outCol + filterCol;
 76:                         #if gPadZeros == 1
 77:                             inCol -= gHalfFilterSize;
 78:                         #endif
 79:                         bool process = rowOk && inCol >= 0 && inCol < gInputSize;
 80:                         if (process) {
 81:                             float imageValue = _inputPlane[ inRow * gInputSize + inCol ];
 82:                             float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ];
 83:                             sum += imageValue * filterValue;
 84:                         }
 85:                     }
 86:                 }
 87:                 if (filterId < gNumFilters) {
 88:                     // [n][filterId][outRow][outCol][inputPlane]
 89:                     int resultIndex = (( (n
 90:                         * gNumFilters + filterId)
 91:                         * gOutputSize + outRow)
 92:                         * gOutputSize + outCol)
 93:                         * gNumInputPlanes + inputPlaneId;
 94:                     output[resultIndex] = sum;
 95:                     //if (globalId == 2) output[0] = resultIndex;
 96: //                    output[resultIndex] = outRow;
 97:                 }
 98: //                output[localId] = _localFilterPlane[localId];
 99:             }
 100:         }
 101:     }
 102: }
 103: 
 104: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward_byinputplane.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 6: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept:
 8: // - load same input plane from each image
 9: // - hold filter plane for this input plane, for all filters
 10: // - reduce afterwards
 11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB
 12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB
 13: // => seems ok?
 14: // workgroupid: [inputPlaneId]
 15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...)
 16: // iterate over: [n][outCol]
 17: // output: [n][filterId][outRow][outCol][inputPlane]
 18: // need to later reduce output over: [inputPlane]
 19: void kernel forward_byinputplane(const int batchSize,
 20:       global const float *images, global const float *filters,
 21:     global float *output,
 22:     local float *_inputPlane, local float *_filterPlanes) {
 23: //    const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0;
 24: 
 25:     const int globalId = get_global_id(0);
 26:     const int workgroupId = get_group_id(0);
 27:     const int workgroupSize = get_local_size(0);
 28:     const int localId = get_local_id(0);
 29: 
 30:     const int inputPlaneId = workgroupId;
 31:     const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize;
 32:     const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize;
 33:     const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 34:     for (int loop = 0; loop < numLoops; loop++) {
 35:         const int loopLocalId = localId + loop * workgroupSize;
 36:         const int filterId = loopLocalId / gOutputSize;
 37:         const int outRow = loopLocalId % gOutputSize;
 38: 
 39:         // copy down our filter, we have gOutputSize threads to do this
 40:         global float const *globalFilterPlane = filters +
 41:             (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared;
 42:         local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared;
 43:         barrier(CLK_LOCAL_MEM_FENCE);
 44:         for (int i = 0; i < numFilterCopyLoops; i++) {
 45:             const int offset = i * gOutputSize + outRow;
 46:             bool process = filterId < gNumFilters && offset < gFilterSizeSquared;
 47:             if (process) {
 48:                 _localFilterPlane[ offset ] = globalFilterPlane[ offset ];
 49:             }
 50:         }
 51:         // loop over n ...
 52:         for (int n = 0; n < batchSize; n++) {
 53:             // copy down our imageplane, we have workgroupSize threads to do this
 54:             barrier(CLK_LOCAL_MEM_FENCE);
 55:             global float const *globalImagePlane = images +
 56:                 (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared;
 57:             for (int i = 0; i< numImageCopyLoops; i++) {
 58:                 const int offset = i * workgroupSize + localId;
 59:                 if (offset < gInputSizeSquared) {
 60:                     _inputPlane[ offset ] = globalImagePlane[ offset ];
 61:                 }
 62:             }
 63:             barrier(CLK_LOCAL_MEM_FENCE);
 64:             // calc output for each [outrow][outcol]
 65:             bool filterPlaneOk = filterId < gNumFilters;
 66:             for (int outCol = 0; outCol < gOutputSize; outCol++) {
 67:                 float sum = 0;
 68:                 for (int filterRow = 0; filterRow < gFilterSize; filterRow++) {
 69:                     int inRow = outRow + filterRow;
 70:                     #if gPadZeros == 1
 71:                         inRow -= gHalfFilterSize;
 72:                     #endif
 73:                     bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize;
 74:                     for (int filterCol = 0; filterCol < gFilterSize; filterCol++) {
 75:                         int inCol = outCol + filterCol;
 76:                         #if gPadZeros == 1
 77:                             inCol -= gHalfFilterSize;
 78:                         #endif
 79:                         bool process = rowOk && inCol >= 0 && inCol < gInputSize;
 80:                         if (process) {
 81:                             float imageValue = _inputPlane[ inRow * gInputSize + inCol ];
 82:                             float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ];
 83:                             sum += imageValue * filterValue;
 84:                         }
 85:                     }
 86:                 }
 87:                 if (filterId < gNumFilters) {
 88:                     // [n][filterId][outRow][outCol][inputPlane]
 89:                     int resultIndex = (( (n
 90:                         * gNumFilters + filterId)
 91:                         * gOutputSize + outRow)
 92:                         * gOutputSize + outCol)
 93:                         * gNumInputPlanes + inputPlaneId;
 94:                     output[resultIndex] = sum;
 95:                     //if (globalId == 2) output[0] = resultIndex;
 96: //                    output[resultIndex] = outRow;
 97:                 }
 98: //                output[localId] = _localFilterPlane[localId];
 99:             }
 100:         }
 101:     }
 102: }
 103: 
 104: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward_byinputplane.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 7
   ... seems valid
 ForwardIm2Col.cl build log: 
 (19:0) : error : invalid global address space qualifier specified for parameter type
 (19:0) : error : syntax error at 'const'

 kernel build error:

 kernel source:
 1: // from SpatialConvolutionMM.cu:
 2: 
 3: // CL: grid stride looping
 4: #define CL_KERNEL_LOOP(i, n)                        \
 5:   for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \
 6:       i < (n);                                       \
 7:       i += get_local_size(0) * get_num_groups(0))
 8: 
 9: //#define gPadding 0
 10: //#define gStride 1
 11: //#define gColSize 1
 12: //#define gFilterSize 1
 13: //#define gSize 1
 14: 
 15: // Kernel for fast unfold+copy
 16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu)
 17: kernel void im2col(
 18:     const int n,
 19:     global float const * im_data, int im_offset,
 20:     global float* data_col) {
 21:   global const float *data_im = im_data + im_offset;
 22: 
 23:   CL_KERNEL_LOOP(index, n) {
 24:     int w_out = index % 1;
 25:     index /= 1;
 26:     int h_out = index % 1;
 27:     int channel_in = index / 1;
 28:     int channel_out = channel_in * 1 * 1;
 29:     int h_in = h_out * 1 - 0;
 30:     int w_in = w_out * 1 - 0;
 31:     data_col += (channel_out * 1 + h_out) * 1 + w_out;
 32:     data_im += (channel_in * 1 + h_in) * 1 + w_in;
 33:     for (int i = 0; i < 1; ++i) {
 34:       for (int j = 0; j < 1; ++j) {
 35:         int h = h_in + i;
 36:         int w = w_in + j;
 37:         *data_col = (h >= 0 && w >= 0 && h < 1 && w < 1) ?
 38:           data_im[i * 1 + j] : 0;
 39:         data_col += 1 * 1;
 40:       }
 41:     }
 42:   }
 43: }
 44: 
 45: kernel void col2im(
 46:     const int n,
 47:     global float const *data_col,
 48:     global float* im_data, int im_offset) {
 49:   global float *data_im = im_data + im_offset;
 50: 
 51:   for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) {
 52:     float val = 0;
 53:     int w = index % 1 + 0;
 54:     int h = (index / 1) % 1 + 0;
 55:     int c = index / (1 * 1);
 56:     // compute the start and end of the output
 57:     int w_col_start = (w < 1) ? 0 : (w - 1) / 1 + 1;
 58:     int w_col_end = min(w / 1 + 1, 1);
 59:     int h_col_start = (h < 1) ? 0 : (h - 1) / 1 + 1;
 60:     int h_col_end = min(h / 1 + 1, 1);
 61: 
 62:     int offset = (c * 1 * 1 + h * 1 + w) * 1 * 1;
 63:     int coeff_h_col = (1 - 1 * 1 * 1) * 1;
 64:     int coeff_w_col = (1 - 1 * 1 * 1);
 65:     for (int h_col = h_col_start; h_col < h_col_end; ++h_col) {
 66:       for (int w_col = w_col_start; w_col < w_col_end; ++w_col) {
 67:         val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col];
 68:       }
 69:     }
 70:     data_im[index] = val;
 71:   }
 72: }
 73: 
 74: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 ForwardIm2Col.cl build log: 
 (19:0) : error : invalid global address space qualifier specified for parameter type
 (19:0) : error : syntax error at 'const'

 ForwardAuto: kernel 7 this instance cant be used: 
 kernel source:
 1: // from SpatialConvolutionMM.cu:
 2: 
 3: // CL: grid stride looping
 4: #define CL_KERNEL_LOOP(i, n)                        \
 5:   for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \
 6:       i < (n);                                       \
 7:       i += get_local_size(0) * get_num_groups(0))
 8: 
 9: //#define gPadding 0
 10: //#define gStride 1
 11: //#define gColSize 1
 12: //#define gFilterSize 1
 13: //#define gSize 1
 14: 
 15: // Kernel for fast unfold+copy
 16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu)
 17: kernel void im2col(
 18:     const int n,
 19:     global float const * im_data, int im_offset,
 20:     global float* data_col) {
 21:   global const float *data_im = im_data + im_offset;
 22: 
 23:   CL_KERNEL_LOOP(index, n) {
 24:     int w_out = index % 1;
 25:     index /= 1;
 26:     int h_out = index % 1;
 27:     int channel_in = index / 1;
 28:     int channel_out = channel_in * 1 * 1;
 29:     int h_in = h_out * 1 - 0;
 30:     int w_in = w_out * 1 - 0;
 31:     data_col += (channel_out * 1 + h_out) * 1 + w_out;
 32:     data_im += (channel_in * 1 + h_in) * 1 + w_in;
 33:     for (int i = 0; i < 1; ++i) {
 34:       for (int j = 0; j < 1; ++j) {
 35:         int h = h_in + i;
 36:         int w = w_in + j;
 37:         *data_col = (h >= 0 && w >= 0 && h < 1 && w < 1) ?
 38:           data_im[i * 1 + j] : 0;
 39:         data_col += 1 * 1;
 40:       }
 41:     }
 42:   }
 43: }
 44: 
 45: kernel void col2im(
 46:     const int n,
 47:     global float const *data_col,
 48:     global float* im_data, int im_offset) {
 49:   global float *data_im = im_data + im_offset;
 50: 
 51:   for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) {
 52:     float val = 0;
 53:     int w = index % 1 + 0;
 54:     int h = (index / 1) % 1 + 0;
 55:     int c = index / (1 * 1);
 56:     // compute the start and end of the output
 57:     int w_col_start = (w < 1) ? 0 : (w - 1) / 1 + 1;
 58:     int w_col_end = min(w / 1 + 1, 1);
 59:     int h_col_start = (h < 1) ? 0 : (h - 1) / 1 + 1;
 60:     int h_col_end = min(h / 1 + 1, 1);
 61: 
 62:     int offset = (c * 1 * 1 + h * 1 + w) * 1 * 1;
 63:     int coeff_h_col = (1 - 1 * 1 * 1) * 1;
 64:     int coeff_w_col = (1 - 1 * 1 * 1);
 65:     for (int h_col = h_col_start; h_col < h_col_end; ++h_col) {
 66:       for (int w_col = w_col_start; w_col < w_col_end; ++w_col) {
 67:         val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col];
 68:       }
 69:     }
 70:     data_im[index] = val;
 71:   }
 72: }
 73: 
 74: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 ForwardIm2Col.cl build log: 
 (19:0) : error : invalid global address space qualifier specified for parameter type
 (19:0) : error : syntax error at 'const'

   forward kernel 0: cannot be used
   forward kernel 1: cannot be used
   forward kernel 2: cannot be used
   forward kernel 3: cannot be used
   forward kernel 4: cannot be used
   forward kernel 5: cannot be used
   forward kernel 6: cannot be used
   forward kernel 7: cannot be used
 clblas teardown
 unknown file: Failure
 C++ exception with description "No valid forward implementations found" thrown in the test body.
 [  FAILED  ] testsimpleconvolvenet.imagesize1_2planes_filtersize1 (186 ms)
 [ RUN      ] testsimpleconvolvenet.imagesize3_n4_filtersize3_relu
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU"
 " thrown in the test body.
 [  FAILED  ] testsimpleconvolvenet.imagesize3_n4_filtersize3_relu (70 ms)
 [ RUN      ] testsimpleconvolvenet.imagesize3_n4_filtersize3_linear
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 forward try kernel 0
  ... not plausibly optimal, skipping
 forward try kernel 1
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 1: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 2
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

 ForwardAuto: kernel 2: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

   ... not valid
 forward try kernel 3
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept: each workgroup handles convolving one input example with one filtercube
 8: // and writing out one single output plane
 9: //
 10: // workgroup id organized like: [imageid][outplane]
 11: // local id organized like: [outrow][outcol]
 12: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 13: // number workgroups = 32
 14: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 16: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 17: // output are organized like [imageid][filterid][row][col]
 18: void kernel forward_3_by_n_outplane(const int batchSize,
 19:       global const float *images, global const float *filters,
 20:     global float *output,
 21:     local float *_upstreamImage, local float *_filterCube) {
 22:     const int globalId = get_global_id(0);
 23: 
 24:     const int workgroupId = get_group_id(0);
 25:     const int workgroupSize = get_local_size(0);
 26:     const int n = workgroupId / gNumFilters;
 27:     const int outPlane = workgroupId % gNumFilters;
 28: 
 29:     const int localId = get_local_id(0);
 30:     const int outputRow = localId / gOutputSize;
 31:     const int outputCol = localId % gOutputSize;
 32: 
 33:     const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize;
 34:     const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow  - gEven) : gHalfFilterSize - gEven;
 35:     const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize;
 36:     const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven;
 37: 
 38:     const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 39: 
 40:     const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 41:     const int filterCubeGlobalOffset = outPlane * filterCubeLength;
 42:     const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize;
 43:     for (int i = 0; i < numPixelsPerThread; i++) {
 44:         int thisOffset = localId + i * workgroupSize;
 45:         if (thisOffset < filterCubeLength) {
 46:             _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset];
 47:         }
 48:     }
 49:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 50: 
 51:     float sum = 0;
 52:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 53:         int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 54:         barrier(CLK_LOCAL_MEM_FENCE);
 55:         for (int i = 0; i < numUpstreamsPerThread; i++) {
 56:             int thisOffset = workgroupSize * i + localId;
 57:             if (thisOffset < gInputSizeSquared) {
 58:                 _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ];
 59:             }
 60:         }
 61:         barrier(CLK_LOCAL_MEM_FENCE);
 62:         int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 63:         for (int u = minu; u <= maxu; u++) {
 64:             int inputRow = outputRow + u;
 65:             #if gPadZeros == 0
 66:                 inputRow += gHalfFilterSize;
 67:             #endif
 68:             int inputimagerowoffset = inputRow * gInputSize;
 69:             int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 70:             for (int v = minv; v <= maxv; v++) {
 71:                 int inputCol = outputCol + v;
 72:                 #if gPadZeros == 0
 73:                     inputCol += gHalfFilterSize;
 74:                 #endif
 75:                 if (localId < gOutputSizeSquared) {
 76:                     sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 77:                 }
 78:             }
 79:         }
 80:     }
 81: 
 82:     // output are organized like [imageid][filterid][row][col]
 83:     int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 84:     if (localId < gOutputSizeSquared) {
 85:         output[resultIndex ] = sum;
 86:     }
 87: }
 88: 
 89: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 3: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept: each workgroup handles convolving one input example with one filtercube
 8: // and writing out one single output plane
 9: //
 10: // workgroup id organized like: [imageid][outplane]
 11: // local id organized like: [outrow][outcol]
 12: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 13: // number workgroups = 32
 14: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 16: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 17: // output are organized like [imageid][filterid][row][col]
 18: void kernel forward_3_by_n_outplane(const int batchSize,
 19:       global const float *images, global const float *filters,
 20:     global float *output,
 21:     local float *_upstreamImage, local float *_filterCube) {
 22:     const int globalId = get_global_id(0);
 23: 
 24:     const int workgroupId = get_group_id(0);
 25:     const int workgroupSize = get_local_size(0);
 26:     const int n = workgroupId / gNumFilters;
 27:     const int outPlane = workgroupId % gNumFilters;
 28: 
 29:     const int localId = get_local_id(0);
 30:     const int outputRow = localId / gOutputSize;
 31:     const int outputCol = localId % gOutputSize;
 32: 
 33:     const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize;
 34:     const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow  - gEven) : gHalfFilterSize - gEven;
 35:     const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize;
 36:     const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven;
 37: 
 38:     const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 39: 
 40:     const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 41:     const int filterCubeGlobalOffset = outPlane * filterCubeLength;
 42:     const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize;
 43:     for (int i = 0; i < numPixelsPerThread; i++) {
 44:         int thisOffset = localId + i * workgroupSize;
 45:         if (thisOffset < filterCubeLength) {
 46:             _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset];
 47:         }
 48:     }
 49:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 50: 
 51:     float sum = 0;
 52:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 53:         int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 54:         barrier(CLK_LOCAL_MEM_FENCE);
 55:         for (int i = 0; i < numUpstreamsPerThread; i++) {
 56:             int thisOffset = workgroupSize * i + localId;
 57:             if (thisOffset < gInputSizeSquared) {
 58:                 _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ];
 59:             }
 60:         }
 61:         barrier(CLK_LOCAL_MEM_FENCE);
 62:         int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 63:         for (int u = minu; u <= maxu; u++) {
 64:             int inputRow = outputRow + u;
 65:             #if gPadZeros == 0
 66:                 inputRow += gHalfFilterSize;
 67:             #endif
 68:             int inputimagerowoffset = inputRow * gInputSize;
 69:             int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 70:             for (int v = minv; v <= maxv; v++) {
 71:                 int inputCol = outputCol + v;
 72:                 #if gPadZeros == 0
 73:                     inputCol += gHalfFilterSize;
 74:                 #endif
 75:                 if (localId < gOutputSizeSquared) {
 76:                     sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 77:                 }
 78:             }
 79:         }
 80:     }
 81: 
 82:     // output are organized like [imageid][filterid][row][col]
 83:     int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 84:     if (localId < gOutputSizeSquared) {
 85:         output[resultIndex ] = sum;
 86:     }
 87: }
 88: 
 89: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 4
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, int N) {
 8:     int numLoops = (N + get_local_size(0) - 1) / get_local_size(0);
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * get_local_size(0) + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [n][filterid]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [n][filterid][outrow][outcol]
 26: // the pixels per thread thing... :
 27: // - we have one thread (~= cuda core) per output value,
 28: //   ie one thread for each combination of [outrow][outcol]
 29: // - however, the number of threads is typically limited on a gpu,
 30: //   eg to 512 (eg Intel HD), or 1024 (eg nVidia K520)
 31: // - so what happens if the number of output points is larger than
 32: //   the maximum workgroup size?
 33: // - then we have several possibilities really:
 34: //   - we can divide the image into blocks, and process each block
 35: //     separately.  This is probably a good option, but fair amount of
 36: //     work
 37: //   - we can get each thread to handle more than one output
 38: //     pixel, by looping
 39: //   - we can consider the output image in 1d, by putting the rows
 40: //     one after another, and assign each contiguous workgroup-size
 41: //     block to one workgroup
 42: //     => this is how this kernel works
 43: //     basically, it's a hack, so larger images actually run, without
 44: //     crashing, and we can probably improve it a lot :-)
 45: //
 46: // So, when outputSize * outputSize > workgroupSize, then
 47: // multiple workgroups will be created for each output plane
 48: // the number of such workgroups is given by: `gPixelsPerThread`
 49: // the id of our workgroup within such a set of workgroups is calculated
 50: // as `pixel`
 51: // effectiveLocalId is our local id if we had one enormous workgroup
 52: // containing the whole output image plane
 53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize,
 54:       global const float *images, global const float *filters,
 55:     global float *output,
 56:     local float *_inputPlane, local float *_filterPlane) {
 57:     #define globalId (get_global_id(0))
 58: 
 59:     #define localId (get_local_id(0))
 60:     #define workgroupId (get_group_id(0))
 61: //    const int workgroupSize = get_local_size(0);
 62:     const int effectiveWorkgroupId = workgroupId / gPixelsPerThread;
 63:     const int pixel = workgroupId % gPixelsPerThread;
 64:     const int effectiveLocalId = localId + pixel * gWorkgroupSize;
 65:     const int n = effectiveWorkgroupId / gNumFilters;
 66:     const int outPlane = effectiveWorkgroupId % gNumFilters;
 67: 
 68:     const int outputRow = effectiveLocalId / gOutputSize;
 69:     const int outputCol = effectiveLocalId % gOutputSize;
 70: 
 71:     float sum = 0;
 72:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 73:         barrier(CLK_LOCAL_MEM_FENCE);
 74:         copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared);
 75:         copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared);
 76:         barrier(CLK_LOCAL_MEM_FENCE);
 77: 
 78:         if (effectiveLocalId < gOutputSizeSquared) {
 79:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 80:                 // trying to reduce register pressure...
 81:                 #if gPadZeros == 1
 82:                     #define inputRow (outputRow + u)
 83:                 #else
 84:                     #define inputRow (outputRow + u + gHalfFilterSize)
 85:                 #endif
 86:                 int inputimagerowoffset = inputRow * gInputSize;
 87:                 int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 88:                 bool rowOk = inputRow >= 0 && inputRow < gInputSize;
 89:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 90:                     #if gPadZeros == 1
 91:                         #define inputCol (outputCol + v)
 92:                     #else
 93:                         #define inputCol (outputCol + v + gHalfFilterSize)
 94:                     #endif
 95:                     bool process = rowOk && inputCol >= 0 && inputCol < gInputSize;
 96:                     if (process) {
 97:                             sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ];
 98:                     }
 99:                 }
 100:             }
 101:         }
 102:     }
 103:     // output are organized like [imageid][filterid][row][col]
 104:     #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId)
 105:     if (effectiveLocalId < gOutputSizeSquared) {
 106:         output[resultIndex ] = sum;
 107:     }
 108: }
 109: #endif
 110: 
 111: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 4: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, int N) {
 8:     int numLoops = (N + get_local_size(0) - 1) / get_local_size(0);
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * get_local_size(0) + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [n][filterid]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [n][filterid][outrow][outcol]
 26: // the pixels per thread thing... :
 27: // - we have one thread (~= cuda core) per output value,
 28: //   ie one thread for each combination of [outrow][outcol]
 29: // - however, the number of threads is typically limited on a gpu,
 30: //   eg to 512 (eg Intel HD), or 1024 (eg nVidia K520)
 31: // - so what happens if the number of output points is larger than
 32: //   the maximum workgroup size?
 33: // - then we have several possibilities really:
 34: //   - we can divide the image into blocks, and process each block
 35: //     separately.  This is probably a good option, but fair amount of
 36: //     work
 37: //   - we can get each thread to handle more than one output
 38: //     pixel, by looping
 39: //   - we can consider the output image in 1d, by putting the rows
 40: //     one after another, and assign each contiguous workgroup-size
 41: //     block to one workgroup
 42: //     => this is how this kernel works
 43: //     basically, it's a hack, so larger images actually run, without
 44: //     crashing, and we can probably improve it a lot :-)
 45: //
 46: // So, when outputSize * outputSize > workgroupSize, then
 47: // multiple workgroups will be created for each output plane
 48: // the number of such workgroups is given by: `gPixelsPerThread`
 49: // the id of our workgroup within such a set of workgroups is calculated
 50: // as `pixel`
 51: // effectiveLocalId is our local id if we had one enormous workgroup
 52: // containing the whole output image plane
 53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize,
 54:       global const float *images, global const float *filters,
 55:     global float *output,
 56:     local float *_inputPlane, local float *_filterPlane) {
 57:     #define globalId (get_global_id(0))
 58: 
 59:     #define localId (get_local_id(0))
 60:     #define workgroupId (get_group_id(0))
 61: //    const int workgroupSize = get_local_size(0);
 62:     const int effectiveWorkgroupId = workgroupId / gPixelsPerThread;
 63:     const int pixel = workgroupId % gPixelsPerThread;
 64:     const int effectiveLocalId = localId + pixel * gWorkgroupSize;
 65:     const int n = effectiveWorkgroupId / gNumFilters;
 66:     const int outPlane = effectiveWorkgroupId % gNumFilters;
 67: 
 68:     const int outputRow = effectiveLocalId / gOutputSize;
 69:     const int outputCol = effectiveLocalId % gOutputSize;
 70: 
 71:     float sum = 0;
 72:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 73:         barrier(CLK_LOCAL_MEM_FENCE);
 74:         copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared);
 75:         copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared);
 76:         barrier(CLK_LOCAL_MEM_FENCE);
 77: 
 78:         if (effectiveLocalId < gOutputSizeSquared) {
 79:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 80:                 // trying to reduce register pressure...
 81:                 #if gPadZeros == 1
 82:                     #define inputRow (outputRow + u)
 83:                 #else
 84:                     #define inputRow (outputRow + u + gHalfFilterSize)
 85:                 #endif
 86:                 int inputimagerowoffset = inputRow * gInputSize;
 87:                 int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 88:                 bool rowOk = inputRow >= 0 && inputRow < gInputSize;
 89:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 90:                     #if gPadZeros == 1
 91:                         #define inputCol (outputCol + v)
 92:                     #else
 93:                         #define inputCol (outputCol + v + gHalfFilterSize)
 94:                     #endif
 95:                     bool process = rowOk && inputCol >= 0 && inputCol < gInputSize;
 96:                     if (process) {
 97:                             sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ];
 98:                     }
 99:                 }
 100:             }
 101:         }
 102:     }
 103:     // output are organized like [imageid][filterid][row][col]
 104:     #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId)
 105:     if (effectiveLocalId < gOutputSizeSquared) {
 106:         output[resultIndex ] = sum;
 107:     }
 108: }
 109: #endif
 110: 
 111: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 5
 cl/reduce_segments.cl build log: 
 (8:0) : error : invalid global address space qualifier specified for parameter type
 (8:0) : error : syntax error at 'const'

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: kernel void reduce_segments(const int numSegments, const int segmentLength,
 8:         global float const *in, global float* out) {
 9:     const int globalId = get_global_id(0);
 10:     const int segmentId = globalId;
 11: 
 12:     if (segmentId >= numSegments) {
 13:         return;
 14:     }
 15: 
 16:     float sum = 0;
 17:     global const float *segment = in + segmentId * segmentLength;
 18:     for (int i = 0; i < segmentLength; i++) {
 19:         sum += segment[i];
 20:     }
 21:     out[segmentId] = sum;
 22: }
 23: 
 24: 
 25: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/reduce_segments.cl build log: 
 (8:0) : error : invalid global address space qualifier specified for parameter type
 (8:0) : error : syntax error at 'const'

 ForwardAuto: kernel 5: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: kernel void reduce_segments(const int numSegments, const int segmentLength,
 8:         global float const *in, global float* out) {
 9:     const int globalId = get_global_id(0);
 10:     const int segmentId = globalId;
 11: 
 12:     if (segmentId >= numSegments) {
 13:         return;
 14:     }
 15: 
 16:     float sum = 0;
 17:     global const float *segment = in + segmentId * segmentLength;
 18:     for (int i = 0; i < segmentLength; i++) {
 19:         sum += segment[i];
 20:     }
 21:     out[segmentId] = sum;
 22: }
 23: 
 24: 
 25: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/reduce_segments.cl build log: 
 (8:0) : error : invalid global address space qualifier specified for parameter type
 (8:0) : error : syntax error at 'const'

   ... not valid
 forward try kernel 6
 cl/forward_byinputplane.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept:
 8: // - load same input plane from each image
 9: // - hold filter plane for this input plane, for all filters
 10: // - reduce afterwards
 11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB
 12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB
 13: // => seems ok?
 14: // workgroupid: [inputPlaneId]
 15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...)
 16: // iterate over: [n][outCol]
 17: // output: [n][filterId][outRow][outCol][inputPlane]
 18: // need to later reduce output over: [inputPlane]
 19: void kernel forward_byinputplane(const int batchSize,
 20:       global const float *images, global const float *filters,
 21:     global float *output,
 22:     local float *_inputPlane, local float *_filterPlanes) {
 23: //    const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0;
 24: 
 25:     const int globalId = get_global_id(0);
 26:     const int workgroupId = get_group_id(0);
 27:     const int workgroupSize = get_local_size(0);
 28:     const int localId = get_local_id(0);
 29: 
 30:     const int inputPlaneId = workgroupId;
 31:     const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize;
 32:     const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize;
 33:     const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 34:     for (int loop = 0; loop < numLoops; loop++) {
 35:         const int loopLocalId = localId + loop * workgroupSize;
 36:         const int filterId = loopLocalId / gOutputSize;
 37:         const int outRow = loopLocalId % gOutputSize;
 38: 
 39:         // copy down our filter, we have gOutputSize threads to do this
 40:         global float const *globalFilterPlane = filters +
 41:             (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared;
 42:         local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared;
 43:         barrier(CLK_LOCAL_MEM_FENCE);
 44:         for (int i = 0; i < numFilterCopyLoops; i++) {
 45:             const int offset = i * gOutputSize + outRow;
 46:             bool process = filterId < gNumFilters && offset < gFilterSizeSquared;
 47:             if (process) {
 48:                 _localFilterPlane[ offset ] = globalFilterPlane[ offset ];
 49:             }
 50:         }
 51:         // loop over n ...
 52:         for (int n = 0; n < batchSize; n++) {
 53:             // copy down our imageplane, we have workgroupSize threads to do this
 54:             barrier(CLK_LOCAL_MEM_FENCE);
 55:             global float const *globalImagePlane = images +
 56:                 (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared;
 57:             for (int i = 0; i< numImageCopyLoops; i++) {
 58:                 const int offset = i * workgroupSize + localId;
 59:                 if (offset < gInputSizeSquared) {
 60:                     _inputPlane[ offset ] = globalImagePlane[ offset ];
 61:                 }
 62:             }
 63:             barrier(CLK_LOCAL_MEM_FENCE);
 64:             // calc output for each [outrow][outcol]
 65:             bool filterPlaneOk = filterId < gNumFilters;
 66:             for (int outCol = 0; outCol < gOutputSize; outCol++) {
 67:                 float sum = 0;
 68:                 for (int filterRow = 0; filterRow < gFilterSize; filterRow++) {
 69:                     int inRow = outRow + filterRow;
 70:                     #if gPadZeros == 1
 71:                         inRow -= gHalfFilterSize;
 72:                     #endif
 73:                     bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize;
 74:                     for (int filterCol = 0; filterCol < gFilterSize; filterCol++) {
 75:                         int inCol = outCol + filterCol;
 76:                         #if gPadZeros == 1
 77:                             inCol -= gHalfFilterSize;
 78:                         #endif
 79:                         bool process = rowOk && inCol >= 0 && inCol < gInputSize;
 80:                         if (process) {
 81:                             float imageValue = _inputPlane[ inRow * gInputSize + inCol ];
 82:                             float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ];
 83:                             sum += imageValue * filterValue;
 84:                         }
 85:                     }
 86:                 }
 87:                 if (filterId < gNumFilters) {
 88:                     // [n][filterId][outRow][outCol][inputPlane]
 89:                     int resultIndex = (( (n
 90:                         * gNumFilters + filterId)
 91:                         * gOutputSize + outRow)
 92:                         * gOutputSize + outCol)
 93:                         * gNumInputPlanes + inputPlaneId;
 94:                     output[resultIndex] = sum;
 95:                     //if (globalId == 2) output[0] = resultIndex;
 96: //                    output[resultIndex] = outRow;
 97:                 }
 98: //                output[localId] = _localFilterPlane[localId];
 99:             }
 100:         }
 101:     }
 102: }
 103: 
 104: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward_byinputplane.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 6: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept:
 8: // - load same input plane from each image
 9: // - hold filter plane for this input plane, for all filters
 10: // - reduce afterwards
 11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB
 12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB
 13: // => seems ok?
 14: // workgroupid: [inputPlaneId]
 15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...)
 16: // iterate over: [n][outCol]
 17: // output: [n][filterId][outRow][outCol][inputPlane]
 18: // need to later reduce output over: [inputPlane]
 19: void kernel forward_byinputplane(const int batchSize,
 20:       global const float *images, global const float *filters,
 21:     global float *output,
 22:     local float *_inputPlane, local float *_filterPlanes) {
 23: //    const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0;
 24: 
 25:     const int globalId = get_global_id(0);
 26:     const int workgroupId = get_group_id(0);
 27:     const int workgroupSize = get_local_size(0);
 28:     const int localId = get_local_id(0);
 29: 
 30:     const int inputPlaneId = workgroupId;
 31:     const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize;
 32:     const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize;
 33:     const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 34:     for (int loop = 0; loop < numLoops; loop++) {
 35:         const int loopLocalId = localId + loop * workgroupSize;
 36:         const int filterId = loopLocalId / gOutputSize;
 37:         const int outRow = loopLocalId % gOutputSize;
 38: 
 39:         // copy down our filter, we have gOutputSize threads to do this
 40:         global float const *globalFilterPlane = filters +
 41:             (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared;
 42:         local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared;
 43:         barrier(CLK_LOCAL_MEM_FENCE);
 44:         for (int i = 0; i < numFilterCopyLoops; i++) {
 45:             const int offset = i * gOutputSize + outRow;
 46:             bool process = filterId < gNumFilters && offset < gFilterSizeSquared;
 47:             if (process) {
 48:                 _localFilterPlane[ offset ] = globalFilterPlane[ offset ];
 49:             }
 50:         }
 51:         // loop over n ...
 52:         for (int n = 0; n < batchSize; n++) {
 53:             // copy down our imageplane, we have workgroupSize threads to do this
 54:             barrier(CLK_LOCAL_MEM_FENCE);
 55:             global float const *globalImagePlane = images +
 56:                 (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared;
 57:             for (int i = 0; i< numImageCopyLoops; i++) {
 58:                 const int offset = i * workgroupSize + localId;
 59:                 if (offset < gInputSizeSquared) {
 60:                     _inputPlane[ offset ] = globalImagePlane[ offset ];
 61:                 }
 62:             }
 63:             barrier(CLK_LOCAL_MEM_FENCE);
 64:             // calc output for each [outrow][outcol]
 65:             bool filterPlaneOk = filterId < gNumFilters;
 66:             for (int outCol = 0; outCol < gOutputSize; outCol++) {
 67:                 float sum = 0;
 68:                 for (int filterRow = 0; filterRow < gFilterSize; filterRow++) {
 69:                     int inRow = outRow + filterRow;
 70:                     #if gPadZeros == 1
 71:                         inRow -= gHalfFilterSize;
 72:                     #endif
 73:                     bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize;
 74:                     for (int filterCol = 0; filterCol < gFilterSize; filterCol++) {
 75:                         int inCol = outCol + filterCol;
 76:                         #if gPadZeros == 1
 77:                             inCol -= gHalfFilterSize;
 78:                         #endif
 79:                         bool process = rowOk && inCol >= 0 && inCol < gInputSize;
 80:                         if (process) {
 81:                             float imageValue = _inputPlane[ inRow * gInputSize + inCol ];
 82:                             float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ];
 83:                             sum += imageValue * filterValue;
 84:                         }
 85:                     }
 86:                 }
 87:                 if (filterId < gNumFilters) {
 88:                     // [n][filterId][outRow][outCol][inputPlane]
 89:                     int resultIndex = (( (n
 90:                         * gNumFilters + filterId)
 91:                         * gOutputSize + outRow)
 92:                         * gOutputSize + outCol)
 93:                         * gNumInputPlanes + inputPlaneId;
 94:                     output[resultIndex] = sum;
 95:                     //if (globalId == 2) output[0] = resultIndex;
 96: //                    output[resultIndex] = outRow;
 97:                 }
 98: //                output[localId] = _localFilterPlane[localId];
 99:             }
 100:         }
 101:     }
 102: }
 103: 
 104: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward_byinputplane.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 7
   ... seems valid
 ForwardIm2Col.cl build log: 
 (19:0) : error : invalid global address space qualifier specified for parameter type
 (19:0) : error : syntax error at 'const'

 kernel build error:

 kernel source:
 1: // from SpatialConvolutionMM.cu:
 2: 
 3: // CL: grid stride looping
 4: #define CL_KERNEL_LOOP(i, n)                        \
 5:   for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \
 6:       i < (n);                                       \
 7:       i += get_local_size(0) * get_num_groups(0))
 8: 
 9: //#define gPadding 0
 10: //#define gStride 1
 11: //#define gColSize 1
 12: //#define gFilterSize 3
 13: //#define gSize 3
 14: 
 15: // Kernel for fast unfold+copy
 16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu)
 17: kernel void im2col(
 18:     const int n,
 19:     global float const * im_data, int im_offset,
 20:     global float* data_col) {
 21:   global const float *data_im = im_data + im_offset;
 22: 
 23:   CL_KERNEL_LOOP(index, n) {
 24:     int w_out = index % 1;
 25:     index /= 1;
 26:     int h_out = index % 1;
 27:     int channel_in = index / 1;
 28:     int channel_out = channel_in * 3 * 3;
 29:     int h_in = h_out * 1 - 0;
 30:     int w_in = w_out * 1 - 0;
 31:     data_col += (channel_out * 1 + h_out) * 1 + w_out;
 32:     data_im += (channel_in * 3 + h_in) * 3 + w_in;
 33:     for (int i = 0; i < 3; ++i) {
 34:       for (int j = 0; j < 3; ++j) {
 35:         int h = h_in + i;
 36:         int w = w_in + j;
 37:         *data_col = (h >= 0 && w >= 0 && h < 3 && w < 3) ?
 38:           data_im[i * 3 + j] : 0;
 39:         data_col += 1 * 1;
 40:       }
 41:     }
 42:   }
 43: }
 44: 
 45: kernel void col2im(
 46:     const int n,
 47:     global float const *data_col,
 48:     global float* im_data, int im_offset) {
 49:   global float *data_im = im_data + im_offset;
 50: 
 51:   for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) {
 52:     float val = 0;
 53:     int w = index % 3 + 0;
 54:     int h = (index / 3) % 3 + 0;
 55:     int c = index / (3 * 3);
 56:     // compute the start and end of the output
 57:     int w_col_start = (w < 3) ? 0 : (w - 3) / 1 + 1;
 58:     int w_col_end = min(w / 1 + 1, 1);
 59:     int h_col_start = (h < 3) ? 0 : (h - 3) / 1 + 1;
 60:     int h_col_end = min(h / 1 + 1, 1);
 61: 
 62:     int offset = (c * 3 * 3 + h * 3 + w) * 1 * 1;
 63:     int coeff_h_col = (1 - 1 * 3 * 1) * 1;
 64:     int coeff_w_col = (1 - 1 * 1 * 1);
 65:     for (int h_col = h_col_start; h_col < h_col_end; ++h_col) {
 66:       for (int w_col = w_col_start; w_col < w_col_end; ++w_col) {
 67:         val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col];
 68:       }
 69:     }
 70:     data_im[index] = val;
 71:   }
 72: }
 73: 
 74: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 ForwardIm2Col.cl build log: 
 (19:0) : error : invalid global address space qualifier specified for parameter type
 (19:0) : error : syntax error at 'const'

 ForwardAuto: kernel 7 this instance cant be used: 
 kernel source:
 1: // from SpatialConvolutionMM.cu:
 2: 
 3: // CL: grid stride looping
 4: #define CL_KERNEL_LOOP(i, n)                        \
 5:   for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \
 6:       i < (n);                                       \
 7:       i += get_local_size(0) * get_num_groups(0))
 8: 
 9: //#define gPadding 0
 10: //#define gStride 1
 11: //#define gColSize 1
 12: //#define gFilterSize 3
 13: //#define gSize 3
 14: 
 15: // Kernel for fast unfold+copy
 16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu)
 17: kernel void im2col(
 18:     const int n,
 19:     global float const * im_data, int im_offset,
 20:     global float* data_col) {
 21:   global const float *data_im = im_data + im_offset;
 22: 
 23:   CL_KERNEL_LOOP(index, n) {
 24:     int w_out = index % 1;
 25:     index /= 1;
 26:     int h_out = index % 1;
 27:     int channel_in = index / 1;
 28:     int channel_out = channel_in * 3 * 3;
 29:     int h_in = h_out * 1 - 0;
 30:     int w_in = w_out * 1 - 0;
 31:     data_col += (channel_out * 1 + h_out) * 1 + w_out;
 32:     data_im += (channel_in * 3 + h_in) * 3 + w_in;
 33:     for (int i = 0; i < 3; ++i) {
 34:       for (int j = 0; j < 3; ++j) {
 35:         int h = h_in + i;
 36:         int w = w_in + j;
 37:         *data_col = (h >= 0 && w >= 0 && h < 3 && w < 3) ?
 38:           data_im[i * 3 + j] : 0;
 39:         data_col += 1 * 1;
 40:       }
 41:     }
 42:   }
 43: }
 44: 
 45: kernel void col2im(
 46:     const int n,
 47:     global float const *data_col,
 48:     global float* im_data, int im_offset) {
 49:   global float *data_im = im_data + im_offset;
 50: 
 51:   for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) {
 52:     float val = 0;
 53:     int w = index % 3 + 0;
 54:     int h = (index / 3) % 3 + 0;
 55:     int c = index / (3 * 3);
 56:     // compute the start and end of the output
 57:     int w_col_start = (w < 3) ? 0 : (w - 3) / 1 + 1;
 58:     int w_col_end = min(w / 1 + 1, 1);
 59:     int h_col_start = (h < 3) ? 0 : (h - 3) / 1 + 1;
 60:     int h_col_end = min(h / 1 + 1, 1);
 61: 
 62:     int offset = (c * 3 * 3 + h * 3 + w) * 1 * 1;
 63:     int coeff_h_col = (1 - 1 * 3 * 1) * 1;
 64:     int coeff_w_col = (1 - 1 * 1 * 1);
 65:     for (int h_col = h_col_start; h_col < h_col_end; ++h_col) {
 66:       for (int w_col = w_col_start; w_col < w_col_end; ++w_col) {
 67:         val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col];
 68:       }
 69:     }
 70:     data_im[index] = val;
 71:   }
 72: }
 73: 
 74: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 ForwardIm2Col.cl build log: 
 (19:0) : error : invalid global address space qualifier specified for parameter type
 (19:0) : error : syntax error at 'const'

   forward kernel 0: cannot be used
   forward kernel 1: cannot be used
   forward kernel 2: cannot be used
   forward kernel 3: cannot be used
   forward kernel 4: cannot be used
   forward kernel 5: cannot be used
   forward kernel 6: cannot be used
   forward kernel 7: cannot be used
 clblas teardown
 unknown file: Failure
 C++ exception with description "No valid forward implementations found" thrown in the test body.
 [  FAILED  ] testsimpleconvolvenet.imagesize3_n4_filtersize3_linear (190 ms)
 [ RUN      ] testsimpleconvolvenet.imagesize1_n2_2layers_unbiased
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU"
 " thrown in the test body.
 [  FAILED  ] testsimpleconvolvenet.imagesize1_n2_2layers_unbiased (79 ms)
 [ RUN      ] testsimpleconvolvenet.imagesize1_n2_2layers_biased
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU"
 " thrown in the test body.
 [  FAILED  ] testsimpleconvolvenet.imagesize1_n2_2layers_biased (83 ms)
 [ RUN      ] testsimpleconvolvenet.imagesize_5_4_2layers_filtersize_2_4_biased_n3
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=4 -DgOutputSizeSquared=16 -DgInputSize=4 -DgInputSizeSquared=16 -DgNumPlanes=3 -D RELU"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=4 -DgOutputSizeSquared=16 -DgInputSize=4 -DgInputSizeSquared=16 -DgNumPlanes=3 -D RELU"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=4 -DgOutputSizeSquared=16 -DgInputSize=4 -DgInputSizeSquared=16 -DgNumPlanes=3 -D RELU"
 " thrown in the test body.
 [  FAILED  ] testsimpleconvolvenet.imagesize_5_4_2layers_filtersize_2_4_biased_n3 (76 ms)
 [ RUN      ] testsimpleconvolvenet.imagesize_5_4_2layers_filtersize_2_4_biased_n6
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=4 -DgOutputSizeSquared=16 -DgInputSize=4 -DgInputSizeSquared=16 -DgNumPlanes=3 -D RELU"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=4 -DgOutputSizeSquared=16 -DgInputSize=4 -DgInputSizeSquared=16 -DgNumPlanes=3 -D RELU"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=4 -DgOutputSizeSquared=16 -DgInputSize=4 -DgInputSizeSquared=16 -DgNumPlanes=3 -D RELU"
 " thrown in the test body.
 [  FAILED  ] testsimpleconvolvenet.imagesize_5_4_2layers_filtersize_2_4_biased_n6 (84 ms)
 [ RUN      ] testsimpleconvolvenet.imagesize_5_3_2layers_filtersize_3_3_biased_n6
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=3 -D RELU"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=3 -D RELU"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=3 -D RELU"
 " thrown in the test body.
 [  FAILED  ] testsimpleconvolvenet.imagesize_5_3_2layers_filtersize_3_3_biased_n6 (75 ms)
 [ RUN      ] testsimpleconvolvenet.imagesize_5_3_2layers_filtersize_3_3_biased_n18
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=3 -D RELU"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=3 -D RELU"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=3 -D RELU"
 " thrown in the test body.
 [  FAILED  ] testsimpleconvolvenet.imagesize_5_3_2layers_filtersize_3_3_biased_n18 (86 ms)
 [----------] 12 tests from testsimpleconvolvenet (1163 ms total)

 [----------] 3 tests from testlogicaloperators
 [ RUN      ] testlogicaloperators.Convolve_1layer_biased_And
 And
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 forward try kernel 0
  ... not plausibly optimal, skipping
 forward try kernel 1
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 1: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 2
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

 ForwardAuto: kernel 2: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

   ... not valid
 forward try kernel 3
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept: each workgroup handles convolving one input example with one filtercube
 8: // and writing out one single output plane
 9: //
 10: // workgroup id organized like: [imageid][outplane]
 11: // local id organized like: [outrow][outcol]
 12: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 13: // number workgroups = 32
 14: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 16: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 17: // output are organized like [imageid][filterid][row][col]
 18: void kernel forward_3_by_n_outplane(const int batchSize,
 19:       global const float *images, global const float *filters,
 20:     global float *output,
 21:     local float *_upstreamImage, local float *_filterCube) {
 22:     const int globalId = get_global_id(0);
 23: 
 24:     const int workgroupId = get_group_id(0);
 25:     const int workgroupSize = get_local_size(0);
 26:     const int n = workgroupId / gNumFilters;
 27:     const int outPlane = workgroupId % gNumFilters;
 28: 
 29:     const int localId = get_local_id(0);
 30:     const int outputRow = localId / gOutputSize;
 31:     const int outputCol = localId % gOutputSize;
 32: 
 33:     const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize;
 34:     const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow  - gEven) : gHalfFilterSize - gEven;
 35:     const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize;
 36:     const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven;
 37: 
 38:     const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 39: 
 40:     const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 41:     const int filterCubeGlobalOffset = outPlane * filterCubeLength;
 42:     const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize;
 43:     for (int i = 0; i < numPixelsPerThread; i++) {
 44:         int thisOffset = localId + i * workgroupSize;
 45:         if (thisOffset < filterCubeLength) {
 46:             _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset];
 47:         }
 48:     }
 49:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 50: 
 51:     float sum = 0;
 52:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 53:         int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 54:         barrier(CLK_LOCAL_MEM_FENCE);
 55:         for (int i = 0; i < numUpstreamsPerThread; i++) {
 56:             int thisOffset = workgroupSize * i + localId;
 57:             if (thisOffset < gInputSizeSquared) {
 58:                 _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ];
 59:             }
 60:         }
 61:         barrier(CLK_LOCAL_MEM_FENCE);
 62:         int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 63:         for (int u = minu; u <= maxu; u++) {
 64:             int inputRow = outputRow + u;
 65:             #if gPadZeros == 0
 66:                 inputRow += gHalfFilterSize;
 67:             #endif
 68:             int inputimagerowoffset = inputRow * gInputSize;
 69:             int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 70:             for (int v = minv; v <= maxv; v++) {
 71:                 int inputCol = outputCol + v;
 72:                 #if gPadZeros == 0
 73:                     inputCol += gHalfFilterSize;
 74:                 #endif
 75:                 if (localId < gOutputSizeSquared) {
 76:                     sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 77:                 }
 78:             }
 79:         }
 80:     }
 81: 
 82:     // output are organized like [imageid][filterid][row][col]
 83:     int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 84:     if (localId < gOutputSizeSquared) {
 85:         output[resultIndex ] = sum;
 86:     }
 87: }
 88: 
 89: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 3: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept: each workgroup handles convolving one input example with one filtercube
 8: // and writing out one single output plane
 9: //
 10: // workgroup id organized like: [imageid][outplane]
 11: // local id organized like: [outrow][outcol]
 12: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 13: // number workgroups = 32
 14: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 16: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 17: // output are organized like [imageid][filterid][row][col]
 18: void kernel forward_3_by_n_outplane(const int batchSize,
 19:       global const float *images, global const float *filters,
 20:     global float *output,
 21:     local float *_upstreamImage, local float *_filterCube) {
 22:     const int globalId = get_global_id(0);
 23: 
 24:     const int workgroupId = get_group_id(0);
 25:     const int workgroupSize = get_local_size(0);
 26:     const int n = workgroupId / gNumFilters;
 27:     const int outPlane = workgroupId % gNumFilters;
 28: 
 29:     const int localId = get_local_id(0);
 30:     const int outputRow = localId / gOutputSize;
 31:     const int outputCol = localId % gOutputSize;
 32: 
 33:     const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize;
 34:     const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow  - gEven) : gHalfFilterSize - gEven;
 35:     const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize;
 36:     const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven;
 37: 
 38:     const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 39: 
 40:     const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 41:     const int filterCubeGlobalOffset = outPlane * filterCubeLength;
 42:     const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize;
 43:     for (int i = 0; i < numPixelsPerThread; i++) {
 44:         int thisOffset = localId + i * workgroupSize;
 45:         if (thisOffset < filterCubeLength) {
 46:             _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset];
 47:         }
 48:     }
 49:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 50: 
 51:     float sum = 0;
 52:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 53:         int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 54:         barrier(CLK_LOCAL_MEM_FENCE);
 55:         for (int i = 0; i < numUpstreamsPerThread; i++) {
 56:             int thisOffset = workgroupSize * i + localId;
 57:             if (thisOffset < gInputSizeSquared) {
 58:                 _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ];
 59:             }
 60:         }
 61:         barrier(CLK_LOCAL_MEM_FENCE);
 62:         int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 63:         for (int u = minu; u <= maxu; u++) {
 64:             int inputRow = outputRow + u;
 65:             #if gPadZeros == 0
 66:                 inputRow += gHalfFilterSize;
 67:             #endif
 68:             int inputimagerowoffset = inputRow * gInputSize;
 69:             int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 70:             for (int v = minv; v <= maxv; v++) {
 71:                 int inputCol = outputCol + v;
 72:                 #if gPadZeros == 0
 73:                     inputCol += gHalfFilterSize;
 74:                 #endif
 75:                 if (localId < gOutputSizeSquared) {
 76:                     sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 77:                 }
 78:             }
 79:         }
 80:     }
 81: 
 82:     // output are organized like [imageid][filterid][row][col]
 83:     int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 84:     if (localId < gOutputSizeSquared) {
 85:         output[resultIndex ] = sum;
 86:     }
 87: }
 88: 
 89: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 4
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, int N) {
 8:     int numLoops = (N + get_local_size(0) - 1) / get_local_size(0);
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * get_local_size(0) + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [n][filterid]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [n][filterid][outrow][outcol]
 26: // the pixels per thread thing... :
 27: // - we have one thread (~= cuda core) per output value,
 28: //   ie one thread for each combination of [outrow][outcol]
 29: // - however, the number of threads is typically limited on a gpu,
 30: //   eg to 512 (eg Intel HD), or 1024 (eg nVidia K520)
 31: // - so what happens if the number of output points is larger than
 32: //   the maximum workgroup size?
 33: // - then we have several possibilities really:
 34: //   - we can divide the image into blocks, and process each block
 35: //     separately.  This is probably a good option, but fair amount of
 36: //     work
 37: //   - we can get each thread to handle more than one output
 38: //     pixel, by looping
 39: //   - we can consider the output image in 1d, by putting the rows
 40: //     one after another, and assign each contiguous workgroup-size
 41: //     block to one workgroup
 42: //     => this is how this kernel works
 43: //     basically, it's a hack, so larger images actually run, without
 44: //     crashing, and we can probably improve it a lot :-)
 45: //
 46: // So, when outputSize * outputSize > workgroupSize, then
 47: // multiple workgroups will be created for each output plane
 48: // the number of such workgroups is given by: `gPixelsPerThread`
 49: // the id of our workgroup within such a set of workgroups is calculated
 50: // as `pixel`
 51: // effectiveLocalId is our local id if we had one enormous workgroup
 52: // containing the whole output image plane
 53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize,
 54:       global const float *images, global const float *filters,
 55:     global float *output,
 56:     local float *_inputPlane, local float *_filterPlane) {
 57:     #define globalId (get_global_id(0))
 58: 
 59:     #define localId (get_local_id(0))
 60:     #define workgroupId (get_group_id(0))
 61: //    const int workgroupSize = get_local_size(0);
 62:     const int effectiveWorkgroupId = workgroupId / gPixelsPerThread;
 63:     const int pixel = workgroupId % gPixelsPerThread;
 64:     const int effectiveLocalId = localId + pixel * gWorkgroupSize;
 65:     const int n = effectiveWorkgroupId / gNumFilters;
 66:     const int outPlane = effectiveWorkgroupId % gNumFilters;
 67: 
 68:     const int outputRow = effectiveLocalId / gOutputSize;
 69:     const int outputCol = effectiveLocalId % gOutputSize;
 70: 
 71:     float sum = 0;
 72:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 73:         barrier(CLK_LOCAL_MEM_FENCE);
 74:         copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared);
 75:         copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared);
 76:         barrier(CLK_LOCAL_MEM_FENCE);
 77: 
 78:         if (effectiveLocalId < gOutputSizeSquared) {
 79:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 80:                 // trying to reduce register pressure...
 81:                 #if gPadZeros == 1
 82:                     #define inputRow (outputRow + u)
 83:                 #else
 84:                     #define inputRow (outputRow + u + gHalfFilterSize)
 85:                 #endif
 86:                 int inputimagerowoffset = inputRow * gInputSize;
 87:                 int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 88:                 bool rowOk = inputRow >= 0 && inputRow < gInputSize;
 89:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 90:                     #if gPadZeros == 1
 91:                         #define inputCol (outputCol + v)
 92:                     #else
 93:                         #define inputCol (outputCol + v + gHalfFilterSize)
 94:                     #endif
 95:                     bool process = rowOk && inputCol >= 0 && inputCol < gInputSize;
 96:                     if (process) {
 97:                             sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ];
 98:                     }
 99:                 }
 100:             }
 101:         }
 102:     }
 103:     // output are organized like [imageid][filterid][row][col]
 104:     #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId)
 105:     if (effectiveLocalId < gOutputSizeSquared) {
 106:         output[resultIndex ] = sum;
 107:     }
 108: }
 109: #endif
 110: 
 111: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 4: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, int N) {
 8:     int numLoops = (N + get_local_size(0) - 1) / get_local_size(0);
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * get_local_size(0) + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [n][filterid]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [n][filterid][outrow][outcol]
 26: // the pixels per thread thing... :
 27: // - we have one thread (~= cuda core) per output value,
 28: //   ie one thread for each combination of [outrow][outcol]
 29: // - however, the number of threads is typically limited on a gpu,
 30: //   eg to 512 (eg Intel HD), or 1024 (eg nVidia K520)
 31: // - so what happens if the number of output points is larger than
 32: //   the maximum workgroup size?
 33: // - then we have several possibilities really:
 34: //   - we can divide the image into blocks, and process each block
 35: //     separately.  This is probably a good option, but fair amount of
 36: //     work
 37: //   - we can get each thread to handle more than one output
 38: //     pixel, by looping
 39: //   - we can consider the output image in 1d, by putting the rows
 40: //     one after another, and assign each contiguous workgroup-size
 41: //     block to one workgroup
 42: //     => this is how this kernel works
 43: //     basically, it's a hack, so larger images actually run, without
 44: //     crashing, and we can probably improve it a lot :-)
 45: //
 46: // So, when outputSize * outputSize > workgroupSize, then
 47: // multiple workgroups will be created for each output plane
 48: // the number of such workgroups is given by: `gPixelsPerThread`
 49: // the id of our workgroup within such a set of workgroups is calculated
 50: // as `pixel`
 51: // effectiveLocalId is our local id if we had one enormous workgroup
 52: // containing the whole output image plane
 53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize,
 54:       global const float *images, global const float *filters,
 55:     global float *output,
 56:     local float *_inputPlane, local float *_filterPlane) {
 57:     #define globalId (get_global_id(0))
 58: 
 59:     #define localId (get_local_id(0))
 60:     #define workgroupId (get_group_id(0))
 61: //    const int workgroupSize = get_local_size(0);
 62:     const int effectiveWorkgroupId = workgroupId / gPixelsPerThread;
 63:     const int pixel = workgroupId % gPixelsPerThread;
 64:     const int effectiveLocalId = localId + pixel * gWorkgroupSize;
 65:     const int n = effectiveWorkgroupId / gNumFilters;
 66:     const int outPlane = effectiveWorkgroupId % gNumFilters;
 67: 
 68:     const int outputRow = effectiveLocalId / gOutputSize;
 69:     const int outputCol = effectiveLocalId % gOutputSize;
 70: 
 71:     float sum = 0;
 72:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 73:         barrier(CLK_LOCAL_MEM_FENCE);
 74:         copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared);
 75:         copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared);
 76:         barrier(CLK_LOCAL_MEM_FENCE);
 77: 
 78:         if (effectiveLocalId < gOutputSizeSquared) {
 79:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 80:                 // trying to reduce register pressure...
 81:                 #if gPadZeros == 1
 82:                     #define inputRow (outputRow + u)
 83:                 #else
 84:                     #define inputRow (outputRow + u + gHalfFilterSize)
 85:                 #endif
 86:                 int inputimagerowoffset = inputRow * gInputSize;
 87:                 int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 88:                 bool rowOk = inputRow >= 0 && inputRow < gInputSize;
 89:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 90:                     #if gPadZeros == 1
 91:                         #define inputCol (outputCol + v)
 92:                     #else
 93:                         #define inputCol (outputCol + v + gHalfFilterSize)
 94:                     #endif
 95:                     bool process = rowOk && inputCol >= 0 && inputCol < gInputSize;
 96:                     if (process) {
 97:                             sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ];
 98:                     }
 99:                 }
 100:             }
 101:         }
 102:     }
 103:     // output are organized like [imageid][filterid][row][col]
 104:     #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId)
 105:     if (effectiveLocalId < gOutputSizeSquared) {
 106:         output[resultIndex ] = sum;
 107:     }
 108: }
 109: #endif
 110: 
 111: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 5
 cl/reduce_segments.cl build log: 
 (8:0) : error : invalid global address space qualifier specified for parameter type
 (8:0) : error : syntax error at 'const'

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: kernel void reduce_segments(const int numSegments, const int segmentLength,
 8:         global float const *in, global float* out) {
 9:     const int globalId = get_global_id(0);
 10:     const int segmentId = globalId;
 11: 
 12:     if (segmentId >= numSegments) {
 13:         return;
 14:     }
 15: 
 16:     float sum = 0;
 17:     global const float *segment = in + segmentId * segmentLength;
 18:     for (int i = 0; i < segmentLength; i++) {
 19:         sum += segment[i];
 20:     }
 21:     out[segmentId] = sum;
 22: }
 23: 
 24: 
 25: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/reduce_segments.cl build log: 
 (8:0) : error : invalid global address space qualifier specified for parameter type
 (8:0) : error : syntax error at 'const'

 ForwardAuto: kernel 5: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: kernel void reduce_segments(const int numSegments, const int segmentLength,
 8:         global float const *in, global float* out) {
 9:     const int globalId = get_global_id(0);
 10:     const int segmentId = globalId;
 11: 
 12:     if (segmentId >= numSegments) {
 13:         return;
 14:     }
 15: 
 16:     float sum = 0;
 17:     global const float *segment = in + segmentId * segmentLength;
 18:     for (int i = 0; i < segmentLength; i++) {
 19:         sum += segment[i];
 20:     }
 21:     out[segmentId] = sum;
 22: }
 23: 
 24: 
 25: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/reduce_segments.cl build log: 
 (8:0) : error : invalid global address space qualifier specified for parameter type
 (8:0) : error : syntax error at 'const'

   ... not valid
 forward try kernel 6
 cl/forward_byinputplane.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept:
 8: // - load same input plane from each image
 9: // - hold filter plane for this input plane, for all filters
 10: // - reduce afterwards
 11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB
 12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB
 13: // => seems ok?
 14: // workgroupid: [inputPlaneId]
 15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...)
 16: // iterate over: [n][outCol]
 17: // output: [n][filterId][outRow][outCol][inputPlane]
 18: // need to later reduce output over: [inputPlane]
 19: void kernel forward_byinputplane(const int batchSize,
 20:       global const float *images, global const float *filters,
 21:     global float *output,
 22:     local float *_inputPlane, local float *_filterPlanes) {
 23: //    const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0;
 24: 
 25:     const int globalId = get_global_id(0);
 26:     const int workgroupId = get_group_id(0);
 27:     const int workgroupSize = get_local_size(0);
 28:     const int localId = get_local_id(0);
 29: 
 30:     const int inputPlaneId = workgroupId;
 31:     const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize;
 32:     const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize;
 33:     const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 34:     for (int loop = 0; loop < numLoops; loop++) {
 35:         const int loopLocalId = localId + loop * workgroupSize;
 36:         const int filterId = loopLocalId / gOutputSize;
 37:         const int outRow = loopLocalId % gOutputSize;
 38: 
 39:         // copy down our filter, we have gOutputSize threads to do this
 40:         global float const *globalFilterPlane = filters +
 41:             (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared;
 42:         local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared;
 43:         barrier(CLK_LOCAL_MEM_FENCE);
 44:         for (int i = 0; i < numFilterCopyLoops; i++) {
 45:             const int offset = i * gOutputSize + outRow;
 46:             bool process = filterId < gNumFilters && offset < gFilterSizeSquared;
 47:             if (process) {
 48:                 _localFilterPlane[ offset ] = globalFilterPlane[ offset ];
 49:             }
 50:         }
 51:         // loop over n ...
 52:         for (int n = 0; n < batchSize; n++) {
 53:             // copy down our imageplane, we have workgroupSize threads to do this
 54:             barrier(CLK_LOCAL_MEM_FENCE);
 55:             global float const *globalImagePlane = images +
 56:                 (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared;
 57:             for (int i = 0; i< numImageCopyLoops; i++) {
 58:                 const int offset = i * workgroupSize + localId;
 59:                 if (offset < gInputSizeSquared) {
 60:                     _inputPlane[ offset ] = globalImagePlane[ offset ];
 61:                 }
 62:             }
 63:             barrier(CLK_LOCAL_MEM_FENCE);
 64:             // calc output for each [outrow][outcol]
 65:             bool filterPlaneOk = filterId < gNumFilters;
 66:             for (int outCol = 0; outCol < gOutputSize; outCol++) {
 67:                 float sum = 0;
 68:                 for (int filterRow = 0; filterRow < gFilterSize; filterRow++) {
 69:                     int inRow = outRow + filterRow;
 70:                     #if gPadZeros == 1
 71:                         inRow -= gHalfFilterSize;
 72:                     #endif
 73:                     bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize;
 74:                     for (int filterCol = 0; filterCol < gFilterSize; filterCol++) {
 75:                         int inCol = outCol + filterCol;
 76:                         #if gPadZeros == 1
 77:                             inCol -= gHalfFilterSize;
 78:                         #endif
 79:                         bool process = rowOk && inCol >= 0 && inCol < gInputSize;
 80:                         if (process) {
 81:                             float imageValue = _inputPlane[ inRow * gInputSize + inCol ];
 82:                             float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ];
 83:                             sum += imageValue * filterValue;
 84:                         }
 85:                     }
 86:                 }
 87:                 if (filterId < gNumFilters) {
 88:                     // [n][filterId][outRow][outCol][inputPlane]
 89:                     int resultIndex = (( (n
 90:                         * gNumFilters + filterId)
 91:                         * gOutputSize + outRow)
 92:                         * gOutputSize + outCol)
 93:                         * gNumInputPlanes + inputPlaneId;
 94:                     output[resultIndex] = sum;
 95:                     //if (globalId == 2) output[0] = resultIndex;
 96: //                    output[resultIndex] = outRow;
 97:                 }
 98: //                output[localId] = _localFilterPlane[localId];
 99:             }
 100:         }
 101:     }
 102: }
 103: 
 104: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward_byinputplane.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 6: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept:
 8: // - load same input plane from each image
 9: // - hold filter plane for this input plane, for all filters
 10: // - reduce afterwards
 11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB
 12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB
 13: // => seems ok?
 14: // workgroupid: [inputPlaneId]
 15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...)
 16: // iterate over: [n][outCol]
 17: // output: [n][filterId][outRow][outCol][inputPlane]
 18: // need to later reduce output over: [inputPlane]
 19: void kernel forward_byinputplane(const int batchSize,
 20:       global const float *images, global const float *filters,
 21:     global float *output,
 22:     local float *_inputPlane, local float *_filterPlanes) {
 23: //    const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0;
 24: 
 25:     const int globalId = get_global_id(0);
 26:     const int workgroupId = get_group_id(0);
 27:     const int workgroupSize = get_local_size(0);
 28:     const int localId = get_local_id(0);
 29: 
 30:     const int inputPlaneId = workgroupId;
 31:     const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize;
 32:     const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize;
 33:     const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 34:     for (int loop = 0; loop < numLoops; loop++) {
 35:         const int loopLocalId = localId + loop * workgroupSize;
 36:         const int filterId = loopLocalId / gOutputSize;
 37:         const int outRow = loopLocalId % gOutputSize;
 38: 
 39:         // copy down our filter, we have gOutputSize threads to do this
 40:         global float const *globalFilterPlane = filters +
 41:             (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared;
 42:         local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared;
 43:         barrier(CLK_LOCAL_MEM_FENCE);
 44:         for (int i = 0; i < numFilterCopyLoops; i++) {
 45:             const int offset = i * gOutputSize + outRow;
 46:             bool process = filterId < gNumFilters && offset < gFilterSizeSquared;
 47:             if (process) {
 48:                 _localFilterPlane[ offset ] = globalFilterPlane[ offset ];
 49:             }
 50:         }
 51:         // loop over n ...
 52:         for (int n = 0; n < batchSize; n++) {
 53:             // copy down our imageplane, we have workgroupSize threads to do this
 54:             barrier(CLK_LOCAL_MEM_FENCE);
 55:             global float const *globalImagePlane = images +
 56:                 (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared;
 57:             for (int i = 0; i< numImageCopyLoops; i++) {
 58:                 const int offset = i * workgroupSize + localId;
 59:                 if (offset < gInputSizeSquared) {
 60:                     _inputPlane[ offset ] = globalImagePlane[ offset ];
 61:                 }
 62:             }
 63:             barrier(CLK_LOCAL_MEM_FENCE);
 64:             // calc output for each [outrow][outcol]
 65:             bool filterPlaneOk = filterId < gNumFilters;
 66:             for (int outCol = 0; outCol < gOutputSize; outCol++) {
 67:                 float sum = 0;
 68:                 for (int filterRow = 0; filterRow < gFilterSize; filterRow++) {
 69:                     int inRow = outRow + filterRow;
 70:                     #if gPadZeros == 1
 71:                         inRow -= gHalfFilterSize;
 72:                     #endif
 73:                     bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize;
 74:                     for (int filterCol = 0; filterCol < gFilterSize; filterCol++) {
 75:                         int inCol = outCol + filterCol;
 76:                         #if gPadZeros == 1
 77:                             inCol -= gHalfFilterSize;
 78:                         #endif
 79:                         bool process = rowOk && inCol >= 0 && inCol < gInputSize;
 80:                         if (process) {
 81:                             float imageValue = _inputPlane[ inRow * gInputSize + inCol ];
 82:                             float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ];
 83:                             sum += imageValue * filterValue;
 84:                         }
 85:                     }
 86:                 }
 87:                 if (filterId < gNumFilters) {
 88:                     // [n][filterId][outRow][outCol][inputPlane]
 89:                     int resultIndex = (( (n
 90:                         * gNumFilters + filterId)
 91:                         * gOutputSize + outRow)
 92:                         * gOutputSize + outCol)
 93:                         * gNumInputPlanes + inputPlaneId;
 94:                     output[resultIndex] = sum;
 95:                     //if (globalId == 2) output[0] = resultIndex;
 96: //                    output[resultIndex] = outRow;
 97:                 }
 98: //                output[localId] = _localFilterPlane[localId];
 99:             }
 100:         }
 101:     }
 102: }
 103: 
 104: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward_byinputplane.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 7
   ... seems valid
 ForwardIm2Col.cl build log: 
 (19:0) : error : invalid global address space qualifier specified for parameter type
 (19:0) : error : syntax error at 'const'

 kernel build error:

 kernel source:
 1: // from SpatialConvolutionMM.cu:
 2: 
 3: // CL: grid stride looping
 4: #define CL_KERNEL_LOOP(i, n)                        \
 5:   for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \
 6:       i < (n);                                       \
 7:       i += get_local_size(0) * get_num_groups(0))
 8: 
 9: //#define gPadding 0
 10: //#define gStride 1
 11: //#define gColSize 1
 12: //#define gFilterSize 1
 13: //#define gSize 1
 14: 
 15: // Kernel for fast unfold+copy
 16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu)
 17: kernel void im2col(
 18:     const int n,
 19:     global float const * im_data, int im_offset,
 20:     global float* data_col) {
 21:   global const float *data_im = im_data + im_offset;
 22: 
 23:   CL_KERNEL_LOOP(index, n) {
 24:     int w_out = index % 1;
 25:     index /= 1;
 26:     int h_out = index % 1;
 27:     int channel_in = index / 1;
 28:     int channel_out = channel_in * 1 * 1;
 29:     int h_in = h_out * 1 - 0;
 30:     int w_in = w_out * 1 - 0;
 31:     data_col += (channel_out * 1 + h_out) * 1 + w_out;
 32:     data_im += (channel_in * 1 + h_in) * 1 + w_in;
 33:     for (int i = 0; i < 1; ++i) {
 34:       for (int j = 0; j < 1; ++j) {
 35:         int h = h_in + i;
 36:         int w = w_in + j;
 37:         *data_col = (h >= 0 && w >= 0 && h < 1 && w < 1) ?
 38:           data_im[i * 1 + j] : 0;
 39:         data_col += 1 * 1;
 40:       }
 41:     }
 42:   }
 43: }
 44: 
 45: kernel void col2im(
 46:     const int n,
 47:     global float const *data_col,
 48:     global float* im_data, int im_offset) {
 49:   global float *data_im = im_data + im_offset;
 50: 
 51:   for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) {
 52:     float val = 0;
 53:     int w = index % 1 + 0;
 54:     int h = (index / 1) % 1 + 0;
 55:     int c = index / (1 * 1);
 56:     // compute the start and end of the output
 57:     int w_col_start = (w < 1) ? 0 : (w - 1) / 1 + 1;
 58:     int w_col_end = min(w / 1 + 1, 1);
 59:     int h_col_start = (h < 1) ? 0 : (h - 1) / 1 + 1;
 60:     int h_col_end = min(h / 1 + 1, 1);
 61: 
 62:     int offset = (c * 1 * 1 + h * 1 + w) * 1 * 1;
 63:     int coeff_h_col = (1 - 1 * 1 * 1) * 1;
 64:     int coeff_w_col = (1 - 1 * 1 * 1);
 65:     for (int h_col = h_col_start; h_col < h_col_end; ++h_col) {
 66:       for (int w_col = w_col_start; w_col < w_col_end; ++w_col) {
 67:         val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col];
 68:       }
 69:     }
 70:     data_im[index] = val;
 71:   }
 72: }
 73: 
 74: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 ForwardIm2Col.cl build log: 
 (19:0) : error : invalid global address space qualifier specified for parameter type
 (19:0) : error : syntax error at 'const'

 ForwardAuto: kernel 7 this instance cant be used: 
 kernel source:
 1: // from SpatialConvolutionMM.cu:
 2: 
 3: // CL: grid stride looping
 4: #define CL_KERNEL_LOOP(i, n)                        \
 5:   for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \
 6:       i < (n);                                       \
 7:       i += get_local_size(0) * get_num_groups(0))
 8: 
 9: //#define gPadding 0
 10: //#define gStride 1
 11: //#define gColSize 1
 12: //#define gFilterSize 1
 13: //#define gSize 1
 14: 
 15: // Kernel for fast unfold+copy
 16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu)
 17: kernel void im2col(
 18:     const int n,
 19:     global float const * im_data, int im_offset,
 20:     global float* data_col) {
 21:   global const float *data_im = im_data + im_offset;
 22: 
 23:   CL_KERNEL_LOOP(index, n) {
 24:     int w_out = index % 1;
 25:     index /= 1;
 26:     int h_out = index % 1;
 27:     int channel_in = index / 1;
 28:     int channel_out = channel_in * 1 * 1;
 29:     int h_in = h_out * 1 - 0;
 30:     int w_in = w_out * 1 - 0;
 31:     data_col += (channel_out * 1 + h_out) * 1 + w_out;
 32:     data_im += (channel_in * 1 + h_in) * 1 + w_in;
 33:     for (int i = 0; i < 1; ++i) {
 34:       for (int j = 0; j < 1; ++j) {
 35:         int h = h_in + i;
 36:         int w = w_in + j;
 37:         *data_col = (h >= 0 && w >= 0 && h < 1 && w < 1) ?
 38:           data_im[i * 1 + j] : 0;
 39:         data_col += 1 * 1;
 40:       }
 41:     }
 42:   }
 43: }
 44: 
 45: kernel void col2im(
 46:     const int n,
 47:     global float const *data_col,
 48:     global float* im_data, int im_offset) {
 49:   global float *data_im = im_data + im_offset;
 50: 
 51:   for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) {
 52:     float val = 0;
 53:     int w = index % 1 + 0;
 54:     int h = (index / 1) % 1 + 0;
 55:     int c = index / (1 * 1);
 56:     // compute the start and end of the output
 57:     int w_col_start = (w < 1) ? 0 : (w - 1) / 1 + 1;
 58:     int w_col_end = min(w / 1 + 1, 1);
 59:     int h_col_start = (h < 1) ? 0 : (h - 1) / 1 + 1;
 60:     int h_col_end = min(h / 1 + 1, 1);
 61: 
 62:     int offset = (c * 1 * 1 + h * 1 + w) * 1 * 1;
 63:     int coeff_h_col = (1 - 1 * 1 * 1) * 1;
 64:     int coeff_w_col = (1 - 1 * 1 * 1);
 65:     for (int h_col = h_col_start; h_col < h_col_end; ++h_col) {
 66:       for (int w_col = w_col_start; w_col < w_col_end; ++w_col) {
 67:         val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col];
 68:       }
 69:     }
 70:     data_im[index] = val;
 71:   }
 72: }
 73: 
 74: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 ForwardIm2Col.cl build log: 
 (19:0) : error : invalid global address space qualifier specified for parameter type
 (19:0) : error : syntax error at 'const'

   forward kernel 0: cannot be used
   forward kernel 1: cannot be used
   forward kernel 2: cannot be used
   forward kernel 3: cannot be used
   forward kernel 4: cannot be used
   forward kernel 5: cannot be used
   forward kernel 6: cannot be used
   forward kernel 7: cannot be used
 clblas teardown
 unknown file: Failure
 C++ exception with description "No valid forward implementations found" thrown in the test body.
 [  FAILED  ] testlogicaloperators.Convolve_1layer_biased_And (182 ms)
 [ RUN      ] testlogicaloperators.Convolve_1layerbiased_Or
 Or, convolve
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 forward try kernel 0
  ... not plausibly optimal, skipping
 forward try kernel 1
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 1: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 2
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

 ForwardAuto: kernel 2: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

   ... not valid
 forward try kernel 3
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept: each workgroup handles convolving one input example with one filtercube
 8: // and writing out one single output plane
 9: //
 10: // workgroup id organized like: [imageid][outplane]
 11: // local id organized like: [outrow][outcol]
 12: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 13: // number workgroups = 32
 14: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 16: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 17: // output are organized like [imageid][filterid][row][col]
 18: void kernel forward_3_by_n_outplane(const int batchSize,
 19:       global const float *images, global const float *filters,
 20:     global float *output,
 21:     local float *_upstreamImage, local float *_filterCube) {
 22:     const int globalId = get_global_id(0);
 23: 
 24:     const int workgroupId = get_group_id(0);
 25:     const int workgroupSize = get_local_size(0);
 26:     const int n = workgroupId / gNumFilters;
 27:     const int outPlane = workgroupId % gNumFilters;
 28: 
 29:     const int localId = get_local_id(0);
 30:     const int outputRow = localId / gOutputSize;
 31:     const int outputCol = localId % gOutputSize;
 32: 
 33:     const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize;
 34:     const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow  - gEven) : gHalfFilterSize - gEven;
 35:     const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize;
 36:     const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven;
 37: 
 38:     const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 39: 
 40:     const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 41:     const int filterCubeGlobalOffset = outPlane * filterCubeLength;
 42:     const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize;
 43:     for (int i = 0; i < numPixelsPerThread; i++) {
 44:         int thisOffset = localId + i * workgroupSize;
 45:         if (thisOffset < filterCubeLength) {
 46:             _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset];
 47:         }
 48:     }
 49:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 50: 
 51:     float sum = 0;
 52:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 53:         int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 54:         barrier(CLK_LOCAL_MEM_FENCE);
 55:         for (int i = 0; i < numUpstreamsPerThread; i++) {
 56:             int thisOffset = workgroupSize * i + localId;
 57:             if (thisOffset < gInputSizeSquared) {
 58:                 _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ];
 59:             }
 60:         }
 61:         barrier(CLK_LOCAL_MEM_FENCE);
 62:         int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 63:         for (int u = minu; u <= maxu; u++) {
 64:             int inputRow = outputRow + u;
 65:             #if gPadZeros == 0
 66:                 inputRow += gHalfFilterSize;
 67:             #endif
 68:             int inputimagerowoffset = inputRow * gInputSize;
 69:             int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 70:             for (int v = minv; v <= maxv; v++) {
 71:                 int inputCol = outputCol + v;
 72:                 #if gPadZeros == 0
 73:                     inputCol += gHalfFilterSize;
 74:                 #endif
 75:                 if (localId < gOutputSizeSquared) {
 76:                     sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 77:                 }
 78:             }
 79:         }
 80:     }
 81: 
 82:     // output are organized like [imageid][filterid][row][col]
 83:     int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 84:     if (localId < gOutputSizeSquared) {
 85:         output[resultIndex ] = sum;
 86:     }
 87: }
 88: 
 89: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 3: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept: each workgroup handles convolving one input example with one filtercube
 8: // and writing out one single output plane
 9: //
 10: // workgroup id organized like: [imageid][outplane]
 11: // local id organized like: [outrow][outcol]
 12: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 13: // number workgroups = 32
 14: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 16: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 17: // output are organized like [imageid][filterid][row][col]
 18: void kernel forward_3_by_n_outplane(const int batchSize,
 19:       global const float *images, global const float *filters,
 20:     global float *output,
 21:     local float *_upstreamImage, local float *_filterCube) {
 22:     const int globalId = get_global_id(0);
 23: 
 24:     const int workgroupId = get_group_id(0);
 25:     const int workgroupSize = get_local_size(0);
 26:     const int n = workgroupId / gNumFilters;
 27:     const int outPlane = workgroupId % gNumFilters;
 28: 
 29:     const int localId = get_local_id(0);
 30:     const int outputRow = localId / gOutputSize;
 31:     const int outputCol = localId % gOutputSize;
 32: 
 33:     const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize;
 34:     const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow  - gEven) : gHalfFilterSize - gEven;
 35:     const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize;
 36:     const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven;
 37: 
 38:     const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 39: 
 40:     const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 41:     const int filterCubeGlobalOffset = outPlane * filterCubeLength;
 42:     const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize;
 43:     for (int i = 0; i < numPixelsPerThread; i++) {
 44:         int thisOffset = localId + i * workgroupSize;
 45:         if (thisOffset < filterCubeLength) {
 46:             _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset];
 47:         }
 48:     }
 49:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 50: 
 51:     float sum = 0;
 52:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 53:         int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 54:         barrier(CLK_LOCAL_MEM_FENCE);
 55:         for (int i = 0; i < numUpstreamsPerThread; i++) {
 56:             int thisOffset = workgroupSize * i + localId;
 57:             if (thisOffset < gInputSizeSquared) {
 58:                 _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ];
 59:             }
 60:         }
 61:         barrier(CLK_LOCAL_MEM_FENCE);
 62:         int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 63:         for (int u = minu; u <= maxu; u++) {
 64:             int inputRow = outputRow + u;
 65:             #if gPadZeros == 0
 66:                 inputRow += gHalfFilterSize;
 67:             #endif
 68:             int inputimagerowoffset = inputRow * gInputSize;
 69:             int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 70:             for (int v = minv; v <= maxv; v++) {
 71:                 int inputCol = outputCol + v;
 72:                 #if gPadZeros == 0
 73:                     inputCol += gHalfFilterSize;
 74:                 #endif
 75:                 if (localId < gOutputSizeSquared) {
 76:                     sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 77:                 }
 78:             }
 79:         }
 80:     }
 81: 
 82:     // output are organized like [imageid][filterid][row][col]
 83:     int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 84:     if (localId < gOutputSizeSquared) {
 85:         output[resultIndex ] = sum;
 86:     }
 87: }
 88: 
 89: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 4
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, int N) {
 8:     int numLoops = (N + get_local_size(0) - 1) / get_local_size(0);
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * get_local_size(0) + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [n][filterid]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [n][filterid][outrow][outcol]
 26: // the pixels per thread thing... :
 27: // - we have one thread (~= cuda core) per output value,
 28: //   ie one thread for each combination of [outrow][outcol]
 29: // - however, the number of threads is typically limited on a gpu,
 30: //   eg to 512 (eg Intel HD), or 1024 (eg nVidia K520)
 31: // - so what happens if the number of output points is larger than
 32: //   the maximum workgroup size?
 33: // - then we have several possibilities really:
 34: //   - we can divide the image into blocks, and process each block
 35: //     separately.  This is probably a good option, but fair amount of
 36: //     work
 37: //   - we can get each thread to handle more than one output
 38: //     pixel, by looping
 39: //   - we can consider the output image in 1d, by putting the rows
 40: //     one after another, and assign each contiguous workgroup-size
 41: //     block to one workgroup
 42: //     => this is how this kernel works
 43: //     basically, it's a hack, so larger images actually run, without
 44: //     crashing, and we can probably improve it a lot :-)
 45: //
 46: // So, when outputSize * outputSize > workgroupSize, then
 47: // multiple workgroups will be created for each output plane
 48: // the number of such workgroups is given by: `gPixelsPerThread`
 49: // the id of our workgroup within such a set of workgroups is calculated
 50: // as `pixel`
 51: // effectiveLocalId is our local id if we had one enormous workgroup
 52: // containing the whole output image plane
 53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize,
 54:       global const float *images, global const float *filters,
 55:     global float *output,
 56:     local float *_inputPlane, local float *_filterPlane) {
 57:     #define globalId (get_global_id(0))
 58: 
 59:     #define localId (get_local_id(0))
 60:     #define workgroupId (get_group_id(0))
 61: //    const int workgroupSize = get_local_size(0);
 62:     const int effectiveWorkgroupId = workgroupId / gPixelsPerThread;
 63:     const int pixel = workgroupId % gPixelsPerThread;
 64:     const int effectiveLocalId = localId + pixel * gWorkgroupSize;
 65:     const int n = effectiveWorkgroupId / gNumFilters;
 66:     const int outPlane = effectiveWorkgroupId % gNumFilters;
 67: 
 68:     const int outputRow = effectiveLocalId / gOutputSize;
 69:     const int outputCol = effectiveLocalId % gOutputSize;
 70: 
 71:     float sum = 0;
 72:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 73:         barrier(CLK_LOCAL_MEM_FENCE);
 74:         copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared);
 75:         copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared);
 76:         barrier(CLK_LOCAL_MEM_FENCE);
 77: 
 78:         if (effectiveLocalId < gOutputSizeSquared) {
 79:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 80:                 // trying to reduce register pressure...
 81:                 #if gPadZeros == 1
 82:                     #define inputRow (outputRow + u)
 83:                 #else
 84:                     #define inputRow (outputRow + u + gHalfFilterSize)
 85:                 #endif
 86:                 int inputimagerowoffset = inputRow * gInputSize;
 87:                 int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 88:                 bool rowOk = inputRow >= 0 && inputRow < gInputSize;
 89:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 90:                     #if gPadZeros == 1
 91:                         #define inputCol (outputCol + v)
 92:                     #else
 93:                         #define inputCol (outputCol + v + gHalfFilterSize)
 94:                     #endif
 95:                     bool process = rowOk && inputCol >= 0 && inputCol < gInputSize;
 96:                     if (process) {
 97:                             sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ];
 98:                     }
 99:                 }
 100:             }
 101:         }
 102:     }
 103:     // output are organized like [imageid][filterid][row][col]
 104:     #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId)
 105:     if (effectiveLocalId < gOutputSizeSquared) {
 106:         output[resultIndex ] = sum;
 107:     }
 108: }
 109: #endif
 110: 
 111: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 4: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, int N) {
 8:     int numLoops = (N + get_local_size(0) - 1) / get_local_size(0);
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * get_local_size(0) + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [n][filterid]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [n][filterid][outrow][outcol]
 26: // the pixels per thread thing... :
 27: // - we have one thread (~= cuda core) per output value,
 28: //   ie one thread for each combination of [outrow][outcol]
 29: // - however, the number of threads is typically limited on a gpu,
 30: //   eg to 512 (eg Intel HD), or 1024 (eg nVidia K520)
 31: // - so what happens if the number of output points is larger than
 32: //   the maximum workgroup size?
 33: // - then we have several possibilities really:
 34: //   - we can divide the image into blocks, and process each block
 35: //     separately.  This is probably a good option, but fair amount of
 36: //     work
 37: //   - we can get each thread to handle more than one output
 38: //     pixel, by looping
 39: //   - we can consider the output image in 1d, by putting the rows
 40: //     one after another, and assign each contiguous workgroup-size
 41: //     block to one workgroup
 42: //     => this is how this kernel works
 43: //     basically, it's a hack, so larger images actually run, without
 44: //     crashing, and we can probably improve it a lot :-)
 45: //
 46: // So, when outputSize * outputSize > workgroupSize, then
 47: // multiple workgroups will be created for each output plane
 48: // the number of such workgroups is given by: `gPixelsPerThread`
 49: // the id of our workgroup within such a set of workgroups is calculated
 50: // as `pixel`
 51: // effectiveLocalId is our local id if we had one enormous workgroup
 52: // containing the whole output image plane
 53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize,
 54:       global const float *images, global const float *filters,
 55:     global float *output,
 56:     local float *_inputPlane, local float *_filterPlane) {
 57:     #define globalId (get_global_id(0))
 58: 
 59:     #define localId (get_local_id(0))
 60:     #define workgroupId (get_group_id(0))
 61: //    const int workgroupSize = get_local_size(0);
 62:     const int effectiveWorkgroupId = workgroupId / gPixelsPerThread;
 63:     const int pixel = workgroupId % gPixelsPerThread;
 64:     const int effectiveLocalId = localId + pixel * gWorkgroupSize;
 65:     const int n = effectiveWorkgroupId / gNumFilters;
 66:     const int outPlane = effectiveWorkgroupId % gNumFilters;
 67: 
 68:     const int outputRow = effectiveLocalId / gOutputSize;
 69:     const int outputCol = effectiveLocalId % gOutputSize;
 70: 
 71:     float sum = 0;
 72:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 73:         barrier(CLK_LOCAL_MEM_FENCE);
 74:         copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared);
 75:         copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared);
 76:         barrier(CLK_LOCAL_MEM_FENCE);
 77: 
 78:         if (effectiveLocalId < gOutputSizeSquared) {
 79:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 80:                 // trying to reduce register pressure...
 81:                 #if gPadZeros == 1
 82:                     #define inputRow (outputRow + u)
 83:                 #else
 84:                     #define inputRow (outputRow + u + gHalfFilterSize)
 85:                 #endif
 86:                 int inputimagerowoffset = inputRow * gInputSize;
 87:                 int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 88:                 bool rowOk = inputRow >= 0 && inputRow < gInputSize;
 89:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 90:                     #if gPadZeros == 1
 91:                         #define inputCol (outputCol + v)
 92:                     #else
 93:                         #define inputCol (outputCol + v + gHalfFilterSize)
 94:                     #endif
 95:                     bool process = rowOk && inputCol >= 0 && inputCol < gInputSize;
 96:                     if (process) {
 97:                             sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ];
 98:                     }
 99:                 }
 100:             }
 101:         }
 102:     }
 103:     // output are organized like [imageid][filterid][row][col]
 104:     #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId)
 105:     if (effectiveLocalId < gOutputSizeSquared) {
 106:         output[resultIndex ] = sum;
 107:     }
 108: }
 109: #endif
 110: 
 111: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 5
 cl/reduce_segments.cl build log: 
 (8:0) : error : invalid global address space qualifier specified for parameter type
 (8:0) : error : syntax error at 'const'

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: kernel void reduce_segments(const int numSegments, const int segmentLength,
 8:         global float const *in, global float* out) {
 9:     const int globalId = get_global_id(0);
 10:     const int segmentId = globalId;
 11: 
 12:     if (segmentId >= numSegments) {
 13:         return;
 14:     }
 15: 
 16:     float sum = 0;
 17:     global const float *segment = in + segmentId * segmentLength;
 18:     for (int i = 0; i < segmentLength; i++) {
 19:         sum += segment[i];
 20:     }
 21:     out[segmentId] = sum;
 22: }
 23: 
 24: 
 25: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/reduce_segments.cl build log: 
 (8:0) : error : invalid global address space qualifier specified for parameter type
 (8:0) : error : syntax error at 'const'

 ForwardAuto: kernel 5: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: kernel void reduce_segments(const int numSegments, const int segmentLength,
 8:         global float const *in, global float* out) {
 9:     const int globalId = get_global_id(0);
 10:     const int segmentId = globalId;
 11: 
 12:     if (segmentId >= numSegments) {
 13:         return;
 14:     }
 15: 
 16:     float sum = 0;
 17:     global const float *segment = in + segmentId * segmentLength;
 18:     for (int i = 0; i < segmentLength; i++) {
 19:         sum += segment[i];
 20:     }
 21:     out[segmentId] = sum;
 22: }
 23: 
 24: 
 25: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/reduce_segments.cl build log: 
 (8:0) : error : invalid global address space qualifier specified for parameter type
 (8:0) : error : syntax error at 'const'

   ... not valid
 forward try kernel 6
 cl/forward_byinputplane.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept:
 8: // - load same input plane from each image
 9: // - hold filter plane for this input plane, for all filters
 10: // - reduce afterwards
 11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB
 12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB
 13: // => seems ok?
 14: // workgroupid: [inputPlaneId]
 15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...)
 16: // iterate over: [n][outCol]
 17: // output: [n][filterId][outRow][outCol][inputPlane]
 18: // need to later reduce output over: [inputPlane]
 19: void kernel forward_byinputplane(const int batchSize,
 20:       global const float *images, global const float *filters,
 21:     global float *output,
 22:     local float *_inputPlane, local float *_filterPlanes) {
 23: //    const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0;
 24: 
 25:     const int globalId = get_global_id(0);
 26:     const int workgroupId = get_group_id(0);
 27:     const int workgroupSize = get_local_size(0);
 28:     const int localId = get_local_id(0);
 29: 
 30:     const int inputPlaneId = workgroupId;
 31:     const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize;
 32:     const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize;
 33:     const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 34:     for (int loop = 0; loop < numLoops; loop++) {
 35:         const int loopLocalId = localId + loop * workgroupSize;
 36:         const int filterId = loopLocalId / gOutputSize;
 37:         const int outRow = loopLocalId % gOutputSize;
 38: 
 39:         // copy down our filter, we have gOutputSize threads to do this
 40:         global float const *globalFilterPlane = filters +
 41:             (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared;
 42:         local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared;
 43:         barrier(CLK_LOCAL_MEM_FENCE);
 44:         for (int i = 0; i < numFilterCopyLoops; i++) {
 45:             const int offset = i * gOutputSize + outRow;
 46:             bool process = filterId < gNumFilters && offset < gFilterSizeSquared;
 47:             if (process) {
 48:                 _localFilterPlane[ offset ] = globalFilterPlane[ offset ];
 49:             }
 50:         }
 51:         // loop over n ...
 52:         for (int n = 0; n < batchSize; n++) {
 53:             // copy down our imageplane, we have workgroupSize threads to do this
 54:             barrier(CLK_LOCAL_MEM_FENCE);
 55:             global float const *globalImagePlane = images +
 56:                 (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared;
 57:             for (int i = 0; i< numImageCopyLoops; i++) {
 58:                 const int offset = i * workgroupSize + localId;
 59:                 if (offset < gInputSizeSquared) {
 60:                     _inputPlane[ offset ] = globalImagePlane[ offset ];
 61:                 }
 62:             }
 63:             barrier(CLK_LOCAL_MEM_FENCE);
 64:             // calc output for each [outrow][outcol]
 65:             bool filterPlaneOk = filterId < gNumFilters;
 66:             for (int outCol = 0; outCol < gOutputSize; outCol++) {
 67:                 float sum = 0;
 68:                 for (int filterRow = 0; filterRow < gFilterSize; filterRow++) {
 69:                     int inRow = outRow + filterRow;
 70:                     #if gPadZeros == 1
 71:                         inRow -= gHalfFilterSize;
 72:                     #endif
 73:                     bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize;
 74:                     for (int filterCol = 0; filterCol < gFilterSize; filterCol++) {
 75:                         int inCol = outCol + filterCol;
 76:                         #if gPadZeros == 1
 77:                             inCol -= gHalfFilterSize;
 78:                         #endif
 79:                         bool process = rowOk && inCol >= 0 && inCol < gInputSize;
 80:                         if (process) {
 81:                             float imageValue = _inputPlane[ inRow * gInputSize + inCol ];
 82:                             float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ];
 83:                             sum += imageValue * filterValue;
 84:                         }
 85:                     }
 86:                 }
 87:                 if (filterId < gNumFilters) {
 88:                     // [n][filterId][outRow][outCol][inputPlane]
 89:                     int resultIndex = (( (n
 90:                         * gNumFilters + filterId)
 91:                         * gOutputSize + outRow)
 92:                         * gOutputSize + outCol)
 93:                         * gNumInputPlanes + inputPlaneId;
 94:                     output[resultIndex] = sum;
 95:                     //if (globalId == 2) output[0] = resultIndex;
 96: //                    output[resultIndex] = outRow;
 97:                 }
 98: //                output[localId] = _localFilterPlane[localId];
 99:             }
 100:         }
 101:     }
 102: }
 103: 
 104: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward_byinputplane.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 6: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept:
 8: // - load same input plane from each image
 9: // - hold filter plane for this input plane, for all filters
 10: // - reduce afterwards
 11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB
 12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB
 13: // => seems ok?
 14: // workgroupid: [inputPlaneId]
 15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...)
 16: // iterate over: [n][outCol]
 17: // output: [n][filterId][outRow][outCol][inputPlane]
 18: // need to later reduce output over: [inputPlane]
 19: void kernel forward_byinputplane(const int batchSize,
 20:       global const float *images, global const float *filters,
 21:     global float *output,
 22:     local float *_inputPlane, local float *_filterPlanes) {
 23: //    const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0;
 24: 
 25:     const int globalId = get_global_id(0);
 26:     const int workgroupId = get_group_id(0);
 27:     const int workgroupSize = get_local_size(0);
 28:     const int localId = get_local_id(0);
 29: 
 30:     const int inputPlaneId = workgroupId;
 31:     const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize;
 32:     const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize;
 33:     const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 34:     for (int loop = 0; loop < numLoops; loop++) {
 35:         const int loopLocalId = localId + loop * workgroupSize;
 36:         const int filterId = loopLocalId / gOutputSize;
 37:         const int outRow = loopLocalId % gOutputSize;
 38: 
 39:         // copy down our filter, we have gOutputSize threads to do this
 40:         global float const *globalFilterPlane = filters +
 41:             (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared;
 42:         local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared;
 43:         barrier(CLK_LOCAL_MEM_FENCE);
 44:         for (int i = 0; i < numFilterCopyLoops; i++) {
 45:             const int offset = i * gOutputSize + outRow;
 46:             bool process = filterId < gNumFilters && offset < gFilterSizeSquared;
 47:             if (process) {
 48:                 _localFilterPlane[ offset ] = globalFilterPlane[ offset ];
 49:             }
 50:         }
 51:         // loop over n ...
 52:         for (int n = 0; n < batchSize; n++) {
 53:             // copy down our imageplane, we have workgroupSize threads to do this
 54:             barrier(CLK_LOCAL_MEM_FENCE);
 55:             global float const *globalImagePlane = images +
 56:                 (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared;
 57:             for (int i = 0; i< numImageCopyLoops; i++) {
 58:                 const int offset = i * workgroupSize + localId;
 59:                 if (offset < gInputSizeSquared) {
 60:                     _inputPlane[ offset ] = globalImagePlane[ offset ];
 61:                 }
 62:             }
 63:             barrier(CLK_LOCAL_MEM_FENCE);
 64:             // calc output for each [outrow][outcol]
 65:             bool filterPlaneOk = filterId < gNumFilters;
 66:             for (int outCol = 0; outCol < gOutputSize; outCol++) {
 67:                 float sum = 0;
 68:                 for (int filterRow = 0; filterRow < gFilterSize; filterRow++) {
 69:                     int inRow = outRow + filterRow;
 70:                     #if gPadZeros == 1
 71:                         inRow -= gHalfFilterSize;
 72:                     #endif
 73:                     bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize;
 74:                     for (int filterCol = 0; filterCol < gFilterSize; filterCol++) {
 75:                         int inCol = outCol + filterCol;
 76:                         #if gPadZeros == 1
 77:                             inCol -= gHalfFilterSize;
 78:                         #endif
 79:                         bool process = rowOk && inCol >= 0 && inCol < gInputSize;
 80:                         if (process) {
 81:                             float imageValue = _inputPlane[ inRow * gInputSize + inCol ];
 82:                             float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ];
 83:                             sum += imageValue * filterValue;
 84:                         }
 85:                     }
 86:                 }
 87:                 if (filterId < gNumFilters) {
 88:                     // [n][filterId][outRow][outCol][inputPlane]
 89:                     int resultIndex = (( (n
 90:                         * gNumFilters + filterId)
 91:                         * gOutputSize + outRow)
 92:                         * gOutputSize + outCol)
 93:                         * gNumInputPlanes + inputPlaneId;
 94:                     output[resultIndex] = sum;
 95:                     //if (globalId == 2) output[0] = resultIndex;
 96: //                    output[resultIndex] = outRow;
 97:                 }
 98: //                output[localId] = _localFilterPlane[localId];
 99:             }
 100:         }
 101:     }
 102: }
 103: 
 104: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward_byinputplane.cl build log: 
 error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 7
   ... seems valid
 ForwardIm2Col.cl build log: 
 (19:0) : error : invalid global address space qualifier specified for parameter type
 (19:0) : error : syntax error at 'const'

 kernel build error:

 kernel source:
 1: // from SpatialConvolutionMM.cu:
 2: 
 3: // CL: grid stride looping
 4: #define CL_KERNEL_LOOP(i, n)                        \
 5:   for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \
 6:       i < (n);                                       \
 7:       i += get_local_size(0) * get_num_groups(0))
 8: 
 9: //#define gPadding 0
 10: //#define gStride 1
 11: //#define gColSize 1
 12: //#define gFilterSize 1
 13: //#define gSize 1
 14: 
 15: // Kernel for fast unfold+copy
 16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu)
 17: kernel void im2col(
 18:     const int n,
 19:     global float const * im_data, int im_offset,
 20:     global float* data_col) {
 21:   global const float *data_im = im_data + im_offset;
 22: 
 23:   CL_KERNEL_LOOP(index, n) {
 24:     int w_out = index % 1;
 25:     index /= 1;
 26:     int h_out = index % 1;
 27:     int channel_in = index / 1;
 28:     int channel_out = channel_in * 1 * 1;
 29:     int h_in = h_out * 1 - 0;
 30:     int w_in = w_out * 1 - 0;
 31:     data_col += (channel_out * 1 + h_out) * 1 + w_out;
 32:     data_im += (channel_in * 1 + h_in) * 1 + w_in;
 33:     for (int i = 0; i < 1; ++i) {
 34:       for (int j = 0; j < 1; ++j) {
 35:         int h = h_in + i;
 36:         int w = w_in + j;
 37:         *data_col = (h >= 0 && w >= 0 && h < 1 && w < 1) ?
 38:           data_im[i * 1 + j] : 0;
 39:         data_col += 1 * 1;
 40:       }
 41:     }
 42:   }
 43: }
 44: 
 45: kernel void col2im(
 46:     const int n,
 47:     global float const *data_col,
 48:     global float* im_data, int im_offset) {
 49:   global float *data_im = im_data + im_offset;
 50: 
 51:   for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) {
 52:     float val = 0;
 53:     int w = index % 1 + 0;
 54:     int h = (index / 1) % 1 + 0;
 55:     int c = index / (1 * 1);
 56:     // compute the start and end of the output
 57:     int w_col_start = (w < 1) ? 0 : (w - 1) / 1 + 1;
 58:     int w_col_end = min(w / 1 + 1, 1);
 59:     int h_col_start = (h < 1) ? 0 : (h - 1) / 1 + 1;
 60:     int h_col_end = min(h / 1 + 1, 1);
 61: 
 62:     int offset = (c * 1 * 1 + h * 1 + w) * 1 * 1;
 63:     int coeff_h_col = (1 - 1 * 1 * 1) * 1;
 64:     int coeff_w_col = (1 - 1 * 1 * 1);
 65:     for (int h_col = h_col_start; h_col < h_col_end; ++h_col) {
 66:       for (int w_col = w_col_start; w_col < w_col_end; ++w_col) {
 67:         val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col];
 68:       }
 69:     }
 70:     data_im[index] = val;
 71:   }
 72: }
 73: 
 74: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 ForwardIm2Col.cl build log: 
 (19:0) : error : invalid global address space qualifier specified for parameter type
 (19:0) : error : syntax error at 'const'

 ForwardAuto: kernel 7 this instance cant be used: 
 kernel source:
 1: // from SpatialConvolutionMM.cu:
 2: 
 3: // CL: grid stride looping
 4: #define CL_KERNEL_LOOP(i, n)                        \
 5:   for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \
 6:       i < (n);                                       \
 7:       i += get_local_size(0) * get_num_groups(0))
 8: 
 9: //#define gPadding 0
 10: //#define gStride 1
 11: //#define gColSize 1
 12: //#define gFilterSize 1
 13: //#define gSize 1
 14: 
 15: // Kernel for fast unfold+copy
 16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu)
 17: kernel void im2col(
 18:     const int n,
 19:     global float const * im_data, int im_offset,
 20:     global float* data_col) {
 21:   global const float *data_im = im_data + im_offset;
 22: 
 23:   CL_KERNEL_LOOP(index, n) {
 24:     int w_out = index % 1;
 25:     index /= 1;
 26:     int h_out = index % 1;
 27:     int channel_in = index / 1;
 28:     int channel_out = channel_in * 1 * 1;
 29:     int h_in = h_out * 1 - 0;
 30:     int w_in = w_out * 1 - 0;
 31:     data_col += (channel_out * 1 + h_out) * 1 + w_out;
 32:     data_im += (channel_in * 1 + h_in) * 1 + w_in;
 33:     for (int i = 0; i < 1; ++i) {
 34:       for (int j = 0; j < 1; ++j) {
 35:         int h = h_in + i;
 36:         int w = w_in + j;
 37:         *data_col = (h >= 0 && w >= 0 && h < 1 && w < 1) ?
 38:           data_im[i * 1 + j] : 0;
 39:         data_col += 1 * 1;
 40:       }
 41:     }
 42:   }
 43: }
 44: 
 45: kernel void col2im(
 46:     const int n,
 47:     global float const *data_col,
 48:     global float* im_data, int im_offset) {
 49:   global float *data_im = im_data + im_offset;
 50: 
 51:   for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) {
 52:     float val = 0;
 53:     int w = index % 1 + 0;
 54:     int h = (index / 1) % 1 + 0;
 55:     int c = index / (1 * 1);
 56:     // compute the start and end of the output
 57:     int w_col_start = (w < 1) ? 0 : (w - 1) / 1 + 1;
 58:     int w_col_end = min(w / 1 + 1, 1);
 59:     int h_col_start = (h < 1) ? 0 : (h - 1) / 1 + 1;
 60:     int h_col_end = min(h / 1 + 1, 1);
 61: 
 62:     int offset = (c * 1 * 1 + h * 1 + w) * 1 * 1;
 63:     int coeff_h_col = (1 - 1 * 1 * 1) * 1;
 64:     int coeff_w_col = (1 - 1 * 1 * 1);
 65:     for (int h_col = h_col_start; h_col < h_col_end; ++h_col) {
 66:       for (int w_col = w_col_start; w_col < w_col_end; ++w_col) {
 67:         val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col];
 68:       }
 69:     }
 70:     data_im[index] = val;
 71:   }
 72: }
 73: 
 74: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 ForwardIm2Col.cl build log: 
 (19:0) : error : invalid global address space qualifier specified for parameter type
 (19:0) : error : syntax error at 'const'

   forward kernel 0: cannot be used
   forward kernel 1: cannot be used
   forward kernel 2: cannot be used
   forward kernel 3: cannot be used
   forward kernel 4: cannot be used
   forward kernel 5: cannot be used
   forward kernel 6: cannot be used
   forward kernel 7: cannot be used
 clblas teardown
 unknown file: Failure
 C++ exception with description "No valid forward implementations found" thrown in the test body.
 [  FAILED  ] testlogicaloperators.Convolve_1layerbiased_Or (193 ms)
 [ RUN      ] testlogicaloperators.Convolve_2layers_relu_Xor
 Xor, convolve
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU"

 clblas teardown
 unknown file: Failure
 C++ exception with description "
 kernel source:
 1: // Copyright Hugh Perkins 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // expected defines:
 8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ]
 9: 
 10: #ifdef TANH
 11:     #define ACTIVATION_FUNCTION(output) (tanh(output))
 12: #elif defined SCALEDTANH
 13:     #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output))
 14: #elif SIGMOID
 15:     #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output)))
 16: #elif defined RELU
 17:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0)
 18: #elif defined ELU
 19:     #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1)
 20: #elif defined LINEAR
 21:     #define ACTIVATION_FUNCTION(output) (output)
 22: #endif
 23: 
 24: #ifdef ACTIVATION_FUNCTION // protect against not defined
 25: kernel void activate(const int N, global float *inout) {
 26:     const int globalId = get_global_id(0);
 27:     if (globalId >= N) {
 28:         return;
 29:     }
 30:     inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]);
 31: }
 32: #endif
 33: 
 34: #ifdef ACTIVATION_FUNCTION // protect against not defined
 35: kernel void forwardNaive(const int N, global float *out, global const float *in) {
 36:     const int globalId = get_global_id(0);
 37:     if (globalId >= N) {
 38:         return;
 39:     }
 40:     out[globalId] = ACTIVATION_FUNCTION(in[globalId]);
 41: }
 42: #endif
 43: 
 44: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/activate.cl build log: 
 error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU"
 " thrown in the test body.
 [  FAILED  ] testlogicaloperators.Convolve_2layers_relu_Xor (85 ms)
 [----------] 3 tests from testlogicaloperators (460 ms total)

 [----------] 12 tests from testbackward
 [ RUN      ] testbackward.squareloss
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 layer 0:InputLayer{ outputPlanes=3 outputSize=5 }
 layer 1:ForceBackpropLayer{ outputPlanes=3 outputSize=5 }
 layer 2:SquareLossLayer{}

 inputtotalsize=2400 outputTotalSize=2400
 layer 0:InputLayer{ outputPlanes=3 outputSize=5 }
 layer 1:ForceBackpropLayer{ outputPlanes=3 outputSize=5 }
 layer 2:SquareLossLayer{}
 Parameters overview: (skipping 3 layers with 0 params)
 TOTAL  : params=0
 idx=44 predicted losschange=-0.000912508 actual=-0.000976562
 idx=2245 predicted losschange=0.00785823 actual=0.00805664
 idx=648 predicted losschange=0.00965759 actual=0.00976562
 idx=586 predicted losschange=0.0136895 actual=0.0136719
 idx=730 predicted losschange=0.00117897 actual=0.00146484
 idx=611 predicted losschange=0.00152302 actual=0.00195312
 idx=1130 predicted losschange=0.0159167 actual=0.0161133
 idx=15 predicted losschange=0.0434798 actual=0.0439453
 idx=1923 predicted losschange=-0.00790002 actual=-0.0078125
 idx=670 predicted losschange=0.0335141 actual=0.0336914
 [       OK ] testbackward.squareloss (64 ms)
 [ RUN      ] testbackward.crossentropyloss
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 layer 0:InputLayer{ outputPlanes=3 outputSize=5 }
 layer 1:ForceBackpropLayer{ outputPlanes=3 outputSize=5 }
 layer 2:Layer{}

 inputtotalsize=300 outputTotalSize=300
 layer 0:InputLayer{ outputPlanes=3 outputSize=5 }
 layer 1:ForceBackpropLayer{ outputPlanes=3 outputSize=5 }
 layer 2:Layer{}
 Parameters overview: (skipping 3 layers with 0 params)
 TOTAL  : params=0
 idx=44 predicted losschange=0.000274935 actual=0.000274658
 idx=145 predicted losschange=-0.000885784 actual=-0.00088501
 idx=48 predicted losschange=-0.000859834 actual=-0.000854492
 idx=286 predicted losschange=0.00713042 actual=0.00717163
 idx=130 predicted losschange=-0.000264829 actual=-0.000244141
 idx=11 predicted losschange=-1.98163e-05 actual=0
 idx=230 predicted losschange=-0.000594819 actual=-0.000610352
 idx=15 predicted losschange=-0.0006499 actual=-0.000640869
 idx=123 predicted losschange=-0.000846121 actual=-0.000823975
 idx=70 predicted losschange=0.000790196 actual=0.000793457
 [       OK ] testbackward.crossentropyloss (53 ms)
 [ RUN      ] testbackward.softmaxloss
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 layer 0:InputLayer{ outputPlanes=5 outputSize=1 }
 layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 }
 layer 2:SoftMaxLayer{ perPlane=0 numPlanes=5 imageSize=1 }

 inputtotalsize=10 outputTotalSize=10
 layer 0:InputLayer{ outputPlanes=5 outputSize=1 }
 layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 }
 layer 2:SoftMaxLayer{ perPlane=0 numPlanes=5 imageSize=1 }
 Parameters overview: (skipping 3 layers with 0 params)
 TOTAL  : params=0
 idx=4 predicted losschange=0.000113075 actual=0.00011301
 idx=5 predicted losschange=0.000145627 actual=0.000145674
 idx=8 predicted losschange=3.16699e-05 actual=3.19481e-05
 idx=6 predicted losschange=4.89271e-06 actual=5.24521e-06
 idx=0 predicted losschange=2.29469e-05 actual=2.28882e-05
 idx=1 predicted losschange=-8.26119e-05 actual=-8.27312e-05
 idx=0 predicted losschange=2.29469e-05 actual=2.28882e-05
 idx=5 predicted losschange=0.000145627 actual=0.000145674
 idx=3 predicted losschange=-5.50179e-05 actual=-5.50747e-05
 idx=0 predicted losschange=2.29469e-05 actual=2.28882e-05
 [       OK ] testbackward.softmaxloss (50 ms)
 [ RUN      ] testbackward.squareloss2
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 layer 0:InputLayer{ outputPlanes=5 outputSize=1 }
 layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 }
 layer 2:SquareLossLayer{}

 layer 0:InputLayer{ outputPlanes=5 outputSize=1 }
 layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 }
 layer 2:SquareLossLayer{}

 batchSize: 32
 inputtotalsize=160 outputTotalSize=160
 layer SquareLossLayer{}
 layer 0:InputLayer{ outputPlanes=5 outputSize=1 }
 layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 }
 layer 2:SquareLossLayer{}
 Parameters overview: (skipping 3 layers with 0 params)
 TOTAL  : params=0
 idx=44 predicted losschange=0.000126406 actual=0.000125885
 idx=5 predicted losschange=0.00461891 actual=0.00464439
 idx=8 predicted losschange=0.000356787 actual=0.000356674
 idx=106 predicted losschange=0.00716324 actual=0.00719643
 idx=90 predicted losschange=0.000474759 actual=0.000480652
 idx=131 predicted losschange=0.000979017 actual=0.000984192
 idx=10 predicted losschange=0.000660134 actual=0.000663757
 idx=15 predicted losschange=0.00961313 actual=0.00965118
 idx=3 predicted losschange=0.00264732 actual=0.00267029
 idx=30 predicted losschange=0.00865312 actual=0.00868607
 [       OK ] testbackward.squareloss2 (60 ms)
 [ RUN      ] testbackward.crossentropy2
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 layer 0:InputLayer{ outputPlanes=5 outputSize=1 }
 layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 }
 layer 2:Layer{}

 layer 0:InputLayer{ outputPlanes=5 outputSize=1 }
 layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 }
 layer 2:Layer{}

 batchSize: 2
 inputtotalsize=10 outputTotalSize=10
 layer Layer{}
 layer 0:InputLayer{ outputPlanes=5 outputSize=1 }
 layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 }
 layer 2:Layer{}
 Parameters overview: (skipping 3 layers with 0 params)
 TOTAL  : params=0
 idx=4 predicted losschange=0.00258649 actual=nan
 idx=5 predicted losschange=0.0227095 actual=nan
 idx=8 predicted losschange=-0.00202714 actual=nan
 idx=6 predicted losschange=-0.000846508 actual=nan
 idx=0 predicted losschange=-0.000424821 actual=nan
 idx=1 predicted losschange=-0.00171216 actual=nan
 idx=0 predicted losschange=-0.000424821 actual=nan
 idx=5 predicted losschange=0.0227095 actual=nan
 idx=3 predicted losschange=0.0123444 actual=nan
 idx=0 predicted losschange=-0.000424821 actual=nan
 [       OK ] testbackward.crossentropy2 (21 ms)
 [ RUN      ] testbackward.softmax2
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 layer 0:InputLayer{ outputPlanes=5 outputSize=1 }
 layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 }
 layer 2:SoftMaxLayer{ perPlane=0 numPlanes=5 imageSize=1 }

 layer 0:InputLayer{ outputPlanes=5 outputSize=1 }
 layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 }
 layer 2:SoftMaxLayer{ perPlane=0 numPlanes=5 imageSize=1 }

 batchSize: 2
 inputtotalsize=10 outputTotalSize=10
 layer SoftMaxLayer{ perPlane=0 numPlanes=5 imageSize=1 }
 layer 0:InputLayer{ outputPlanes=5 outputSize=1 }
 layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 }
 layer 2:SoftMaxLayer{ perPlane=0 numPlanes=5 imageSize=1 }
 Parameters overview: (skipping 3 layers with 0 params)
 TOTAL  : params=0
 idx=4 predicted losschange=0.00035729 actual=0.000357628
 idx=5 predicted losschange=0.0015055 actual=0.00151086
 idx=8 predicted losschange=-5.63632e-05 actual=-5.65052e-05
 idx=6 predicted losschange=-1.48864e-05 actual=-1.4782e-05
 idx=0 predicted losschange=1.96542e-05 actual=1.95503e-05
 idx=1 predicted losschange=-0.000287167 actual=-0.000287056
 idx=0 predicted losschange=1.96542e-05 actual=1.95503e-05
 idx=5 predicted losschange=0.0015055 actual=0.00151086
 idx=3 predicted losschange=-0.000152824 actual=-0.00014782
 idx=0 predicted losschange=1.96542e-05 actual=1.95503e-05
 [       OK ] testbackward.softmax2 (20 ms)
 [ RUN      ] testbackward.conv1
 Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found
 Trying for OpenCL-enabled CPU
 Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform
 Using OpenCL device: Vivante OpenCL Device
 initializing clblas
 layer 0:InputLayer{ outputPlanes=2 outputSize=4 }
 layer 1:ForceBackpropLayer{ outputPlanes=2 outputSize=4 }
 layer 2:ConvolutionalLayer{ LayerDimensions{ inputPlanes=2 inputSize=4 numFilters=2 filterSize=3 outputSize=2 padZeros=0 biased=0 skip=0} }
 layer 3:SquareLossLayer{}

 layer 0:InputLayer{ outputPlanes=2 outputSize=4 }
 layer 1:ForceBackpropLayer{ outputPlanes=2 outputSize=4 }
 layer 2:ConvolutionalLayer{ LayerDimensions{ inputPlanes=2 inputSize=4 numFilters=2 filterSize=3 outputSize=2 padZeros=0 biased=0 skip=0} }
 layer 3:SquareLossLayer{}

 batchSize: 4
 inputtotalsize=128 outputTotalSize=32
 layer ConvolutionalLayer{ LayerDimensions{ inputPlanes=2 inputSize=4 numFilters=2 filterSize=3 outputSize=2 padZeros=0 biased=0 skip=0} }
 forward try kernel 0
  ... not plausibly optimal, skipping
 forward try kernel 1
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 1: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // notes on non-odd filtersizes:
 8: // for odd, imagesize and filtersize 3, padZeros = 0:
 9: // output is a single square
 10: // m and n should vary between -1,0,1
 11: // for even, imagesize and filtersize 2, padzeros = 0
 12: // output is a single square, which we can position at topleft or bottomrigth
 13: // lets position it in bottomright
 14: // then m and n should vary as -1,0
 15: //
 16: // for even, imagesize and filtersize 2, padzeros = 1
 17: // output is 2 by 2
 18: // well... if it is even:
 19: // - if we are not padding zeros, then we simply move our filter around the image somehow
 20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1
 21: // filtersize remains the same
 22: //      m will vary as -1,0,1
 23: //       outputrow is fixed by globalid
 24: //       inputrow should be unchanged...
 25: // padzeros = 0:
 26: //  x x .  . . .
 27: //  x x .  . x x
 28: //  . . .  . x x
 29: // when filtersize even:
 30: //    new imagesize = oldimagesize - filtersize + 1
 31: // when filtersize odd:
 32: //    x x x .
 33: //    x x x .
 34: //    x x x .
 35: //    . . . .
 36: //    new imagesize = oldimagesize - filtersize + 1
 37: // padzeros = 1:
 38: // x x
 39: // x x . .   x x .    . . .     . . .
 40: //   . . .   x x .    . x x     . . .
 41: //   . . .   . . .    . x x     . . x x
 42: // outrow=0 outrow=1  outrow=2      x x
 43: // outcol=0 outcol=1  outcol=2    outrow=3
 44: //                                outcol=3
 45: // when filtersize is even, and padzeros, imagesize grows by 1 each time...
 46: //    imagesize = oldimagesize + 1
 47: // when filtersize is odd
 48: //  x x x
 49: //  x x x .   x x x    . . .
 50: //  x x x .   x x x    . x x x
 51: //    . . .   x x x    . x x x
 52: //                       x x x
 53: 
 54: // images are organized like [imageId][plane][row][col]
 55: // filters are organized like [filterid][inplane][filterrow][filtercol]
 56: // output are organized like [imageid][filterid][row][col]
 57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol]
 58: // - no local memory used currently
 59: // - each thread:
 60: //     - loads a whole upstream cube
 61: //     - loads a whole filter cube
 62: //     - writes one output...
 63: void kernel convolve_imagecubes_float2(
 64:     const int numExamples,
 65:       global const float *inputs, global const float *filters,
 66:     global float *output) {
 67:     int globalId = get_global_id(0);
 68: 
 69:     int outputImage2Id = globalId / gOutputSizeSquared;
 70:     int exampleId = outputImage2Id / gNumFilters;
 71:     int filterId = outputImage2Id % gNumFilters;
 72: 
 73:     // intraimage coords
 74:     int localid = globalId % gOutputSizeSquared;
 75:     int outputRow = localid / gOutputSize;
 76:     int outputCol = localid % gOutputSize;
 77: 
 78:     global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared;
 79:     global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared;
 80: 
 81:     float sum = 0;
 82:     if (exampleId < numExamples) {
 83:         for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) {
 84:             global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared;
 85:             global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared;
 86:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 87:                 // trying to reduce register pressure...
 88:                 #if gPadZeros == 1
 89:                     #define inputRowIdx (outputRow + u)
 90:                 #else
 91:                     #define inputRowIdx (outputRow + u + gHalfFilterSize)
 92:                 #endif
 93:                 global float const *inputRow = inputPlane + inputRowIdx * gInputSize;
 94:                 global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 95:                 bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize;
 96:                 #pragma unroll
 97:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 98:                     #if gPadZeros == 1
 99:                         #define inputColIdx (outputCol + v)
 100:                     #else
 101:                         #define inputColIdx (outputCol + v + gHalfFilterSize)
 102:                     #endif
 103:                     bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize;
 104:                     if (process) {
 105:                             sum += inputRow[inputColIdx] * filterRow[v];
 106:                     }
 107:                 }
 108:             }
 109:         }
 110:     }
 111: 
 112:     if (exampleId < numExamples) {
 113:         output[globalId] = sum;
 114:     }
 115: }
 116: 
 117: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward1.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 2
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

 ForwardAuto: kernel 2: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, const int N) {
 8:     int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize;
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * gWorkgroupSize + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [outplane]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [imageid][filterid][row][col]
 26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB
 27: //                            eg 5 * 5 * 32 * 4 = 3.2KB => ok :-)
 28: //                           but 28 * 28 * 32 * 4 = 100KB => less good :-P
 29: void kernel forward_2_by_outplane(
 30:         const int batchSize,
 31:         global const float *images, global const float *filters,
 32:         global float *output,
 33:         local float *_inputPlane, local float *_filterCube) {
 34:     const int globalId = get_global_id(0);
 35: 
 36:     const int workgroupId = get_group_id(0);
 37:     const int workgroupSize = get_local_size(0);
 38:     const int outPlane = workgroupId;
 39: 
 40:     const int localId = get_local_id(0);
 41:     const int outputRow = localId / gOutputSize;
 42:     const int outputCol = localId % gOutputSize;
 43: 
 44:     #if gPadZeros == 1
 45:         const int minu = max(-gHalfFilterSize, -outputRow);
 46:         const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven;
 47:         const int minv = max(-gHalfFilterSize, -outputCol);
 48:         const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven;
 49:     #else
 50:         const int minu = -gHalfFilterSize;
 51:         const int maxu = gHalfFilterSize - gEven;
 52:         const int minv = -gHalfFilterSize;
 53:         const int maxv = gHalfFilterSize - gEven;
 54:     #endif
 55: 
 56:     {
 57:         const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 58:         copyLocal(_filterCube,
 59:                 filters + outPlane * filterCubeLength,
 60:                 filterCubeLength);
 61:     }
 62:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 63: 
 64:     for (int n = 0; n < batchSize; n++) {
 65:         float sum = 0;
 66:         for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 67:             barrier(CLK_LOCAL_MEM_FENCE);
 68:             copyLocal(_inputPlane,
 69:                        images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared,
 70:                        gInputSizeSquared);
 71:             barrier(CLK_LOCAL_MEM_FENCE);
 72:             int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 73:             if (localId < gOutputSizeSquared) {
 74:                 for (int u = minu; u <= maxu; u++) {
 75:                     int inputRow = outputRow + u;
 76:                     #if gPadZeros == 0
 77:                          inputRow += gHalfFilterSize;
 78:                     #endif
 79:                     int inputimagerowoffset = inputRow * gInputSize;
 80:                     int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 81:                     for (int v = minv; v <= maxv; v++) {
 82:                         int inputCol = outputCol + v;
 83:                         #if gPadZeros == 0
 84:                              inputCol += gHalfFilterSize;
 85:                         #endif
 86:                         sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 87:                     }
 88:                 }
 89:             }
 90:         }
 91:         // output are organized like [imageid][filterid][row][col]
 92:         int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 93:         if (localId < gOutputSizeSquared) {
 94:             output[resultIndex ] = sum;
 95:         }
 96:     }
 97: }
 98: #endif
 99: 
 100: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward2.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32"

   ... not valid
 forward try kernel 3
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept: each workgroup handles convolving one input example with one filtercube
 8: // and writing out one single output plane
 9: //
 10: // workgroup id organized like: [imageid][outplane]
 11: // local id organized like: [outrow][outcol]
 12: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 13: // number workgroups = 32
 14: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 16: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 17: // output are organized like [imageid][filterid][row][col]
 18: void kernel forward_3_by_n_outplane(const int batchSize,
 19:       global const float *images, global const float *filters,
 20:     global float *output,
 21:     local float *_upstreamImage, local float *_filterCube) {
 22:     const int globalId = get_global_id(0);
 23: 
 24:     const int workgroupId = get_group_id(0);
 25:     const int workgroupSize = get_local_size(0);
 26:     const int n = workgroupId / gNumFilters;
 27:     const int outPlane = workgroupId % gNumFilters;
 28: 
 29:     const int localId = get_local_id(0);
 30:     const int outputRow = localId / gOutputSize;
 31:     const int outputCol = localId % gOutputSize;
 32: 
 33:     const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize;
 34:     const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow  - gEven) : gHalfFilterSize - gEven;
 35:     const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize;
 36:     const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven;
 37: 
 38:     const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 39: 
 40:     const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 41:     const int filterCubeGlobalOffset = outPlane * filterCubeLength;
 42:     const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize;
 43:     for (int i = 0; i < numPixelsPerThread; i++) {
 44:         int thisOffset = localId + i * workgroupSize;
 45:         if (thisOffset < filterCubeLength) {
 46:             _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset];
 47:         }
 48:     }
 49:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 50: 
 51:     float sum = 0;
 52:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 53:         int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 54:         barrier(CLK_LOCAL_MEM_FENCE);
 55:         for (int i = 0; i < numUpstreamsPerThread; i++) {
 56:             int thisOffset = workgroupSize * i + localId;
 57:             if (thisOffset < gInputSizeSquared) {
 58:                 _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ];
 59:             }
 60:         }
 61:         barrier(CLK_LOCAL_MEM_FENCE);
 62:         int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 63:         for (int u = minu; u <= maxu; u++) {
 64:             int inputRow = outputRow + u;
 65:             #if gPadZeros == 0
 66:                 inputRow += gHalfFilterSize;
 67:             #endif
 68:             int inputimagerowoffset = inputRow * gInputSize;
 69:             int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 70:             for (int v = minv; v <= maxv; v++) {
 71:                 int inputCol = outputCol + v;
 72:                 #if gPadZeros == 0
 73:                     inputCol += gHalfFilterSize;
 74:                 #endif
 75:                 if (localId < gOutputSizeSquared) {
 76:                     sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 77:                 }
 78:             }
 79:         }
 80:     }
 81: 
 82:     // output are organized like [imageid][filterid][row][col]
 83:     int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 84:     if (localId < gOutputSizeSquared) {
 85:         output[resultIndex ] = sum;
 86:     }
 87: }
 88: 
 89: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 3: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: // concept: each workgroup handles convolving one input example with one filtercube
 8: // and writing out one single output plane
 9: //
 10: // workgroup id organized like: [imageid][outplane]
 11: // local id organized like: [outrow][outcol]
 12: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 13: // number workgroups = 32
 14: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 16: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 17: // output are organized like [imageid][filterid][row][col]
 18: void kernel forward_3_by_n_outplane(const int batchSize,
 19:       global const float *images, global const float *filters,
 20:     global float *output,
 21:     local float *_upstreamImage, local float *_filterCube) {
 22:     const int globalId = get_global_id(0);
 23: 
 24:     const int workgroupId = get_group_id(0);
 25:     const int workgroupSize = get_local_size(0);
 26:     const int n = workgroupId / gNumFilters;
 27:     const int outPlane = workgroupId % gNumFilters;
 28: 
 29:     const int localId = get_local_id(0);
 30:     const int outputRow = localId / gOutputSize;
 31:     const int outputCol = localId % gOutputSize;
 32: 
 33:     const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize;
 34:     const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow  - gEven) : gHalfFilterSize - gEven;
 35:     const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize;
 36:     const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven;
 37: 
 38:     const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize;
 39: 
 40:     const int filterCubeLength = gInputPlanes * gFilterSizeSquared;
 41:     const int filterCubeGlobalOffset = outPlane * filterCubeLength;
 42:     const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize;
 43:     for (int i = 0; i < numPixelsPerThread; i++) {
 44:         int thisOffset = localId + i * workgroupSize;
 45:         if (thisOffset < filterCubeLength) {
 46:             _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset];
 47:         }
 48:     }
 49:     // dont need a barrier, since we'll just run behind the barrier from the upstream image download
 50: 
 51:     float sum = 0;
 52:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 53:         int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared;
 54:         barrier(CLK_LOCAL_MEM_FENCE);
 55:         for (int i = 0; i < numUpstreamsPerThread; i++) {
 56:             int thisOffset = workgroupSize * i + localId;
 57:             if (thisOffset < gInputSizeSquared) {
 58:                 _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ];
 59:             }
 60:         }
 61:         barrier(CLK_LOCAL_MEM_FENCE);
 62:         int filterImageOffset = upstreamPlane * gFilterSizeSquared;
 63:         for (int u = minu; u <= maxu; u++) {
 64:             int inputRow = outputRow + u;
 65:             #if gPadZeros == 0
 66:                 inputRow += gHalfFilterSize;
 67:             #endif
 68:             int inputimagerowoffset = inputRow * gInputSize;
 69:             int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 70:             for (int v = minv; v <= maxv; v++) {
 71:                 int inputCol = outputCol + v;
 72:                 #if gPadZeros == 0
 73:                     inputCol += gHalfFilterSize;
 74:                 #endif
 75:                 if (localId < gOutputSizeSquared) {
 76:                     sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ];
 77:                 }
 78:             }
 79:         }
 80:     }
 81: 
 82:     // output are organized like [imageid][filterid][row][col]
 83:     int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId;
 84:     if (localId < gOutputSizeSquared) {
 85:         output[resultIndex ] = sum;
 86:     }
 87: }
 88: 
 89: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward3.cl build log: 
 error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

   ... not valid
 forward try kernel 4
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 kernel build error:

 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, int N) {
 8:     int numLoops = (N + get_local_size(0) - 1) / get_local_size(0);
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * get_local_size(0) + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [n][filterid]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [n][filterid][outrow][outcol]
 26: // the pixels per thread thing... :
 27: // - we have one thread (~= cuda core) per output value,
 28: //   ie one thread for each combination of [outrow][outcol]
 29: // - however, the number of threads is typically limited on a gpu,
 30: //   eg to 512 (eg Intel HD), or 1024 (eg nVidia K520)
 31: // - so what happens if the number of output points is larger than
 32: //   the maximum workgroup size?
 33: // - then we have several possibilities really:
 34: //   - we can divide the image into blocks, and process each block
 35: //     separately.  This is probably a good option, but fair amount of
 36: //     work
 37: //   - we can get each thread to handle more than one output
 38: //     pixel, by looping
 39: //   - we can consider the output image in 1d, by putting the rows
 40: //     one after another, and assign each contiguous workgroup-size
 41: //     block to one workgroup
 42: //     => this is how this kernel works
 43: //     basically, it's a hack, so larger images actually run, without
 44: //     crashing, and we can probably improve it a lot :-)
 45: //
 46: // So, when outputSize * outputSize > workgroupSize, then
 47: // multiple workgroups will be created for each output plane
 48: // the number of such workgroups is given by: `gPixelsPerThread`
 49: // the id of our workgroup within such a set of workgroups is calculated
 50: // as `pixel`
 51: // effectiveLocalId is our local id if we had one enormous workgroup
 52: // containing the whole output image plane
 53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize,
 54:       global const float *images, global const float *filters,
 55:     global float *output,
 56:     local float *_inputPlane, local float *_filterPlane) {
 57:     #define globalId (get_global_id(0))
 58: 
 59:     #define localId (get_local_id(0))
 60:     #define workgroupId (get_group_id(0))
 61: //    const int workgroupSize = get_local_size(0);
 62:     const int effectiveWorkgroupId = workgroupId / gPixelsPerThread;
 63:     const int pixel = workgroupId % gPixelsPerThread;
 64:     const int effectiveLocalId = localId + pixel * gWorkgroupSize;
 65:     const int n = effectiveWorkgroupId / gNumFilters;
 66:     const int outPlane = effectiveWorkgroupId % gNumFilters;
 67: 
 68:     const int outputRow = effectiveLocalId / gOutputSize;
 69:     const int outputCol = effectiveLocalId % gOutputSize;
 70: 
 71:     float sum = 0;
 72:     for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) {
 73:         barrier(CLK_LOCAL_MEM_FENCE);
 74:         copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared);
 75:         copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared);
 76:         barrier(CLK_LOCAL_MEM_FENCE);
 77: 
 78:         if (effectiveLocalId < gOutputSizeSquared) {
 79:             for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) {
 80:                 // trying to reduce register pressure...
 81:                 #if gPadZeros == 1
 82:                     #define inputRow (outputRow + u)
 83:                 #else
 84:                     #define inputRow (outputRow + u + gHalfFilterSize)
 85:                 #endif
 86:                 int inputimagerowoffset = inputRow * gInputSize;
 87:                 int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize;
 88:                 bool rowOk = inputRow >= 0 && inputRow < gInputSize;
 89:                 for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) {
 90:                     #if gPadZeros == 1
 91:                         #define inputCol (outputCol + v)
 92:                     #else
 93:                         #define inputCol (outputCol + v + gHalfFilterSize)
 94:                     #endif
 95:                     bool process = rowOk && inputCol >= 0 && inputCol < gInputSize;
 96:                     if (process) {
 97:                             sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ];
 98:                     }
 99:                 }
 100:             }
 101:         }
 102:     }
 103:     // output are organized like [imageid][filterid][row][col]
 104:     #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId)
 105:     if (effectiveLocalId < gOutputSizeSquared) {
 106:         output[resultIndex ] = sum;
 107:     }
 108: }
 109: #endif
 110: 
 111: 


 Something went wrong with clCreateKernel, OpenCL erorr code -45
 cl/forward4.cl build log: 
 error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0"

 ForwardAuto: kernel 4: this instance cant be used: 
 kernel source:
 1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail
 2: //
 3: // This Source Code Form is subject to the terms of the Mozilla Public License,
 4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can
 5: // obtain one at http://mozilla.org/MPL/2.0/.
 6: 
 7: void copyLocal(local float *target, global float const *source, int N) {
 8:     int numLoops = (N + get_local_size(0) - 1) / get_local_size(0);
 9:     for (int loop = 0; loop < numLoops; loop++) {
 10:         int offset = loop * get_local_size(0) + get_local_id(0);
 11:         if (offset < N) {
 12:             target[offset] = source[offset];
 13:         }
 14:     }
 15: }
 16: 
 17: #ifdef gOutputSize // for previous tests that dont define it
 18: // workgroup id organized like: [n][filterid]
 19: // local id organized like: [outrow][outcol]
 20: // each thread iterates over: [upstreamplane][filterrow][filtercol]
 21: // number workgroups = 32
 22: // one filter plane takes up 5 * 5 * 4 = 100 bytes
 23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok)
 24: // all filter cubes = 3.2KB * 32 = 102KB (too big)
 25: // output are organized like [n][filterid][outrow][outcol]
 26: // the pixels per thread thing... :
 27: // - we have one thread (~= cuda core) per output value,
 28: //   ie one thread for each combination of [outrow][outcol]
 29: // - however, the number of threads is typically limited on a gpu,
 30: //   eg to 512 (eg Intel HD), or 1024 (eg nVidia K520)
 31: // - so what happens if the number of output points is larger than
 32: //   the maximum workgroup size?
 33: // - then we have several possibilities really:
 34: //   - we can divide the image into blocks, and process each block
 35: //     separately.  This is probably a good option, but fair amount of
 36: //     work
 37: //   - we can get each thread to handle more than one output
 38: //     pixel, by looping
 39: //   - we can consider the output image in 1d, by putting the rows
 40: //     one after another, and assign each contiguous workgroup-size
 41: //     block to one workgroup
 42: //     => this is how this kernel works
 43: //     basically, it's a hack, so larger images actually run, without
 44: //     crashing, and we can probably improve it a lot :-)
 45: //
 46: // So, when outputSize * outputSize > workgroupSize, then
 47: // multiple workgroups will be created for each output plane
 48: // the number of such workgroups is given by: `gPixelsPerThread`
 49: // the id of our workgroup within such a set of workgroups is calculated
 50: // as `pixel`
 51: // effectiveLocalId is our local id if we had one enormous workgroup
 52: // containing the whole output image plane
 53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize,
 54:       global const float *images, global const float *filters,
 55:     global float *output,
 56:     local float *_inputPlane, local float *_filterPlane) {
 57:     #define globalId (get_global_id(0))
 58: 
 59:     #define localId (get_local_id(0))
 60:     #define workgroupId (get_group_id(0))
 61: //    const int workgroupSize = get_local_size(0);
 62:     const int effectiveWorkgroupId = workgroupId / gPixelsPerThread;
 63:     const int pixel = workgroupId % gPixelsPerThread;
 64:     const int effectiveLocalId = localId + pixel * gWorkgroupSize;
 65:     const int n = effectiveWorkgroupId / gNumFilters;
 66:     const int outPlane = effectiveWorkgroupId % gNumFilters;
 67: 
 68:     const int outputRow = effectiveLocalId / gOutputSize;
 69:     const int outputCol = effectiveLocalId % gOutputSize;
 70: 
 71:     float sum = 0;
 72:     for (int upstreamPlane = 0; upstreamP