Created
July 29, 2016 11:56
-
-
Save Mezzano/4c6933f15969124775998bf4338abe0a to your computer and use it in GitHub Desktop.
This file has been truncated, but you can view the full file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
args: ./deepcl_unittests --gtest_filter=-DATA*:SLOW* | |
Note: Google Test filter = -DATA*:SLOW* | |
[==========] Running 158 tests from 29 test cases. | |
[----------] Global test environment set-up. | |
[----------] 7 tests from testClBlas | |
[ RUN ] testClBlas.basic | |
DEBUG TANGUY: 18200632Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description "clblasSgemm() failed with -11" thrown in the test body. | |
[ FAILED ] testClBlas.basic (77 ms) | |
[ RUN ] testClBlas.transA | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
1 2 9 | |
3 7 5 | |
initializing clblas | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description "clblasSgemm() failed with -11" thrown in the test body. | |
[ FAILED ] testClBlas.transA (52 ms) | |
[ RUN ] testClBlas.transB | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
3 | |
-1 | |
initializing clblas | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description "clblasSgemm() failed with -11" thrown in the test body. | |
[ FAILED ] testClBlas.transB (55 ms) | |
[ RUN ] testClBlas.colMajor | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description "clblasSgemm() failed with -11" thrown in the test body. | |
[ FAILED ] testClBlas.colMajor (50 ms) | |
[ RUN ] testClBlas.colMajor2 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description "clblasSgemm() failed with -11" thrown in the test body. | |
[ FAILED ] testClBlas.colMajor2 (48 ms) | |
[ RUN ] testClBlas.colMajorTransA | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description "clblasSgemm() failed with -11" thrown in the test body. | |
[ FAILED ] testClBlas.colMajorTransA (43 ms) | |
[ RUN ] testClBlas.colMajorTransB | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description "clblasSgemm() failed with -11" thrown in the test body. | |
[ FAILED ] testClBlas.colMajorTransB (51 ms) | |
[----------] 7 tests from testClBlas (377 ms total) | |
[----------] 1 test from testDeepCL | |
[ RUN ] testDeepCL.basic | |
unknown file: Failure | |
C++ exception with description "No devices found" thrown in the test body. | |
[ FAILED ] testDeepCL.basic (0 ms) | |
[----------] 1 test from testDeepCL (0 ms total) | |
[----------] 23 tests from testupdateweights | |
[ RUN ] testupdateweights.conv1 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
layer 0:InputLayer{ outputPlanes=2 outputSize=5 } | |
layer 1:ConvolutionalLayer{ LayerDimensions{ inputPlanes=2 inputSize=5 numFilters=2 filterSize=3 outputSize=3 padZeros=0 biased=0 skip=0} } | |
layer 2:SquareLossLayer{} | |
layer 0:InputLayer{ outputPlanes=2 outputSize=5 } | |
layer 1:ConvolutionalLayer{ LayerDimensions{ inputPlanes=2 inputSize=5 numFilters=2 filterSize=3 outputSize=3 padZeros=0 biased=0 skip=0} } | |
layer 2:SquareLossLayer{} | |
batchSize: 4 | |
inputtotalsize=200 outputTotalSize=72 | |
layer ConvolutionalLayer{ LayerDimensions{ inputPlanes=2 inputSize=5 numFilters=2 filterSize=3 outputSize=3 padZeros=0 biased=0 skip=0} } | |
weightsize=36 biassize=0 | |
statefultimer v0.7 | |
forward try kernel 0 | |
... not plausibly optimal, skipping | |
forward try kernel 1 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 1: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 2 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
ForwardAuto: kernel 2: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
... not valid | |
forward try kernel 3 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: each workgroup handles convolving one input example with one filtercube | |
8: // and writing out one single output plane | |
9: // | |
10: // workgroup id organized like: [imageid][outplane] | |
11: // local id organized like: [outrow][outcol] | |
12: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
13: // number workgroups = 32 | |
14: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
16: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
17: // output are organized like [imageid][filterid][row][col] | |
18: void kernel forward_3_by_n_outplane(const int batchSize, | |
19: global const float *images, global const float *filters, | |
20: global float *output, | |
21: local float *_upstreamImage, local float *_filterCube) { | |
22: const int globalId = get_global_id(0); | |
23: | |
24: const int workgroupId = get_group_id(0); | |
25: const int workgroupSize = get_local_size(0); | |
26: const int n = workgroupId / gNumFilters; | |
27: const int outPlane = workgroupId % gNumFilters; | |
28: | |
29: const int localId = get_local_id(0); | |
30: const int outputRow = localId / gOutputSize; | |
31: const int outputCol = localId % gOutputSize; | |
32: | |
33: const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize; | |
34: const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow - gEven) : gHalfFilterSize - gEven; | |
35: const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize; | |
36: const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven; | |
37: | |
38: const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
39: | |
40: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
41: const int filterCubeGlobalOffset = outPlane * filterCubeLength; | |
42: const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize; | |
43: for (int i = 0; i < numPixelsPerThread; i++) { | |
44: int thisOffset = localId + i * workgroupSize; | |
45: if (thisOffset < filterCubeLength) { | |
46: _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset]; | |
47: } | |
48: } | |
49: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
50: | |
51: float sum = 0; | |
52: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
53: int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: for (int i = 0; i < numUpstreamsPerThread; i++) { | |
56: int thisOffset = workgroupSize * i + localId; | |
57: if (thisOffset < gInputSizeSquared) { | |
58: _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ]; | |
59: } | |
60: } | |
61: barrier(CLK_LOCAL_MEM_FENCE); | |
62: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
63: for (int u = minu; u <= maxu; u++) { | |
64: int inputRow = outputRow + u; | |
65: #if gPadZeros == 0 | |
66: inputRow += gHalfFilterSize; | |
67: #endif | |
68: int inputimagerowoffset = inputRow * gInputSize; | |
69: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
70: for (int v = minv; v <= maxv; v++) { | |
71: int inputCol = outputCol + v; | |
72: #if gPadZeros == 0 | |
73: inputCol += gHalfFilterSize; | |
74: #endif | |
75: if (localId < gOutputSizeSquared) { | |
76: sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
77: } | |
78: } | |
79: } | |
80: } | |
81: | |
82: // output are organized like [imageid][filterid][row][col] | |
83: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
84: if (localId < gOutputSizeSquared) { | |
85: output[resultIndex ] = sum; | |
86: } | |
87: } | |
88: | |
89: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 3: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: each workgroup handles convolving one input example with one filtercube | |
8: // and writing out one single output plane | |
9: // | |
10: // workgroup id organized like: [imageid][outplane] | |
11: // local id organized like: [outrow][outcol] | |
12: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
13: // number workgroups = 32 | |
14: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
16: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
17: // output are organized like [imageid][filterid][row][col] | |
18: void kernel forward_3_by_n_outplane(const int batchSize, | |
19: global const float *images, global const float *filters, | |
20: global float *output, | |
21: local float *_upstreamImage, local float *_filterCube) { | |
22: const int globalId = get_global_id(0); | |
23: | |
24: const int workgroupId = get_group_id(0); | |
25: const int workgroupSize = get_local_size(0); | |
26: const int n = workgroupId / gNumFilters; | |
27: const int outPlane = workgroupId % gNumFilters; | |
28: | |
29: const int localId = get_local_id(0); | |
30: const int outputRow = localId / gOutputSize; | |
31: const int outputCol = localId % gOutputSize; | |
32: | |
33: const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize; | |
34: const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow - gEven) : gHalfFilterSize - gEven; | |
35: const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize; | |
36: const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven; | |
37: | |
38: const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
39: | |
40: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
41: const int filterCubeGlobalOffset = outPlane * filterCubeLength; | |
42: const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize; | |
43: for (int i = 0; i < numPixelsPerThread; i++) { | |
44: int thisOffset = localId + i * workgroupSize; | |
45: if (thisOffset < filterCubeLength) { | |
46: _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset]; | |
47: } | |
48: } | |
49: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
50: | |
51: float sum = 0; | |
52: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
53: int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: for (int i = 0; i < numUpstreamsPerThread; i++) { | |
56: int thisOffset = workgroupSize * i + localId; | |
57: if (thisOffset < gInputSizeSquared) { | |
58: _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ]; | |
59: } | |
60: } | |
61: barrier(CLK_LOCAL_MEM_FENCE); | |
62: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
63: for (int u = minu; u <= maxu; u++) { | |
64: int inputRow = outputRow + u; | |
65: #if gPadZeros == 0 | |
66: inputRow += gHalfFilterSize; | |
67: #endif | |
68: int inputimagerowoffset = inputRow * gInputSize; | |
69: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
70: for (int v = minv; v <= maxv; v++) { | |
71: int inputCol = outputCol + v; | |
72: #if gPadZeros == 0 | |
73: inputCol += gHalfFilterSize; | |
74: #endif | |
75: if (localId < gOutputSizeSquared) { | |
76: sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
77: } | |
78: } | |
79: } | |
80: } | |
81: | |
82: // output are organized like [imageid][filterid][row][col] | |
83: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
84: if (localId < gOutputSizeSquared) { | |
85: output[resultIndex ] = sum; | |
86: } | |
87: } | |
88: | |
89: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 4 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, int N) { | |
8: int numLoops = (N + get_local_size(0) - 1) / get_local_size(0); | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * get_local_size(0) + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [n][filterid] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [n][filterid][outrow][outcol] | |
26: // the pixels per thread thing... : | |
27: // - we have one thread (~= cuda core) per output value, | |
28: // ie one thread for each combination of [outrow][outcol] | |
29: // - however, the number of threads is typically limited on a gpu, | |
30: // eg to 512 (eg Intel HD), or 1024 (eg nVidia K520) | |
31: // - so what happens if the number of output points is larger than | |
32: // the maximum workgroup size? | |
33: // - then we have several possibilities really: | |
34: // - we can divide the image into blocks, and process each block | |
35: // separately. This is probably a good option, but fair amount of | |
36: // work | |
37: // - we can get each thread to handle more than one output | |
38: // pixel, by looping | |
39: // - we can consider the output image in 1d, by putting the rows | |
40: // one after another, and assign each contiguous workgroup-size | |
41: // block to one workgroup | |
42: // => this is how this kernel works | |
43: // basically, it's a hack, so larger images actually run, without | |
44: // crashing, and we can probably improve it a lot :-) | |
45: // | |
46: // So, when outputSize * outputSize > workgroupSize, then | |
47: // multiple workgroups will be created for each output plane | |
48: // the number of such workgroups is given by: `gPixelsPerThread` | |
49: // the id of our workgroup within such a set of workgroups is calculated | |
50: // as `pixel` | |
51: // effectiveLocalId is our local id if we had one enormous workgroup | |
52: // containing the whole output image plane | |
53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize, | |
54: global const float *images, global const float *filters, | |
55: global float *output, | |
56: local float *_inputPlane, local float *_filterPlane) { | |
57: #define globalId (get_global_id(0)) | |
58: | |
59: #define localId (get_local_id(0)) | |
60: #define workgroupId (get_group_id(0)) | |
61: // const int workgroupSize = get_local_size(0); | |
62: const int effectiveWorkgroupId = workgroupId / gPixelsPerThread; | |
63: const int pixel = workgroupId % gPixelsPerThread; | |
64: const int effectiveLocalId = localId + pixel * gWorkgroupSize; | |
65: const int n = effectiveWorkgroupId / gNumFilters; | |
66: const int outPlane = effectiveWorkgroupId % gNumFilters; | |
67: | |
68: const int outputRow = effectiveLocalId / gOutputSize; | |
69: const int outputCol = effectiveLocalId % gOutputSize; | |
70: | |
71: float sum = 0; | |
72: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
73: barrier(CLK_LOCAL_MEM_FENCE); | |
74: copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared); | |
75: copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared); | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: | |
78: if (effectiveLocalId < gOutputSizeSquared) { | |
79: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
80: // trying to reduce register pressure... | |
81: #if gPadZeros == 1 | |
82: #define inputRow (outputRow + u) | |
83: #else | |
84: #define inputRow (outputRow + u + gHalfFilterSize) | |
85: #endif | |
86: int inputimagerowoffset = inputRow * gInputSize; | |
87: int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
88: bool rowOk = inputRow >= 0 && inputRow < gInputSize; | |
89: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
90: #if gPadZeros == 1 | |
91: #define inputCol (outputCol + v) | |
92: #else | |
93: #define inputCol (outputCol + v + gHalfFilterSize) | |
94: #endif | |
95: bool process = rowOk && inputCol >= 0 && inputCol < gInputSize; | |
96: if (process) { | |
97: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ]; | |
98: } | |
99: } | |
100: } | |
101: } | |
102: } | |
103: // output are organized like [imageid][filterid][row][col] | |
104: #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId) | |
105: if (effectiveLocalId < gOutputSizeSquared) { | |
106: output[resultIndex ] = sum; | |
107: } | |
108: } | |
109: #endif | |
110: | |
111: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 4: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, int N) { | |
8: int numLoops = (N + get_local_size(0) - 1) / get_local_size(0); | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * get_local_size(0) + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [n][filterid] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [n][filterid][outrow][outcol] | |
26: // the pixels per thread thing... : | |
27: // - we have one thread (~= cuda core) per output value, | |
28: // ie one thread for each combination of [outrow][outcol] | |
29: // - however, the number of threads is typically limited on a gpu, | |
30: // eg to 512 (eg Intel HD), or 1024 (eg nVidia K520) | |
31: // - so what happens if the number of output points is larger than | |
32: // the maximum workgroup size? | |
33: // - then we have several possibilities really: | |
34: // - we can divide the image into blocks, and process each block | |
35: // separately. This is probably a good option, but fair amount of | |
36: // work | |
37: // - we can get each thread to handle more than one output | |
38: // pixel, by looping | |
39: // - we can consider the output image in 1d, by putting the rows | |
40: // one after another, and assign each contiguous workgroup-size | |
41: // block to one workgroup | |
42: // => this is how this kernel works | |
43: // basically, it's a hack, so larger images actually run, without | |
44: // crashing, and we can probably improve it a lot :-) | |
45: // | |
46: // So, when outputSize * outputSize > workgroupSize, then | |
47: // multiple workgroups will be created for each output plane | |
48: // the number of such workgroups is given by: `gPixelsPerThread` | |
49: // the id of our workgroup within such a set of workgroups is calculated | |
50: // as `pixel` | |
51: // effectiveLocalId is our local id if we had one enormous workgroup | |
52: // containing the whole output image plane | |
53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize, | |
54: global const float *images, global const float *filters, | |
55: global float *output, | |
56: local float *_inputPlane, local float *_filterPlane) { | |
57: #define globalId (get_global_id(0)) | |
58: | |
59: #define localId (get_local_id(0)) | |
60: #define workgroupId (get_group_id(0)) | |
61: // const int workgroupSize = get_local_size(0); | |
62: const int effectiveWorkgroupId = workgroupId / gPixelsPerThread; | |
63: const int pixel = workgroupId % gPixelsPerThread; | |
64: const int effectiveLocalId = localId + pixel * gWorkgroupSize; | |
65: const int n = effectiveWorkgroupId / gNumFilters; | |
66: const int outPlane = effectiveWorkgroupId % gNumFilters; | |
67: | |
68: const int outputRow = effectiveLocalId / gOutputSize; | |
69: const int outputCol = effectiveLocalId % gOutputSize; | |
70: | |
71: float sum = 0; | |
72: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
73: barrier(CLK_LOCAL_MEM_FENCE); | |
74: copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared); | |
75: copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared); | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: | |
78: if (effectiveLocalId < gOutputSizeSquared) { | |
79: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
80: // trying to reduce register pressure... | |
81: #if gPadZeros == 1 | |
82: #define inputRow (outputRow + u) | |
83: #else | |
84: #define inputRow (outputRow + u + gHalfFilterSize) | |
85: #endif | |
86: int inputimagerowoffset = inputRow * gInputSize; | |
87: int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
88: bool rowOk = inputRow >= 0 && inputRow < gInputSize; | |
89: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
90: #if gPadZeros == 1 | |
91: #define inputCol (outputCol + v) | |
92: #else | |
93: #define inputCol (outputCol + v + gHalfFilterSize) | |
94: #endif | |
95: bool process = rowOk && inputCol >= 0 && inputCol < gInputSize; | |
96: if (process) { | |
97: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ]; | |
98: } | |
99: } | |
100: } | |
101: } | |
102: } | |
103: // output are organized like [imageid][filterid][row][col] | |
104: #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId) | |
105: if (effectiveLocalId < gOutputSizeSquared) { | |
106: output[resultIndex ] = sum; | |
107: } | |
108: } | |
109: #endif | |
110: | |
111: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 5 | |
ForwardAuto: kernel 5: this instance cant be used: For ForwardFc, filtersize and inputimagesize must be identical | |
... not valid | |
forward try kernel 6 | |
cl/forward_byinputplane.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: | |
8: // - load same input plane from each image | |
9: // - hold filter plane for this input plane, for all filters | |
10: // - reduce afterwards | |
11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB | |
12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB | |
13: // => seems ok? | |
14: // workgroupid: [inputPlaneId] | |
15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...) | |
16: // iterate over: [n][outCol] | |
17: // output: [n][filterId][outRow][outCol][inputPlane] | |
18: // need to later reduce output over: [inputPlane] | |
19: void kernel forward_byinputplane(const int batchSize, | |
20: global const float *images, global const float *filters, | |
21: global float *output, | |
22: local float *_inputPlane, local float *_filterPlanes) { | |
23: // const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0; | |
24: | |
25: const int globalId = get_global_id(0); | |
26: const int workgroupId = get_group_id(0); | |
27: const int workgroupSize = get_local_size(0); | |
28: const int localId = get_local_id(0); | |
29: | |
30: const int inputPlaneId = workgroupId; | |
31: const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize; | |
32: const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize; | |
33: const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
34: for (int loop = 0; loop < numLoops; loop++) { | |
35: const int loopLocalId = localId + loop * workgroupSize; | |
36: const int filterId = loopLocalId / gOutputSize; | |
37: const int outRow = loopLocalId % gOutputSize; | |
38: | |
39: // copy down our filter, we have gOutputSize threads to do this | |
40: global float const *globalFilterPlane = filters + | |
41: (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared; | |
42: local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared; | |
43: barrier(CLK_LOCAL_MEM_FENCE); | |
44: for (int i = 0; i < numFilterCopyLoops; i++) { | |
45: const int offset = i * gOutputSize + outRow; | |
46: bool process = filterId < gNumFilters && offset < gFilterSizeSquared; | |
47: if (process) { | |
48: _localFilterPlane[ offset ] = globalFilterPlane[ offset ]; | |
49: } | |
50: } | |
51: // loop over n ... | |
52: for (int n = 0; n < batchSize; n++) { | |
53: // copy down our imageplane, we have workgroupSize threads to do this | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: global float const *globalImagePlane = images + | |
56: (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared; | |
57: for (int i = 0; i< numImageCopyLoops; i++) { | |
58: const int offset = i * workgroupSize + localId; | |
59: if (offset < gInputSizeSquared) { | |
60: _inputPlane[ offset ] = globalImagePlane[ offset ]; | |
61: } | |
62: } | |
63: barrier(CLK_LOCAL_MEM_FENCE); | |
64: // calc output for each [outrow][outcol] | |
65: bool filterPlaneOk = filterId < gNumFilters; | |
66: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
67: float sum = 0; | |
68: for (int filterRow = 0; filterRow < gFilterSize; filterRow++) { | |
69: int inRow = outRow + filterRow; | |
70: #if gPadZeros == 1 | |
71: inRow -= gHalfFilterSize; | |
72: #endif | |
73: bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize; | |
74: for (int filterCol = 0; filterCol < gFilterSize; filterCol++) { | |
75: int inCol = outCol + filterCol; | |
76: #if gPadZeros == 1 | |
77: inCol -= gHalfFilterSize; | |
78: #endif | |
79: bool process = rowOk && inCol >= 0 && inCol < gInputSize; | |
80: if (process) { | |
81: float imageValue = _inputPlane[ inRow * gInputSize + inCol ]; | |
82: float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ]; | |
83: sum += imageValue * filterValue; | |
84: } | |
85: } | |
86: } | |
87: if (filterId < gNumFilters) { | |
88: // [n][filterId][outRow][outCol][inputPlane] | |
89: int resultIndex = (( (n | |
90: * gNumFilters + filterId) | |
91: * gOutputSize + outRow) | |
92: * gOutputSize + outCol) | |
93: * gNumInputPlanes + inputPlaneId; | |
94: output[resultIndex] = sum; | |
95: //if (globalId == 2) output[0] = resultIndex; | |
96: // output[resultIndex] = outRow; | |
97: } | |
98: // output[localId] = _localFilterPlane[localId]; | |
99: } | |
100: } | |
101: } | |
102: } | |
103: | |
104: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward_byinputplane.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 6: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: | |
8: // - load same input plane from each image | |
9: // - hold filter plane for this input plane, for all filters | |
10: // - reduce afterwards | |
11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB | |
12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB | |
13: // => seems ok? | |
14: // workgroupid: [inputPlaneId] | |
15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...) | |
16: // iterate over: [n][outCol] | |
17: // output: [n][filterId][outRow][outCol][inputPlane] | |
18: // need to later reduce output over: [inputPlane] | |
19: void kernel forward_byinputplane(const int batchSize, | |
20: global const float *images, global const float *filters, | |
21: global float *output, | |
22: local float *_inputPlane, local float *_filterPlanes) { | |
23: // const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0; | |
24: | |
25: const int globalId = get_global_id(0); | |
26: const int workgroupId = get_group_id(0); | |
27: const int workgroupSize = get_local_size(0); | |
28: const int localId = get_local_id(0); | |
29: | |
30: const int inputPlaneId = workgroupId; | |
31: const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize; | |
32: const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize; | |
33: const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
34: for (int loop = 0; loop < numLoops; loop++) { | |
35: const int loopLocalId = localId + loop * workgroupSize; | |
36: const int filterId = loopLocalId / gOutputSize; | |
37: const int outRow = loopLocalId % gOutputSize; | |
38: | |
39: // copy down our filter, we have gOutputSize threads to do this | |
40: global float const *globalFilterPlane = filters + | |
41: (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared; | |
42: local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared; | |
43: barrier(CLK_LOCAL_MEM_FENCE); | |
44: for (int i = 0; i < numFilterCopyLoops; i++) { | |
45: const int offset = i * gOutputSize + outRow; | |
46: bool process = filterId < gNumFilters && offset < gFilterSizeSquared; | |
47: if (process) { | |
48: _localFilterPlane[ offset ] = globalFilterPlane[ offset ]; | |
49: } | |
50: } | |
51: // loop over n ... | |
52: for (int n = 0; n < batchSize; n++) { | |
53: // copy down our imageplane, we have workgroupSize threads to do this | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: global float const *globalImagePlane = images + | |
56: (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared; | |
57: for (int i = 0; i< numImageCopyLoops; i++) { | |
58: const int offset = i * workgroupSize + localId; | |
59: if (offset < gInputSizeSquared) { | |
60: _inputPlane[ offset ] = globalImagePlane[ offset ]; | |
61: } | |
62: } | |
63: barrier(CLK_LOCAL_MEM_FENCE); | |
64: // calc output for each [outrow][outcol] | |
65: bool filterPlaneOk = filterId < gNumFilters; | |
66: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
67: float sum = 0; | |
68: for (int filterRow = 0; filterRow < gFilterSize; filterRow++) { | |
69: int inRow = outRow + filterRow; | |
70: #if gPadZeros == 1 | |
71: inRow -= gHalfFilterSize; | |
72: #endif | |
73: bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize; | |
74: for (int filterCol = 0; filterCol < gFilterSize; filterCol++) { | |
75: int inCol = outCol + filterCol; | |
76: #if gPadZeros == 1 | |
77: inCol -= gHalfFilterSize; | |
78: #endif | |
79: bool process = rowOk && inCol >= 0 && inCol < gInputSize; | |
80: if (process) { | |
81: float imageValue = _inputPlane[ inRow * gInputSize + inCol ]; | |
82: float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ]; | |
83: sum += imageValue * filterValue; | |
84: } | |
85: } | |
86: } | |
87: if (filterId < gNumFilters) { | |
88: // [n][filterId][outRow][outCol][inputPlane] | |
89: int resultIndex = (( (n | |
90: * gNumFilters + filterId) | |
91: * gOutputSize + outRow) | |
92: * gOutputSize + outCol) | |
93: * gNumInputPlanes + inputPlaneId; | |
94: output[resultIndex] = sum; | |
95: //if (globalId == 2) output[0] = resultIndex; | |
96: // output[resultIndex] = outRow; | |
97: } | |
98: // output[localId] = _localFilterPlane[localId]; | |
99: } | |
100: } | |
101: } | |
102: } | |
103: | |
104: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward_byinputplane.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 7 | |
... seems valid | |
ForwardIm2Col.cl build log: | |
(19:0) : error : invalid global address space qualifier specified for parameter type | |
(19:0) : error : syntax error at 'const' | |
kernel build error: | |
kernel source: | |
1: // from SpatialConvolutionMM.cu: | |
2: | |
3: // CL: grid stride looping | |
4: #define CL_KERNEL_LOOP(i, n) \ | |
5: for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \ | |
6: i < (n); \ | |
7: i += get_local_size(0) * get_num_groups(0)) | |
8: | |
9: //#define gPadding 0 | |
10: //#define gStride 1 | |
11: //#define gColSize 3 | |
12: //#define gFilterSize 3 | |
13: //#define gSize 5 | |
14: | |
15: // Kernel for fast unfold+copy | |
16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu) | |
17: kernel void im2col( | |
18: const int n, | |
19: global float const * im_data, int im_offset, | |
20: global float* data_col) { | |
21: global const float *data_im = im_data + im_offset; | |
22: | |
23: CL_KERNEL_LOOP(index, n) { | |
24: int w_out = index % 3; | |
25: index /= 3; | |
26: int h_out = index % 3; | |
27: int channel_in = index / 3; | |
28: int channel_out = channel_in * 3 * 3; | |
29: int h_in = h_out * 1 - 0; | |
30: int w_in = w_out * 1 - 0; | |
31: data_col += (channel_out * 3 + h_out) * 3 + w_out; | |
32: data_im += (channel_in * 5 + h_in) * 5 + w_in; | |
33: for (int i = 0; i < 3; ++i) { | |
34: for (int j = 0; j < 3; ++j) { | |
35: int h = h_in + i; | |
36: int w = w_in + j; | |
37: *data_col = (h >= 0 && w >= 0 && h < 5 && w < 5) ? | |
38: data_im[i * 5 + j] : 0; | |
39: data_col += 3 * 3; | |
40: } | |
41: } | |
42: } | |
43: } | |
44: | |
45: kernel void col2im( | |
46: const int n, | |
47: global float const *data_col, | |
48: global float* im_data, int im_offset) { | |
49: global float *data_im = im_data + im_offset; | |
50: | |
51: for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) { | |
52: float val = 0; | |
53: int w = index % 5 + 0; | |
54: int h = (index / 5) % 5 + 0; | |
55: int c = index / (5 * 5); | |
56: // compute the start and end of the output | |
57: int w_col_start = (w < 3) ? 0 : (w - 3) / 1 + 1; | |
58: int w_col_end = min(w / 1 + 1, 3); | |
59: int h_col_start = (h < 3) ? 0 : (h - 3) / 1 + 1; | |
60: int h_col_end = min(h / 1 + 1, 3); | |
61: | |
62: int offset = (c * 3 * 3 + h * 3 + w) * 3 * 3; | |
63: int coeff_h_col = (1 - 1 * 3 * 3) * 3; | |
64: int coeff_w_col = (1 - 1 * 3 * 3); | |
65: for (int h_col = h_col_start; h_col < h_col_end; ++h_col) { | |
66: for (int w_col = w_col_start; w_col < w_col_end; ++w_col) { | |
67: val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col]; | |
68: } | |
69: } | |
70: data_im[index] = val; | |
71: } | |
72: } | |
73: | |
74: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
ForwardIm2Col.cl build log: | |
(19:0) : error : invalid global address space qualifier specified for parameter type | |
(19:0) : error : syntax error at 'const' | |
ForwardAuto: kernel 7 this instance cant be used: | |
kernel source: | |
1: // from SpatialConvolutionMM.cu: | |
2: | |
3: // CL: grid stride looping | |
4: #define CL_KERNEL_LOOP(i, n) \ | |
5: for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \ | |
6: i < (n); \ | |
7: i += get_local_size(0) * get_num_groups(0)) | |
8: | |
9: //#define gPadding 0 | |
10: //#define gStride 1 | |
11: //#define gColSize 3 | |
12: //#define gFilterSize 3 | |
13: //#define gSize 5 | |
14: | |
15: // Kernel for fast unfold+copy | |
16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu) | |
17: kernel void im2col( | |
18: const int n, | |
19: global float const * im_data, int im_offset, | |
20: global float* data_col) { | |
21: global const float *data_im = im_data + im_offset; | |
22: | |
23: CL_KERNEL_LOOP(index, n) { | |
24: int w_out = index % 3; | |
25: index /= 3; | |
26: int h_out = index % 3; | |
27: int channel_in = index / 3; | |
28: int channel_out = channel_in * 3 * 3; | |
29: int h_in = h_out * 1 - 0; | |
30: int w_in = w_out * 1 - 0; | |
31: data_col += (channel_out * 3 + h_out) * 3 + w_out; | |
32: data_im += (channel_in * 5 + h_in) * 5 + w_in; | |
33: for (int i = 0; i < 3; ++i) { | |
34: for (int j = 0; j < 3; ++j) { | |
35: int h = h_in + i; | |
36: int w = w_in + j; | |
37: *data_col = (h >= 0 && w >= 0 && h < 5 && w < 5) ? | |
38: data_im[i * 5 + j] : 0; | |
39: data_col += 3 * 3; | |
40: } | |
41: } | |
42: } | |
43: } | |
44: | |
45: kernel void col2im( | |
46: const int n, | |
47: global float const *data_col, | |
48: global float* im_data, int im_offset) { | |
49: global float *data_im = im_data + im_offset; | |
50: | |
51: for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) { | |
52: float val = 0; | |
53: int w = index % 5 + 0; | |
54: int h = (index / 5) % 5 + 0; | |
55: int c = index / (5 * 5); | |
56: // compute the start and end of the output | |
57: int w_col_start = (w < 3) ? 0 : (w - 3) / 1 + 1; | |
58: int w_col_end = min(w / 1 + 1, 3); | |
59: int h_col_start = (h < 3) ? 0 : (h - 3) / 1 + 1; | |
60: int h_col_end = min(h / 1 + 1, 3); | |
61: | |
62: int offset = (c * 3 * 3 + h * 3 + w) * 3 * 3; | |
63: int coeff_h_col = (1 - 1 * 3 * 3) * 3; | |
64: int coeff_w_col = (1 - 1 * 3 * 3); | |
65: for (int h_col = h_col_start; h_col < h_col_end; ++h_col) { | |
66: for (int w_col = w_col_start; w_col < w_col_end; ++w_col) { | |
67: val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col]; | |
68: } | |
69: } | |
70: data_im[index] = val; | |
71: } | |
72: } | |
73: | |
74: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
ForwardIm2Col.cl build log: | |
(19:0) : error : invalid global address space qualifier specified for parameter type | |
(19:0) : error : syntax error at 'const' | |
forward kernel 0: cannot be used | |
forward kernel 1: cannot be used | |
forward kernel 2: cannot be used | |
forward kernel 3: cannot be used | |
forward kernel 4: cannot be used | |
forward kernel 5: cannot be used | |
forward kernel 6: cannot be used | |
forward kernel 7: cannot be used | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description "No valid forward implementations found" thrown in the test body. | |
[ FAILED ] testupdateweights.conv1 (147 ms) | |
[ RUN ] testupdateweights.conv1z | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
layer 0:InputLayer{ outputPlanes=2 outputSize=3 } | |
layer 1:ConvolutionalLayer{ LayerDimensions{ inputPlanes=2 inputSize=3 numFilters=2 filterSize=3 outputSize=3 padZeros=1 biased=0 skip=0} } | |
layer 2:SquareLossLayer{} | |
layer 0:InputLayer{ outputPlanes=2 outputSize=3 } | |
layer 1:ConvolutionalLayer{ LayerDimensions{ inputPlanes=2 inputSize=3 numFilters=2 filterSize=3 outputSize=3 padZeros=1 biased=0 skip=0} } | |
layer 2:SquareLossLayer{} | |
batchSize: 4 | |
inputtotalsize=72 outputTotalSize=72 | |
layer ConvolutionalLayer{ LayerDimensions{ inputPlanes=2 inputSize=3 numFilters=2 filterSize=3 outputSize=3 padZeros=1 biased=0 skip=0} } | |
weightsize=36 biassize=0 | |
forward try kernel 0 | |
... not plausibly optimal, skipping | |
forward try kernel 1 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 1: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 2 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
ForwardAuto: kernel 2: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
... not valid | |
forward try kernel 3 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: each workgroup handles convolving one input example with one filtercube | |
8: // and writing out one single output plane | |
9: // | |
10: // workgroup id organized like: [imageid][outplane] | |
11: // local id organized like: [outrow][outcol] | |
12: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
13: // number workgroups = 32 | |
14: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
16: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
17: // output are organized like [imageid][filterid][row][col] | |
18: void kernel forward_3_by_n_outplane(const int batchSize, | |
19: global const float *images, global const float *filters, | |
20: global float *output, | |
21: local float *_upstreamImage, local float *_filterCube) { | |
22: const int globalId = get_global_id(0); | |
23: | |
24: const int workgroupId = get_group_id(0); | |
25: const int workgroupSize = get_local_size(0); | |
26: const int n = workgroupId / gNumFilters; | |
27: const int outPlane = workgroupId % gNumFilters; | |
28: | |
29: const int localId = get_local_id(0); | |
30: const int outputRow = localId / gOutputSize; | |
31: const int outputCol = localId % gOutputSize; | |
32: | |
33: const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize; | |
34: const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow - gEven) : gHalfFilterSize - gEven; | |
35: const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize; | |
36: const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven; | |
37: | |
38: const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
39: | |
40: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
41: const int filterCubeGlobalOffset = outPlane * filterCubeLength; | |
42: const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize; | |
43: for (int i = 0; i < numPixelsPerThread; i++) { | |
44: int thisOffset = localId + i * workgroupSize; | |
45: if (thisOffset < filterCubeLength) { | |
46: _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset]; | |
47: } | |
48: } | |
49: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
50: | |
51: float sum = 0; | |
52: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
53: int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: for (int i = 0; i < numUpstreamsPerThread; i++) { | |
56: int thisOffset = workgroupSize * i + localId; | |
57: if (thisOffset < gInputSizeSquared) { | |
58: _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ]; | |
59: } | |
60: } | |
61: barrier(CLK_LOCAL_MEM_FENCE); | |
62: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
63: for (int u = minu; u <= maxu; u++) { | |
64: int inputRow = outputRow + u; | |
65: #if gPadZeros == 0 | |
66: inputRow += gHalfFilterSize; | |
67: #endif | |
68: int inputimagerowoffset = inputRow * gInputSize; | |
69: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
70: for (int v = minv; v <= maxv; v++) { | |
71: int inputCol = outputCol + v; | |
72: #if gPadZeros == 0 | |
73: inputCol += gHalfFilterSize; | |
74: #endif | |
75: if (localId < gOutputSizeSquared) { | |
76: sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
77: } | |
78: } | |
79: } | |
80: } | |
81: | |
82: // output are organized like [imageid][filterid][row][col] | |
83: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
84: if (localId < gOutputSizeSquared) { | |
85: output[resultIndex ] = sum; | |
86: } | |
87: } | |
88: | |
89: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 3: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: each workgroup handles convolving one input example with one filtercube | |
8: // and writing out one single output plane | |
9: // | |
10: // workgroup id organized like: [imageid][outplane] | |
11: // local id organized like: [outrow][outcol] | |
12: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
13: // number workgroups = 32 | |
14: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
16: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
17: // output are organized like [imageid][filterid][row][col] | |
18: void kernel forward_3_by_n_outplane(const int batchSize, | |
19: global const float *images, global const float *filters, | |
20: global float *output, | |
21: local float *_upstreamImage, local float *_filterCube) { | |
22: const int globalId = get_global_id(0); | |
23: | |
24: const int workgroupId = get_group_id(0); | |
25: const int workgroupSize = get_local_size(0); | |
26: const int n = workgroupId / gNumFilters; | |
27: const int outPlane = workgroupId % gNumFilters; | |
28: | |
29: const int localId = get_local_id(0); | |
30: const int outputRow = localId / gOutputSize; | |
31: const int outputCol = localId % gOutputSize; | |
32: | |
33: const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize; | |
34: const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow - gEven) : gHalfFilterSize - gEven; | |
35: const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize; | |
36: const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven; | |
37: | |
38: const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
39: | |
40: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
41: const int filterCubeGlobalOffset = outPlane * filterCubeLength; | |
42: const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize; | |
43: for (int i = 0; i < numPixelsPerThread; i++) { | |
44: int thisOffset = localId + i * workgroupSize; | |
45: if (thisOffset < filterCubeLength) { | |
46: _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset]; | |
47: } | |
48: } | |
49: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
50: | |
51: float sum = 0; | |
52: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
53: int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: for (int i = 0; i < numUpstreamsPerThread; i++) { | |
56: int thisOffset = workgroupSize * i + localId; | |
57: if (thisOffset < gInputSizeSquared) { | |
58: _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ]; | |
59: } | |
60: } | |
61: barrier(CLK_LOCAL_MEM_FENCE); | |
62: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
63: for (int u = minu; u <= maxu; u++) { | |
64: int inputRow = outputRow + u; | |
65: #if gPadZeros == 0 | |
66: inputRow += gHalfFilterSize; | |
67: #endif | |
68: int inputimagerowoffset = inputRow * gInputSize; | |
69: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
70: for (int v = minv; v <= maxv; v++) { | |
71: int inputCol = outputCol + v; | |
72: #if gPadZeros == 0 | |
73: inputCol += gHalfFilterSize; | |
74: #endif | |
75: if (localId < gOutputSizeSquared) { | |
76: sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
77: } | |
78: } | |
79: } | |
80: } | |
81: | |
82: // output are organized like [imageid][filterid][row][col] | |
83: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
84: if (localId < gOutputSizeSquared) { | |
85: output[resultIndex ] = sum; | |
86: } | |
87: } | |
88: | |
89: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 4 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, int N) { | |
8: int numLoops = (N + get_local_size(0) - 1) / get_local_size(0); | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * get_local_size(0) + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [n][filterid] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [n][filterid][outrow][outcol] | |
26: // the pixels per thread thing... : | |
27: // - we have one thread (~= cuda core) per output value, | |
28: // ie one thread for each combination of [outrow][outcol] | |
29: // - however, the number of threads is typically limited on a gpu, | |
30: // eg to 512 (eg Intel HD), or 1024 (eg nVidia K520) | |
31: // - so what happens if the number of output points is larger than | |
32: // the maximum workgroup size? | |
33: // - then we have several possibilities really: | |
34: // - we can divide the image into blocks, and process each block | |
35: // separately. This is probably a good option, but fair amount of | |
36: // work | |
37: // - we can get each thread to handle more than one output | |
38: // pixel, by looping | |
39: // - we can consider the output image in 1d, by putting the rows | |
40: // one after another, and assign each contiguous workgroup-size | |
41: // block to one workgroup | |
42: // => this is how this kernel works | |
43: // basically, it's a hack, so larger images actually run, without | |
44: // crashing, and we can probably improve it a lot :-) | |
45: // | |
46: // So, when outputSize * outputSize > workgroupSize, then | |
47: // multiple workgroups will be created for each output plane | |
48: // the number of such workgroups is given by: `gPixelsPerThread` | |
49: // the id of our workgroup within such a set of workgroups is calculated | |
50: // as `pixel` | |
51: // effectiveLocalId is our local id if we had one enormous workgroup | |
52: // containing the whole output image plane | |
53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize, | |
54: global const float *images, global const float *filters, | |
55: global float *output, | |
56: local float *_inputPlane, local float *_filterPlane) { | |
57: #define globalId (get_global_id(0)) | |
58: | |
59: #define localId (get_local_id(0)) | |
60: #define workgroupId (get_group_id(0)) | |
61: // const int workgroupSize = get_local_size(0); | |
62: const int effectiveWorkgroupId = workgroupId / gPixelsPerThread; | |
63: const int pixel = workgroupId % gPixelsPerThread; | |
64: const int effectiveLocalId = localId + pixel * gWorkgroupSize; | |
65: const int n = effectiveWorkgroupId / gNumFilters; | |
66: const int outPlane = effectiveWorkgroupId % gNumFilters; | |
67: | |
68: const int outputRow = effectiveLocalId / gOutputSize; | |
69: const int outputCol = effectiveLocalId % gOutputSize; | |
70: | |
71: float sum = 0; | |
72: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
73: barrier(CLK_LOCAL_MEM_FENCE); | |
74: copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared); | |
75: copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared); | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: | |
78: if (effectiveLocalId < gOutputSizeSquared) { | |
79: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
80: // trying to reduce register pressure... | |
81: #if gPadZeros == 1 | |
82: #define inputRow (outputRow + u) | |
83: #else | |
84: #define inputRow (outputRow + u + gHalfFilterSize) | |
85: #endif | |
86: int inputimagerowoffset = inputRow * gInputSize; | |
87: int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
88: bool rowOk = inputRow >= 0 && inputRow < gInputSize; | |
89: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
90: #if gPadZeros == 1 | |
91: #define inputCol (outputCol + v) | |
92: #else | |
93: #define inputCol (outputCol + v + gHalfFilterSize) | |
94: #endif | |
95: bool process = rowOk && inputCol >= 0 && inputCol < gInputSize; | |
96: if (process) { | |
97: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ]; | |
98: } | |
99: } | |
100: } | |
101: } | |
102: } | |
103: // output are organized like [imageid][filterid][row][col] | |
104: #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId) | |
105: if (effectiveLocalId < gOutputSizeSquared) { | |
106: output[resultIndex ] = sum; | |
107: } | |
108: } | |
109: #endif | |
110: | |
111: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 4: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, int N) { | |
8: int numLoops = (N + get_local_size(0) - 1) / get_local_size(0); | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * get_local_size(0) + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [n][filterid] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [n][filterid][outrow][outcol] | |
26: // the pixels per thread thing... : | |
27: // - we have one thread (~= cuda core) per output value, | |
28: // ie one thread for each combination of [outrow][outcol] | |
29: // - however, the number of threads is typically limited on a gpu, | |
30: // eg to 512 (eg Intel HD), or 1024 (eg nVidia K520) | |
31: // - so what happens if the number of output points is larger than | |
32: // the maximum workgroup size? | |
33: // - then we have several possibilities really: | |
34: // - we can divide the image into blocks, and process each block | |
35: // separately. This is probably a good option, but fair amount of | |
36: // work | |
37: // - we can get each thread to handle more than one output | |
38: // pixel, by looping | |
39: // - we can consider the output image in 1d, by putting the rows | |
40: // one after another, and assign each contiguous workgroup-size | |
41: // block to one workgroup | |
42: // => this is how this kernel works | |
43: // basically, it's a hack, so larger images actually run, without | |
44: // crashing, and we can probably improve it a lot :-) | |
45: // | |
46: // So, when outputSize * outputSize > workgroupSize, then | |
47: // multiple workgroups will be created for each output plane | |
48: // the number of such workgroups is given by: `gPixelsPerThread` | |
49: // the id of our workgroup within such a set of workgroups is calculated | |
50: // as `pixel` | |
51: // effectiveLocalId is our local id if we had one enormous workgroup | |
52: // containing the whole output image plane | |
53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize, | |
54: global const float *images, global const float *filters, | |
55: global float *output, | |
56: local float *_inputPlane, local float *_filterPlane) { | |
57: #define globalId (get_global_id(0)) | |
58: | |
59: #define localId (get_local_id(0)) | |
60: #define workgroupId (get_group_id(0)) | |
61: // const int workgroupSize = get_local_size(0); | |
62: const int effectiveWorkgroupId = workgroupId / gPixelsPerThread; | |
63: const int pixel = workgroupId % gPixelsPerThread; | |
64: const int effectiveLocalId = localId + pixel * gWorkgroupSize; | |
65: const int n = effectiveWorkgroupId / gNumFilters; | |
66: const int outPlane = effectiveWorkgroupId % gNumFilters; | |
67: | |
68: const int outputRow = effectiveLocalId / gOutputSize; | |
69: const int outputCol = effectiveLocalId % gOutputSize; | |
70: | |
71: float sum = 0; | |
72: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
73: barrier(CLK_LOCAL_MEM_FENCE); | |
74: copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared); | |
75: copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared); | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: | |
78: if (effectiveLocalId < gOutputSizeSquared) { | |
79: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
80: // trying to reduce register pressure... | |
81: #if gPadZeros == 1 | |
82: #define inputRow (outputRow + u) | |
83: #else | |
84: #define inputRow (outputRow + u + gHalfFilterSize) | |
85: #endif | |
86: int inputimagerowoffset = inputRow * gInputSize; | |
87: int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
88: bool rowOk = inputRow >= 0 && inputRow < gInputSize; | |
89: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
90: #if gPadZeros == 1 | |
91: #define inputCol (outputCol + v) | |
92: #else | |
93: #define inputCol (outputCol + v + gHalfFilterSize) | |
94: #endif | |
95: bool process = rowOk && inputCol >= 0 && inputCol < gInputSize; | |
96: if (process) { | |
97: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ]; | |
98: } | |
99: } | |
100: } | |
101: } | |
102: } | |
103: // output are organized like [imageid][filterid][row][col] | |
104: #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId) | |
105: if (effectiveLocalId < gOutputSizeSquared) { | |
106: output[resultIndex ] = sum; | |
107: } | |
108: } | |
109: #endif | |
110: | |
111: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 5 | |
ForwardAuto: kernel 5: this instance cant be used: For ForwardFc, padzeros must be disabled | |
... not valid | |
forward try kernel 6 | |
cl/forward_byinputplane.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: | |
8: // - load same input plane from each image | |
9: // - hold filter plane for this input plane, for all filters | |
10: // - reduce afterwards | |
11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB | |
12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB | |
13: // => seems ok? | |
14: // workgroupid: [inputPlaneId] | |
15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...) | |
16: // iterate over: [n][outCol] | |
17: // output: [n][filterId][outRow][outCol][inputPlane] | |
18: // need to later reduce output over: [inputPlane] | |
19: void kernel forward_byinputplane(const int batchSize, | |
20: global const float *images, global const float *filters, | |
21: global float *output, | |
22: local float *_inputPlane, local float *_filterPlanes) { | |
23: // const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0; | |
24: | |
25: const int globalId = get_global_id(0); | |
26: const int workgroupId = get_group_id(0); | |
27: const int workgroupSize = get_local_size(0); | |
28: const int localId = get_local_id(0); | |
29: | |
30: const int inputPlaneId = workgroupId; | |
31: const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize; | |
32: const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize; | |
33: const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
34: for (int loop = 0; loop < numLoops; loop++) { | |
35: const int loopLocalId = localId + loop * workgroupSize; | |
36: const int filterId = loopLocalId / gOutputSize; | |
37: const int outRow = loopLocalId % gOutputSize; | |
38: | |
39: // copy down our filter, we have gOutputSize threads to do this | |
40: global float const *globalFilterPlane = filters + | |
41: (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared; | |
42: local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared; | |
43: barrier(CLK_LOCAL_MEM_FENCE); | |
44: for (int i = 0; i < numFilterCopyLoops; i++) { | |
45: const int offset = i * gOutputSize + outRow; | |
46: bool process = filterId < gNumFilters && offset < gFilterSizeSquared; | |
47: if (process) { | |
48: _localFilterPlane[ offset ] = globalFilterPlane[ offset ]; | |
49: } | |
50: } | |
51: // loop over n ... | |
52: for (int n = 0; n < batchSize; n++) { | |
53: // copy down our imageplane, we have workgroupSize threads to do this | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: global float const *globalImagePlane = images + | |
56: (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared; | |
57: for (int i = 0; i< numImageCopyLoops; i++) { | |
58: const int offset = i * workgroupSize + localId; | |
59: if (offset < gInputSizeSquared) { | |
60: _inputPlane[ offset ] = globalImagePlane[ offset ]; | |
61: } | |
62: } | |
63: barrier(CLK_LOCAL_MEM_FENCE); | |
64: // calc output for each [outrow][outcol] | |
65: bool filterPlaneOk = filterId < gNumFilters; | |
66: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
67: float sum = 0; | |
68: for (int filterRow = 0; filterRow < gFilterSize; filterRow++) { | |
69: int inRow = outRow + filterRow; | |
70: #if gPadZeros == 1 | |
71: inRow -= gHalfFilterSize; | |
72: #endif | |
73: bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize; | |
74: for (int filterCol = 0; filterCol < gFilterSize; filterCol++) { | |
75: int inCol = outCol + filterCol; | |
76: #if gPadZeros == 1 | |
77: inCol -= gHalfFilterSize; | |
78: #endif | |
79: bool process = rowOk && inCol >= 0 && inCol < gInputSize; | |
80: if (process) { | |
81: float imageValue = _inputPlane[ inRow * gInputSize + inCol ]; | |
82: float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ]; | |
83: sum += imageValue * filterValue; | |
84: } | |
85: } | |
86: } | |
87: if (filterId < gNumFilters) { | |
88: // [n][filterId][outRow][outCol][inputPlane] | |
89: int resultIndex = (( (n | |
90: * gNumFilters + filterId) | |
91: * gOutputSize + outRow) | |
92: * gOutputSize + outCol) | |
93: * gNumInputPlanes + inputPlaneId; | |
94: output[resultIndex] = sum; | |
95: //if (globalId == 2) output[0] = resultIndex; | |
96: // output[resultIndex] = outRow; | |
97: } | |
98: // output[localId] = _localFilterPlane[localId]; | |
99: } | |
100: } | |
101: } | |
102: } | |
103: | |
104: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward_byinputplane.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 6: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: | |
8: // - load same input plane from each image | |
9: // - hold filter plane for this input plane, for all filters | |
10: // - reduce afterwards | |
11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB | |
12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB | |
13: // => seems ok? | |
14: // workgroupid: [inputPlaneId] | |
15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...) | |
16: // iterate over: [n][outCol] | |
17: // output: [n][filterId][outRow][outCol][inputPlane] | |
18: // need to later reduce output over: [inputPlane] | |
19: void kernel forward_byinputplane(const int batchSize, | |
20: global const float *images, global const float *filters, | |
21: global float *output, | |
22: local float *_inputPlane, local float *_filterPlanes) { | |
23: // const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0; | |
24: | |
25: const int globalId = get_global_id(0); | |
26: const int workgroupId = get_group_id(0); | |
27: const int workgroupSize = get_local_size(0); | |
28: const int localId = get_local_id(0); | |
29: | |
30: const int inputPlaneId = workgroupId; | |
31: const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize; | |
32: const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize; | |
33: const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
34: for (int loop = 0; loop < numLoops; loop++) { | |
35: const int loopLocalId = localId + loop * workgroupSize; | |
36: const int filterId = loopLocalId / gOutputSize; | |
37: const int outRow = loopLocalId % gOutputSize; | |
38: | |
39: // copy down our filter, we have gOutputSize threads to do this | |
40: global float const *globalFilterPlane = filters + | |
41: (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared; | |
42: local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared; | |
43: barrier(CLK_LOCAL_MEM_FENCE); | |
44: for (int i = 0; i < numFilterCopyLoops; i++) { | |
45: const int offset = i * gOutputSize + outRow; | |
46: bool process = filterId < gNumFilters && offset < gFilterSizeSquared; | |
47: if (process) { | |
48: _localFilterPlane[ offset ] = globalFilterPlane[ offset ]; | |
49: } | |
50: } | |
51: // loop over n ... | |
52: for (int n = 0; n < batchSize; n++) { | |
53: // copy down our imageplane, we have workgroupSize threads to do this | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: global float const *globalImagePlane = images + | |
56: (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared; | |
57: for (int i = 0; i< numImageCopyLoops; i++) { | |
58: const int offset = i * workgroupSize + localId; | |
59: if (offset < gInputSizeSquared) { | |
60: _inputPlane[ offset ] = globalImagePlane[ offset ]; | |
61: } | |
62: } | |
63: barrier(CLK_LOCAL_MEM_FENCE); | |
64: // calc output for each [outrow][outcol] | |
65: bool filterPlaneOk = filterId < gNumFilters; | |
66: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
67: float sum = 0; | |
68: for (int filterRow = 0; filterRow < gFilterSize; filterRow++) { | |
69: int inRow = outRow + filterRow; | |
70: #if gPadZeros == 1 | |
71: inRow -= gHalfFilterSize; | |
72: #endif | |
73: bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize; | |
74: for (int filterCol = 0; filterCol < gFilterSize; filterCol++) { | |
75: int inCol = outCol + filterCol; | |
76: #if gPadZeros == 1 | |
77: inCol -= gHalfFilterSize; | |
78: #endif | |
79: bool process = rowOk && inCol >= 0 && inCol < gInputSize; | |
80: if (process) { | |
81: float imageValue = _inputPlane[ inRow * gInputSize + inCol ]; | |
82: float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ]; | |
83: sum += imageValue * filterValue; | |
84: } | |
85: } | |
86: } | |
87: if (filterId < gNumFilters) { | |
88: // [n][filterId][outRow][outCol][inputPlane] | |
89: int resultIndex = (( (n | |
90: * gNumFilters + filterId) | |
91: * gOutputSize + outRow) | |
92: * gOutputSize + outCol) | |
93: * gNumInputPlanes + inputPlaneId; | |
94: output[resultIndex] = sum; | |
95: //if (globalId == 2) output[0] = resultIndex; | |
96: // output[resultIndex] = outRow; | |
97: } | |
98: // output[localId] = _localFilterPlane[localId]; | |
99: } | |
100: } | |
101: } | |
102: } | |
103: | |
104: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward_byinputplane.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 7 | |
... seems valid | |
ForwardIm2Col.cl build log: | |
(19:0) : error : invalid global address space qualifier specified for parameter type | |
(19:0) : error : syntax error at 'const' | |
kernel build error: | |
kernel source: | |
1: // from SpatialConvolutionMM.cu: | |
2: | |
3: // CL: grid stride looping | |
4: #define CL_KERNEL_LOOP(i, n) \ | |
5: for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \ | |
6: i < (n); \ | |
7: i += get_local_size(0) * get_num_groups(0)) | |
8: | |
9: //#define gPadding 1 | |
10: //#define gStride 1 | |
11: //#define gColSize 3 | |
12: //#define gFilterSize 3 | |
13: //#define gSize 3 | |
14: | |
15: // Kernel for fast unfold+copy | |
16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu) | |
17: kernel void im2col( | |
18: const int n, | |
19: global float const * im_data, int im_offset, | |
20: global float* data_col) { | |
21: global const float *data_im = im_data + im_offset; | |
22: | |
23: CL_KERNEL_LOOP(index, n) { | |
24: int w_out = index % 3; | |
25: index /= 3; | |
26: int h_out = index % 3; | |
27: int channel_in = index / 3; | |
28: int channel_out = channel_in * 3 * 3; | |
29: int h_in = h_out * 1 - 1; | |
30: int w_in = w_out * 1 - 1; | |
31: data_col += (channel_out * 3 + h_out) * 3 + w_out; | |
32: data_im += (channel_in * 3 + h_in) * 3 + w_in; | |
33: for (int i = 0; i < 3; ++i) { | |
34: for (int j = 0; j < 3; ++j) { | |
35: int h = h_in + i; | |
36: int w = w_in + j; | |
37: *data_col = (h >= 0 && w >= 0 && h < 3 && w < 3) ? | |
38: data_im[i * 3 + j] : 0; | |
39: data_col += 3 * 3; | |
40: } | |
41: } | |
42: } | |
43: } | |
44: | |
45: kernel void col2im( | |
46: const int n, | |
47: global float const *data_col, | |
48: global float* im_data, int im_offset) { | |
49: global float *data_im = im_data + im_offset; | |
50: | |
51: for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) { | |
52: float val = 0; | |
53: int w = index % 3 + 1; | |
54: int h = (index / 3) % 3 + 1; | |
55: int c = index / (3 * 3); | |
56: // compute the start and end of the output | |
57: int w_col_start = (w < 3) ? 0 : (w - 3) / 1 + 1; | |
58: int w_col_end = min(w / 1 + 1, 3); | |
59: int h_col_start = (h < 3) ? 0 : (h - 3) / 1 + 1; | |
60: int h_col_end = min(h / 1 + 1, 3); | |
61: | |
62: int offset = (c * 3 * 3 + h * 3 + w) * 3 * 3; | |
63: int coeff_h_col = (1 - 1 * 3 * 3) * 3; | |
64: int coeff_w_col = (1 - 1 * 3 * 3); | |
65: for (int h_col = h_col_start; h_col < h_col_end; ++h_col) { | |
66: for (int w_col = w_col_start; w_col < w_col_end; ++w_col) { | |
67: val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col]; | |
68: } | |
69: } | |
70: data_im[index] = val; | |
71: } | |
72: } | |
73: | |
74: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
ForwardIm2Col.cl build log: | |
(19:0) : error : invalid global address space qualifier specified for parameter type | |
(19:0) : error : syntax error at 'const' | |
ForwardAuto: kernel 7 this instance cant be used: | |
kernel source: | |
1: // from SpatialConvolutionMM.cu: | |
2: | |
3: // CL: grid stride looping | |
4: #define CL_KERNEL_LOOP(i, n) \ | |
5: for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \ | |
6: i < (n); \ | |
7: i += get_local_size(0) * get_num_groups(0)) | |
8: | |
9: //#define gPadding 1 | |
10: //#define gStride 1 | |
11: //#define gColSize 3 | |
12: //#define gFilterSize 3 | |
13: //#define gSize 3 | |
14: | |
15: // Kernel for fast unfold+copy | |
16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu) | |
17: kernel void im2col( | |
18: const int n, | |
19: global float const * im_data, int im_offset, | |
20: global float* data_col) { | |
21: global const float *data_im = im_data + im_offset; | |
22: | |
23: CL_KERNEL_LOOP(index, n) { | |
24: int w_out = index % 3; | |
25: index /= 3; | |
26: int h_out = index % 3; | |
27: int channel_in = index / 3; | |
28: int channel_out = channel_in * 3 * 3; | |
29: int h_in = h_out * 1 - 1; | |
30: int w_in = w_out * 1 - 1; | |
31: data_col += (channel_out * 3 + h_out) * 3 + w_out; | |
32: data_im += (channel_in * 3 + h_in) * 3 + w_in; | |
33: for (int i = 0; i < 3; ++i) { | |
34: for (int j = 0; j < 3; ++j) { | |
35: int h = h_in + i; | |
36: int w = w_in + j; | |
37: *data_col = (h >= 0 && w >= 0 && h < 3 && w < 3) ? | |
38: data_im[i * 3 + j] : 0; | |
39: data_col += 3 * 3; | |
40: } | |
41: } | |
42: } | |
43: } | |
44: | |
45: kernel void col2im( | |
46: const int n, | |
47: global float const *data_col, | |
48: global float* im_data, int im_offset) { | |
49: global float *data_im = im_data + im_offset; | |
50: | |
51: for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) { | |
52: float val = 0; | |
53: int w = index % 3 + 1; | |
54: int h = (index / 3) % 3 + 1; | |
55: int c = index / (3 * 3); | |
56: // compute the start and end of the output | |
57: int w_col_start = (w < 3) ? 0 : (w - 3) / 1 + 1; | |
58: int w_col_end = min(w / 1 + 1, 3); | |
59: int h_col_start = (h < 3) ? 0 : (h - 3) / 1 + 1; | |
60: int h_col_end = min(h / 1 + 1, 3); | |
61: | |
62: int offset = (c * 3 * 3 + h * 3 + w) * 3 * 3; | |
63: int coeff_h_col = (1 - 1 * 3 * 3) * 3; | |
64: int coeff_w_col = (1 - 1 * 3 * 3); | |
65: for (int h_col = h_col_start; h_col < h_col_end; ++h_col) { | |
66: for (int w_col = w_col_start; w_col < w_col_end; ++w_col) { | |
67: val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col]; | |
68: } | |
69: } | |
70: data_im[index] = val; | |
71: } | |
72: } | |
73: | |
74: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
ForwardIm2Col.cl build log: | |
(19:0) : error : invalid global address space qualifier specified for parameter type | |
(19:0) : error : syntax error at 'const' | |
forward kernel 0: cannot be used | |
forward kernel 1: cannot be used | |
forward kernel 2: cannot be used | |
forward kernel 3: cannot be used | |
forward kernel 4: cannot be used | |
forward kernel 5: cannot be used | |
forward kernel 6: cannot be used | |
forward kernel 7: cannot be used | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description "No valid forward implementations found" thrown in the test body. | |
[ FAILED ] testupdateweights.conv1z (141 ms) | |
[ RUN ] testupdateweights.numericallytest | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=1 -D TANH" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=1 -D TANH" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=1 -D TANH" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.numericallytest (56 ms) | |
[ RUN ] testupdateweights.numericallytest_imagesize3 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.numericallytest_imagesize3 (66 ms) | |
[ RUN ] testupdateweights.numericallytest_imagesize5 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=5 -DgOutputSizeSquared=25 -DgInputSize=5 -DgInputSizeSquared=25 -DgNumPlanes=1 -D TANH" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=5 -DgOutputSizeSquared=25 -DgInputSize=5 -DgInputSizeSquared=25 -DgNumPlanes=1 -D TANH" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=5 -DgOutputSizeSquared=25 -DgInputSize=5 -DgInputSizeSquared=25 -DgNumPlanes=1 -D TANH" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.numericallytest_imagesize5 (66 ms) | |
[ RUN ] testupdateweights.numericallytest_imagesize9 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=9 -DgOutputSizeSquared=81 -DgInputSize=9 -DgInputSizeSquared=81 -DgNumPlanes=1 -D TANH" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=9 -DgOutputSizeSquared=81 -DgInputSize=9 -DgInputSizeSquared=81 -DgNumPlanes=1 -D TANH" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=9 -DgOutputSizeSquared=81 -DgInputSize=9 -DgInputSizeSquared=81 -DgNumPlanes=1 -D TANH" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.numericallytest_imagesize9 (57 ms) | |
[ RUN ] testupdateweights.numericallytest_imagesize9_filtersize9 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=1 -D TANH" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=1 -D TANH" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=1 -D TANH" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.numericallytest_imagesize9_filtersize9 (56 ms) | |
[ RUN ] testupdateweights.numericallytest_imagesize9_filtersize3 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=7 -DgOutputSizeSquared=49 -DgInputSize=7 -DgInputSizeSquared=49 -DgNumPlanes=1 -D TANH" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=7 -DgOutputSizeSquared=49 -DgInputSize=7 -DgInputSizeSquared=49 -DgNumPlanes=1 -D TANH" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=7 -DgOutputSizeSquared=49 -DgInputSize=7 -DgInputSizeSquared=49 -DgNumPlanes=1 -D TANH" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.numericallytest_imagesize9_filtersize3 (67 ms) | |
[ RUN ] testupdateweights.numericallytest_imagesize3_filtersize3 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=1 -D TANH" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=1 -D TANH" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=1 -D TANH" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.numericallytest_imagesize3_filtersize3 (68 ms) | |
[ RUN ] testupdateweights.numericallytest_imagesize5_filtersize3 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.numericallytest_imagesize5_filtersize3 (68 ms) | |
[ RUN ] testupdateweights.numericallytest_imagesize5_filtersize3_batchsize3 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.numericallytest_imagesize5_filtersize3_batchsize3 (69 ms) | |
[ RUN ] testupdateweights.numericallytest_imagesize5_filtersize3_planes3 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.numericallytest_imagesize5_filtersize3_planes3 (70 ms) | |
[ RUN ] testupdateweights.numericallytest_imagesize5_filtersize3_planes3_batchsize3 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=1 -D TANH" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.numericallytest_imagesize5_filtersize3_planes3_batchsize3 (71 ms) | |
[ RUN ] testupdateweights.backprop_weights_2 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
options: -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=1 -DgInputStripeOuterSize=1 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=1 -DgOutputStripeSize=1 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=1 -DgInputStripeOuterSize=1 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=1 -DgOutputStripeSize=1" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=1 -DgInputStripeOuterSize=1 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=1 -DgOutputStripeSize=1" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=1 -DgInputStripeOuterSize=1 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=1 -DgOutputStripeSize=1" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.backprop_weights_2 (25 ms) | |
[ RUN ] testupdateweights.backprop_weights_2_upstreamimagesize2 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
options: -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=2 -DgInputStripeOuterNumRows=2 -DgInputStripeInnerSize=4 -DgInputStripeOuterSize=4 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=4 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=2 -DgInputStripeOuterNumRows=2 -DgInputStripeInnerSize=4 -DgInputStripeOuterSize=4 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=4" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=2 -DgInputStripeOuterNumRows=2 -DgInputStripeInnerSize=4 -DgInputStripeOuterSize=4 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=4" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=2 -DgInputStripeOuterNumRows=2 -DgInputStripeInnerSize=4 -DgInputStripeOuterSize=4 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=4" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.backprop_weights_2_upstreamimagesize2 (30 ms) | |
[ RUN ] testupdateweights.backprop_weights_2_upstreamimagesize3_filtersize3 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
options: -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=3 -DgInputStripeOuterNumRows=7 -DgInputStripeInnerSize=9 -DgInputStripeOuterSize=21 -DgInputStripeMarginSize=6 -DgOutputStripeNumRows=1 -DgOutputStripeSize=1 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=3 -DgInputStripeOuterNumRows=7 -DgInputStripeInnerSize=9 -DgInputStripeOuterSize=21 -DgInputStripeMarginSize=6 -DgOutputStripeNumRows=1 -DgOutputStripeSize=1" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=3 -DgInputStripeOuterNumRows=7 -DgInputStripeInnerSize=9 -DgInputStripeOuterSize=21 -DgInputStripeMarginSize=6 -DgOutputStripeNumRows=1 -DgOutputStripeSize=1" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=3 -DgInputStripeOuterNumRows=7 -DgInputStripeInnerSize=9 -DgInputStripeOuterSize=21 -DgInputStripeMarginSize=6 -DgOutputStripeNumRows=1 -DgOutputStripeSize=1" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.backprop_weights_2_upstreamimagesize3_filtersize3 (25 ms) | |
[ RUN ] testupdateweights.backprop_weights_2_upstreamimagesize4_filtersize3 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
options: -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=4 -DgInputStripeOuterNumRows=8 -DgInputStripeInnerSize=16 -DgInputStripeOuterSize=32 -DgInputStripeMarginSize=8 -DgOutputStripeNumRows=2 -DgOutputStripeSize=4 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=4 -DgInputStripeOuterNumRows=8 -DgInputStripeInnerSize=16 -DgInputStripeOuterSize=32 -DgInputStripeMarginSize=8 -DgOutputStripeNumRows=2 -DgOutputStripeSize=4" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=4 -DgInputStripeOuterNumRows=8 -DgInputStripeInnerSize=16 -DgInputStripeOuterSize=32 -DgInputStripeMarginSize=8 -DgOutputStripeNumRows=2 -DgOutputStripeSize=4" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=4 -DgInputStripeOuterNumRows=8 -DgInputStripeInnerSize=16 -DgInputStripeOuterSize=32 -DgInputStripeMarginSize=8 -DgOutputStripeNumRows=2 -DgOutputStripeSize=4" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.backprop_weights_2_upstreamimagesize4_filtersize3 (34 ms) | |
[ RUN ] testupdateweights.backprop_weights_2_upstreamimagesize5_filtersize3 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
options: -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=5 -DgInputStripeOuterNumRows=9 -DgInputStripeInnerSize=25 -DgInputStripeOuterSize=45 -DgInputStripeMarginSize=10 -DgOutputStripeNumRows=3 -DgOutputStripeSize=9 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=5 -DgInputStripeOuterNumRows=9 -DgInputStripeInnerSize=25 -DgInputStripeOuterSize=45 -DgInputStripeMarginSize=10 -DgOutputStripeNumRows=3 -DgOutputStripeSize=9" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=5 -DgInputStripeOuterNumRows=9 -DgInputStripeInnerSize=25 -DgInputStripeOuterSize=45 -DgInputStripeMarginSize=10 -DgOutputStripeNumRows=3 -DgOutputStripeSize=9" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=5 -D gInputSizeSquared=25 -D gNumFilters=1 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=2 -DgInputStripeInnerNumRows=5 -DgInputStripeOuterNumRows=9 -DgInputStripeInnerSize=25 -DgInputStripeOuterSize=45 -DgInputStripeMarginSize=10 -DgOutputStripeNumRows=3 -DgOutputStripeSize=9" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.backprop_weights_2_upstreamimagesize5_filtersize3 (50 ms) | |
[ RUN ] testupdateweights.backprop_weights_2_upstreamimagesize3_filtersize1 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
options: -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=3 -DgInputStripeOuterNumRows=3 -DgInputStripeInnerSize=9 -DgInputStripeOuterSize=9 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=3 -DgOutputStripeSize=9 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=3 -DgInputStripeOuterNumRows=3 -DgInputStripeInnerSize=9 -DgInputStripeOuterSize=9 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=3 -DgOutputStripeSize=9" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=3 -DgInputStripeOuterNumRows=3 -DgInputStripeInnerSize=9 -DgInputStripeOuterSize=9 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=3 -DgOutputStripeSize=9" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=1 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=3 -DgInputStripeOuterNumRows=3 -DgInputStripeInnerSize=9 -DgInputStripeOuterSize=9 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=3 -DgOutputStripeSize=9" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.backprop_weights_2_upstreamimagesize3_filtersize1 (52 ms) | |
[ RUN ] testupdateweights.backprop_weights_2_upstreamimagesize16_filtersize1 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
options: -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=16 -D gInputSizeSquared=256 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=16 -D gOutputSizeSquared=256 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=8 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=2 -DgInputStripeOuterNumRows=2 -DgInputStripeInnerSize=32 -DgInputStripeOuterSize=32 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=32 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=16 -D gInputSizeSquared=256 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=16 -D gOutputSizeSquared=256 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=8 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=2 -DgInputStripeOuterNumRows=2 -DgInputStripeInnerSize=32 -DgInputStripeOuterSize=32 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=32" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=16 -D gInputSizeSquared=256 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=16 -D gOutputSizeSquared=256 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=8 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=2 -DgInputStripeOuterNumRows=2 -DgInputStripeInnerSize=32 -DgInputStripeOuterSize=32 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=32" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=16 -D gInputSizeSquared=256 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=16 -D gOutputSizeSquared=256 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=8 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=2 -DgInputStripeOuterNumRows=2 -DgInputStripeInnerSize=32 -DgInputStripeOuterSize=32 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=32" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.backprop_weights_2_upstreamimagesize16_filtersize1 (48 ms) | |
[ RUN ] testupdateweights.backprop_weights_2_upstreamimagesize17_filtersize1 | |
LayerDimensions{ inputPlanes=1 inputSize=17 numFilters=1 filterSize=1 outputSize=17 padZeros=0 biased=0 skip=0} | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
options: -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=17 -D gInputSizeSquared=289 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=17 -D gOutputSizeSquared=289 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=16 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=17 -DgInputStripeOuterSize=17 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=34 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=17 -D gInputSizeSquared=289 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=17 -D gOutputSizeSquared=289 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=16 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=17 -DgInputStripeOuterSize=17 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=34" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=17 -D gInputSizeSquared=289 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=17 -D gOutputSizeSquared=289 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=16 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=17 -DgInputStripeOuterSize=17 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=34" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=17 -D gInputSizeSquared=289 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=17 -D gOutputSizeSquared=289 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=16 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=17 -DgInputStripeOuterSize=17 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=34" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.backprop_weights_2_upstreamimagesize17_filtersize1 (54 ms) | |
[ RUN ] testupdateweights.backprop_weights_2_upstreamimagesize17_filtersize1_moredata | |
expectedresult: -958.715 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
options: -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=17 -D gInputSizeSquared=289 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=17 -D gOutputSizeSquared=289 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=16 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=17 -DgInputStripeOuterSize=17 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=34 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=17 -D gInputSizeSquared=289 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=17 -D gOutputSizeSquared=289 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=16 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=17 -DgInputStripeOuterSize=17 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=34" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=17 -D gInputSizeSquared=289 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=17 -D gOutputSizeSquared=289 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=16 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=17 -DgInputStripeOuterSize=17 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=34" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=17 -D gInputSizeSquared=289 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=17 -D gOutputSizeSquared=289 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgNumStripes=16 -DgInputStripeMarginRows=0 -DgInputStripeInnerNumRows=1 -DgInputStripeOuterNumRows=1 -DgInputStripeInnerSize=17 -DgInputStripeOuterSize=17 -DgInputStripeMarginSize=0 -DgOutputStripeNumRows=2 -DgOutputStripeSize=34" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.backprop_weights_2_upstreamimagesize17_filtersize1_moredata (57 ms) | |
[ RUN ] testupdateweights.backprop_instance3_smaller2 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
numweights: 36 | |
options: -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=96 -D gInputSizeSquared=9216 -D gNumFilters=1 -D gFilterSize=6 -D gHalfFilterSize=3 -D gFilterSizeSquared=36 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=91 -D gOutputSizeSquared=8281 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0 -DgNumStripes=512 -DgInputStripeMarginRows=5 -DgInputStripeInnerNumRows=0 -DgInputStripeOuterNumRows=10 -DgInputStripeInnerSize=0 -DgInputStripeOuterSize=960 -DgInputStripeMarginSize=480 -DgOutputStripeNumRows=1 -DgOutputStripeSize=91 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=96 -D gInputSizeSquared=9216 -D gNumFilters=1 -D gFilterSize=6 -D gHalfFilterSize=3 -D gFilterSizeSquared=36 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=91 -D gOutputSizeSquared=8281 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0 -DgNumStripes=512 -DgInputStripeMarginRows=5 -DgInputStripeInnerNumRows=0 -DgInputStripeOuterNumRows=10 -DgInputStripeInnerSize=0 -DgInputStripeOuterSize=960 -DgInputStripeMarginSize=480 -DgOutputStripeNumRows=1 -DgOutputStripeSize=91" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=96 -D gInputSizeSquared=9216 -D gNumFilters=1 -D gFilterSize=6 -D gHalfFilterSize=3 -D gFilterSizeSquared=36 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=91 -D gOutputSizeSquared=8281 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0 -DgNumStripes=512 -DgInputStripeMarginRows=5 -DgInputStripeInnerNumRows=0 -DgInputStripeOuterNumRows=10 -DgInputStripeInnerSize=0 -DgInputStripeOuterSize=960 -DgInputStripeMarginSize=480 -DgOutputStripeNumRows=1 -DgOutputStripeSize=91" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014,2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // BIASED (or not) | |
9: | |
10: // workgroupId: [outputPlane][inputPlane] | |
11: // localId: [filterRow][filterCol] | |
12: // per-thread iteration: [n][outputRow][outputCol] | |
13: // local: errorimage: outputSize * outputSize | |
14: // imageimage: inputSize * inputSize | |
15: // specific characteristic: load one stripe of each image at a time, | |
16: // so we dont run out of memory | |
17: // number of stripes set in: gNumStripes | |
18: // note that whilst we can stripe the gradOutput simply, | |
19: // we actually need to add a half-filter widthed additional few rows | |
20: // onto the images stripe, otherwise we will be missing data | |
21: // we will call the size of the non-overlapping image stripes: gInputStripeInnerSize | |
22: // the outersize, including the two margins is: gInputStripeOuterSize | |
23: // of course, the first and last stripes will be missing a bit off the top/bottom, where the | |
24: // corresponding outer margin would be | |
25: void kernel backprop_floats_withscratch_dobias_striped( | |
26: const float learningRateMultiplier, const int batchSize, | |
27: global const float *gradOutput, global const float *images, | |
28: global float *gradWeights, | |
29: #ifdef BIASED | |
30: global float *gradBiasWeights, | |
31: #endif | |
32: local float *_errorStripe, local float *_imageStripe | |
33: ) { | |
34: // gHalfFilterSize | |
35: // gInputSize | |
36: // | |
37: // gInputStripeMarginRows => basically equal to gHalfFilterSize | |
38: // gInputStripeInnerNumRows = gInputSize / gNumStripes | |
39: // gInputStripeOuterNumRows = gInputStripeInnerNumRows + 2 * gHalfFilterSize (note: one row less than | |
40: // if we just added gFilterSize) | |
41: // gInputStripeInnerSize = gInputStripeInnerNumRows * gInputSize | |
42: // gInputStripeOuterSize = gInputStripeOuterNumRows * gInputSize | |
43: // gInputStripeMarginSize = gInputStripeMarginRows * gInputSize | |
44: // | |
45: // gOutputStripeNumRows | |
46: // gOutputStripeSize | |
47: | |
48: const int globalId = get_global_id(0); | |
49: const int localId = get_local_id(0); | |
50: const int workgroupId = get_group_id(0); | |
51: const int workgroupSize = get_local_size(0); | |
52: | |
53: const int filterRow = localId / gFilterSize; | |
54: const int filterCol = localId % gFilterSize; | |
55: | |
56: const int outPlane = workgroupId / gInputPlanes; | |
57: const int upstreamPlane = workgroupId % gInputPlanes; | |
58: | |
59: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
60: // aggregate over: [outRow][outCol][n] | |
61: float thiswchange = 0; | |
62: #ifdef BIASED | |
63: float thisbiaschange = 0; | |
64: #endif | |
65: const int numLoopsForImageStripe = (gInputStripeOuterSize + workgroupSize - 1) / workgroupSize; | |
66: const int numLoopsForErrorStripe = (gOutputSizeSquared + workgroupSize - 1) / workgroupSize; | |
67: for (int n = 0; n < batchSize; n++) { | |
68: const int imageImageGlobalOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
69: const int imageImageGlobalOffsetAfter = imageImageGlobalOffset + gInputSizeSquared; | |
70: const int errorImageGlobalOffset = (n * gNumFilters + outPlane) * gOutputSizeSquared; | |
71: const int errorImageGlobalOffsetAfter = errorImageGlobalOffset + gOutputSizeSquared; | |
72: for (int stripe = 0; stripe < gNumStripes; stripe++) { | |
73: const int imageStripeInnerOffset = imageImageGlobalOffset + stripe * gInputStripeInnerSize; | |
74: const int imageStripeOuterOffset = imageStripeInnerOffset - gInputStripeMarginSize; | |
75: // need to fetch the image, but it's bigger than us, so will need to loop... | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: for (int i = 0; i < numLoopsForImageStripe; i++) { | |
78: int thisOffset = i * workgroupSize + localId; | |
79: int thisGlobalImagesOffset = imageStripeOuterOffset + thisOffset; | |
80: bool process = thisOffset < gInputStripeOuterSize | |
81: && thisGlobalImagesOffset >= imageImageGlobalOffset | |
82: && thisGlobalImagesOffset < imageImageGlobalOffsetAfter; | |
83: if (process) { | |
84: _imageStripe[thisOffset] = images[ thisGlobalImagesOffset ]; | |
85: } | |
86: } | |
87: int errorStripeOffset = errorImageGlobalOffset + stripe * gOutputStripeSize; | |
88: for (int i = 0; i < numLoopsForErrorStripe; i++) { | |
89: int thisOffset = i * workgroupSize + localId; | |
90: int globalErrorsOffset = errorStripeOffset + thisOffset; | |
91: bool process = thisOffset < gOutputStripeSize | |
92: && globalErrorsOffset < errorImageGlobalOffsetAfter; | |
93: if (process) { | |
94: _errorStripe[thisOffset ] = gradOutput[globalErrorsOffset]; | |
95: } | |
96: } | |
97: const int stripeOutRowStart = stripe * gOutputStripeNumRows; | |
98: const int stripeOutRowEndExcl = stripeOutRowStart + gOutputStripeNumRows; | |
99: barrier(CLK_LOCAL_MEM_FENCE); | |
100: // if (localId == 13) { | |
101: // for (int i = 0; i < 12; i++) { | |
102: // gradWeights[100 + stripe * 12 + i ] = _errorStripe[i * gOutputSize]; | |
103: // } | |
104: // for (int i = 0; i < 20; i++) { | |
105: // gradWeights[200 + stripe * 20 + i ] = _imageStripe[i * gInputSize]; | |
106: // } | |
107: // } | |
108: if (localId < gFilterSizeSquared) { | |
109: for (int outRow = stripeOutRowStart; outRow < stripeOutRowEndExcl; outRow++) { | |
110: int upstreamRow = outRow - gMargin + filterRow; | |
111: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
112: int upstreamCol = outCol - gMargin + filterCol; | |
113: bool proceed = | |
114: upstreamRow >= 0 && upstreamCol >= 0 | |
115: && upstreamRow < gInputSize && upstreamCol < gInputSize | |
116: && outRow < gOutputSize; | |
117: if (proceed) { | |
118: int resultIndex = outRow * gOutputSize + outCol; | |
119: float error = _errorStripe[resultIndex - stripe * gOutputStripeSize]; | |
120: int upstreamDataIndex = upstreamRow * gInputSize + upstreamCol; | |
121: float upstreamResult = _imageStripe[upstreamDataIndex + gInputStripeMarginSize | |
122: - stripe * gInputStripeInnerSize ]; | |
123: thiswchange += upstreamResult * error; | |
124: #ifdef BIASED | |
125: thisbiaschange += error; | |
126: #endif | |
127: } | |
128: } | |
129: } | |
130: } | |
131: } | |
132: } | |
133: if (localId < gFilterSizeSquared) { | |
134: gradWeights[ workgroupId * gFilterSizeSquared + localId ] = learningRateMultiplier * thiswchange; | |
135: // weightChanges[ workgroupId * gFilterSizeSquared + localId ] = workgroupId; | |
136: } | |
137: #ifdef BIASED | |
138: bool writeBias = upstreamPlane == 0 && filterRow == gMargin && filterCol == gMargin; | |
139: if (writeBias) { | |
140: gradBiasWeights[outPlane] = learningRateMultiplier * thisbiaschange; | |
141: } | |
142: #endif | |
143: // gradWeights: [outPlane][upstreamPlane][filterRow][filterCol] | |
144: // aggregate over: [outRow][outCol][n] | |
145: } | |
146: | |
147: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/BackpropWeightsScratchLarge.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=96 -D gInputSizeSquared=9216 -D gNumFilters=1 -D gFilterSize=6 -D gHalfFilterSize=3 -D gFilterSizeSquared=36 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=91 -D gOutputSizeSquared=8281 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0 -DgNumStripes=512 -DgInputStripeMarginRows=5 -DgInputStripeInnerNumRows=0 -DgInputStripeOuterNumRows=10 -DgInputStripeInnerSize=0 -DgInputStripeOuterSize=960 -DgInputStripeMarginSize=480 -DgOutputStripeNumRows=1 -DgOutputStripeSize=91" | |
" thrown in the test body. | |
[ FAILED ] testupdateweights.backprop_instance3_smaller2 (63 ms) | |
[----------] 23 tests from testupdateweights (1443 ms total) | |
[----------] 17 tests from testforward | |
[ RUN ] testforward.imagesize2_nopadzeros | |
expected number of output: 4 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=2 -D gFilterSize=2 -D gHalfFilterSize=1 -D gFilterSizeSquared=4 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: each workgroup handles convolving one input example with one filtercube | |
8: // and writing out one single output plane | |
9: // | |
10: // workgroup id organized like: [imageid][outplane] | |
11: // local id organized like: [outrow][outcol] | |
12: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
13: // number workgroups = 32 | |
14: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
16: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
17: // output are organized like [imageid][filterid][row][col] | |
18: void kernel forward_3_by_n_outplane(const int batchSize, | |
19: global const float *images, global const float *filters, | |
20: global float *output, | |
21: local float *_upstreamImage, local float *_filterCube) { | |
22: const int globalId = get_global_id(0); | |
23: | |
24: const int workgroupId = get_group_id(0); | |
25: const int workgroupSize = get_local_size(0); | |
26: const int n = workgroupId / gNumFilters; | |
27: const int outPlane = workgroupId % gNumFilters; | |
28: | |
29: const int localId = get_local_id(0); | |
30: const int outputRow = localId / gOutputSize; | |
31: const int outputCol = localId % gOutputSize; | |
32: | |
33: const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize; | |
34: const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow - gEven) : gHalfFilterSize - gEven; | |
35: const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize; | |
36: const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven; | |
37: | |
38: const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
39: | |
40: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
41: const int filterCubeGlobalOffset = outPlane * filterCubeLength; | |
42: const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize; | |
43: for (int i = 0; i < numPixelsPerThread; i++) { | |
44: int thisOffset = localId + i * workgroupSize; | |
45: if (thisOffset < filterCubeLength) { | |
46: _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset]; | |
47: } | |
48: } | |
49: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
50: | |
51: float sum = 0; | |
52: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
53: int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: for (int i = 0; i < numUpstreamsPerThread; i++) { | |
56: int thisOffset = workgroupSize * i + localId; | |
57: if (thisOffset < gInputSizeSquared) { | |
58: _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ]; | |
59: } | |
60: } | |
61: barrier(CLK_LOCAL_MEM_FENCE); | |
62: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
63: for (int u = minu; u <= maxu; u++) { | |
64: int inputRow = outputRow + u; | |
65: #if gPadZeros == 0 | |
66: inputRow += gHalfFilterSize; | |
67: #endif | |
68: int inputimagerowoffset = inputRow * gInputSize; | |
69: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
70: for (int v = minv; v <= maxv; v++) { | |
71: int inputCol = outputCol + v; | |
72: #if gPadZeros == 0 | |
73: inputCol += gHalfFilterSize; | |
74: #endif | |
75: if (localId < gOutputSizeSquared) { | |
76: sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
77: } | |
78: } | |
79: } | |
80: } | |
81: | |
82: // output are organized like [imageid][filterid][row][col] | |
83: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
84: if (localId < gOutputSizeSquared) { | |
85: output[resultIndex ] = sum; | |
86: } | |
87: } | |
88: | |
89: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=2 -D gFilterSize=2 -D gHalfFilterSize=1 -D gFilterSizeSquared=4 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: each workgroup handles convolving one input example with one filtercube | |
8: // and writing out one single output plane | |
9: // | |
10: // workgroup id organized like: [imageid][outplane] | |
11: // local id organized like: [outrow][outcol] | |
12: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
13: // number workgroups = 32 | |
14: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
16: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
17: // output are organized like [imageid][filterid][row][col] | |
18: void kernel forward_3_by_n_outplane(const int batchSize, | |
19: global const float *images, global const float *filters, | |
20: global float *output, | |
21: local float *_upstreamImage, local float *_filterCube) { | |
22: const int globalId = get_global_id(0); | |
23: | |
24: const int workgroupId = get_group_id(0); | |
25: const int workgroupSize = get_local_size(0); | |
26: const int n = workgroupId / gNumFilters; | |
27: const int outPlane = workgroupId % gNumFilters; | |
28: | |
29: const int localId = get_local_id(0); | |
30: const int outputRow = localId / gOutputSize; | |
31: const int outputCol = localId % gOutputSize; | |
32: | |
33: const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize; | |
34: const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow - gEven) : gHalfFilterSize - gEven; | |
35: const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize; | |
36: const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven; | |
37: | |
38: const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
39: | |
40: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
41: const int filterCubeGlobalOffset = outPlane * filterCubeLength; | |
42: const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize; | |
43: for (int i = 0; i < numPixelsPerThread; i++) { | |
44: int thisOffset = localId + i * workgroupSize; | |
45: if (thisOffset < filterCubeLength) { | |
46: _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset]; | |
47: } | |
48: } | |
49: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
50: | |
51: float sum = 0; | |
52: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
53: int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: for (int i = 0; i < numUpstreamsPerThread; i++) { | |
56: int thisOffset = workgroupSize * i + localId; | |
57: if (thisOffset < gInputSizeSquared) { | |
58: _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ]; | |
59: } | |
60: } | |
61: barrier(CLK_LOCAL_MEM_FENCE); | |
62: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
63: for (int u = minu; u <= maxu; u++) { | |
64: int inputRow = outputRow + u; | |
65: #if gPadZeros == 0 | |
66: inputRow += gHalfFilterSize; | |
67: #endif | |
68: int inputimagerowoffset = inputRow * gInputSize; | |
69: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
70: for (int v = minv; v <= maxv; v++) { | |
71: int inputCol = outputCol + v; | |
72: #if gPadZeros == 0 | |
73: inputCol += gHalfFilterSize; | |
74: #endif | |
75: if (localId < gOutputSizeSquared) { | |
76: sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
77: } | |
78: } | |
79: } | |
80: } | |
81: | |
82: // output are organized like [imageid][filterid][row][col] | |
83: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
84: if (localId < gOutputSizeSquared) { | |
85: output[resultIndex ] = sum; | |
86: } | |
87: } | |
88: | |
89: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=2 -D gFilterSize=2 -D gHalfFilterSize=1 -D gFilterSizeSquared=4 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0" | |
" thrown in the test body. | |
[ FAILED ] testforward.imagesize2_nopadzeros (75 ms) | |
[ RUN ] testforward.imagesize2_padzeros | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=2 -D gFilterSize=2 -D gHalfFilterSize=1 -D gFilterSizeSquared=4 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=1 -D gSkip=0 -DgWorkgroupSize=32" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=2 -D gFilterSize=2 -D gHalfFilterSize=1 -D gFilterSizeSquared=4 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=1 -D gSkip=0 -DgWorkgroupSize=32" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=2 -D gInputSizeSquared=4 -D gNumFilters=2 -D gFilterSize=2 -D gHalfFilterSize=1 -D gFilterSizeSquared=4 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=3 -D gOutputSizeSquared=9 -D gPadZeros=1 -D gMargin=1 -D gEven=1 -D gSkip=0 -DgWorkgroupSize=32" | |
" thrown in the test body. | |
[ FAILED ] testforward.imagesize2_padzeros (49 ms) | |
[ RUN ] testforward.imagesize3 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
" thrown in the test body. | |
[ FAILED ] testforward.imagesize3 (91 ms) | |
[ RUN ] testforward.test2 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
" thrown in the test body. | |
[ FAILED ] testforward.test2 (102 ms) | |
[ RUN ] testforward.test3 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
" thrown in the test body. | |
[ FAILED ] testforward.test3 (50 ms) | |
[ RUN ] testforward.compare_0_1_biased_nopad | |
LayerDimensions{ inputPlanes=8 inputSize=19 numFilters=8 filterSize=5 outputSize=15 padZeros=0 biased=1 skip=0} | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=15 -D gOutputSizeSquared=225 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=15 -D gOutputSizeSquared=225 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=15 -D gOutputSizeSquared=225 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
" thrown in the test body. | |
[ FAILED ] testforward.compare_0_1_biased_nopad (101 ms) | |
[ RUN ] testforward.compare_0_1_biased_pad | |
LayerDimensions{ inputPlanes=8 inputSize=19 numFilters=8 filterSize=5 outputSize=19 padZeros=1 biased=1 skip=0} | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=19 -D gOutputSizeSquared=361 -D gPadZeros=1 -D gMargin=2 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=19 -D gOutputSizeSquared=361 -D gPadZeros=1 -D gMargin=2 -D gEven=0 -D gSkip=0" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=19 -D gOutputSizeSquared=361 -D gPadZeros=1 -D gMargin=2 -D gEven=0 -D gSkip=0" | |
" thrown in the test body. | |
[ FAILED ] testforward.compare_0_1_biased_pad (47 ms) | |
[ RUN ] testforward.compare_1_n_biased_nopad | |
instance: 2 | |
LayerDimensions{ inputPlanes=8 inputSize=19 numFilters=8 filterSize=5 outputSize=15 padZeros=0 biased=1 skip=0} | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=15 -D gOutputSizeSquared=225 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=15 -D gOutputSizeSquared=225 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=15 -D gOutputSizeSquared=225 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
" thrown in the test body. | |
[ FAILED ] testforward.compare_1_n_biased_nopad (61 ms) | |
[ RUN ] testforward.compare_1_n_biased_pad | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
instance: 2 | |
LayerDimensions{ inputPlanes=8 inputSize=19 numFilters=8 filterSize=5 outputSize=19 padZeros=1 biased=1 skip=0} | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=19 -D gOutputSizeSquared=361 -D gPadZeros=1 -D gMargin=2 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=19 -D gOutputSizeSquared=361 -D gPadZeros=1 -D gMargin=2 -D gEven=0 -D gSkip=0" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=5 -D gHalfFilterSize=2 -D gFilterSizeSquared=25 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=19 -D gOutputSizeSquared=361 -D gPadZeros=1 -D gMargin=2 -D gEven=0 -D gSkip=0" | |
" thrown in the test body. | |
[ FAILED ] testforward.compare_1_n_biased_pad (145 ms) | |
[ RUN ] testforward.compare_1_5_biased_nopad | |
LayerDimensions{ inputPlanes=8 inputSize=19 numFilters=8 filterSize=19 outputSize=1 padZeros=0 biased=1 skip=0} | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=19 -D gHalfFilterSize=9 -D gFilterSizeSquared=361 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=19 -D gHalfFilterSize=9 -D gFilterSizeSquared=361 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=8 -D gInputPlanes=8 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=8 -D gFilterSize=19 -D gHalfFilterSize=9 -D gFilterSizeSquared=361 -D gNumOutputPlanes=8 -D gOutputPlanes=8 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
" thrown in the test body. | |
[ FAILED ] testforward.compare_1_5_biased_nopad (50 ms) | |
[ RUN ] testforward.compare_1_4_fcscenario | |
LayerDimensions{ inputPlanes=10 inputSize=24 numFilters=10 filterSize=24 outputSize=1 padZeros=0 biased=1 skip=0} | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=10 -D gInputPlanes=10 -D gInputSize=24 -D gInputSizeSquared=576 -D gNumFilters=10 -D gFilterSize=24 -D gHalfFilterSize=12 -D gFilterSizeSquared=576 -D gNumOutputPlanes=10 -D gOutputPlanes=10 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=10 -D gInputPlanes=10 -D gInputSize=24 -D gInputSizeSquared=576 -D gNumFilters=10 -D gFilterSize=24 -D gHalfFilterSize=12 -D gFilterSizeSquared=576 -D gNumOutputPlanes=10 -D gOutputPlanes=10 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=10 -D gInputPlanes=10 -D gInputSize=24 -D gInputSizeSquared=576 -D gNumFilters=10 -D gFilterSize=24 -D gHalfFilterSize=12 -D gFilterSizeSquared=576 -D gNumOutputPlanes=10 -D gOutputPlanes=10 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0" | |
" thrown in the test body. | |
[ FAILED ] testforward.compare_1_4_fcscenario (59 ms) | |
[ RUN ] testforward.compare_break1_0_1 | |
LayerDimensions{ inputPlanes=1 inputSize=33 numFilters=1 filterSize=1 outputSize=33 padZeros=0 biased=0 skip=0} | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=33 -D gInputSizeSquared=1089 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=33 -D gOutputSizeSquared=1089 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=33 -D gInputSizeSquared=1089 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=33 -D gOutputSizeSquared=1089 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=33 -D gInputSizeSquared=1089 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=33 -D gOutputSizeSquared=1089 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
" thrown in the test body. | |
[ FAILED ] testforward.compare_break1_0_1 (101 ms) | |
[ RUN ] testforward.compare_break1_0_4 | |
LayerDimensions{ inputPlanes=1 inputSize=33 numFilters=1 filterSize=1 outputSize=33 padZeros=0 biased=0 skip=0} | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=545 -D gPixelsPerThread=2 -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=33 -D gInputSizeSquared=1089 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=33 -D gOutputSizeSquared=1089 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, int N) { | |
8: int numLoops = (N + get_local_size(0) - 1) / get_local_size(0); | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * get_local_size(0) + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [n][filterid] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [n][filterid][outrow][outcol] | |
26: // the pixels per thread thing... : | |
27: // - we have one thread (~= cuda core) per output value, | |
28: // ie one thread for each combination of [outrow][outcol] | |
29: // - however, the number of threads is typically limited on a gpu, | |
30: // eg to 512 (eg Intel HD), or 1024 (eg nVidia K520) | |
31: // - so what happens if the number of output points is larger than | |
32: // the maximum workgroup size? | |
33: // - then we have several possibilities really: | |
34: // - we can divide the image into blocks, and process each block | |
35: // separately. This is probably a good option, but fair amount of | |
36: // work | |
37: // - we can get each thread to handle more than one output | |
38: // pixel, by looping | |
39: // - we can consider the output image in 1d, by putting the rows | |
40: // one after another, and assign each contiguous workgroup-size | |
41: // block to one workgroup | |
42: // => this is how this kernel works | |
43: // basically, it's a hack, so larger images actually run, without | |
44: // crashing, and we can probably improve it a lot :-) | |
45: // | |
46: // So, when outputSize * outputSize > workgroupSize, then | |
47: // multiple workgroups will be created for each output plane | |
48: // the number of such workgroups is given by: `gPixelsPerThread` | |
49: // the id of our workgroup within such a set of workgroups is calculated | |
50: // as `pixel` | |
51: // effectiveLocalId is our local id if we had one enormous workgroup | |
52: // containing the whole output image plane | |
53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize, | |
54: global const float *images, global const float *filters, | |
55: global float *output, | |
56: local float *_inputPlane, local float *_filterPlane) { | |
57: #define globalId (get_global_id(0)) | |
58: | |
59: #define localId (get_local_id(0)) | |
60: #define workgroupId (get_group_id(0)) | |
61: // const int workgroupSize = get_local_size(0); | |
62: const int effectiveWorkgroupId = workgroupId / gPixelsPerThread; | |
63: const int pixel = workgroupId % gPixelsPerThread; | |
64: const int effectiveLocalId = localId + pixel * gWorkgroupSize; | |
65: const int n = effectiveWorkgroupId / gNumFilters; | |
66: const int outPlane = effectiveWorkgroupId % gNumFilters; | |
67: | |
68: const int outputRow = effectiveLocalId / gOutputSize; | |
69: const int outputCol = effectiveLocalId % gOutputSize; | |
70: | |
71: float sum = 0; | |
72: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
73: barrier(CLK_LOCAL_MEM_FENCE); | |
74: copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared); | |
75: copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared); | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: | |
78: if (effectiveLocalId < gOutputSizeSquared) { | |
79: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
80: // trying to reduce register pressure... | |
81: #if gPadZeros == 1 | |
82: #define inputRow (outputRow + u) | |
83: #else | |
84: #define inputRow (outputRow + u + gHalfFilterSize) | |
85: #endif | |
86: int inputimagerowoffset = inputRow * gInputSize; | |
87: int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
88: bool rowOk = inputRow >= 0 && inputRow < gInputSize; | |
89: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
90: #if gPadZeros == 1 | |
91: #define inputCol (outputCol + v) | |
92: #else | |
93: #define inputCol (outputCol + v + gHalfFilterSize) | |
94: #endif | |
95: bool process = rowOk && inputCol >= 0 && inputCol < gInputSize; | |
96: if (process) { | |
97: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ]; | |
98: } | |
99: } | |
100: } | |
101: } | |
102: } | |
103: // output are organized like [imageid][filterid][row][col] | |
104: #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId) | |
105: if (effectiveLocalId < gOutputSizeSquared) { | |
106: output[resultIndex ] = sum; | |
107: } | |
108: } | |
109: #endif | |
110: | |
111: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=545 -D gPixelsPerThread=2 -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=33 -D gInputSizeSquared=1089 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=33 -D gOutputSizeSquared=1089 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, int N) { | |
8: int numLoops = (N + get_local_size(0) - 1) / get_local_size(0); | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * get_local_size(0) + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [n][filterid] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [n][filterid][outrow][outcol] | |
26: // the pixels per thread thing... : | |
27: // - we have one thread (~= cuda core) per output value, | |
28: // ie one thread for each combination of [outrow][outcol] | |
29: // - however, the number of threads is typically limited on a gpu, | |
30: // eg to 512 (eg Intel HD), or 1024 (eg nVidia K520) | |
31: // - so what happens if the number of output points is larger than | |
32: // the maximum workgroup size? | |
33: // - then we have several possibilities really: | |
34: // - we can divide the image into blocks, and process each block | |
35: // separately. This is probably a good option, but fair amount of | |
36: // work | |
37: // - we can get each thread to handle more than one output | |
38: // pixel, by looping | |
39: // - we can consider the output image in 1d, by putting the rows | |
40: // one after another, and assign each contiguous workgroup-size | |
41: // block to one workgroup | |
42: // => this is how this kernel works | |
43: // basically, it's a hack, so larger images actually run, without | |
44: // crashing, and we can probably improve it a lot :-) | |
45: // | |
46: // So, when outputSize * outputSize > workgroupSize, then | |
47: // multiple workgroups will be created for each output plane | |
48: // the number of such workgroups is given by: `gPixelsPerThread` | |
49: // the id of our workgroup within such a set of workgroups is calculated | |
50: // as `pixel` | |
51: // effectiveLocalId is our local id if we had one enormous workgroup | |
52: // containing the whole output image plane | |
53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize, | |
54: global const float *images, global const float *filters, | |
55: global float *output, | |
56: local float *_inputPlane, local float *_filterPlane) { | |
57: #define globalId (get_global_id(0)) | |
58: | |
59: #define localId (get_local_id(0)) | |
60: #define workgroupId (get_group_id(0)) | |
61: // const int workgroupSize = get_local_size(0); | |
62: const int effectiveWorkgroupId = workgroupId / gPixelsPerThread; | |
63: const int pixel = workgroupId % gPixelsPerThread; | |
64: const int effectiveLocalId = localId + pixel * gWorkgroupSize; | |
65: const int n = effectiveWorkgroupId / gNumFilters; | |
66: const int outPlane = effectiveWorkgroupId % gNumFilters; | |
67: | |
68: const int outputRow = effectiveLocalId / gOutputSize; | |
69: const int outputCol = effectiveLocalId % gOutputSize; | |
70: | |
71: float sum = 0; | |
72: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
73: barrier(CLK_LOCAL_MEM_FENCE); | |
74: copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared); | |
75: copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared); | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: | |
78: if (effectiveLocalId < gOutputSizeSquared) { | |
79: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
80: // trying to reduce register pressure... | |
81: #if gPadZeros == 1 | |
82: #define inputRow (outputRow + u) | |
83: #else | |
84: #define inputRow (outputRow + u + gHalfFilterSize) | |
85: #endif | |
86: int inputimagerowoffset = inputRow * gInputSize; | |
87: int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
88: bool rowOk = inputRow >= 0 && inputRow < gInputSize; | |
89: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
90: #if gPadZeros == 1 | |
91: #define inputCol (outputCol + v) | |
92: #else | |
93: #define inputCol (outputCol + v + gHalfFilterSize) | |
94: #endif | |
95: bool process = rowOk && inputCol >= 0 && inputCol < gInputSize; | |
96: if (process) { | |
97: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ]; | |
98: } | |
99: } | |
100: } | |
101: } | |
102: } | |
103: // output are organized like [imageid][filterid][row][col] | |
104: #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId) | |
105: if (effectiveLocalId < gOutputSizeSquared) { | |
106: output[resultIndex ] = sum; | |
107: } | |
108: } | |
109: #endif | |
110: | |
111: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=545 -D gPixelsPerThread=2 -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=33 -D gInputSizeSquared=1089 -D gNumFilters=1 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=1 -D gOutputPlanes=1 -D gOutputSize=33 -D gOutputSizeSquared=1089 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
" thrown in the test body. | |
[ FAILED ] testforward.compare_break1_0_4 (53 ms) | |
[ RUN ] testforward.comparespecific_break2 | |
LayerDimensions{ inputPlanes=64 inputSize=19 numFilters=64 filterSize=19 outputSize=1 padZeros=0 biased=0 skip=0} | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=64 -D gInputPlanes=64 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=64 -D gFilterSize=19 -D gHalfFilterSize=9 -D gFilterSizeSquared=361 -D gNumOutputPlanes=64 -D gOutputPlanes=64 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=64 -D gInputPlanes=64 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=64 -D gFilterSize=19 -D gHalfFilterSize=9 -D gFilterSizeSquared=361 -D gNumOutputPlanes=64 -D gOutputPlanes=64 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=64 -D gInputPlanes=64 -D gInputSize=19 -D gInputSizeSquared=361 -D gNumFilters=64 -D gFilterSize=19 -D gHalfFilterSize=9 -D gFilterSizeSquared=361 -D gNumOutputPlanes=64 -D gOutputPlanes=64 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
" thrown in the test body. | |
[ FAILED ] testforward.comparespecific_break2 (138 ms) | |
[ RUN ] testforward.softmax | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
output[0]=0.0320586 | |
output[1]=0.0871443 | |
output[2]=0.643914 | |
output[3]=0.236883 | |
loss 0.44019 | |
loss 3.44019 | |
loss 2.44019 | |
loss 1.44019 | |
[ OK ] testforward.softmax (25 ms) | |
[ RUN ] testforward.softmax_byplane | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
output[0]=0.0320586 | |
output[1]=0.0871443 | |
output[2]=0.643914 | |
output[3]=0.236883 | |
loss 0.44019 | |
loss 3.44019 | |
loss 2.44019 | |
loss 1.44019 | |
[ OK ] testforward.softmax_byplane (17 ms) | |
[ RUN ] testforward.crash_from_jm | |
-D gNumInputPlanes=32 -D gInputPlanes=32 -D gInputSize=28 -D gInputSizeSquared=784 -D gNumFilters=20 -D gFilterSize=28 -D gHalfFilterSize=14 -D gFilterSizeSquared=784 -D gNumOutputPlanes=20 -D gOutputPlanes=20 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=32 -D gInputPlanes=32 -D gInputSize=28 -D gInputSizeSquared=784 -D gNumFilters=20 -D gFilterSize=28 -D gHalfFilterSize=14 -D gFilterSizeSquared=784 -D gNumOutputPlanes=20 -D gOutputPlanes=20 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=32 -D gInputPlanes=32 -D gInputSize=28 -D gInputSizeSquared=784 -D gNumFilters=20 -D gFilterSize=28 -D gHalfFilterSize=14 -D gFilterSizeSquared=784 -D gNumOutputPlanes=20 -D gOutputPlanes=20 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0" | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=32 -D gInputPlanes=32 -D gInputSize=28 -D gInputSizeSquared=784 -D gNumFilters=20 -D gFilterSize=28 -D gHalfFilterSize=14 -D gFilterSizeSquared=784 -D gNumOutputPlanes=20 -D gOutputPlanes=20 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=1 -D gSkip=0" | |
" thrown in the test body. | |
[ FAILED ] testforward.crash_from_jm (157 ms) | |
[----------] 17 tests from testforward (1322 ms total) | |
[----------] 2 tests from testfilehelper | |
[ RUN ] testfilehelper.testfilehelper | |
[ OK ] testfilehelper.testfilehelper (19 ms) | |
[ RUN ] testfilehelper.testreadchunk | |
[ OK ] testfilehelper.testreadchunk (11 ms) | |
[----------] 2 tests from testfilehelper (30 ms total) | |
[----------] 12 tests from testsimpleconvolvenet | |
[ RUN ] testsimpleconvolvenet.imagesize1_planes2_filters2_unbiased_tanh | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D TANH" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D TANH" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D TANH" | |
" thrown in the test body. | |
[ FAILED ] testsimpleconvolvenet.imagesize1_planes2_filters2_unbiased_tanh (77 ms) | |
[ RUN ] testsimpleconvolvenet.imagesize1_planes2_filters2_tanh | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D TANH" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D TANH" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D TANH" | |
" thrown in the test body. | |
[ FAILED ] testsimpleconvolvenet.imagesize1_planes2_filters2_tanh (77 ms) | |
[ RUN ] testsimpleconvolvenet.imagesize3_n4_filtersize3_tanh | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D TANH" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D TANH" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D TANH" | |
" thrown in the test body. | |
[ FAILED ] testsimpleconvolvenet.imagesize3_n4_filtersize3_tanh (78 ms) | |
[ RUN ] testsimpleconvolvenet.imagesize1_2planes_filtersize1 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
forward try kernel 0 | |
... not plausibly optimal, skipping | |
forward try kernel 1 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 1: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 2 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
ForwardAuto: kernel 2: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
... not valid | |
forward try kernel 3 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: each workgroup handles convolving one input example with one filtercube | |
8: // and writing out one single output plane | |
9: // | |
10: // workgroup id organized like: [imageid][outplane] | |
11: // local id organized like: [outrow][outcol] | |
12: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
13: // number workgroups = 32 | |
14: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
16: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
17: // output are organized like [imageid][filterid][row][col] | |
18: void kernel forward_3_by_n_outplane(const int batchSize, | |
19: global const float *images, global const float *filters, | |
20: global float *output, | |
21: local float *_upstreamImage, local float *_filterCube) { | |
22: const int globalId = get_global_id(0); | |
23: | |
24: const int workgroupId = get_group_id(0); | |
25: const int workgroupSize = get_local_size(0); | |
26: const int n = workgroupId / gNumFilters; | |
27: const int outPlane = workgroupId % gNumFilters; | |
28: | |
29: const int localId = get_local_id(0); | |
30: const int outputRow = localId / gOutputSize; | |
31: const int outputCol = localId % gOutputSize; | |
32: | |
33: const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize; | |
34: const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow - gEven) : gHalfFilterSize - gEven; | |
35: const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize; | |
36: const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven; | |
37: | |
38: const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
39: | |
40: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
41: const int filterCubeGlobalOffset = outPlane * filterCubeLength; | |
42: const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize; | |
43: for (int i = 0; i < numPixelsPerThread; i++) { | |
44: int thisOffset = localId + i * workgroupSize; | |
45: if (thisOffset < filterCubeLength) { | |
46: _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset]; | |
47: } | |
48: } | |
49: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
50: | |
51: float sum = 0; | |
52: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
53: int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: for (int i = 0; i < numUpstreamsPerThread; i++) { | |
56: int thisOffset = workgroupSize * i + localId; | |
57: if (thisOffset < gInputSizeSquared) { | |
58: _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ]; | |
59: } | |
60: } | |
61: barrier(CLK_LOCAL_MEM_FENCE); | |
62: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
63: for (int u = minu; u <= maxu; u++) { | |
64: int inputRow = outputRow + u; | |
65: #if gPadZeros == 0 | |
66: inputRow += gHalfFilterSize; | |
67: #endif | |
68: int inputimagerowoffset = inputRow * gInputSize; | |
69: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
70: for (int v = minv; v <= maxv; v++) { | |
71: int inputCol = outputCol + v; | |
72: #if gPadZeros == 0 | |
73: inputCol += gHalfFilterSize; | |
74: #endif | |
75: if (localId < gOutputSizeSquared) { | |
76: sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
77: } | |
78: } | |
79: } | |
80: } | |
81: | |
82: // output are organized like [imageid][filterid][row][col] | |
83: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
84: if (localId < gOutputSizeSquared) { | |
85: output[resultIndex ] = sum; | |
86: } | |
87: } | |
88: | |
89: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 3: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: each workgroup handles convolving one input example with one filtercube | |
8: // and writing out one single output plane | |
9: // | |
10: // workgroup id organized like: [imageid][outplane] | |
11: // local id organized like: [outrow][outcol] | |
12: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
13: // number workgroups = 32 | |
14: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
16: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
17: // output are organized like [imageid][filterid][row][col] | |
18: void kernel forward_3_by_n_outplane(const int batchSize, | |
19: global const float *images, global const float *filters, | |
20: global float *output, | |
21: local float *_upstreamImage, local float *_filterCube) { | |
22: const int globalId = get_global_id(0); | |
23: | |
24: const int workgroupId = get_group_id(0); | |
25: const int workgroupSize = get_local_size(0); | |
26: const int n = workgroupId / gNumFilters; | |
27: const int outPlane = workgroupId % gNumFilters; | |
28: | |
29: const int localId = get_local_id(0); | |
30: const int outputRow = localId / gOutputSize; | |
31: const int outputCol = localId % gOutputSize; | |
32: | |
33: const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize; | |
34: const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow - gEven) : gHalfFilterSize - gEven; | |
35: const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize; | |
36: const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven; | |
37: | |
38: const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
39: | |
40: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
41: const int filterCubeGlobalOffset = outPlane * filterCubeLength; | |
42: const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize; | |
43: for (int i = 0; i < numPixelsPerThread; i++) { | |
44: int thisOffset = localId + i * workgroupSize; | |
45: if (thisOffset < filterCubeLength) { | |
46: _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset]; | |
47: } | |
48: } | |
49: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
50: | |
51: float sum = 0; | |
52: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
53: int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: for (int i = 0; i < numUpstreamsPerThread; i++) { | |
56: int thisOffset = workgroupSize * i + localId; | |
57: if (thisOffset < gInputSizeSquared) { | |
58: _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ]; | |
59: } | |
60: } | |
61: barrier(CLK_LOCAL_MEM_FENCE); | |
62: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
63: for (int u = minu; u <= maxu; u++) { | |
64: int inputRow = outputRow + u; | |
65: #if gPadZeros == 0 | |
66: inputRow += gHalfFilterSize; | |
67: #endif | |
68: int inputimagerowoffset = inputRow * gInputSize; | |
69: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
70: for (int v = minv; v <= maxv; v++) { | |
71: int inputCol = outputCol + v; | |
72: #if gPadZeros == 0 | |
73: inputCol += gHalfFilterSize; | |
74: #endif | |
75: if (localId < gOutputSizeSquared) { | |
76: sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
77: } | |
78: } | |
79: } | |
80: } | |
81: | |
82: // output are organized like [imageid][filterid][row][col] | |
83: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
84: if (localId < gOutputSizeSquared) { | |
85: output[resultIndex ] = sum; | |
86: } | |
87: } | |
88: | |
89: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 4 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, int N) { | |
8: int numLoops = (N + get_local_size(0) - 1) / get_local_size(0); | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * get_local_size(0) + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [n][filterid] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [n][filterid][outrow][outcol] | |
26: // the pixels per thread thing... : | |
27: // - we have one thread (~= cuda core) per output value, | |
28: // ie one thread for each combination of [outrow][outcol] | |
29: // - however, the number of threads is typically limited on a gpu, | |
30: // eg to 512 (eg Intel HD), or 1024 (eg nVidia K520) | |
31: // - so what happens if the number of output points is larger than | |
32: // the maximum workgroup size? | |
33: // - then we have several possibilities really: | |
34: // - we can divide the image into blocks, and process each block | |
35: // separately. This is probably a good option, but fair amount of | |
36: // work | |
37: // - we can get each thread to handle more than one output | |
38: // pixel, by looping | |
39: // - we can consider the output image in 1d, by putting the rows | |
40: // one after another, and assign each contiguous workgroup-size | |
41: // block to one workgroup | |
42: // => this is how this kernel works | |
43: // basically, it's a hack, so larger images actually run, without | |
44: // crashing, and we can probably improve it a lot :-) | |
45: // | |
46: // So, when outputSize * outputSize > workgroupSize, then | |
47: // multiple workgroups will be created for each output plane | |
48: // the number of such workgroups is given by: `gPixelsPerThread` | |
49: // the id of our workgroup within such a set of workgroups is calculated | |
50: // as `pixel` | |
51: // effectiveLocalId is our local id if we had one enormous workgroup | |
52: // containing the whole output image plane | |
53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize, | |
54: global const float *images, global const float *filters, | |
55: global float *output, | |
56: local float *_inputPlane, local float *_filterPlane) { | |
57: #define globalId (get_global_id(0)) | |
58: | |
59: #define localId (get_local_id(0)) | |
60: #define workgroupId (get_group_id(0)) | |
61: // const int workgroupSize = get_local_size(0); | |
62: const int effectiveWorkgroupId = workgroupId / gPixelsPerThread; | |
63: const int pixel = workgroupId % gPixelsPerThread; | |
64: const int effectiveLocalId = localId + pixel * gWorkgroupSize; | |
65: const int n = effectiveWorkgroupId / gNumFilters; | |
66: const int outPlane = effectiveWorkgroupId % gNumFilters; | |
67: | |
68: const int outputRow = effectiveLocalId / gOutputSize; | |
69: const int outputCol = effectiveLocalId % gOutputSize; | |
70: | |
71: float sum = 0; | |
72: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
73: barrier(CLK_LOCAL_MEM_FENCE); | |
74: copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared); | |
75: copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared); | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: | |
78: if (effectiveLocalId < gOutputSizeSquared) { | |
79: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
80: // trying to reduce register pressure... | |
81: #if gPadZeros == 1 | |
82: #define inputRow (outputRow + u) | |
83: #else | |
84: #define inputRow (outputRow + u + gHalfFilterSize) | |
85: #endif | |
86: int inputimagerowoffset = inputRow * gInputSize; | |
87: int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
88: bool rowOk = inputRow >= 0 && inputRow < gInputSize; | |
89: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
90: #if gPadZeros == 1 | |
91: #define inputCol (outputCol + v) | |
92: #else | |
93: #define inputCol (outputCol + v + gHalfFilterSize) | |
94: #endif | |
95: bool process = rowOk && inputCol >= 0 && inputCol < gInputSize; | |
96: if (process) { | |
97: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ]; | |
98: } | |
99: } | |
100: } | |
101: } | |
102: } | |
103: // output are organized like [imageid][filterid][row][col] | |
104: #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId) | |
105: if (effectiveLocalId < gOutputSizeSquared) { | |
106: output[resultIndex ] = sum; | |
107: } | |
108: } | |
109: #endif | |
110: | |
111: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 4: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, int N) { | |
8: int numLoops = (N + get_local_size(0) - 1) / get_local_size(0); | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * get_local_size(0) + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [n][filterid] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [n][filterid][outrow][outcol] | |
26: // the pixels per thread thing... : | |
27: // - we have one thread (~= cuda core) per output value, | |
28: // ie one thread for each combination of [outrow][outcol] | |
29: // - however, the number of threads is typically limited on a gpu, | |
30: // eg to 512 (eg Intel HD), or 1024 (eg nVidia K520) | |
31: // - so what happens if the number of output points is larger than | |
32: // the maximum workgroup size? | |
33: // - then we have several possibilities really: | |
34: // - we can divide the image into blocks, and process each block | |
35: // separately. This is probably a good option, but fair amount of | |
36: // work | |
37: // - we can get each thread to handle more than one output | |
38: // pixel, by looping | |
39: // - we can consider the output image in 1d, by putting the rows | |
40: // one after another, and assign each contiguous workgroup-size | |
41: // block to one workgroup | |
42: // => this is how this kernel works | |
43: // basically, it's a hack, so larger images actually run, without | |
44: // crashing, and we can probably improve it a lot :-) | |
45: // | |
46: // So, when outputSize * outputSize > workgroupSize, then | |
47: // multiple workgroups will be created for each output plane | |
48: // the number of such workgroups is given by: `gPixelsPerThread` | |
49: // the id of our workgroup within such a set of workgroups is calculated | |
50: // as `pixel` | |
51: // effectiveLocalId is our local id if we had one enormous workgroup | |
52: // containing the whole output image plane | |
53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize, | |
54: global const float *images, global const float *filters, | |
55: global float *output, | |
56: local float *_inputPlane, local float *_filterPlane) { | |
57: #define globalId (get_global_id(0)) | |
58: | |
59: #define localId (get_local_id(0)) | |
60: #define workgroupId (get_group_id(0)) | |
61: // const int workgroupSize = get_local_size(0); | |
62: const int effectiveWorkgroupId = workgroupId / gPixelsPerThread; | |
63: const int pixel = workgroupId % gPixelsPerThread; | |
64: const int effectiveLocalId = localId + pixel * gWorkgroupSize; | |
65: const int n = effectiveWorkgroupId / gNumFilters; | |
66: const int outPlane = effectiveWorkgroupId % gNumFilters; | |
67: | |
68: const int outputRow = effectiveLocalId / gOutputSize; | |
69: const int outputCol = effectiveLocalId % gOutputSize; | |
70: | |
71: float sum = 0; | |
72: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
73: barrier(CLK_LOCAL_MEM_FENCE); | |
74: copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared); | |
75: copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared); | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: | |
78: if (effectiveLocalId < gOutputSizeSquared) { | |
79: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
80: // trying to reduce register pressure... | |
81: #if gPadZeros == 1 | |
82: #define inputRow (outputRow + u) | |
83: #else | |
84: #define inputRow (outputRow + u + gHalfFilterSize) | |
85: #endif | |
86: int inputimagerowoffset = inputRow * gInputSize; | |
87: int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
88: bool rowOk = inputRow >= 0 && inputRow < gInputSize; | |
89: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
90: #if gPadZeros == 1 | |
91: #define inputCol (outputCol + v) | |
92: #else | |
93: #define inputCol (outputCol + v + gHalfFilterSize) | |
94: #endif | |
95: bool process = rowOk && inputCol >= 0 && inputCol < gInputSize; | |
96: if (process) { | |
97: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ]; | |
98: } | |
99: } | |
100: } | |
101: } | |
102: } | |
103: // output are organized like [imageid][filterid][row][col] | |
104: #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId) | |
105: if (effectiveLocalId < gOutputSizeSquared) { | |
106: output[resultIndex ] = sum; | |
107: } | |
108: } | |
109: #endif | |
110: | |
111: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 5 | |
cl/reduce_segments.cl build log: | |
(8:0) : error : invalid global address space qualifier specified for parameter type | |
(8:0) : error : syntax error at 'const' | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: kernel void reduce_segments(const int numSegments, const int segmentLength, | |
8: global float const *in, global float* out) { | |
9: const int globalId = get_global_id(0); | |
10: const int segmentId = globalId; | |
11: | |
12: if (segmentId >= numSegments) { | |
13: return; | |
14: } | |
15: | |
16: float sum = 0; | |
17: global const float *segment = in + segmentId * segmentLength; | |
18: for (int i = 0; i < segmentLength; i++) { | |
19: sum += segment[i]; | |
20: } | |
21: out[segmentId] = sum; | |
22: } | |
23: | |
24: | |
25: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/reduce_segments.cl build log: | |
(8:0) : error : invalid global address space qualifier specified for parameter type | |
(8:0) : error : syntax error at 'const' | |
ForwardAuto: kernel 5: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: kernel void reduce_segments(const int numSegments, const int segmentLength, | |
8: global float const *in, global float* out) { | |
9: const int globalId = get_global_id(0); | |
10: const int segmentId = globalId; | |
11: | |
12: if (segmentId >= numSegments) { | |
13: return; | |
14: } | |
15: | |
16: float sum = 0; | |
17: global const float *segment = in + segmentId * segmentLength; | |
18: for (int i = 0; i < segmentLength; i++) { | |
19: sum += segment[i]; | |
20: } | |
21: out[segmentId] = sum; | |
22: } | |
23: | |
24: | |
25: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/reduce_segments.cl build log: | |
(8:0) : error : invalid global address space qualifier specified for parameter type | |
(8:0) : error : syntax error at 'const' | |
... not valid | |
forward try kernel 6 | |
cl/forward_byinputplane.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: | |
8: // - load same input plane from each image | |
9: // - hold filter plane for this input plane, for all filters | |
10: // - reduce afterwards | |
11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB | |
12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB | |
13: // => seems ok? | |
14: // workgroupid: [inputPlaneId] | |
15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...) | |
16: // iterate over: [n][outCol] | |
17: // output: [n][filterId][outRow][outCol][inputPlane] | |
18: // need to later reduce output over: [inputPlane] | |
19: void kernel forward_byinputplane(const int batchSize, | |
20: global const float *images, global const float *filters, | |
21: global float *output, | |
22: local float *_inputPlane, local float *_filterPlanes) { | |
23: // const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0; | |
24: | |
25: const int globalId = get_global_id(0); | |
26: const int workgroupId = get_group_id(0); | |
27: const int workgroupSize = get_local_size(0); | |
28: const int localId = get_local_id(0); | |
29: | |
30: const int inputPlaneId = workgroupId; | |
31: const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize; | |
32: const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize; | |
33: const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
34: for (int loop = 0; loop < numLoops; loop++) { | |
35: const int loopLocalId = localId + loop * workgroupSize; | |
36: const int filterId = loopLocalId / gOutputSize; | |
37: const int outRow = loopLocalId % gOutputSize; | |
38: | |
39: // copy down our filter, we have gOutputSize threads to do this | |
40: global float const *globalFilterPlane = filters + | |
41: (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared; | |
42: local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared; | |
43: barrier(CLK_LOCAL_MEM_FENCE); | |
44: for (int i = 0; i < numFilterCopyLoops; i++) { | |
45: const int offset = i * gOutputSize + outRow; | |
46: bool process = filterId < gNumFilters && offset < gFilterSizeSquared; | |
47: if (process) { | |
48: _localFilterPlane[ offset ] = globalFilterPlane[ offset ]; | |
49: } | |
50: } | |
51: // loop over n ... | |
52: for (int n = 0; n < batchSize; n++) { | |
53: // copy down our imageplane, we have workgroupSize threads to do this | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: global float const *globalImagePlane = images + | |
56: (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared; | |
57: for (int i = 0; i< numImageCopyLoops; i++) { | |
58: const int offset = i * workgroupSize + localId; | |
59: if (offset < gInputSizeSquared) { | |
60: _inputPlane[ offset ] = globalImagePlane[ offset ]; | |
61: } | |
62: } | |
63: barrier(CLK_LOCAL_MEM_FENCE); | |
64: // calc output for each [outrow][outcol] | |
65: bool filterPlaneOk = filterId < gNumFilters; | |
66: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
67: float sum = 0; | |
68: for (int filterRow = 0; filterRow < gFilterSize; filterRow++) { | |
69: int inRow = outRow + filterRow; | |
70: #if gPadZeros == 1 | |
71: inRow -= gHalfFilterSize; | |
72: #endif | |
73: bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize; | |
74: for (int filterCol = 0; filterCol < gFilterSize; filterCol++) { | |
75: int inCol = outCol + filterCol; | |
76: #if gPadZeros == 1 | |
77: inCol -= gHalfFilterSize; | |
78: #endif | |
79: bool process = rowOk && inCol >= 0 && inCol < gInputSize; | |
80: if (process) { | |
81: float imageValue = _inputPlane[ inRow * gInputSize + inCol ]; | |
82: float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ]; | |
83: sum += imageValue * filterValue; | |
84: } | |
85: } | |
86: } | |
87: if (filterId < gNumFilters) { | |
88: // [n][filterId][outRow][outCol][inputPlane] | |
89: int resultIndex = (( (n | |
90: * gNumFilters + filterId) | |
91: * gOutputSize + outRow) | |
92: * gOutputSize + outCol) | |
93: * gNumInputPlanes + inputPlaneId; | |
94: output[resultIndex] = sum; | |
95: //if (globalId == 2) output[0] = resultIndex; | |
96: // output[resultIndex] = outRow; | |
97: } | |
98: // output[localId] = _localFilterPlane[localId]; | |
99: } | |
100: } | |
101: } | |
102: } | |
103: | |
104: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward_byinputplane.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 6: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: | |
8: // - load same input plane from each image | |
9: // - hold filter plane for this input plane, for all filters | |
10: // - reduce afterwards | |
11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB | |
12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB | |
13: // => seems ok? | |
14: // workgroupid: [inputPlaneId] | |
15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...) | |
16: // iterate over: [n][outCol] | |
17: // output: [n][filterId][outRow][outCol][inputPlane] | |
18: // need to later reduce output over: [inputPlane] | |
19: void kernel forward_byinputplane(const int batchSize, | |
20: global const float *images, global const float *filters, | |
21: global float *output, | |
22: local float *_inputPlane, local float *_filterPlanes) { | |
23: // const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0; | |
24: | |
25: const int globalId = get_global_id(0); | |
26: const int workgroupId = get_group_id(0); | |
27: const int workgroupSize = get_local_size(0); | |
28: const int localId = get_local_id(0); | |
29: | |
30: const int inputPlaneId = workgroupId; | |
31: const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize; | |
32: const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize; | |
33: const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
34: for (int loop = 0; loop < numLoops; loop++) { | |
35: const int loopLocalId = localId + loop * workgroupSize; | |
36: const int filterId = loopLocalId / gOutputSize; | |
37: const int outRow = loopLocalId % gOutputSize; | |
38: | |
39: // copy down our filter, we have gOutputSize threads to do this | |
40: global float const *globalFilterPlane = filters + | |
41: (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared; | |
42: local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared; | |
43: barrier(CLK_LOCAL_MEM_FENCE); | |
44: for (int i = 0; i < numFilterCopyLoops; i++) { | |
45: const int offset = i * gOutputSize + outRow; | |
46: bool process = filterId < gNumFilters && offset < gFilterSizeSquared; | |
47: if (process) { | |
48: _localFilterPlane[ offset ] = globalFilterPlane[ offset ]; | |
49: } | |
50: } | |
51: // loop over n ... | |
52: for (int n = 0; n < batchSize; n++) { | |
53: // copy down our imageplane, we have workgroupSize threads to do this | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: global float const *globalImagePlane = images + | |
56: (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared; | |
57: for (int i = 0; i< numImageCopyLoops; i++) { | |
58: const int offset = i * workgroupSize + localId; | |
59: if (offset < gInputSizeSquared) { | |
60: _inputPlane[ offset ] = globalImagePlane[ offset ]; | |
61: } | |
62: } | |
63: barrier(CLK_LOCAL_MEM_FENCE); | |
64: // calc output for each [outrow][outcol] | |
65: bool filterPlaneOk = filterId < gNumFilters; | |
66: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
67: float sum = 0; | |
68: for (int filterRow = 0; filterRow < gFilterSize; filterRow++) { | |
69: int inRow = outRow + filterRow; | |
70: #if gPadZeros == 1 | |
71: inRow -= gHalfFilterSize; | |
72: #endif | |
73: bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize; | |
74: for (int filterCol = 0; filterCol < gFilterSize; filterCol++) { | |
75: int inCol = outCol + filterCol; | |
76: #if gPadZeros == 1 | |
77: inCol -= gHalfFilterSize; | |
78: #endif | |
79: bool process = rowOk && inCol >= 0 && inCol < gInputSize; | |
80: if (process) { | |
81: float imageValue = _inputPlane[ inRow * gInputSize + inCol ]; | |
82: float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ]; | |
83: sum += imageValue * filterValue; | |
84: } | |
85: } | |
86: } | |
87: if (filterId < gNumFilters) { | |
88: // [n][filterId][outRow][outCol][inputPlane] | |
89: int resultIndex = (( (n | |
90: * gNumFilters + filterId) | |
91: * gOutputSize + outRow) | |
92: * gOutputSize + outCol) | |
93: * gNumInputPlanes + inputPlaneId; | |
94: output[resultIndex] = sum; | |
95: //if (globalId == 2) output[0] = resultIndex; | |
96: // output[resultIndex] = outRow; | |
97: } | |
98: // output[localId] = _localFilterPlane[localId]; | |
99: } | |
100: } | |
101: } | |
102: } | |
103: | |
104: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward_byinputplane.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 7 | |
... seems valid | |
ForwardIm2Col.cl build log: | |
(19:0) : error : invalid global address space qualifier specified for parameter type | |
(19:0) : error : syntax error at 'const' | |
kernel build error: | |
kernel source: | |
1: // from SpatialConvolutionMM.cu: | |
2: | |
3: // CL: grid stride looping | |
4: #define CL_KERNEL_LOOP(i, n) \ | |
5: for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \ | |
6: i < (n); \ | |
7: i += get_local_size(0) * get_num_groups(0)) | |
8: | |
9: //#define gPadding 0 | |
10: //#define gStride 1 | |
11: //#define gColSize 1 | |
12: //#define gFilterSize 1 | |
13: //#define gSize 1 | |
14: | |
15: // Kernel for fast unfold+copy | |
16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu) | |
17: kernel void im2col( | |
18: const int n, | |
19: global float const * im_data, int im_offset, | |
20: global float* data_col) { | |
21: global const float *data_im = im_data + im_offset; | |
22: | |
23: CL_KERNEL_LOOP(index, n) { | |
24: int w_out = index % 1; | |
25: index /= 1; | |
26: int h_out = index % 1; | |
27: int channel_in = index / 1; | |
28: int channel_out = channel_in * 1 * 1; | |
29: int h_in = h_out * 1 - 0; | |
30: int w_in = w_out * 1 - 0; | |
31: data_col += (channel_out * 1 + h_out) * 1 + w_out; | |
32: data_im += (channel_in * 1 + h_in) * 1 + w_in; | |
33: for (int i = 0; i < 1; ++i) { | |
34: for (int j = 0; j < 1; ++j) { | |
35: int h = h_in + i; | |
36: int w = w_in + j; | |
37: *data_col = (h >= 0 && w >= 0 && h < 1 && w < 1) ? | |
38: data_im[i * 1 + j] : 0; | |
39: data_col += 1 * 1; | |
40: } | |
41: } | |
42: } | |
43: } | |
44: | |
45: kernel void col2im( | |
46: const int n, | |
47: global float const *data_col, | |
48: global float* im_data, int im_offset) { | |
49: global float *data_im = im_data + im_offset; | |
50: | |
51: for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) { | |
52: float val = 0; | |
53: int w = index % 1 + 0; | |
54: int h = (index / 1) % 1 + 0; | |
55: int c = index / (1 * 1); | |
56: // compute the start and end of the output | |
57: int w_col_start = (w < 1) ? 0 : (w - 1) / 1 + 1; | |
58: int w_col_end = min(w / 1 + 1, 1); | |
59: int h_col_start = (h < 1) ? 0 : (h - 1) / 1 + 1; | |
60: int h_col_end = min(h / 1 + 1, 1); | |
61: | |
62: int offset = (c * 1 * 1 + h * 1 + w) * 1 * 1; | |
63: int coeff_h_col = (1 - 1 * 1 * 1) * 1; | |
64: int coeff_w_col = (1 - 1 * 1 * 1); | |
65: for (int h_col = h_col_start; h_col < h_col_end; ++h_col) { | |
66: for (int w_col = w_col_start; w_col < w_col_end; ++w_col) { | |
67: val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col]; | |
68: } | |
69: } | |
70: data_im[index] = val; | |
71: } | |
72: } | |
73: | |
74: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
ForwardIm2Col.cl build log: | |
(19:0) : error : invalid global address space qualifier specified for parameter type | |
(19:0) : error : syntax error at 'const' | |
ForwardAuto: kernel 7 this instance cant be used: | |
kernel source: | |
1: // from SpatialConvolutionMM.cu: | |
2: | |
3: // CL: grid stride looping | |
4: #define CL_KERNEL_LOOP(i, n) \ | |
5: for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \ | |
6: i < (n); \ | |
7: i += get_local_size(0) * get_num_groups(0)) | |
8: | |
9: //#define gPadding 0 | |
10: //#define gStride 1 | |
11: //#define gColSize 1 | |
12: //#define gFilterSize 1 | |
13: //#define gSize 1 | |
14: | |
15: // Kernel for fast unfold+copy | |
16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu) | |
17: kernel void im2col( | |
18: const int n, | |
19: global float const * im_data, int im_offset, | |
20: global float* data_col) { | |
21: global const float *data_im = im_data + im_offset; | |
22: | |
23: CL_KERNEL_LOOP(index, n) { | |
24: int w_out = index % 1; | |
25: index /= 1; | |
26: int h_out = index % 1; | |
27: int channel_in = index / 1; | |
28: int channel_out = channel_in * 1 * 1; | |
29: int h_in = h_out * 1 - 0; | |
30: int w_in = w_out * 1 - 0; | |
31: data_col += (channel_out * 1 + h_out) * 1 + w_out; | |
32: data_im += (channel_in * 1 + h_in) * 1 + w_in; | |
33: for (int i = 0; i < 1; ++i) { | |
34: for (int j = 0; j < 1; ++j) { | |
35: int h = h_in + i; | |
36: int w = w_in + j; | |
37: *data_col = (h >= 0 && w >= 0 && h < 1 && w < 1) ? | |
38: data_im[i * 1 + j] : 0; | |
39: data_col += 1 * 1; | |
40: } | |
41: } | |
42: } | |
43: } | |
44: | |
45: kernel void col2im( | |
46: const int n, | |
47: global float const *data_col, | |
48: global float* im_data, int im_offset) { | |
49: global float *data_im = im_data + im_offset; | |
50: | |
51: for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) { | |
52: float val = 0; | |
53: int w = index % 1 + 0; | |
54: int h = (index / 1) % 1 + 0; | |
55: int c = index / (1 * 1); | |
56: // compute the start and end of the output | |
57: int w_col_start = (w < 1) ? 0 : (w - 1) / 1 + 1; | |
58: int w_col_end = min(w / 1 + 1, 1); | |
59: int h_col_start = (h < 1) ? 0 : (h - 1) / 1 + 1; | |
60: int h_col_end = min(h / 1 + 1, 1); | |
61: | |
62: int offset = (c * 1 * 1 + h * 1 + w) * 1 * 1; | |
63: int coeff_h_col = (1 - 1 * 1 * 1) * 1; | |
64: int coeff_w_col = (1 - 1 * 1 * 1); | |
65: for (int h_col = h_col_start; h_col < h_col_end; ++h_col) { | |
66: for (int w_col = w_col_start; w_col < w_col_end; ++w_col) { | |
67: val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col]; | |
68: } | |
69: } | |
70: data_im[index] = val; | |
71: } | |
72: } | |
73: | |
74: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
ForwardIm2Col.cl build log: | |
(19:0) : error : invalid global address space qualifier specified for parameter type | |
(19:0) : error : syntax error at 'const' | |
forward kernel 0: cannot be used | |
forward kernel 1: cannot be used | |
forward kernel 2: cannot be used | |
forward kernel 3: cannot be used | |
forward kernel 4: cannot be used | |
forward kernel 5: cannot be used | |
forward kernel 6: cannot be used | |
forward kernel 7: cannot be used | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description "No valid forward implementations found" thrown in the test body. | |
[ FAILED ] testsimpleconvolvenet.imagesize1_2planes_filtersize1 (186 ms) | |
[ RUN ] testsimpleconvolvenet.imagesize3_n4_filtersize3_relu | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU" | |
" thrown in the test body. | |
[ FAILED ] testsimpleconvolvenet.imagesize3_n4_filtersize3_relu (70 ms) | |
[ RUN ] testsimpleconvolvenet.imagesize3_n4_filtersize3_linear | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
forward try kernel 0 | |
... not plausibly optimal, skipping | |
forward try kernel 1 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 1: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 2 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
ForwardAuto: kernel 2: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
... not valid | |
forward try kernel 3 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: each workgroup handles convolving one input example with one filtercube | |
8: // and writing out one single output plane | |
9: // | |
10: // workgroup id organized like: [imageid][outplane] | |
11: // local id organized like: [outrow][outcol] | |
12: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
13: // number workgroups = 32 | |
14: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
16: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
17: // output are organized like [imageid][filterid][row][col] | |
18: void kernel forward_3_by_n_outplane(const int batchSize, | |
19: global const float *images, global const float *filters, | |
20: global float *output, | |
21: local float *_upstreamImage, local float *_filterCube) { | |
22: const int globalId = get_global_id(0); | |
23: | |
24: const int workgroupId = get_group_id(0); | |
25: const int workgroupSize = get_local_size(0); | |
26: const int n = workgroupId / gNumFilters; | |
27: const int outPlane = workgroupId % gNumFilters; | |
28: | |
29: const int localId = get_local_id(0); | |
30: const int outputRow = localId / gOutputSize; | |
31: const int outputCol = localId % gOutputSize; | |
32: | |
33: const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize; | |
34: const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow - gEven) : gHalfFilterSize - gEven; | |
35: const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize; | |
36: const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven; | |
37: | |
38: const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
39: | |
40: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
41: const int filterCubeGlobalOffset = outPlane * filterCubeLength; | |
42: const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize; | |
43: for (int i = 0; i < numPixelsPerThread; i++) { | |
44: int thisOffset = localId + i * workgroupSize; | |
45: if (thisOffset < filterCubeLength) { | |
46: _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset]; | |
47: } | |
48: } | |
49: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
50: | |
51: float sum = 0; | |
52: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
53: int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: for (int i = 0; i < numUpstreamsPerThread; i++) { | |
56: int thisOffset = workgroupSize * i + localId; | |
57: if (thisOffset < gInputSizeSquared) { | |
58: _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ]; | |
59: } | |
60: } | |
61: barrier(CLK_LOCAL_MEM_FENCE); | |
62: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
63: for (int u = minu; u <= maxu; u++) { | |
64: int inputRow = outputRow + u; | |
65: #if gPadZeros == 0 | |
66: inputRow += gHalfFilterSize; | |
67: #endif | |
68: int inputimagerowoffset = inputRow * gInputSize; | |
69: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
70: for (int v = minv; v <= maxv; v++) { | |
71: int inputCol = outputCol + v; | |
72: #if gPadZeros == 0 | |
73: inputCol += gHalfFilterSize; | |
74: #endif | |
75: if (localId < gOutputSizeSquared) { | |
76: sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
77: } | |
78: } | |
79: } | |
80: } | |
81: | |
82: // output are organized like [imageid][filterid][row][col] | |
83: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
84: if (localId < gOutputSizeSquared) { | |
85: output[resultIndex ] = sum; | |
86: } | |
87: } | |
88: | |
89: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 3: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: each workgroup handles convolving one input example with one filtercube | |
8: // and writing out one single output plane | |
9: // | |
10: // workgroup id organized like: [imageid][outplane] | |
11: // local id organized like: [outrow][outcol] | |
12: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
13: // number workgroups = 32 | |
14: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
16: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
17: // output are organized like [imageid][filterid][row][col] | |
18: void kernel forward_3_by_n_outplane(const int batchSize, | |
19: global const float *images, global const float *filters, | |
20: global float *output, | |
21: local float *_upstreamImage, local float *_filterCube) { | |
22: const int globalId = get_global_id(0); | |
23: | |
24: const int workgroupId = get_group_id(0); | |
25: const int workgroupSize = get_local_size(0); | |
26: const int n = workgroupId / gNumFilters; | |
27: const int outPlane = workgroupId % gNumFilters; | |
28: | |
29: const int localId = get_local_id(0); | |
30: const int outputRow = localId / gOutputSize; | |
31: const int outputCol = localId % gOutputSize; | |
32: | |
33: const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize; | |
34: const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow - gEven) : gHalfFilterSize - gEven; | |
35: const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize; | |
36: const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven; | |
37: | |
38: const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
39: | |
40: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
41: const int filterCubeGlobalOffset = outPlane * filterCubeLength; | |
42: const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize; | |
43: for (int i = 0; i < numPixelsPerThread; i++) { | |
44: int thisOffset = localId + i * workgroupSize; | |
45: if (thisOffset < filterCubeLength) { | |
46: _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset]; | |
47: } | |
48: } | |
49: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
50: | |
51: float sum = 0; | |
52: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
53: int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: for (int i = 0; i < numUpstreamsPerThread; i++) { | |
56: int thisOffset = workgroupSize * i + localId; | |
57: if (thisOffset < gInputSizeSquared) { | |
58: _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ]; | |
59: } | |
60: } | |
61: barrier(CLK_LOCAL_MEM_FENCE); | |
62: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
63: for (int u = minu; u <= maxu; u++) { | |
64: int inputRow = outputRow + u; | |
65: #if gPadZeros == 0 | |
66: inputRow += gHalfFilterSize; | |
67: #endif | |
68: int inputimagerowoffset = inputRow * gInputSize; | |
69: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
70: for (int v = minv; v <= maxv; v++) { | |
71: int inputCol = outputCol + v; | |
72: #if gPadZeros == 0 | |
73: inputCol += gHalfFilterSize; | |
74: #endif | |
75: if (localId < gOutputSizeSquared) { | |
76: sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
77: } | |
78: } | |
79: } | |
80: } | |
81: | |
82: // output are organized like [imageid][filterid][row][col] | |
83: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
84: if (localId < gOutputSizeSquared) { | |
85: output[resultIndex ] = sum; | |
86: } | |
87: } | |
88: | |
89: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 4 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, int N) { | |
8: int numLoops = (N + get_local_size(0) - 1) / get_local_size(0); | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * get_local_size(0) + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [n][filterid] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [n][filterid][outrow][outcol] | |
26: // the pixels per thread thing... : | |
27: // - we have one thread (~= cuda core) per output value, | |
28: // ie one thread for each combination of [outrow][outcol] | |
29: // - however, the number of threads is typically limited on a gpu, | |
30: // eg to 512 (eg Intel HD), or 1024 (eg nVidia K520) | |
31: // - so what happens if the number of output points is larger than | |
32: // the maximum workgroup size? | |
33: // - then we have several possibilities really: | |
34: // - we can divide the image into blocks, and process each block | |
35: // separately. This is probably a good option, but fair amount of | |
36: // work | |
37: // - we can get each thread to handle more than one output | |
38: // pixel, by looping | |
39: // - we can consider the output image in 1d, by putting the rows | |
40: // one after another, and assign each contiguous workgroup-size | |
41: // block to one workgroup | |
42: // => this is how this kernel works | |
43: // basically, it's a hack, so larger images actually run, without | |
44: // crashing, and we can probably improve it a lot :-) | |
45: // | |
46: // So, when outputSize * outputSize > workgroupSize, then | |
47: // multiple workgroups will be created for each output plane | |
48: // the number of such workgroups is given by: `gPixelsPerThread` | |
49: // the id of our workgroup within such a set of workgroups is calculated | |
50: // as `pixel` | |
51: // effectiveLocalId is our local id if we had one enormous workgroup | |
52: // containing the whole output image plane | |
53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize, | |
54: global const float *images, global const float *filters, | |
55: global float *output, | |
56: local float *_inputPlane, local float *_filterPlane) { | |
57: #define globalId (get_global_id(0)) | |
58: | |
59: #define localId (get_local_id(0)) | |
60: #define workgroupId (get_group_id(0)) | |
61: // const int workgroupSize = get_local_size(0); | |
62: const int effectiveWorkgroupId = workgroupId / gPixelsPerThread; | |
63: const int pixel = workgroupId % gPixelsPerThread; | |
64: const int effectiveLocalId = localId + pixel * gWorkgroupSize; | |
65: const int n = effectiveWorkgroupId / gNumFilters; | |
66: const int outPlane = effectiveWorkgroupId % gNumFilters; | |
67: | |
68: const int outputRow = effectiveLocalId / gOutputSize; | |
69: const int outputCol = effectiveLocalId % gOutputSize; | |
70: | |
71: float sum = 0; | |
72: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
73: barrier(CLK_LOCAL_MEM_FENCE); | |
74: copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared); | |
75: copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared); | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: | |
78: if (effectiveLocalId < gOutputSizeSquared) { | |
79: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
80: // trying to reduce register pressure... | |
81: #if gPadZeros == 1 | |
82: #define inputRow (outputRow + u) | |
83: #else | |
84: #define inputRow (outputRow + u + gHalfFilterSize) | |
85: #endif | |
86: int inputimagerowoffset = inputRow * gInputSize; | |
87: int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
88: bool rowOk = inputRow >= 0 && inputRow < gInputSize; | |
89: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
90: #if gPadZeros == 1 | |
91: #define inputCol (outputCol + v) | |
92: #else | |
93: #define inputCol (outputCol + v + gHalfFilterSize) | |
94: #endif | |
95: bool process = rowOk && inputCol >= 0 && inputCol < gInputSize; | |
96: if (process) { | |
97: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ]; | |
98: } | |
99: } | |
100: } | |
101: } | |
102: } | |
103: // output are organized like [imageid][filterid][row][col] | |
104: #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId) | |
105: if (effectiveLocalId < gOutputSizeSquared) { | |
106: output[resultIndex ] = sum; | |
107: } | |
108: } | |
109: #endif | |
110: | |
111: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 4: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, int N) { | |
8: int numLoops = (N + get_local_size(0) - 1) / get_local_size(0); | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * get_local_size(0) + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [n][filterid] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [n][filterid][outrow][outcol] | |
26: // the pixels per thread thing... : | |
27: // - we have one thread (~= cuda core) per output value, | |
28: // ie one thread for each combination of [outrow][outcol] | |
29: // - however, the number of threads is typically limited on a gpu, | |
30: // eg to 512 (eg Intel HD), or 1024 (eg nVidia K520) | |
31: // - so what happens if the number of output points is larger than | |
32: // the maximum workgroup size? | |
33: // - then we have several possibilities really: | |
34: // - we can divide the image into blocks, and process each block | |
35: // separately. This is probably a good option, but fair amount of | |
36: // work | |
37: // - we can get each thread to handle more than one output | |
38: // pixel, by looping | |
39: // - we can consider the output image in 1d, by putting the rows | |
40: // one after another, and assign each contiguous workgroup-size | |
41: // block to one workgroup | |
42: // => this is how this kernel works | |
43: // basically, it's a hack, so larger images actually run, without | |
44: // crashing, and we can probably improve it a lot :-) | |
45: // | |
46: // So, when outputSize * outputSize > workgroupSize, then | |
47: // multiple workgroups will be created for each output plane | |
48: // the number of such workgroups is given by: `gPixelsPerThread` | |
49: // the id of our workgroup within such a set of workgroups is calculated | |
50: // as `pixel` | |
51: // effectiveLocalId is our local id if we had one enormous workgroup | |
52: // containing the whole output image plane | |
53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize, | |
54: global const float *images, global const float *filters, | |
55: global float *output, | |
56: local float *_inputPlane, local float *_filterPlane) { | |
57: #define globalId (get_global_id(0)) | |
58: | |
59: #define localId (get_local_id(0)) | |
60: #define workgroupId (get_group_id(0)) | |
61: // const int workgroupSize = get_local_size(0); | |
62: const int effectiveWorkgroupId = workgroupId / gPixelsPerThread; | |
63: const int pixel = workgroupId % gPixelsPerThread; | |
64: const int effectiveLocalId = localId + pixel * gWorkgroupSize; | |
65: const int n = effectiveWorkgroupId / gNumFilters; | |
66: const int outPlane = effectiveWorkgroupId % gNumFilters; | |
67: | |
68: const int outputRow = effectiveLocalId / gOutputSize; | |
69: const int outputCol = effectiveLocalId % gOutputSize; | |
70: | |
71: float sum = 0; | |
72: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
73: barrier(CLK_LOCAL_MEM_FENCE); | |
74: copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared); | |
75: copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared); | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: | |
78: if (effectiveLocalId < gOutputSizeSquared) { | |
79: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
80: // trying to reduce register pressure... | |
81: #if gPadZeros == 1 | |
82: #define inputRow (outputRow + u) | |
83: #else | |
84: #define inputRow (outputRow + u + gHalfFilterSize) | |
85: #endif | |
86: int inputimagerowoffset = inputRow * gInputSize; | |
87: int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
88: bool rowOk = inputRow >= 0 && inputRow < gInputSize; | |
89: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
90: #if gPadZeros == 1 | |
91: #define inputCol (outputCol + v) | |
92: #else | |
93: #define inputCol (outputCol + v + gHalfFilterSize) | |
94: #endif | |
95: bool process = rowOk && inputCol >= 0 && inputCol < gInputSize; | |
96: if (process) { | |
97: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ]; | |
98: } | |
99: } | |
100: } | |
101: } | |
102: } | |
103: // output are organized like [imageid][filterid][row][col] | |
104: #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId) | |
105: if (effectiveLocalId < gOutputSizeSquared) { | |
106: output[resultIndex ] = sum; | |
107: } | |
108: } | |
109: #endif | |
110: | |
111: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 5 | |
cl/reduce_segments.cl build log: | |
(8:0) : error : invalid global address space qualifier specified for parameter type | |
(8:0) : error : syntax error at 'const' | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: kernel void reduce_segments(const int numSegments, const int segmentLength, | |
8: global float const *in, global float* out) { | |
9: const int globalId = get_global_id(0); | |
10: const int segmentId = globalId; | |
11: | |
12: if (segmentId >= numSegments) { | |
13: return; | |
14: } | |
15: | |
16: float sum = 0; | |
17: global const float *segment = in + segmentId * segmentLength; | |
18: for (int i = 0; i < segmentLength; i++) { | |
19: sum += segment[i]; | |
20: } | |
21: out[segmentId] = sum; | |
22: } | |
23: | |
24: | |
25: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/reduce_segments.cl build log: | |
(8:0) : error : invalid global address space qualifier specified for parameter type | |
(8:0) : error : syntax error at 'const' | |
ForwardAuto: kernel 5: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: kernel void reduce_segments(const int numSegments, const int segmentLength, | |
8: global float const *in, global float* out) { | |
9: const int globalId = get_global_id(0); | |
10: const int segmentId = globalId; | |
11: | |
12: if (segmentId >= numSegments) { | |
13: return; | |
14: } | |
15: | |
16: float sum = 0; | |
17: global const float *segment = in + segmentId * segmentLength; | |
18: for (int i = 0; i < segmentLength; i++) { | |
19: sum += segment[i]; | |
20: } | |
21: out[segmentId] = sum; | |
22: } | |
23: | |
24: | |
25: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/reduce_segments.cl build log: | |
(8:0) : error : invalid global address space qualifier specified for parameter type | |
(8:0) : error : syntax error at 'const' | |
... not valid | |
forward try kernel 6 | |
cl/forward_byinputplane.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: | |
8: // - load same input plane from each image | |
9: // - hold filter plane for this input plane, for all filters | |
10: // - reduce afterwards | |
11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB | |
12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB | |
13: // => seems ok? | |
14: // workgroupid: [inputPlaneId] | |
15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...) | |
16: // iterate over: [n][outCol] | |
17: // output: [n][filterId][outRow][outCol][inputPlane] | |
18: // need to later reduce output over: [inputPlane] | |
19: void kernel forward_byinputplane(const int batchSize, | |
20: global const float *images, global const float *filters, | |
21: global float *output, | |
22: local float *_inputPlane, local float *_filterPlanes) { | |
23: // const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0; | |
24: | |
25: const int globalId = get_global_id(0); | |
26: const int workgroupId = get_group_id(0); | |
27: const int workgroupSize = get_local_size(0); | |
28: const int localId = get_local_id(0); | |
29: | |
30: const int inputPlaneId = workgroupId; | |
31: const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize; | |
32: const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize; | |
33: const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
34: for (int loop = 0; loop < numLoops; loop++) { | |
35: const int loopLocalId = localId + loop * workgroupSize; | |
36: const int filterId = loopLocalId / gOutputSize; | |
37: const int outRow = loopLocalId % gOutputSize; | |
38: | |
39: // copy down our filter, we have gOutputSize threads to do this | |
40: global float const *globalFilterPlane = filters + | |
41: (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared; | |
42: local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared; | |
43: barrier(CLK_LOCAL_MEM_FENCE); | |
44: for (int i = 0; i < numFilterCopyLoops; i++) { | |
45: const int offset = i * gOutputSize + outRow; | |
46: bool process = filterId < gNumFilters && offset < gFilterSizeSquared; | |
47: if (process) { | |
48: _localFilterPlane[ offset ] = globalFilterPlane[ offset ]; | |
49: } | |
50: } | |
51: // loop over n ... | |
52: for (int n = 0; n < batchSize; n++) { | |
53: // copy down our imageplane, we have workgroupSize threads to do this | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: global float const *globalImagePlane = images + | |
56: (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared; | |
57: for (int i = 0; i< numImageCopyLoops; i++) { | |
58: const int offset = i * workgroupSize + localId; | |
59: if (offset < gInputSizeSquared) { | |
60: _inputPlane[ offset ] = globalImagePlane[ offset ]; | |
61: } | |
62: } | |
63: barrier(CLK_LOCAL_MEM_FENCE); | |
64: // calc output for each [outrow][outcol] | |
65: bool filterPlaneOk = filterId < gNumFilters; | |
66: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
67: float sum = 0; | |
68: for (int filterRow = 0; filterRow < gFilterSize; filterRow++) { | |
69: int inRow = outRow + filterRow; | |
70: #if gPadZeros == 1 | |
71: inRow -= gHalfFilterSize; | |
72: #endif | |
73: bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize; | |
74: for (int filterCol = 0; filterCol < gFilterSize; filterCol++) { | |
75: int inCol = outCol + filterCol; | |
76: #if gPadZeros == 1 | |
77: inCol -= gHalfFilterSize; | |
78: #endif | |
79: bool process = rowOk && inCol >= 0 && inCol < gInputSize; | |
80: if (process) { | |
81: float imageValue = _inputPlane[ inRow * gInputSize + inCol ]; | |
82: float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ]; | |
83: sum += imageValue * filterValue; | |
84: } | |
85: } | |
86: } | |
87: if (filterId < gNumFilters) { | |
88: // [n][filterId][outRow][outCol][inputPlane] | |
89: int resultIndex = (( (n | |
90: * gNumFilters + filterId) | |
91: * gOutputSize + outRow) | |
92: * gOutputSize + outCol) | |
93: * gNumInputPlanes + inputPlaneId; | |
94: output[resultIndex] = sum; | |
95: //if (globalId == 2) output[0] = resultIndex; | |
96: // output[resultIndex] = outRow; | |
97: } | |
98: // output[localId] = _localFilterPlane[localId]; | |
99: } | |
100: } | |
101: } | |
102: } | |
103: | |
104: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward_byinputplane.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 6: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: | |
8: // - load same input plane from each image | |
9: // - hold filter plane for this input plane, for all filters | |
10: // - reduce afterwards | |
11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB | |
12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB | |
13: // => seems ok? | |
14: // workgroupid: [inputPlaneId] | |
15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...) | |
16: // iterate over: [n][outCol] | |
17: // output: [n][filterId][outRow][outCol][inputPlane] | |
18: // need to later reduce output over: [inputPlane] | |
19: void kernel forward_byinputplane(const int batchSize, | |
20: global const float *images, global const float *filters, | |
21: global float *output, | |
22: local float *_inputPlane, local float *_filterPlanes) { | |
23: // const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0; | |
24: | |
25: const int globalId = get_global_id(0); | |
26: const int workgroupId = get_group_id(0); | |
27: const int workgroupSize = get_local_size(0); | |
28: const int localId = get_local_id(0); | |
29: | |
30: const int inputPlaneId = workgroupId; | |
31: const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize; | |
32: const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize; | |
33: const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
34: for (int loop = 0; loop < numLoops; loop++) { | |
35: const int loopLocalId = localId + loop * workgroupSize; | |
36: const int filterId = loopLocalId / gOutputSize; | |
37: const int outRow = loopLocalId % gOutputSize; | |
38: | |
39: // copy down our filter, we have gOutputSize threads to do this | |
40: global float const *globalFilterPlane = filters + | |
41: (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared; | |
42: local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared; | |
43: barrier(CLK_LOCAL_MEM_FENCE); | |
44: for (int i = 0; i < numFilterCopyLoops; i++) { | |
45: const int offset = i * gOutputSize + outRow; | |
46: bool process = filterId < gNumFilters && offset < gFilterSizeSquared; | |
47: if (process) { | |
48: _localFilterPlane[ offset ] = globalFilterPlane[ offset ]; | |
49: } | |
50: } | |
51: // loop over n ... | |
52: for (int n = 0; n < batchSize; n++) { | |
53: // copy down our imageplane, we have workgroupSize threads to do this | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: global float const *globalImagePlane = images + | |
56: (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared; | |
57: for (int i = 0; i< numImageCopyLoops; i++) { | |
58: const int offset = i * workgroupSize + localId; | |
59: if (offset < gInputSizeSquared) { | |
60: _inputPlane[ offset ] = globalImagePlane[ offset ]; | |
61: } | |
62: } | |
63: barrier(CLK_LOCAL_MEM_FENCE); | |
64: // calc output for each [outrow][outcol] | |
65: bool filterPlaneOk = filterId < gNumFilters; | |
66: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
67: float sum = 0; | |
68: for (int filterRow = 0; filterRow < gFilterSize; filterRow++) { | |
69: int inRow = outRow + filterRow; | |
70: #if gPadZeros == 1 | |
71: inRow -= gHalfFilterSize; | |
72: #endif | |
73: bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize; | |
74: for (int filterCol = 0; filterCol < gFilterSize; filterCol++) { | |
75: int inCol = outCol + filterCol; | |
76: #if gPadZeros == 1 | |
77: inCol -= gHalfFilterSize; | |
78: #endif | |
79: bool process = rowOk && inCol >= 0 && inCol < gInputSize; | |
80: if (process) { | |
81: float imageValue = _inputPlane[ inRow * gInputSize + inCol ]; | |
82: float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ]; | |
83: sum += imageValue * filterValue; | |
84: } | |
85: } | |
86: } | |
87: if (filterId < gNumFilters) { | |
88: // [n][filterId][outRow][outCol][inputPlane] | |
89: int resultIndex = (( (n | |
90: * gNumFilters + filterId) | |
91: * gOutputSize + outRow) | |
92: * gOutputSize + outCol) | |
93: * gNumInputPlanes + inputPlaneId; | |
94: output[resultIndex] = sum; | |
95: //if (globalId == 2) output[0] = resultIndex; | |
96: // output[resultIndex] = outRow; | |
97: } | |
98: // output[localId] = _localFilterPlane[localId]; | |
99: } | |
100: } | |
101: } | |
102: } | |
103: | |
104: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward_byinputplane.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=1 -D gInputPlanes=1 -D gInputSize=3 -D gInputSizeSquared=9 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 7 | |
... seems valid | |
ForwardIm2Col.cl build log: | |
(19:0) : error : invalid global address space qualifier specified for parameter type | |
(19:0) : error : syntax error at 'const' | |
kernel build error: | |
kernel source: | |
1: // from SpatialConvolutionMM.cu: | |
2: | |
3: // CL: grid stride looping | |
4: #define CL_KERNEL_LOOP(i, n) \ | |
5: for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \ | |
6: i < (n); \ | |
7: i += get_local_size(0) * get_num_groups(0)) | |
8: | |
9: //#define gPadding 0 | |
10: //#define gStride 1 | |
11: //#define gColSize 1 | |
12: //#define gFilterSize 3 | |
13: //#define gSize 3 | |
14: | |
15: // Kernel for fast unfold+copy | |
16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu) | |
17: kernel void im2col( | |
18: const int n, | |
19: global float const * im_data, int im_offset, | |
20: global float* data_col) { | |
21: global const float *data_im = im_data + im_offset; | |
22: | |
23: CL_KERNEL_LOOP(index, n) { | |
24: int w_out = index % 1; | |
25: index /= 1; | |
26: int h_out = index % 1; | |
27: int channel_in = index / 1; | |
28: int channel_out = channel_in * 3 * 3; | |
29: int h_in = h_out * 1 - 0; | |
30: int w_in = w_out * 1 - 0; | |
31: data_col += (channel_out * 1 + h_out) * 1 + w_out; | |
32: data_im += (channel_in * 3 + h_in) * 3 + w_in; | |
33: for (int i = 0; i < 3; ++i) { | |
34: for (int j = 0; j < 3; ++j) { | |
35: int h = h_in + i; | |
36: int w = w_in + j; | |
37: *data_col = (h >= 0 && w >= 0 && h < 3 && w < 3) ? | |
38: data_im[i * 3 + j] : 0; | |
39: data_col += 1 * 1; | |
40: } | |
41: } | |
42: } | |
43: } | |
44: | |
45: kernel void col2im( | |
46: const int n, | |
47: global float const *data_col, | |
48: global float* im_data, int im_offset) { | |
49: global float *data_im = im_data + im_offset; | |
50: | |
51: for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) { | |
52: float val = 0; | |
53: int w = index % 3 + 0; | |
54: int h = (index / 3) % 3 + 0; | |
55: int c = index / (3 * 3); | |
56: // compute the start and end of the output | |
57: int w_col_start = (w < 3) ? 0 : (w - 3) / 1 + 1; | |
58: int w_col_end = min(w / 1 + 1, 1); | |
59: int h_col_start = (h < 3) ? 0 : (h - 3) / 1 + 1; | |
60: int h_col_end = min(h / 1 + 1, 1); | |
61: | |
62: int offset = (c * 3 * 3 + h * 3 + w) * 1 * 1; | |
63: int coeff_h_col = (1 - 1 * 3 * 1) * 1; | |
64: int coeff_w_col = (1 - 1 * 1 * 1); | |
65: for (int h_col = h_col_start; h_col < h_col_end; ++h_col) { | |
66: for (int w_col = w_col_start; w_col < w_col_end; ++w_col) { | |
67: val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col]; | |
68: } | |
69: } | |
70: data_im[index] = val; | |
71: } | |
72: } | |
73: | |
74: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
ForwardIm2Col.cl build log: | |
(19:0) : error : invalid global address space qualifier specified for parameter type | |
(19:0) : error : syntax error at 'const' | |
ForwardAuto: kernel 7 this instance cant be used: | |
kernel source: | |
1: // from SpatialConvolutionMM.cu: | |
2: | |
3: // CL: grid stride looping | |
4: #define CL_KERNEL_LOOP(i, n) \ | |
5: for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \ | |
6: i < (n); \ | |
7: i += get_local_size(0) * get_num_groups(0)) | |
8: | |
9: //#define gPadding 0 | |
10: //#define gStride 1 | |
11: //#define gColSize 1 | |
12: //#define gFilterSize 3 | |
13: //#define gSize 3 | |
14: | |
15: // Kernel for fast unfold+copy | |
16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu) | |
17: kernel void im2col( | |
18: const int n, | |
19: global float const * im_data, int im_offset, | |
20: global float* data_col) { | |
21: global const float *data_im = im_data + im_offset; | |
22: | |
23: CL_KERNEL_LOOP(index, n) { | |
24: int w_out = index % 1; | |
25: index /= 1; | |
26: int h_out = index % 1; | |
27: int channel_in = index / 1; | |
28: int channel_out = channel_in * 3 * 3; | |
29: int h_in = h_out * 1 - 0; | |
30: int w_in = w_out * 1 - 0; | |
31: data_col += (channel_out * 1 + h_out) * 1 + w_out; | |
32: data_im += (channel_in * 3 + h_in) * 3 + w_in; | |
33: for (int i = 0; i < 3; ++i) { | |
34: for (int j = 0; j < 3; ++j) { | |
35: int h = h_in + i; | |
36: int w = w_in + j; | |
37: *data_col = (h >= 0 && w >= 0 && h < 3 && w < 3) ? | |
38: data_im[i * 3 + j] : 0; | |
39: data_col += 1 * 1; | |
40: } | |
41: } | |
42: } | |
43: } | |
44: | |
45: kernel void col2im( | |
46: const int n, | |
47: global float const *data_col, | |
48: global float* im_data, int im_offset) { | |
49: global float *data_im = im_data + im_offset; | |
50: | |
51: for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) { | |
52: float val = 0; | |
53: int w = index % 3 + 0; | |
54: int h = (index / 3) % 3 + 0; | |
55: int c = index / (3 * 3); | |
56: // compute the start and end of the output | |
57: int w_col_start = (w < 3) ? 0 : (w - 3) / 1 + 1; | |
58: int w_col_end = min(w / 1 + 1, 1); | |
59: int h_col_start = (h < 3) ? 0 : (h - 3) / 1 + 1; | |
60: int h_col_end = min(h / 1 + 1, 1); | |
61: | |
62: int offset = (c * 3 * 3 + h * 3 + w) * 1 * 1; | |
63: int coeff_h_col = (1 - 1 * 3 * 1) * 1; | |
64: int coeff_w_col = (1 - 1 * 1 * 1); | |
65: for (int h_col = h_col_start; h_col < h_col_end; ++h_col) { | |
66: for (int w_col = w_col_start; w_col < w_col_end; ++w_col) { | |
67: val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col]; | |
68: } | |
69: } | |
70: data_im[index] = val; | |
71: } | |
72: } | |
73: | |
74: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
ForwardIm2Col.cl build log: | |
(19:0) : error : invalid global address space qualifier specified for parameter type | |
(19:0) : error : syntax error at 'const' | |
forward kernel 0: cannot be used | |
forward kernel 1: cannot be used | |
forward kernel 2: cannot be used | |
forward kernel 3: cannot be used | |
forward kernel 4: cannot be used | |
forward kernel 5: cannot be used | |
forward kernel 6: cannot be used | |
forward kernel 7: cannot be used | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description "No valid forward implementations found" thrown in the test body. | |
[ FAILED ] testsimpleconvolvenet.imagesize3_n4_filtersize3_linear (190 ms) | |
[ RUN ] testsimpleconvolvenet.imagesize1_n2_2layers_unbiased | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU" | |
" thrown in the test body. | |
[ FAILED ] testsimpleconvolvenet.imagesize1_n2_2layers_unbiased (79 ms) | |
[ RUN ] testsimpleconvolvenet.imagesize1_n2_2layers_biased | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU" | |
" thrown in the test body. | |
[ FAILED ] testsimpleconvolvenet.imagesize1_n2_2layers_biased (83 ms) | |
[ RUN ] testsimpleconvolvenet.imagesize_5_4_2layers_filtersize_2_4_biased_n3 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=4 -DgOutputSizeSquared=16 -DgInputSize=4 -DgInputSizeSquared=16 -DgNumPlanes=3 -D RELU" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=4 -DgOutputSizeSquared=16 -DgInputSize=4 -DgInputSizeSquared=16 -DgNumPlanes=3 -D RELU" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=4 -DgOutputSizeSquared=16 -DgInputSize=4 -DgInputSizeSquared=16 -DgNumPlanes=3 -D RELU" | |
" thrown in the test body. | |
[ FAILED ] testsimpleconvolvenet.imagesize_5_4_2layers_filtersize_2_4_biased_n3 (76 ms) | |
[ RUN ] testsimpleconvolvenet.imagesize_5_4_2layers_filtersize_2_4_biased_n6 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=4 -DgOutputSizeSquared=16 -DgInputSize=4 -DgInputSizeSquared=16 -DgNumPlanes=3 -D RELU" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=4 -DgOutputSizeSquared=16 -DgInputSize=4 -DgInputSizeSquared=16 -DgNumPlanes=3 -D RELU" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=4 -DgOutputSizeSquared=16 -DgInputSize=4 -DgInputSizeSquared=16 -DgNumPlanes=3 -D RELU" | |
" thrown in the test body. | |
[ FAILED ] testsimpleconvolvenet.imagesize_5_4_2layers_filtersize_2_4_biased_n6 (84 ms) | |
[ RUN ] testsimpleconvolvenet.imagesize_5_3_2layers_filtersize_3_3_biased_n6 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=3 -D RELU" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=3 -D RELU" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=3 -D RELU" | |
" thrown in the test body. | |
[ FAILED ] testsimpleconvolvenet.imagesize_5_3_2layers_filtersize_3_3_biased_n6 (75 ms) | |
[ RUN ] testsimpleconvolvenet.imagesize_5_3_2layers_filtersize_3_3_biased_n18 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=3 -D RELU" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=3 -D RELU" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=3 -DgOutputSizeSquared=9 -DgInputSize=3 -DgInputSizeSquared=9 -DgNumPlanes=3 -D RELU" | |
" thrown in the test body. | |
[ FAILED ] testsimpleconvolvenet.imagesize_5_3_2layers_filtersize_3_3_biased_n18 (86 ms) | |
[----------] 12 tests from testsimpleconvolvenet (1163 ms total) | |
[----------] 3 tests from testlogicaloperators | |
[ RUN ] testlogicaloperators.Convolve_1layer_biased_And | |
And | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
forward try kernel 0 | |
... not plausibly optimal, skipping | |
forward try kernel 1 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 1: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 2 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
ForwardAuto: kernel 2: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
... not valid | |
forward try kernel 3 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: each workgroup handles convolving one input example with one filtercube | |
8: // and writing out one single output plane | |
9: // | |
10: // workgroup id organized like: [imageid][outplane] | |
11: // local id organized like: [outrow][outcol] | |
12: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
13: // number workgroups = 32 | |
14: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
16: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
17: // output are organized like [imageid][filterid][row][col] | |
18: void kernel forward_3_by_n_outplane(const int batchSize, | |
19: global const float *images, global const float *filters, | |
20: global float *output, | |
21: local float *_upstreamImage, local float *_filterCube) { | |
22: const int globalId = get_global_id(0); | |
23: | |
24: const int workgroupId = get_group_id(0); | |
25: const int workgroupSize = get_local_size(0); | |
26: const int n = workgroupId / gNumFilters; | |
27: const int outPlane = workgroupId % gNumFilters; | |
28: | |
29: const int localId = get_local_id(0); | |
30: const int outputRow = localId / gOutputSize; | |
31: const int outputCol = localId % gOutputSize; | |
32: | |
33: const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize; | |
34: const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow - gEven) : gHalfFilterSize - gEven; | |
35: const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize; | |
36: const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven; | |
37: | |
38: const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
39: | |
40: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
41: const int filterCubeGlobalOffset = outPlane * filterCubeLength; | |
42: const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize; | |
43: for (int i = 0; i < numPixelsPerThread; i++) { | |
44: int thisOffset = localId + i * workgroupSize; | |
45: if (thisOffset < filterCubeLength) { | |
46: _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset]; | |
47: } | |
48: } | |
49: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
50: | |
51: float sum = 0; | |
52: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
53: int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: for (int i = 0; i < numUpstreamsPerThread; i++) { | |
56: int thisOffset = workgroupSize * i + localId; | |
57: if (thisOffset < gInputSizeSquared) { | |
58: _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ]; | |
59: } | |
60: } | |
61: barrier(CLK_LOCAL_MEM_FENCE); | |
62: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
63: for (int u = minu; u <= maxu; u++) { | |
64: int inputRow = outputRow + u; | |
65: #if gPadZeros == 0 | |
66: inputRow += gHalfFilterSize; | |
67: #endif | |
68: int inputimagerowoffset = inputRow * gInputSize; | |
69: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
70: for (int v = minv; v <= maxv; v++) { | |
71: int inputCol = outputCol + v; | |
72: #if gPadZeros == 0 | |
73: inputCol += gHalfFilterSize; | |
74: #endif | |
75: if (localId < gOutputSizeSquared) { | |
76: sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
77: } | |
78: } | |
79: } | |
80: } | |
81: | |
82: // output are organized like [imageid][filterid][row][col] | |
83: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
84: if (localId < gOutputSizeSquared) { | |
85: output[resultIndex ] = sum; | |
86: } | |
87: } | |
88: | |
89: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 3: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: each workgroup handles convolving one input example with one filtercube | |
8: // and writing out one single output plane | |
9: // | |
10: // workgroup id organized like: [imageid][outplane] | |
11: // local id organized like: [outrow][outcol] | |
12: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
13: // number workgroups = 32 | |
14: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
16: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
17: // output are organized like [imageid][filterid][row][col] | |
18: void kernel forward_3_by_n_outplane(const int batchSize, | |
19: global const float *images, global const float *filters, | |
20: global float *output, | |
21: local float *_upstreamImage, local float *_filterCube) { | |
22: const int globalId = get_global_id(0); | |
23: | |
24: const int workgroupId = get_group_id(0); | |
25: const int workgroupSize = get_local_size(0); | |
26: const int n = workgroupId / gNumFilters; | |
27: const int outPlane = workgroupId % gNumFilters; | |
28: | |
29: const int localId = get_local_id(0); | |
30: const int outputRow = localId / gOutputSize; | |
31: const int outputCol = localId % gOutputSize; | |
32: | |
33: const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize; | |
34: const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow - gEven) : gHalfFilterSize - gEven; | |
35: const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize; | |
36: const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven; | |
37: | |
38: const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
39: | |
40: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
41: const int filterCubeGlobalOffset = outPlane * filterCubeLength; | |
42: const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize; | |
43: for (int i = 0; i < numPixelsPerThread; i++) { | |
44: int thisOffset = localId + i * workgroupSize; | |
45: if (thisOffset < filterCubeLength) { | |
46: _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset]; | |
47: } | |
48: } | |
49: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
50: | |
51: float sum = 0; | |
52: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
53: int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: for (int i = 0; i < numUpstreamsPerThread; i++) { | |
56: int thisOffset = workgroupSize * i + localId; | |
57: if (thisOffset < gInputSizeSquared) { | |
58: _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ]; | |
59: } | |
60: } | |
61: barrier(CLK_LOCAL_MEM_FENCE); | |
62: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
63: for (int u = minu; u <= maxu; u++) { | |
64: int inputRow = outputRow + u; | |
65: #if gPadZeros == 0 | |
66: inputRow += gHalfFilterSize; | |
67: #endif | |
68: int inputimagerowoffset = inputRow * gInputSize; | |
69: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
70: for (int v = minv; v <= maxv; v++) { | |
71: int inputCol = outputCol + v; | |
72: #if gPadZeros == 0 | |
73: inputCol += gHalfFilterSize; | |
74: #endif | |
75: if (localId < gOutputSizeSquared) { | |
76: sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
77: } | |
78: } | |
79: } | |
80: } | |
81: | |
82: // output are organized like [imageid][filterid][row][col] | |
83: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
84: if (localId < gOutputSizeSquared) { | |
85: output[resultIndex ] = sum; | |
86: } | |
87: } | |
88: | |
89: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 4 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, int N) { | |
8: int numLoops = (N + get_local_size(0) - 1) / get_local_size(0); | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * get_local_size(0) + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [n][filterid] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [n][filterid][outrow][outcol] | |
26: // the pixels per thread thing... : | |
27: // - we have one thread (~= cuda core) per output value, | |
28: // ie one thread for each combination of [outrow][outcol] | |
29: // - however, the number of threads is typically limited on a gpu, | |
30: // eg to 512 (eg Intel HD), or 1024 (eg nVidia K520) | |
31: // - so what happens if the number of output points is larger than | |
32: // the maximum workgroup size? | |
33: // - then we have several possibilities really: | |
34: // - we can divide the image into blocks, and process each block | |
35: // separately. This is probably a good option, but fair amount of | |
36: // work | |
37: // - we can get each thread to handle more than one output | |
38: // pixel, by looping | |
39: // - we can consider the output image in 1d, by putting the rows | |
40: // one after another, and assign each contiguous workgroup-size | |
41: // block to one workgroup | |
42: // => this is how this kernel works | |
43: // basically, it's a hack, so larger images actually run, without | |
44: // crashing, and we can probably improve it a lot :-) | |
45: // | |
46: // So, when outputSize * outputSize > workgroupSize, then | |
47: // multiple workgroups will be created for each output plane | |
48: // the number of such workgroups is given by: `gPixelsPerThread` | |
49: // the id of our workgroup within such a set of workgroups is calculated | |
50: // as `pixel` | |
51: // effectiveLocalId is our local id if we had one enormous workgroup | |
52: // containing the whole output image plane | |
53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize, | |
54: global const float *images, global const float *filters, | |
55: global float *output, | |
56: local float *_inputPlane, local float *_filterPlane) { | |
57: #define globalId (get_global_id(0)) | |
58: | |
59: #define localId (get_local_id(0)) | |
60: #define workgroupId (get_group_id(0)) | |
61: // const int workgroupSize = get_local_size(0); | |
62: const int effectiveWorkgroupId = workgroupId / gPixelsPerThread; | |
63: const int pixel = workgroupId % gPixelsPerThread; | |
64: const int effectiveLocalId = localId + pixel * gWorkgroupSize; | |
65: const int n = effectiveWorkgroupId / gNumFilters; | |
66: const int outPlane = effectiveWorkgroupId % gNumFilters; | |
67: | |
68: const int outputRow = effectiveLocalId / gOutputSize; | |
69: const int outputCol = effectiveLocalId % gOutputSize; | |
70: | |
71: float sum = 0; | |
72: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
73: barrier(CLK_LOCAL_MEM_FENCE); | |
74: copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared); | |
75: copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared); | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: | |
78: if (effectiveLocalId < gOutputSizeSquared) { | |
79: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
80: // trying to reduce register pressure... | |
81: #if gPadZeros == 1 | |
82: #define inputRow (outputRow + u) | |
83: #else | |
84: #define inputRow (outputRow + u + gHalfFilterSize) | |
85: #endif | |
86: int inputimagerowoffset = inputRow * gInputSize; | |
87: int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
88: bool rowOk = inputRow >= 0 && inputRow < gInputSize; | |
89: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
90: #if gPadZeros == 1 | |
91: #define inputCol (outputCol + v) | |
92: #else | |
93: #define inputCol (outputCol + v + gHalfFilterSize) | |
94: #endif | |
95: bool process = rowOk && inputCol >= 0 && inputCol < gInputSize; | |
96: if (process) { | |
97: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ]; | |
98: } | |
99: } | |
100: } | |
101: } | |
102: } | |
103: // output are organized like [imageid][filterid][row][col] | |
104: #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId) | |
105: if (effectiveLocalId < gOutputSizeSquared) { | |
106: output[resultIndex ] = sum; | |
107: } | |
108: } | |
109: #endif | |
110: | |
111: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 4: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, int N) { | |
8: int numLoops = (N + get_local_size(0) - 1) / get_local_size(0); | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * get_local_size(0) + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [n][filterid] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [n][filterid][outrow][outcol] | |
26: // the pixels per thread thing... : | |
27: // - we have one thread (~= cuda core) per output value, | |
28: // ie one thread for each combination of [outrow][outcol] | |
29: // - however, the number of threads is typically limited on a gpu, | |
30: // eg to 512 (eg Intel HD), or 1024 (eg nVidia K520) | |
31: // - so what happens if the number of output points is larger than | |
32: // the maximum workgroup size? | |
33: // - then we have several possibilities really: | |
34: // - we can divide the image into blocks, and process each block | |
35: // separately. This is probably a good option, but fair amount of | |
36: // work | |
37: // - we can get each thread to handle more than one output | |
38: // pixel, by looping | |
39: // - we can consider the output image in 1d, by putting the rows | |
40: // one after another, and assign each contiguous workgroup-size | |
41: // block to one workgroup | |
42: // => this is how this kernel works | |
43: // basically, it's a hack, so larger images actually run, without | |
44: // crashing, and we can probably improve it a lot :-) | |
45: // | |
46: // So, when outputSize * outputSize > workgroupSize, then | |
47: // multiple workgroups will be created for each output plane | |
48: // the number of such workgroups is given by: `gPixelsPerThread` | |
49: // the id of our workgroup within such a set of workgroups is calculated | |
50: // as `pixel` | |
51: // effectiveLocalId is our local id if we had one enormous workgroup | |
52: // containing the whole output image plane | |
53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize, | |
54: global const float *images, global const float *filters, | |
55: global float *output, | |
56: local float *_inputPlane, local float *_filterPlane) { | |
57: #define globalId (get_global_id(0)) | |
58: | |
59: #define localId (get_local_id(0)) | |
60: #define workgroupId (get_group_id(0)) | |
61: // const int workgroupSize = get_local_size(0); | |
62: const int effectiveWorkgroupId = workgroupId / gPixelsPerThread; | |
63: const int pixel = workgroupId % gPixelsPerThread; | |
64: const int effectiveLocalId = localId + pixel * gWorkgroupSize; | |
65: const int n = effectiveWorkgroupId / gNumFilters; | |
66: const int outPlane = effectiveWorkgroupId % gNumFilters; | |
67: | |
68: const int outputRow = effectiveLocalId / gOutputSize; | |
69: const int outputCol = effectiveLocalId % gOutputSize; | |
70: | |
71: float sum = 0; | |
72: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
73: barrier(CLK_LOCAL_MEM_FENCE); | |
74: copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared); | |
75: copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared); | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: | |
78: if (effectiveLocalId < gOutputSizeSquared) { | |
79: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
80: // trying to reduce register pressure... | |
81: #if gPadZeros == 1 | |
82: #define inputRow (outputRow + u) | |
83: #else | |
84: #define inputRow (outputRow + u + gHalfFilterSize) | |
85: #endif | |
86: int inputimagerowoffset = inputRow * gInputSize; | |
87: int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
88: bool rowOk = inputRow >= 0 && inputRow < gInputSize; | |
89: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
90: #if gPadZeros == 1 | |
91: #define inputCol (outputCol + v) | |
92: #else | |
93: #define inputCol (outputCol + v + gHalfFilterSize) | |
94: #endif | |
95: bool process = rowOk && inputCol >= 0 && inputCol < gInputSize; | |
96: if (process) { | |
97: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ]; | |
98: } | |
99: } | |
100: } | |
101: } | |
102: } | |
103: // output are organized like [imageid][filterid][row][col] | |
104: #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId) | |
105: if (effectiveLocalId < gOutputSizeSquared) { | |
106: output[resultIndex ] = sum; | |
107: } | |
108: } | |
109: #endif | |
110: | |
111: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 5 | |
cl/reduce_segments.cl build log: | |
(8:0) : error : invalid global address space qualifier specified for parameter type | |
(8:0) : error : syntax error at 'const' | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: kernel void reduce_segments(const int numSegments, const int segmentLength, | |
8: global float const *in, global float* out) { | |
9: const int globalId = get_global_id(0); | |
10: const int segmentId = globalId; | |
11: | |
12: if (segmentId >= numSegments) { | |
13: return; | |
14: } | |
15: | |
16: float sum = 0; | |
17: global const float *segment = in + segmentId * segmentLength; | |
18: for (int i = 0; i < segmentLength; i++) { | |
19: sum += segment[i]; | |
20: } | |
21: out[segmentId] = sum; | |
22: } | |
23: | |
24: | |
25: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/reduce_segments.cl build log: | |
(8:0) : error : invalid global address space qualifier specified for parameter type | |
(8:0) : error : syntax error at 'const' | |
ForwardAuto: kernel 5: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: kernel void reduce_segments(const int numSegments, const int segmentLength, | |
8: global float const *in, global float* out) { | |
9: const int globalId = get_global_id(0); | |
10: const int segmentId = globalId; | |
11: | |
12: if (segmentId >= numSegments) { | |
13: return; | |
14: } | |
15: | |
16: float sum = 0; | |
17: global const float *segment = in + segmentId * segmentLength; | |
18: for (int i = 0; i < segmentLength; i++) { | |
19: sum += segment[i]; | |
20: } | |
21: out[segmentId] = sum; | |
22: } | |
23: | |
24: | |
25: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/reduce_segments.cl build log: | |
(8:0) : error : invalid global address space qualifier specified for parameter type | |
(8:0) : error : syntax error at 'const' | |
... not valid | |
forward try kernel 6 | |
cl/forward_byinputplane.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: | |
8: // - load same input plane from each image | |
9: // - hold filter plane for this input plane, for all filters | |
10: // - reduce afterwards | |
11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB | |
12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB | |
13: // => seems ok? | |
14: // workgroupid: [inputPlaneId] | |
15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...) | |
16: // iterate over: [n][outCol] | |
17: // output: [n][filterId][outRow][outCol][inputPlane] | |
18: // need to later reduce output over: [inputPlane] | |
19: void kernel forward_byinputplane(const int batchSize, | |
20: global const float *images, global const float *filters, | |
21: global float *output, | |
22: local float *_inputPlane, local float *_filterPlanes) { | |
23: // const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0; | |
24: | |
25: const int globalId = get_global_id(0); | |
26: const int workgroupId = get_group_id(0); | |
27: const int workgroupSize = get_local_size(0); | |
28: const int localId = get_local_id(0); | |
29: | |
30: const int inputPlaneId = workgroupId; | |
31: const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize; | |
32: const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize; | |
33: const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
34: for (int loop = 0; loop < numLoops; loop++) { | |
35: const int loopLocalId = localId + loop * workgroupSize; | |
36: const int filterId = loopLocalId / gOutputSize; | |
37: const int outRow = loopLocalId % gOutputSize; | |
38: | |
39: // copy down our filter, we have gOutputSize threads to do this | |
40: global float const *globalFilterPlane = filters + | |
41: (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared; | |
42: local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared; | |
43: barrier(CLK_LOCAL_MEM_FENCE); | |
44: for (int i = 0; i < numFilterCopyLoops; i++) { | |
45: const int offset = i * gOutputSize + outRow; | |
46: bool process = filterId < gNumFilters && offset < gFilterSizeSquared; | |
47: if (process) { | |
48: _localFilterPlane[ offset ] = globalFilterPlane[ offset ]; | |
49: } | |
50: } | |
51: // loop over n ... | |
52: for (int n = 0; n < batchSize; n++) { | |
53: // copy down our imageplane, we have workgroupSize threads to do this | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: global float const *globalImagePlane = images + | |
56: (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared; | |
57: for (int i = 0; i< numImageCopyLoops; i++) { | |
58: const int offset = i * workgroupSize + localId; | |
59: if (offset < gInputSizeSquared) { | |
60: _inputPlane[ offset ] = globalImagePlane[ offset ]; | |
61: } | |
62: } | |
63: barrier(CLK_LOCAL_MEM_FENCE); | |
64: // calc output for each [outrow][outcol] | |
65: bool filterPlaneOk = filterId < gNumFilters; | |
66: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
67: float sum = 0; | |
68: for (int filterRow = 0; filterRow < gFilterSize; filterRow++) { | |
69: int inRow = outRow + filterRow; | |
70: #if gPadZeros == 1 | |
71: inRow -= gHalfFilterSize; | |
72: #endif | |
73: bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize; | |
74: for (int filterCol = 0; filterCol < gFilterSize; filterCol++) { | |
75: int inCol = outCol + filterCol; | |
76: #if gPadZeros == 1 | |
77: inCol -= gHalfFilterSize; | |
78: #endif | |
79: bool process = rowOk && inCol >= 0 && inCol < gInputSize; | |
80: if (process) { | |
81: float imageValue = _inputPlane[ inRow * gInputSize + inCol ]; | |
82: float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ]; | |
83: sum += imageValue * filterValue; | |
84: } | |
85: } | |
86: } | |
87: if (filterId < gNumFilters) { | |
88: // [n][filterId][outRow][outCol][inputPlane] | |
89: int resultIndex = (( (n | |
90: * gNumFilters + filterId) | |
91: * gOutputSize + outRow) | |
92: * gOutputSize + outCol) | |
93: * gNumInputPlanes + inputPlaneId; | |
94: output[resultIndex] = sum; | |
95: //if (globalId == 2) output[0] = resultIndex; | |
96: // output[resultIndex] = outRow; | |
97: } | |
98: // output[localId] = _localFilterPlane[localId]; | |
99: } | |
100: } | |
101: } | |
102: } | |
103: | |
104: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward_byinputplane.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 6: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: | |
8: // - load same input plane from each image | |
9: // - hold filter plane for this input plane, for all filters | |
10: // - reduce afterwards | |
11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB | |
12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB | |
13: // => seems ok? | |
14: // workgroupid: [inputPlaneId] | |
15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...) | |
16: // iterate over: [n][outCol] | |
17: // output: [n][filterId][outRow][outCol][inputPlane] | |
18: // need to later reduce output over: [inputPlane] | |
19: void kernel forward_byinputplane(const int batchSize, | |
20: global const float *images, global const float *filters, | |
21: global float *output, | |
22: local float *_inputPlane, local float *_filterPlanes) { | |
23: // const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0; | |
24: | |
25: const int globalId = get_global_id(0); | |
26: const int workgroupId = get_group_id(0); | |
27: const int workgroupSize = get_local_size(0); | |
28: const int localId = get_local_id(0); | |
29: | |
30: const int inputPlaneId = workgroupId; | |
31: const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize; | |
32: const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize; | |
33: const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
34: for (int loop = 0; loop < numLoops; loop++) { | |
35: const int loopLocalId = localId + loop * workgroupSize; | |
36: const int filterId = loopLocalId / gOutputSize; | |
37: const int outRow = loopLocalId % gOutputSize; | |
38: | |
39: // copy down our filter, we have gOutputSize threads to do this | |
40: global float const *globalFilterPlane = filters + | |
41: (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared; | |
42: local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared; | |
43: barrier(CLK_LOCAL_MEM_FENCE); | |
44: for (int i = 0; i < numFilterCopyLoops; i++) { | |
45: const int offset = i * gOutputSize + outRow; | |
46: bool process = filterId < gNumFilters && offset < gFilterSizeSquared; | |
47: if (process) { | |
48: _localFilterPlane[ offset ] = globalFilterPlane[ offset ]; | |
49: } | |
50: } | |
51: // loop over n ... | |
52: for (int n = 0; n < batchSize; n++) { | |
53: // copy down our imageplane, we have workgroupSize threads to do this | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: global float const *globalImagePlane = images + | |
56: (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared; | |
57: for (int i = 0; i< numImageCopyLoops; i++) { | |
58: const int offset = i * workgroupSize + localId; | |
59: if (offset < gInputSizeSquared) { | |
60: _inputPlane[ offset ] = globalImagePlane[ offset ]; | |
61: } | |
62: } | |
63: barrier(CLK_LOCAL_MEM_FENCE); | |
64: // calc output for each [outrow][outcol] | |
65: bool filterPlaneOk = filterId < gNumFilters; | |
66: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
67: float sum = 0; | |
68: for (int filterRow = 0; filterRow < gFilterSize; filterRow++) { | |
69: int inRow = outRow + filterRow; | |
70: #if gPadZeros == 1 | |
71: inRow -= gHalfFilterSize; | |
72: #endif | |
73: bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize; | |
74: for (int filterCol = 0; filterCol < gFilterSize; filterCol++) { | |
75: int inCol = outCol + filterCol; | |
76: #if gPadZeros == 1 | |
77: inCol -= gHalfFilterSize; | |
78: #endif | |
79: bool process = rowOk && inCol >= 0 && inCol < gInputSize; | |
80: if (process) { | |
81: float imageValue = _inputPlane[ inRow * gInputSize + inCol ]; | |
82: float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ]; | |
83: sum += imageValue * filterValue; | |
84: } | |
85: } | |
86: } | |
87: if (filterId < gNumFilters) { | |
88: // [n][filterId][outRow][outCol][inputPlane] | |
89: int resultIndex = (( (n | |
90: * gNumFilters + filterId) | |
91: * gOutputSize + outRow) | |
92: * gOutputSize + outCol) | |
93: * gNumInputPlanes + inputPlaneId; | |
94: output[resultIndex] = sum; | |
95: //if (globalId == 2) output[0] = resultIndex; | |
96: // output[resultIndex] = outRow; | |
97: } | |
98: // output[localId] = _localFilterPlane[localId]; | |
99: } | |
100: } | |
101: } | |
102: } | |
103: | |
104: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward_byinputplane.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 7 | |
... seems valid | |
ForwardIm2Col.cl build log: | |
(19:0) : error : invalid global address space qualifier specified for parameter type | |
(19:0) : error : syntax error at 'const' | |
kernel build error: | |
kernel source: | |
1: // from SpatialConvolutionMM.cu: | |
2: | |
3: // CL: grid stride looping | |
4: #define CL_KERNEL_LOOP(i, n) \ | |
5: for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \ | |
6: i < (n); \ | |
7: i += get_local_size(0) * get_num_groups(0)) | |
8: | |
9: //#define gPadding 0 | |
10: //#define gStride 1 | |
11: //#define gColSize 1 | |
12: //#define gFilterSize 1 | |
13: //#define gSize 1 | |
14: | |
15: // Kernel for fast unfold+copy | |
16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu) | |
17: kernel void im2col( | |
18: const int n, | |
19: global float const * im_data, int im_offset, | |
20: global float* data_col) { | |
21: global const float *data_im = im_data + im_offset; | |
22: | |
23: CL_KERNEL_LOOP(index, n) { | |
24: int w_out = index % 1; | |
25: index /= 1; | |
26: int h_out = index % 1; | |
27: int channel_in = index / 1; | |
28: int channel_out = channel_in * 1 * 1; | |
29: int h_in = h_out * 1 - 0; | |
30: int w_in = w_out * 1 - 0; | |
31: data_col += (channel_out * 1 + h_out) * 1 + w_out; | |
32: data_im += (channel_in * 1 + h_in) * 1 + w_in; | |
33: for (int i = 0; i < 1; ++i) { | |
34: for (int j = 0; j < 1; ++j) { | |
35: int h = h_in + i; | |
36: int w = w_in + j; | |
37: *data_col = (h >= 0 && w >= 0 && h < 1 && w < 1) ? | |
38: data_im[i * 1 + j] : 0; | |
39: data_col += 1 * 1; | |
40: } | |
41: } | |
42: } | |
43: } | |
44: | |
45: kernel void col2im( | |
46: const int n, | |
47: global float const *data_col, | |
48: global float* im_data, int im_offset) { | |
49: global float *data_im = im_data + im_offset; | |
50: | |
51: for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) { | |
52: float val = 0; | |
53: int w = index % 1 + 0; | |
54: int h = (index / 1) % 1 + 0; | |
55: int c = index / (1 * 1); | |
56: // compute the start and end of the output | |
57: int w_col_start = (w < 1) ? 0 : (w - 1) / 1 + 1; | |
58: int w_col_end = min(w / 1 + 1, 1); | |
59: int h_col_start = (h < 1) ? 0 : (h - 1) / 1 + 1; | |
60: int h_col_end = min(h / 1 + 1, 1); | |
61: | |
62: int offset = (c * 1 * 1 + h * 1 + w) * 1 * 1; | |
63: int coeff_h_col = (1 - 1 * 1 * 1) * 1; | |
64: int coeff_w_col = (1 - 1 * 1 * 1); | |
65: for (int h_col = h_col_start; h_col < h_col_end; ++h_col) { | |
66: for (int w_col = w_col_start; w_col < w_col_end; ++w_col) { | |
67: val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col]; | |
68: } | |
69: } | |
70: data_im[index] = val; | |
71: } | |
72: } | |
73: | |
74: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
ForwardIm2Col.cl build log: | |
(19:0) : error : invalid global address space qualifier specified for parameter type | |
(19:0) : error : syntax error at 'const' | |
ForwardAuto: kernel 7 this instance cant be used: | |
kernel source: | |
1: // from SpatialConvolutionMM.cu: | |
2: | |
3: // CL: grid stride looping | |
4: #define CL_KERNEL_LOOP(i, n) \ | |
5: for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \ | |
6: i < (n); \ | |
7: i += get_local_size(0) * get_num_groups(0)) | |
8: | |
9: //#define gPadding 0 | |
10: //#define gStride 1 | |
11: //#define gColSize 1 | |
12: //#define gFilterSize 1 | |
13: //#define gSize 1 | |
14: | |
15: // Kernel for fast unfold+copy | |
16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu) | |
17: kernel void im2col( | |
18: const int n, | |
19: global float const * im_data, int im_offset, | |
20: global float* data_col) { | |
21: global const float *data_im = im_data + im_offset; | |
22: | |
23: CL_KERNEL_LOOP(index, n) { | |
24: int w_out = index % 1; | |
25: index /= 1; | |
26: int h_out = index % 1; | |
27: int channel_in = index / 1; | |
28: int channel_out = channel_in * 1 * 1; | |
29: int h_in = h_out * 1 - 0; | |
30: int w_in = w_out * 1 - 0; | |
31: data_col += (channel_out * 1 + h_out) * 1 + w_out; | |
32: data_im += (channel_in * 1 + h_in) * 1 + w_in; | |
33: for (int i = 0; i < 1; ++i) { | |
34: for (int j = 0; j < 1; ++j) { | |
35: int h = h_in + i; | |
36: int w = w_in + j; | |
37: *data_col = (h >= 0 && w >= 0 && h < 1 && w < 1) ? | |
38: data_im[i * 1 + j] : 0; | |
39: data_col += 1 * 1; | |
40: } | |
41: } | |
42: } | |
43: } | |
44: | |
45: kernel void col2im( | |
46: const int n, | |
47: global float const *data_col, | |
48: global float* im_data, int im_offset) { | |
49: global float *data_im = im_data + im_offset; | |
50: | |
51: for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) { | |
52: float val = 0; | |
53: int w = index % 1 + 0; | |
54: int h = (index / 1) % 1 + 0; | |
55: int c = index / (1 * 1); | |
56: // compute the start and end of the output | |
57: int w_col_start = (w < 1) ? 0 : (w - 1) / 1 + 1; | |
58: int w_col_end = min(w / 1 + 1, 1); | |
59: int h_col_start = (h < 1) ? 0 : (h - 1) / 1 + 1; | |
60: int h_col_end = min(h / 1 + 1, 1); | |
61: | |
62: int offset = (c * 1 * 1 + h * 1 + w) * 1 * 1; | |
63: int coeff_h_col = (1 - 1 * 1 * 1) * 1; | |
64: int coeff_w_col = (1 - 1 * 1 * 1); | |
65: for (int h_col = h_col_start; h_col < h_col_end; ++h_col) { | |
66: for (int w_col = w_col_start; w_col < w_col_end; ++w_col) { | |
67: val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col]; | |
68: } | |
69: } | |
70: data_im[index] = val; | |
71: } | |
72: } | |
73: | |
74: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
ForwardIm2Col.cl build log: | |
(19:0) : error : invalid global address space qualifier specified for parameter type | |
(19:0) : error : syntax error at 'const' | |
forward kernel 0: cannot be used | |
forward kernel 1: cannot be used | |
forward kernel 2: cannot be used | |
forward kernel 3: cannot be used | |
forward kernel 4: cannot be used | |
forward kernel 5: cannot be used | |
forward kernel 6: cannot be used | |
forward kernel 7: cannot be used | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description "No valid forward implementations found" thrown in the test body. | |
[ FAILED ] testlogicaloperators.Convolve_1layer_biased_And (182 ms) | |
[ RUN ] testlogicaloperators.Convolve_1layerbiased_Or | |
Or, convolve | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
forward try kernel 0 | |
... not plausibly optimal, skipping | |
forward try kernel 1 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 1: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 2 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
ForwardAuto: kernel 2: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
... not valid | |
forward try kernel 3 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: each workgroup handles convolving one input example with one filtercube | |
8: // and writing out one single output plane | |
9: // | |
10: // workgroup id organized like: [imageid][outplane] | |
11: // local id organized like: [outrow][outcol] | |
12: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
13: // number workgroups = 32 | |
14: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
16: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
17: // output are organized like [imageid][filterid][row][col] | |
18: void kernel forward_3_by_n_outplane(const int batchSize, | |
19: global const float *images, global const float *filters, | |
20: global float *output, | |
21: local float *_upstreamImage, local float *_filterCube) { | |
22: const int globalId = get_global_id(0); | |
23: | |
24: const int workgroupId = get_group_id(0); | |
25: const int workgroupSize = get_local_size(0); | |
26: const int n = workgroupId / gNumFilters; | |
27: const int outPlane = workgroupId % gNumFilters; | |
28: | |
29: const int localId = get_local_id(0); | |
30: const int outputRow = localId / gOutputSize; | |
31: const int outputCol = localId % gOutputSize; | |
32: | |
33: const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize; | |
34: const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow - gEven) : gHalfFilterSize - gEven; | |
35: const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize; | |
36: const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven; | |
37: | |
38: const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
39: | |
40: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
41: const int filterCubeGlobalOffset = outPlane * filterCubeLength; | |
42: const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize; | |
43: for (int i = 0; i < numPixelsPerThread; i++) { | |
44: int thisOffset = localId + i * workgroupSize; | |
45: if (thisOffset < filterCubeLength) { | |
46: _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset]; | |
47: } | |
48: } | |
49: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
50: | |
51: float sum = 0; | |
52: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
53: int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: for (int i = 0; i < numUpstreamsPerThread; i++) { | |
56: int thisOffset = workgroupSize * i + localId; | |
57: if (thisOffset < gInputSizeSquared) { | |
58: _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ]; | |
59: } | |
60: } | |
61: barrier(CLK_LOCAL_MEM_FENCE); | |
62: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
63: for (int u = minu; u <= maxu; u++) { | |
64: int inputRow = outputRow + u; | |
65: #if gPadZeros == 0 | |
66: inputRow += gHalfFilterSize; | |
67: #endif | |
68: int inputimagerowoffset = inputRow * gInputSize; | |
69: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
70: for (int v = minv; v <= maxv; v++) { | |
71: int inputCol = outputCol + v; | |
72: #if gPadZeros == 0 | |
73: inputCol += gHalfFilterSize; | |
74: #endif | |
75: if (localId < gOutputSizeSquared) { | |
76: sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
77: } | |
78: } | |
79: } | |
80: } | |
81: | |
82: // output are organized like [imageid][filterid][row][col] | |
83: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
84: if (localId < gOutputSizeSquared) { | |
85: output[resultIndex ] = sum; | |
86: } | |
87: } | |
88: | |
89: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 3: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: each workgroup handles convolving one input example with one filtercube | |
8: // and writing out one single output plane | |
9: // | |
10: // workgroup id organized like: [imageid][outplane] | |
11: // local id organized like: [outrow][outcol] | |
12: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
13: // number workgroups = 32 | |
14: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
16: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
17: // output are organized like [imageid][filterid][row][col] | |
18: void kernel forward_3_by_n_outplane(const int batchSize, | |
19: global const float *images, global const float *filters, | |
20: global float *output, | |
21: local float *_upstreamImage, local float *_filterCube) { | |
22: const int globalId = get_global_id(0); | |
23: | |
24: const int workgroupId = get_group_id(0); | |
25: const int workgroupSize = get_local_size(0); | |
26: const int n = workgroupId / gNumFilters; | |
27: const int outPlane = workgroupId % gNumFilters; | |
28: | |
29: const int localId = get_local_id(0); | |
30: const int outputRow = localId / gOutputSize; | |
31: const int outputCol = localId % gOutputSize; | |
32: | |
33: const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize; | |
34: const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow - gEven) : gHalfFilterSize - gEven; | |
35: const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize; | |
36: const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven; | |
37: | |
38: const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
39: | |
40: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
41: const int filterCubeGlobalOffset = outPlane * filterCubeLength; | |
42: const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize; | |
43: for (int i = 0; i < numPixelsPerThread; i++) { | |
44: int thisOffset = localId + i * workgroupSize; | |
45: if (thisOffset < filterCubeLength) { | |
46: _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset]; | |
47: } | |
48: } | |
49: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
50: | |
51: float sum = 0; | |
52: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
53: int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: for (int i = 0; i < numUpstreamsPerThread; i++) { | |
56: int thisOffset = workgroupSize * i + localId; | |
57: if (thisOffset < gInputSizeSquared) { | |
58: _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ]; | |
59: } | |
60: } | |
61: barrier(CLK_LOCAL_MEM_FENCE); | |
62: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
63: for (int u = minu; u <= maxu; u++) { | |
64: int inputRow = outputRow + u; | |
65: #if gPadZeros == 0 | |
66: inputRow += gHalfFilterSize; | |
67: #endif | |
68: int inputimagerowoffset = inputRow * gInputSize; | |
69: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
70: for (int v = minv; v <= maxv; v++) { | |
71: int inputCol = outputCol + v; | |
72: #if gPadZeros == 0 | |
73: inputCol += gHalfFilterSize; | |
74: #endif | |
75: if (localId < gOutputSizeSquared) { | |
76: sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
77: } | |
78: } | |
79: } | |
80: } | |
81: | |
82: // output are organized like [imageid][filterid][row][col] | |
83: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
84: if (localId < gOutputSizeSquared) { | |
85: output[resultIndex ] = sum; | |
86: } | |
87: } | |
88: | |
89: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 4 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, int N) { | |
8: int numLoops = (N + get_local_size(0) - 1) / get_local_size(0); | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * get_local_size(0) + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [n][filterid] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [n][filterid][outrow][outcol] | |
26: // the pixels per thread thing... : | |
27: // - we have one thread (~= cuda core) per output value, | |
28: // ie one thread for each combination of [outrow][outcol] | |
29: // - however, the number of threads is typically limited on a gpu, | |
30: // eg to 512 (eg Intel HD), or 1024 (eg nVidia K520) | |
31: // - so what happens if the number of output points is larger than | |
32: // the maximum workgroup size? | |
33: // - then we have several possibilities really: | |
34: // - we can divide the image into blocks, and process each block | |
35: // separately. This is probably a good option, but fair amount of | |
36: // work | |
37: // - we can get each thread to handle more than one output | |
38: // pixel, by looping | |
39: // - we can consider the output image in 1d, by putting the rows | |
40: // one after another, and assign each contiguous workgroup-size | |
41: // block to one workgroup | |
42: // => this is how this kernel works | |
43: // basically, it's a hack, so larger images actually run, without | |
44: // crashing, and we can probably improve it a lot :-) | |
45: // | |
46: // So, when outputSize * outputSize > workgroupSize, then | |
47: // multiple workgroups will be created for each output plane | |
48: // the number of such workgroups is given by: `gPixelsPerThread` | |
49: // the id of our workgroup within such a set of workgroups is calculated | |
50: // as `pixel` | |
51: // effectiveLocalId is our local id if we had one enormous workgroup | |
52: // containing the whole output image plane | |
53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize, | |
54: global const float *images, global const float *filters, | |
55: global float *output, | |
56: local float *_inputPlane, local float *_filterPlane) { | |
57: #define globalId (get_global_id(0)) | |
58: | |
59: #define localId (get_local_id(0)) | |
60: #define workgroupId (get_group_id(0)) | |
61: // const int workgroupSize = get_local_size(0); | |
62: const int effectiveWorkgroupId = workgroupId / gPixelsPerThread; | |
63: const int pixel = workgroupId % gPixelsPerThread; | |
64: const int effectiveLocalId = localId + pixel * gWorkgroupSize; | |
65: const int n = effectiveWorkgroupId / gNumFilters; | |
66: const int outPlane = effectiveWorkgroupId % gNumFilters; | |
67: | |
68: const int outputRow = effectiveLocalId / gOutputSize; | |
69: const int outputCol = effectiveLocalId % gOutputSize; | |
70: | |
71: float sum = 0; | |
72: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
73: barrier(CLK_LOCAL_MEM_FENCE); | |
74: copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared); | |
75: copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared); | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: | |
78: if (effectiveLocalId < gOutputSizeSquared) { | |
79: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
80: // trying to reduce register pressure... | |
81: #if gPadZeros == 1 | |
82: #define inputRow (outputRow + u) | |
83: #else | |
84: #define inputRow (outputRow + u + gHalfFilterSize) | |
85: #endif | |
86: int inputimagerowoffset = inputRow * gInputSize; | |
87: int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
88: bool rowOk = inputRow >= 0 && inputRow < gInputSize; | |
89: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
90: #if gPadZeros == 1 | |
91: #define inputCol (outputCol + v) | |
92: #else | |
93: #define inputCol (outputCol + v + gHalfFilterSize) | |
94: #endif | |
95: bool process = rowOk && inputCol >= 0 && inputCol < gInputSize; | |
96: if (process) { | |
97: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ]; | |
98: } | |
99: } | |
100: } | |
101: } | |
102: } | |
103: // output are organized like [imageid][filterid][row][col] | |
104: #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId) | |
105: if (effectiveLocalId < gOutputSizeSquared) { | |
106: output[resultIndex ] = sum; | |
107: } | |
108: } | |
109: #endif | |
110: | |
111: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 4: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, int N) { | |
8: int numLoops = (N + get_local_size(0) - 1) / get_local_size(0); | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * get_local_size(0) + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [n][filterid] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [n][filterid][outrow][outcol] | |
26: // the pixels per thread thing... : | |
27: // - we have one thread (~= cuda core) per output value, | |
28: // ie one thread for each combination of [outrow][outcol] | |
29: // - however, the number of threads is typically limited on a gpu, | |
30: // eg to 512 (eg Intel HD), or 1024 (eg nVidia K520) | |
31: // - so what happens if the number of output points is larger than | |
32: // the maximum workgroup size? | |
33: // - then we have several possibilities really: | |
34: // - we can divide the image into blocks, and process each block | |
35: // separately. This is probably a good option, but fair amount of | |
36: // work | |
37: // - we can get each thread to handle more than one output | |
38: // pixel, by looping | |
39: // - we can consider the output image in 1d, by putting the rows | |
40: // one after another, and assign each contiguous workgroup-size | |
41: // block to one workgroup | |
42: // => this is how this kernel works | |
43: // basically, it's a hack, so larger images actually run, without | |
44: // crashing, and we can probably improve it a lot :-) | |
45: // | |
46: // So, when outputSize * outputSize > workgroupSize, then | |
47: // multiple workgroups will be created for each output plane | |
48: // the number of such workgroups is given by: `gPixelsPerThread` | |
49: // the id of our workgroup within such a set of workgroups is calculated | |
50: // as `pixel` | |
51: // effectiveLocalId is our local id if we had one enormous workgroup | |
52: // containing the whole output image plane | |
53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize, | |
54: global const float *images, global const float *filters, | |
55: global float *output, | |
56: local float *_inputPlane, local float *_filterPlane) { | |
57: #define globalId (get_global_id(0)) | |
58: | |
59: #define localId (get_local_id(0)) | |
60: #define workgroupId (get_group_id(0)) | |
61: // const int workgroupSize = get_local_size(0); | |
62: const int effectiveWorkgroupId = workgroupId / gPixelsPerThread; | |
63: const int pixel = workgroupId % gPixelsPerThread; | |
64: const int effectiveLocalId = localId + pixel * gWorkgroupSize; | |
65: const int n = effectiveWorkgroupId / gNumFilters; | |
66: const int outPlane = effectiveWorkgroupId % gNumFilters; | |
67: | |
68: const int outputRow = effectiveLocalId / gOutputSize; | |
69: const int outputCol = effectiveLocalId % gOutputSize; | |
70: | |
71: float sum = 0; | |
72: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
73: barrier(CLK_LOCAL_MEM_FENCE); | |
74: copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared); | |
75: copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared); | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: | |
78: if (effectiveLocalId < gOutputSizeSquared) { | |
79: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
80: // trying to reduce register pressure... | |
81: #if gPadZeros == 1 | |
82: #define inputRow (outputRow + u) | |
83: #else | |
84: #define inputRow (outputRow + u + gHalfFilterSize) | |
85: #endif | |
86: int inputimagerowoffset = inputRow * gInputSize; | |
87: int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
88: bool rowOk = inputRow >= 0 && inputRow < gInputSize; | |
89: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
90: #if gPadZeros == 1 | |
91: #define inputCol (outputCol + v) | |
92: #else | |
93: #define inputCol (outputCol + v + gHalfFilterSize) | |
94: #endif | |
95: bool process = rowOk && inputCol >= 0 && inputCol < gInputSize; | |
96: if (process) { | |
97: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ]; | |
98: } | |
99: } | |
100: } | |
101: } | |
102: } | |
103: // output are organized like [imageid][filterid][row][col] | |
104: #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId) | |
105: if (effectiveLocalId < gOutputSizeSquared) { | |
106: output[resultIndex ] = sum; | |
107: } | |
108: } | |
109: #endif | |
110: | |
111: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 5 | |
cl/reduce_segments.cl build log: | |
(8:0) : error : invalid global address space qualifier specified for parameter type | |
(8:0) : error : syntax error at 'const' | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: kernel void reduce_segments(const int numSegments, const int segmentLength, | |
8: global float const *in, global float* out) { | |
9: const int globalId = get_global_id(0); | |
10: const int segmentId = globalId; | |
11: | |
12: if (segmentId >= numSegments) { | |
13: return; | |
14: } | |
15: | |
16: float sum = 0; | |
17: global const float *segment = in + segmentId * segmentLength; | |
18: for (int i = 0; i < segmentLength; i++) { | |
19: sum += segment[i]; | |
20: } | |
21: out[segmentId] = sum; | |
22: } | |
23: | |
24: | |
25: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/reduce_segments.cl build log: | |
(8:0) : error : invalid global address space qualifier specified for parameter type | |
(8:0) : error : syntax error at 'const' | |
ForwardAuto: kernel 5: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: kernel void reduce_segments(const int numSegments, const int segmentLength, | |
8: global float const *in, global float* out) { | |
9: const int globalId = get_global_id(0); | |
10: const int segmentId = globalId; | |
11: | |
12: if (segmentId >= numSegments) { | |
13: return; | |
14: } | |
15: | |
16: float sum = 0; | |
17: global const float *segment = in + segmentId * segmentLength; | |
18: for (int i = 0; i < segmentLength; i++) { | |
19: sum += segment[i]; | |
20: } | |
21: out[segmentId] = sum; | |
22: } | |
23: | |
24: | |
25: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/reduce_segments.cl build log: | |
(8:0) : error : invalid global address space qualifier specified for parameter type | |
(8:0) : error : syntax error at 'const' | |
... not valid | |
forward try kernel 6 | |
cl/forward_byinputplane.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: | |
8: // - load same input plane from each image | |
9: // - hold filter plane for this input plane, for all filters | |
10: // - reduce afterwards | |
11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB | |
12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB | |
13: // => seems ok? | |
14: // workgroupid: [inputPlaneId] | |
15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...) | |
16: // iterate over: [n][outCol] | |
17: // output: [n][filterId][outRow][outCol][inputPlane] | |
18: // need to later reduce output over: [inputPlane] | |
19: void kernel forward_byinputplane(const int batchSize, | |
20: global const float *images, global const float *filters, | |
21: global float *output, | |
22: local float *_inputPlane, local float *_filterPlanes) { | |
23: // const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0; | |
24: | |
25: const int globalId = get_global_id(0); | |
26: const int workgroupId = get_group_id(0); | |
27: const int workgroupSize = get_local_size(0); | |
28: const int localId = get_local_id(0); | |
29: | |
30: const int inputPlaneId = workgroupId; | |
31: const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize; | |
32: const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize; | |
33: const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
34: for (int loop = 0; loop < numLoops; loop++) { | |
35: const int loopLocalId = localId + loop * workgroupSize; | |
36: const int filterId = loopLocalId / gOutputSize; | |
37: const int outRow = loopLocalId % gOutputSize; | |
38: | |
39: // copy down our filter, we have gOutputSize threads to do this | |
40: global float const *globalFilterPlane = filters + | |
41: (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared; | |
42: local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared; | |
43: barrier(CLK_LOCAL_MEM_FENCE); | |
44: for (int i = 0; i < numFilterCopyLoops; i++) { | |
45: const int offset = i * gOutputSize + outRow; | |
46: bool process = filterId < gNumFilters && offset < gFilterSizeSquared; | |
47: if (process) { | |
48: _localFilterPlane[ offset ] = globalFilterPlane[ offset ]; | |
49: } | |
50: } | |
51: // loop over n ... | |
52: for (int n = 0; n < batchSize; n++) { | |
53: // copy down our imageplane, we have workgroupSize threads to do this | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: global float const *globalImagePlane = images + | |
56: (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared; | |
57: for (int i = 0; i< numImageCopyLoops; i++) { | |
58: const int offset = i * workgroupSize + localId; | |
59: if (offset < gInputSizeSquared) { | |
60: _inputPlane[ offset ] = globalImagePlane[ offset ]; | |
61: } | |
62: } | |
63: barrier(CLK_LOCAL_MEM_FENCE); | |
64: // calc output for each [outrow][outcol] | |
65: bool filterPlaneOk = filterId < gNumFilters; | |
66: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
67: float sum = 0; | |
68: for (int filterRow = 0; filterRow < gFilterSize; filterRow++) { | |
69: int inRow = outRow + filterRow; | |
70: #if gPadZeros == 1 | |
71: inRow -= gHalfFilterSize; | |
72: #endif | |
73: bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize; | |
74: for (int filterCol = 0; filterCol < gFilterSize; filterCol++) { | |
75: int inCol = outCol + filterCol; | |
76: #if gPadZeros == 1 | |
77: inCol -= gHalfFilterSize; | |
78: #endif | |
79: bool process = rowOk && inCol >= 0 && inCol < gInputSize; | |
80: if (process) { | |
81: float imageValue = _inputPlane[ inRow * gInputSize + inCol ]; | |
82: float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ]; | |
83: sum += imageValue * filterValue; | |
84: } | |
85: } | |
86: } | |
87: if (filterId < gNumFilters) { | |
88: // [n][filterId][outRow][outCol][inputPlane] | |
89: int resultIndex = (( (n | |
90: * gNumFilters + filterId) | |
91: * gOutputSize + outRow) | |
92: * gOutputSize + outCol) | |
93: * gNumInputPlanes + inputPlaneId; | |
94: output[resultIndex] = sum; | |
95: //if (globalId == 2) output[0] = resultIndex; | |
96: // output[resultIndex] = outRow; | |
97: } | |
98: // output[localId] = _localFilterPlane[localId]; | |
99: } | |
100: } | |
101: } | |
102: } | |
103: | |
104: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward_byinputplane.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 6: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: | |
8: // - load same input plane from each image | |
9: // - hold filter plane for this input plane, for all filters | |
10: // - reduce afterwards | |
11: // local memory for one plane from each filter of 64c7 = 64 * 7 * 7 * 4 = 12.5KB | |
12: // local memory for one single input plane = 19 * 19 * 4 = 1.4KB | |
13: // => seems ok? | |
14: // workgroupid: [inputPlaneId] | |
15: // localid: [filterId][outRow] (if this is more than workgroupsize, we should reuse some threads...) | |
16: // iterate over: [n][outCol] | |
17: // output: [n][filterId][outRow][outCol][inputPlane] | |
18: // need to later reduce output over: [inputPlane] | |
19: void kernel forward_byinputplane(const int batchSize, | |
20: global const float *images, global const float *filters, | |
21: global float *output, | |
22: local float *_inputPlane, local float *_filterPlanes) { | |
23: // const int evenPadding = gFilterSize % 2 == 0 ? 1 : 0; | |
24: | |
25: const int globalId = get_global_id(0); | |
26: const int workgroupId = get_group_id(0); | |
27: const int workgroupSize = get_local_size(0); | |
28: const int localId = get_local_id(0); | |
29: | |
30: const int inputPlaneId = workgroupId; | |
31: const int numLoops = (gNumFilters * gOutputSize + workgroupSize - 1) / workgroupSize; | |
32: const int numFilterCopyLoops = (gFilterSizeSquared + gOutputSize - 1) / gOutputSize; | |
33: const int numImageCopyLoops = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
34: for (int loop = 0; loop < numLoops; loop++) { | |
35: const int loopLocalId = localId + loop * workgroupSize; | |
36: const int filterId = loopLocalId / gOutputSize; | |
37: const int outRow = loopLocalId % gOutputSize; | |
38: | |
39: // copy down our filter, we have gOutputSize threads to do this | |
40: global float const *globalFilterPlane = filters + | |
41: (filterId * gNumInputPlanes + inputPlaneId) * gFilterSizeSquared; | |
42: local float *_localFilterPlane = _filterPlanes + filterId * gFilterSizeSquared; | |
43: barrier(CLK_LOCAL_MEM_FENCE); | |
44: for (int i = 0; i < numFilterCopyLoops; i++) { | |
45: const int offset = i * gOutputSize + outRow; | |
46: bool process = filterId < gNumFilters && offset < gFilterSizeSquared; | |
47: if (process) { | |
48: _localFilterPlane[ offset ] = globalFilterPlane[ offset ]; | |
49: } | |
50: } | |
51: // loop over n ... | |
52: for (int n = 0; n < batchSize; n++) { | |
53: // copy down our imageplane, we have workgroupSize threads to do this | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: global float const *globalImagePlane = images + | |
56: (n * gNumInputPlanes + inputPlaneId) * gInputSizeSquared; | |
57: for (int i = 0; i< numImageCopyLoops; i++) { | |
58: const int offset = i * workgroupSize + localId; | |
59: if (offset < gInputSizeSquared) { | |
60: _inputPlane[ offset ] = globalImagePlane[ offset ]; | |
61: } | |
62: } | |
63: barrier(CLK_LOCAL_MEM_FENCE); | |
64: // calc output for each [outrow][outcol] | |
65: bool filterPlaneOk = filterId < gNumFilters; | |
66: for (int outCol = 0; outCol < gOutputSize; outCol++) { | |
67: float sum = 0; | |
68: for (int filterRow = 0; filterRow < gFilterSize; filterRow++) { | |
69: int inRow = outRow + filterRow; | |
70: #if gPadZeros == 1 | |
71: inRow -= gHalfFilterSize; | |
72: #endif | |
73: bool rowOk = filterPlaneOk && inRow >= 0 && inRow < gInputSize; | |
74: for (int filterCol = 0; filterCol < gFilterSize; filterCol++) { | |
75: int inCol = outCol + filterCol; | |
76: #if gPadZeros == 1 | |
77: inCol -= gHalfFilterSize; | |
78: #endif | |
79: bool process = rowOk && inCol >= 0 && inCol < gInputSize; | |
80: if (process) { | |
81: float imageValue = _inputPlane[ inRow * gInputSize + inCol ]; | |
82: float filterValue = _localFilterPlane[ filterRow * gFilterSize + filterCol ]; | |
83: sum += imageValue * filterValue; | |
84: } | |
85: } | |
86: } | |
87: if (filterId < gNumFilters) { | |
88: // [n][filterId][outRow][outCol][inputPlane] | |
89: int resultIndex = (( (n | |
90: * gNumFilters + filterId) | |
91: * gOutputSize + outRow) | |
92: * gOutputSize + outCol) | |
93: * gNumInputPlanes + inputPlaneId; | |
94: output[resultIndex] = sum; | |
95: //if (globalId == 2) output[0] = resultIndex; | |
96: // output[resultIndex] = outRow; | |
97: } | |
98: // output[localId] = _localFilterPlane[localId]; | |
99: } | |
100: } | |
101: } | |
102: } | |
103: | |
104: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward_byinputplane.cl build log: | |
error : syntax error in compiler option string " -D BIASED -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=1 -D gInputSizeSquared=1 -D gNumFilters=2 -D gFilterSize=1 -D gHalfFilterSize=0 -D gFilterSizeSquared=1 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=1 -D gOutputSizeSquared=1 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 7 | |
... seems valid | |
ForwardIm2Col.cl build log: | |
(19:0) : error : invalid global address space qualifier specified for parameter type | |
(19:0) : error : syntax error at 'const' | |
kernel build error: | |
kernel source: | |
1: // from SpatialConvolutionMM.cu: | |
2: | |
3: // CL: grid stride looping | |
4: #define CL_KERNEL_LOOP(i, n) \ | |
5: for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \ | |
6: i < (n); \ | |
7: i += get_local_size(0) * get_num_groups(0)) | |
8: | |
9: //#define gPadding 0 | |
10: //#define gStride 1 | |
11: //#define gColSize 1 | |
12: //#define gFilterSize 1 | |
13: //#define gSize 1 | |
14: | |
15: // Kernel for fast unfold+copy | |
16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu) | |
17: kernel void im2col( | |
18: const int n, | |
19: global float const * im_data, int im_offset, | |
20: global float* data_col) { | |
21: global const float *data_im = im_data + im_offset; | |
22: | |
23: CL_KERNEL_LOOP(index, n) { | |
24: int w_out = index % 1; | |
25: index /= 1; | |
26: int h_out = index % 1; | |
27: int channel_in = index / 1; | |
28: int channel_out = channel_in * 1 * 1; | |
29: int h_in = h_out * 1 - 0; | |
30: int w_in = w_out * 1 - 0; | |
31: data_col += (channel_out * 1 + h_out) * 1 + w_out; | |
32: data_im += (channel_in * 1 + h_in) * 1 + w_in; | |
33: for (int i = 0; i < 1; ++i) { | |
34: for (int j = 0; j < 1; ++j) { | |
35: int h = h_in + i; | |
36: int w = w_in + j; | |
37: *data_col = (h >= 0 && w >= 0 && h < 1 && w < 1) ? | |
38: data_im[i * 1 + j] : 0; | |
39: data_col += 1 * 1; | |
40: } | |
41: } | |
42: } | |
43: } | |
44: | |
45: kernel void col2im( | |
46: const int n, | |
47: global float const *data_col, | |
48: global float* im_data, int im_offset) { | |
49: global float *data_im = im_data + im_offset; | |
50: | |
51: for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) { | |
52: float val = 0; | |
53: int w = index % 1 + 0; | |
54: int h = (index / 1) % 1 + 0; | |
55: int c = index / (1 * 1); | |
56: // compute the start and end of the output | |
57: int w_col_start = (w < 1) ? 0 : (w - 1) / 1 + 1; | |
58: int w_col_end = min(w / 1 + 1, 1); | |
59: int h_col_start = (h < 1) ? 0 : (h - 1) / 1 + 1; | |
60: int h_col_end = min(h / 1 + 1, 1); | |
61: | |
62: int offset = (c * 1 * 1 + h * 1 + w) * 1 * 1; | |
63: int coeff_h_col = (1 - 1 * 1 * 1) * 1; | |
64: int coeff_w_col = (1 - 1 * 1 * 1); | |
65: for (int h_col = h_col_start; h_col < h_col_end; ++h_col) { | |
66: for (int w_col = w_col_start; w_col < w_col_end; ++w_col) { | |
67: val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col]; | |
68: } | |
69: } | |
70: data_im[index] = val; | |
71: } | |
72: } | |
73: | |
74: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
ForwardIm2Col.cl build log: | |
(19:0) : error : invalid global address space qualifier specified for parameter type | |
(19:0) : error : syntax error at 'const' | |
ForwardAuto: kernel 7 this instance cant be used: | |
kernel source: | |
1: // from SpatialConvolutionMM.cu: | |
2: | |
3: // CL: grid stride looping | |
4: #define CL_KERNEL_LOOP(i, n) \ | |
5: for (int i = get_group_id(0) * get_local_size(0) + get_local_id(0); \ | |
6: i < (n); \ | |
7: i += get_local_size(0) * get_num_groups(0)) | |
8: | |
9: //#define gPadding 0 | |
10: //#define gStride 1 | |
11: //#define gColSize 1 | |
12: //#define gFilterSize 1 | |
13: //#define gSize 1 | |
14: | |
15: // Kernel for fast unfold+copy | |
16: // (adapted from Caffe: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cu) | |
17: kernel void im2col( | |
18: const int n, | |
19: global float const * im_data, int im_offset, | |
20: global float* data_col) { | |
21: global const float *data_im = im_data + im_offset; | |
22: | |
23: CL_KERNEL_LOOP(index, n) { | |
24: int w_out = index % 1; | |
25: index /= 1; | |
26: int h_out = index % 1; | |
27: int channel_in = index / 1; | |
28: int channel_out = channel_in * 1 * 1; | |
29: int h_in = h_out * 1 - 0; | |
30: int w_in = w_out * 1 - 0; | |
31: data_col += (channel_out * 1 + h_out) * 1 + w_out; | |
32: data_im += (channel_in * 1 + h_in) * 1 + w_in; | |
33: for (int i = 0; i < 1; ++i) { | |
34: for (int j = 0; j < 1; ++j) { | |
35: int h = h_in + i; | |
36: int w = w_in + j; | |
37: *data_col = (h >= 0 && w >= 0 && h < 1 && w < 1) ? | |
38: data_im[i * 1 + j] : 0; | |
39: data_col += 1 * 1; | |
40: } | |
41: } | |
42: } | |
43: } | |
44: | |
45: kernel void col2im( | |
46: const int n, | |
47: global float const *data_col, | |
48: global float* im_data, int im_offset) { | |
49: global float *data_im = im_data + im_offset; | |
50: | |
51: for (int index = get_group_id(0) * get_local_size(0) + get_local_id(0); index < (n); index += get_local_size(0) * get_num_groups(0)) { | |
52: float val = 0; | |
53: int w = index % 1 + 0; | |
54: int h = (index / 1) % 1 + 0; | |
55: int c = index / (1 * 1); | |
56: // compute the start and end of the output | |
57: int w_col_start = (w < 1) ? 0 : (w - 1) / 1 + 1; | |
58: int w_col_end = min(w / 1 + 1, 1); | |
59: int h_col_start = (h < 1) ? 0 : (h - 1) / 1 + 1; | |
60: int h_col_end = min(h / 1 + 1, 1); | |
61: | |
62: int offset = (c * 1 * 1 + h * 1 + w) * 1 * 1; | |
63: int coeff_h_col = (1 - 1 * 1 * 1) * 1; | |
64: int coeff_w_col = (1 - 1 * 1 * 1); | |
65: for (int h_col = h_col_start; h_col < h_col_end; ++h_col) { | |
66: for (int w_col = w_col_start; w_col < w_col_end; ++w_col) { | |
67: val += data_col[offset + h_col * coeff_h_col + w_col * coeff_w_col]; | |
68: } | |
69: } | |
70: data_im[index] = val; | |
71: } | |
72: } | |
73: | |
74: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
ForwardIm2Col.cl build log: | |
(19:0) : error : invalid global address space qualifier specified for parameter type | |
(19:0) : error : syntax error at 'const' | |
forward kernel 0: cannot be used | |
forward kernel 1: cannot be used | |
forward kernel 2: cannot be used | |
forward kernel 3: cannot be used | |
forward kernel 4: cannot be used | |
forward kernel 5: cannot be used | |
forward kernel 6: cannot be used | |
forward kernel 7: cannot be used | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description "No valid forward implementations found" thrown in the test body. | |
[ FAILED ] testlogicaloperators.Convolve_1layerbiased_Or (193 ms) | |
[ RUN ] testlogicaloperators.Convolve_2layers_relu_Xor | |
Xor, convolve | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU" | |
clblas teardown | |
unknown file: Failure | |
C++ exception with description " | |
kernel source: | |
1: // Copyright Hugh Perkins 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // expected defines: | |
8: // one of: [ TANH | RELU | LINEAR | SIGMOID | SCALEDTANH | ELU ] | |
9: | |
10: #ifdef TANH | |
11: #define ACTIVATION_FUNCTION(output) (tanh(output)) | |
12: #elif defined SCALEDTANH | |
13: #define ACTIVATION_FUNCTION(output) (1.7159f * tanh(0.66667f * output)) | |
14: #elif SIGMOID | |
15: #define ACTIVATION_FUNCTION(output) (1.0f / (1 + exp(-output))) | |
16: #elif defined RELU | |
17: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : 0) | |
18: #elif defined ELU | |
19: #define ACTIVATION_FUNCTION(output) (output> 0 ? output : exp(output) - 1) | |
20: #elif defined LINEAR | |
21: #define ACTIVATION_FUNCTION(output) (output) | |
22: #endif | |
23: | |
24: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
25: kernel void activate(const int N, global float *inout) { | |
26: const int globalId = get_global_id(0); | |
27: if (globalId >= N) { | |
28: return; | |
29: } | |
30: inout[globalId] = ACTIVATION_FUNCTION(inout[globalId]); | |
31: } | |
32: #endif | |
33: | |
34: #ifdef ACTIVATION_FUNCTION // protect against not defined | |
35: kernel void forwardNaive(const int N, global float *out, global const float *in) { | |
36: const int globalId = get_global_id(0); | |
37: if (globalId >= N) { | |
38: return; | |
39: } | |
40: out[globalId] = ACTIVATION_FUNCTION(in[globalId]); | |
41: } | |
42: #endif | |
43: | |
44: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/activate.cl build log: | |
error : syntax error in compiler option string " -DgOutputSize=1 -DgOutputSizeSquared=1 -DgInputSize=1 -DgInputSizeSquared=1 -DgNumPlanes=2 -D RELU" | |
" thrown in the test body. | |
[ FAILED ] testlogicaloperators.Convolve_2layers_relu_Xor (85 ms) | |
[----------] 3 tests from testlogicaloperators (460 ms total) | |
[----------] 12 tests from testbackward | |
[ RUN ] testbackward.squareloss | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
layer 0:InputLayer{ outputPlanes=3 outputSize=5 } | |
layer 1:ForceBackpropLayer{ outputPlanes=3 outputSize=5 } | |
layer 2:SquareLossLayer{} | |
inputtotalsize=2400 outputTotalSize=2400 | |
layer 0:InputLayer{ outputPlanes=3 outputSize=5 } | |
layer 1:ForceBackpropLayer{ outputPlanes=3 outputSize=5 } | |
layer 2:SquareLossLayer{} | |
Parameters overview: (skipping 3 layers with 0 params) | |
TOTAL : params=0 | |
idx=44 predicted losschange=-0.000912508 actual=-0.000976562 | |
idx=2245 predicted losschange=0.00785823 actual=0.00805664 | |
idx=648 predicted losschange=0.00965759 actual=0.00976562 | |
idx=586 predicted losschange=0.0136895 actual=0.0136719 | |
idx=730 predicted losschange=0.00117897 actual=0.00146484 | |
idx=611 predicted losschange=0.00152302 actual=0.00195312 | |
idx=1130 predicted losschange=0.0159167 actual=0.0161133 | |
idx=15 predicted losschange=0.0434798 actual=0.0439453 | |
idx=1923 predicted losschange=-0.00790002 actual=-0.0078125 | |
idx=670 predicted losschange=0.0335141 actual=0.0336914 | |
[ OK ] testbackward.squareloss (64 ms) | |
[ RUN ] testbackward.crossentropyloss | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
layer 0:InputLayer{ outputPlanes=3 outputSize=5 } | |
layer 1:ForceBackpropLayer{ outputPlanes=3 outputSize=5 } | |
layer 2:Layer{} | |
inputtotalsize=300 outputTotalSize=300 | |
layer 0:InputLayer{ outputPlanes=3 outputSize=5 } | |
layer 1:ForceBackpropLayer{ outputPlanes=3 outputSize=5 } | |
layer 2:Layer{} | |
Parameters overview: (skipping 3 layers with 0 params) | |
TOTAL : params=0 | |
idx=44 predicted losschange=0.000274935 actual=0.000274658 | |
idx=145 predicted losschange=-0.000885784 actual=-0.00088501 | |
idx=48 predicted losschange=-0.000859834 actual=-0.000854492 | |
idx=286 predicted losschange=0.00713042 actual=0.00717163 | |
idx=130 predicted losschange=-0.000264829 actual=-0.000244141 | |
idx=11 predicted losschange=-1.98163e-05 actual=0 | |
idx=230 predicted losschange=-0.000594819 actual=-0.000610352 | |
idx=15 predicted losschange=-0.0006499 actual=-0.000640869 | |
idx=123 predicted losschange=-0.000846121 actual=-0.000823975 | |
idx=70 predicted losschange=0.000790196 actual=0.000793457 | |
[ OK ] testbackward.crossentropyloss (53 ms) | |
[ RUN ] testbackward.softmaxloss | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
layer 0:InputLayer{ outputPlanes=5 outputSize=1 } | |
layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 } | |
layer 2:SoftMaxLayer{ perPlane=0 numPlanes=5 imageSize=1 } | |
inputtotalsize=10 outputTotalSize=10 | |
layer 0:InputLayer{ outputPlanes=5 outputSize=1 } | |
layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 } | |
layer 2:SoftMaxLayer{ perPlane=0 numPlanes=5 imageSize=1 } | |
Parameters overview: (skipping 3 layers with 0 params) | |
TOTAL : params=0 | |
idx=4 predicted losschange=0.000113075 actual=0.00011301 | |
idx=5 predicted losschange=0.000145627 actual=0.000145674 | |
idx=8 predicted losschange=3.16699e-05 actual=3.19481e-05 | |
idx=6 predicted losschange=4.89271e-06 actual=5.24521e-06 | |
idx=0 predicted losschange=2.29469e-05 actual=2.28882e-05 | |
idx=1 predicted losschange=-8.26119e-05 actual=-8.27312e-05 | |
idx=0 predicted losschange=2.29469e-05 actual=2.28882e-05 | |
idx=5 predicted losschange=0.000145627 actual=0.000145674 | |
idx=3 predicted losschange=-5.50179e-05 actual=-5.50747e-05 | |
idx=0 predicted losschange=2.29469e-05 actual=2.28882e-05 | |
[ OK ] testbackward.softmaxloss (50 ms) | |
[ RUN ] testbackward.squareloss2 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
layer 0:InputLayer{ outputPlanes=5 outputSize=1 } | |
layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 } | |
layer 2:SquareLossLayer{} | |
layer 0:InputLayer{ outputPlanes=5 outputSize=1 } | |
layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 } | |
layer 2:SquareLossLayer{} | |
batchSize: 32 | |
inputtotalsize=160 outputTotalSize=160 | |
layer SquareLossLayer{} | |
layer 0:InputLayer{ outputPlanes=5 outputSize=1 } | |
layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 } | |
layer 2:SquareLossLayer{} | |
Parameters overview: (skipping 3 layers with 0 params) | |
TOTAL : params=0 | |
idx=44 predicted losschange=0.000126406 actual=0.000125885 | |
idx=5 predicted losschange=0.00461891 actual=0.00464439 | |
idx=8 predicted losschange=0.000356787 actual=0.000356674 | |
idx=106 predicted losschange=0.00716324 actual=0.00719643 | |
idx=90 predicted losschange=0.000474759 actual=0.000480652 | |
idx=131 predicted losschange=0.000979017 actual=0.000984192 | |
idx=10 predicted losschange=0.000660134 actual=0.000663757 | |
idx=15 predicted losschange=0.00961313 actual=0.00965118 | |
idx=3 predicted losschange=0.00264732 actual=0.00267029 | |
idx=30 predicted losschange=0.00865312 actual=0.00868607 | |
[ OK ] testbackward.squareloss2 (60 ms) | |
[ RUN ] testbackward.crossentropy2 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
layer 0:InputLayer{ outputPlanes=5 outputSize=1 } | |
layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 } | |
layer 2:Layer{} | |
layer 0:InputLayer{ outputPlanes=5 outputSize=1 } | |
layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 } | |
layer 2:Layer{} | |
batchSize: 2 | |
inputtotalsize=10 outputTotalSize=10 | |
layer Layer{} | |
layer 0:InputLayer{ outputPlanes=5 outputSize=1 } | |
layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 } | |
layer 2:Layer{} | |
Parameters overview: (skipping 3 layers with 0 params) | |
TOTAL : params=0 | |
idx=4 predicted losschange=0.00258649 actual=nan | |
idx=5 predicted losschange=0.0227095 actual=nan | |
idx=8 predicted losschange=-0.00202714 actual=nan | |
idx=6 predicted losschange=-0.000846508 actual=nan | |
idx=0 predicted losschange=-0.000424821 actual=nan | |
idx=1 predicted losschange=-0.00171216 actual=nan | |
idx=0 predicted losschange=-0.000424821 actual=nan | |
idx=5 predicted losschange=0.0227095 actual=nan | |
idx=3 predicted losschange=0.0123444 actual=nan | |
idx=0 predicted losschange=-0.000424821 actual=nan | |
[ OK ] testbackward.crossentropy2 (21 ms) | |
[ RUN ] testbackward.softmax2 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
layer 0:InputLayer{ outputPlanes=5 outputSize=1 } | |
layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 } | |
layer 2:SoftMaxLayer{ perPlane=0 numPlanes=5 imageSize=1 } | |
layer 0:InputLayer{ outputPlanes=5 outputSize=1 } | |
layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 } | |
layer 2:SoftMaxLayer{ perPlane=0 numPlanes=5 imageSize=1 } | |
batchSize: 2 | |
inputtotalsize=10 outputTotalSize=10 | |
layer SoftMaxLayer{ perPlane=0 numPlanes=5 imageSize=1 } | |
layer 0:InputLayer{ outputPlanes=5 outputSize=1 } | |
layer 1:ForceBackpropLayer{ outputPlanes=5 outputSize=1 } | |
layer 2:SoftMaxLayer{ perPlane=0 numPlanes=5 imageSize=1 } | |
Parameters overview: (skipping 3 layers with 0 params) | |
TOTAL : params=0 | |
idx=4 predicted losschange=0.00035729 actual=0.000357628 | |
idx=5 predicted losschange=0.0015055 actual=0.00151086 | |
idx=8 predicted losschange=-5.63632e-05 actual=-5.65052e-05 | |
idx=6 predicted losschange=-1.48864e-05 actual=-1.4782e-05 | |
idx=0 predicted losschange=1.96542e-05 actual=1.95503e-05 | |
idx=1 predicted losschange=-0.000287167 actual=-0.000287056 | |
idx=0 predicted losschange=1.96542e-05 actual=1.95503e-05 | |
idx=5 predicted losschange=0.0015055 actual=0.00151086 | |
idx=3 predicted losschange=-0.000152824 actual=-0.00014782 | |
idx=0 predicted losschange=1.96542e-05 actual=1.95503e-05 | |
[ OK ] testbackward.softmax2 (20 ms) | |
[ RUN ] testbackward.conv1 | |
Couldnt find OpenCL-enabled GPU: No OpenCL-enabled GPUs found | |
Trying for OpenCL-enabled CPU | |
Using Vivante Corporation , OpenCL platform: Vivante OpenCL Platform | |
Using OpenCL device: Vivante OpenCL Device | |
initializing clblas | |
layer 0:InputLayer{ outputPlanes=2 outputSize=4 } | |
layer 1:ForceBackpropLayer{ outputPlanes=2 outputSize=4 } | |
layer 2:ConvolutionalLayer{ LayerDimensions{ inputPlanes=2 inputSize=4 numFilters=2 filterSize=3 outputSize=2 padZeros=0 biased=0 skip=0} } | |
layer 3:SquareLossLayer{} | |
layer 0:InputLayer{ outputPlanes=2 outputSize=4 } | |
layer 1:ForceBackpropLayer{ outputPlanes=2 outputSize=4 } | |
layer 2:ConvolutionalLayer{ LayerDimensions{ inputPlanes=2 inputSize=4 numFilters=2 filterSize=3 outputSize=2 padZeros=0 biased=0 skip=0} } | |
layer 3:SquareLossLayer{} | |
batchSize: 4 | |
inputtotalsize=128 outputTotalSize=32 | |
layer ConvolutionalLayer{ LayerDimensions{ inputPlanes=2 inputSize=4 numFilters=2 filterSize=3 outputSize=2 padZeros=0 biased=0 skip=0} } | |
forward try kernel 0 | |
... not plausibly optimal, skipping | |
forward try kernel 1 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 1: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // notes on non-odd filtersizes: | |
8: // for odd, imagesize and filtersize 3, padZeros = 0: | |
9: // output is a single square | |
10: // m and n should vary between -1,0,1 | |
11: // for even, imagesize and filtersize 2, padzeros = 0 | |
12: // output is a single square, which we can position at topleft or bottomrigth | |
13: // lets position it in bottomright | |
14: // then m and n should vary as -1,0 | |
15: // | |
16: // for even, imagesize and filtersize 2, padzeros = 1 | |
17: // output is 2 by 2 | |
18: // well... if it is even: | |
19: // - if we are not padding zeros, then we simply move our filter around the image somehow | |
20: // - if we are padding zeros, then we conceptually pad the bottom and right edge of the image with zeros by 1 | |
21: // filtersize remains the same | |
22: // m will vary as -1,0,1 | |
23: // outputrow is fixed by globalid | |
24: // inputrow should be unchanged... | |
25: // padzeros = 0: | |
26: // x x . . . . | |
27: // x x . . x x | |
28: // . . . . x x | |
29: // when filtersize even: | |
30: // new imagesize = oldimagesize - filtersize + 1 | |
31: // when filtersize odd: | |
32: // x x x . | |
33: // x x x . | |
34: // x x x . | |
35: // . . . . | |
36: // new imagesize = oldimagesize - filtersize + 1 | |
37: // padzeros = 1: | |
38: // x x | |
39: // x x . . x x . . . . . . . | |
40: // . . . x x . . x x . . . | |
41: // . . . . . . . x x . . x x | |
42: // outrow=0 outrow=1 outrow=2 x x | |
43: // outcol=0 outcol=1 outcol=2 outrow=3 | |
44: // outcol=3 | |
45: // when filtersize is even, and padzeros, imagesize grows by 1 each time... | |
46: // imagesize = oldimagesize + 1 | |
47: // when filtersize is odd | |
48: // x x x | |
49: // x x x . x x x . . . | |
50: // x x x . x x x . x x x | |
51: // . . . x x x . x x x | |
52: // x x x | |
53: | |
54: // images are organized like [imageId][plane][row][col] | |
55: // filters are organized like [filterid][inplane][filterrow][filtercol] | |
56: // output are organized like [imageid][filterid][row][col] | |
57: // global id is organized like output, ie: [imageid][outplane][outrow][outcol] | |
58: // - no local memory used currently | |
59: // - each thread: | |
60: // - loads a whole upstream cube | |
61: // - loads a whole filter cube | |
62: // - writes one output... | |
63: void kernel convolve_imagecubes_float2( | |
64: const int numExamples, | |
65: global const float *inputs, global const float *filters, | |
66: global float *output) { | |
67: int globalId = get_global_id(0); | |
68: | |
69: int outputImage2Id = globalId / gOutputSizeSquared; | |
70: int exampleId = outputImage2Id / gNumFilters; | |
71: int filterId = outputImage2Id % gNumFilters; | |
72: | |
73: // intraimage coords | |
74: int localid = globalId % gOutputSizeSquared; | |
75: int outputRow = localid / gOutputSize; | |
76: int outputCol = localid % gOutputSize; | |
77: | |
78: global float const*inputCube = inputs + exampleId * gNumInputPlanes * gInputSizeSquared; | |
79: global float const*filterCube = filters + filterId * gNumInputPlanes * gFilterSizeSquared; | |
80: | |
81: float sum = 0; | |
82: if (exampleId < numExamples) { | |
83: for (int inputPlaneIdx = 0; inputPlaneIdx < gNumInputPlanes; inputPlaneIdx++) { | |
84: global float const*inputPlane = inputCube + inputPlaneIdx * gInputSizeSquared; | |
85: global float const*filterPlane = filterCube + inputPlaneIdx * gFilterSizeSquared; | |
86: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
87: // trying to reduce register pressure... | |
88: #if gPadZeros == 1 | |
89: #define inputRowIdx (outputRow + u) | |
90: #else | |
91: #define inputRowIdx (outputRow + u + gHalfFilterSize) | |
92: #endif | |
93: global float const *inputRow = inputPlane + inputRowIdx * gInputSize; | |
94: global float const *filterRow = filterPlane + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
95: bool rowOk = inputRowIdx >= 0 && inputRowIdx < gInputSize; | |
96: #pragma unroll | |
97: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
98: #if gPadZeros == 1 | |
99: #define inputColIdx (outputCol + v) | |
100: #else | |
101: #define inputColIdx (outputCol + v + gHalfFilterSize) | |
102: #endif | |
103: bool process = rowOk && inputColIdx >= 0 && inputColIdx < gInputSize; | |
104: if (process) { | |
105: sum += inputRow[inputColIdx] * filterRow[v]; | |
106: } | |
107: } | |
108: } | |
109: } | |
110: } | |
111: | |
112: if (exampleId < numExamples) { | |
113: output[globalId] = sum; | |
114: } | |
115: } | |
116: | |
117: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward1.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 2 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
ForwardAuto: kernel 2: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, const int N) { | |
8: int numLoops = (N + gWorkgroupSize - 1) / gWorkgroupSize; | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * gWorkgroupSize + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [outplane] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [imageid][upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [imageid][filterid][row][col] | |
26: // assumes filter is small, so filtersize * filterSize * inputPlanes * 4 < about 3KB | |
27: // eg 5 * 5 * 32 * 4 = 3.2KB => ok :-) | |
28: // but 28 * 28 * 32 * 4 = 100KB => less good :-P | |
29: void kernel forward_2_by_outplane( | |
30: const int batchSize, | |
31: global const float *images, global const float *filters, | |
32: global float *output, | |
33: local float *_inputPlane, local float *_filterCube) { | |
34: const int globalId = get_global_id(0); | |
35: | |
36: const int workgroupId = get_group_id(0); | |
37: const int workgroupSize = get_local_size(0); | |
38: const int outPlane = workgroupId; | |
39: | |
40: const int localId = get_local_id(0); | |
41: const int outputRow = localId / gOutputSize; | |
42: const int outputCol = localId % gOutputSize; | |
43: | |
44: #if gPadZeros == 1 | |
45: const int minu = max(-gHalfFilterSize, -outputRow); | |
46: const int maxu = min(gHalfFilterSize, gOutputSize - 1 - outputRow) - gEven; | |
47: const int minv = max(-gHalfFilterSize, -outputCol); | |
48: const int maxv = min(gHalfFilterSize, gOutputSize - 1 - outputCol) - gEven; | |
49: #else | |
50: const int minu = -gHalfFilterSize; | |
51: const int maxu = gHalfFilterSize - gEven; | |
52: const int minv = -gHalfFilterSize; | |
53: const int maxv = gHalfFilterSize - gEven; | |
54: #endif | |
55: | |
56: { | |
57: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
58: copyLocal(_filterCube, | |
59: filters + outPlane * filterCubeLength, | |
60: filterCubeLength); | |
61: } | |
62: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
63: | |
64: for (int n = 0; n < batchSize; n++) { | |
65: float sum = 0; | |
66: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
67: barrier(CLK_LOCAL_MEM_FENCE); | |
68: copyLocal(_inputPlane, | |
69: images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, | |
70: gInputSizeSquared); | |
71: barrier(CLK_LOCAL_MEM_FENCE); | |
72: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
73: if (localId < gOutputSizeSquared) { | |
74: for (int u = minu; u <= maxu; u++) { | |
75: int inputRow = outputRow + u; | |
76: #if gPadZeros == 0 | |
77: inputRow += gHalfFilterSize; | |
78: #endif | |
79: int inputimagerowoffset = inputRow * gInputSize; | |
80: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
81: for (int v = minv; v <= maxv; v++) { | |
82: int inputCol = outputCol + v; | |
83: #if gPadZeros == 0 | |
84: inputCol += gHalfFilterSize; | |
85: #endif | |
86: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
87: } | |
88: } | |
89: } | |
90: } | |
91: // output are organized like [imageid][filterid][row][col] | |
92: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
93: if (localId < gOutputSizeSquared) { | |
94: output[resultIndex ] = sum; | |
95: } | |
96: } | |
97: } | |
98: #endif | |
99: | |
100: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward2.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0 -DgWorkgroupSize=32" | |
... not valid | |
forward try kernel 3 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: each workgroup handles convolving one input example with one filtercube | |
8: // and writing out one single output plane | |
9: // | |
10: // workgroup id organized like: [imageid][outplane] | |
11: // local id organized like: [outrow][outcol] | |
12: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
13: // number workgroups = 32 | |
14: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
16: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
17: // output are organized like [imageid][filterid][row][col] | |
18: void kernel forward_3_by_n_outplane(const int batchSize, | |
19: global const float *images, global const float *filters, | |
20: global float *output, | |
21: local float *_upstreamImage, local float *_filterCube) { | |
22: const int globalId = get_global_id(0); | |
23: | |
24: const int workgroupId = get_group_id(0); | |
25: const int workgroupSize = get_local_size(0); | |
26: const int n = workgroupId / gNumFilters; | |
27: const int outPlane = workgroupId % gNumFilters; | |
28: | |
29: const int localId = get_local_id(0); | |
30: const int outputRow = localId / gOutputSize; | |
31: const int outputCol = localId % gOutputSize; | |
32: | |
33: const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize; | |
34: const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow - gEven) : gHalfFilterSize - gEven; | |
35: const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize; | |
36: const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven; | |
37: | |
38: const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
39: | |
40: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
41: const int filterCubeGlobalOffset = outPlane * filterCubeLength; | |
42: const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize; | |
43: for (int i = 0; i < numPixelsPerThread; i++) { | |
44: int thisOffset = localId + i * workgroupSize; | |
45: if (thisOffset < filterCubeLength) { | |
46: _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset]; | |
47: } | |
48: } | |
49: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
50: | |
51: float sum = 0; | |
52: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
53: int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: for (int i = 0; i < numUpstreamsPerThread; i++) { | |
56: int thisOffset = workgroupSize * i + localId; | |
57: if (thisOffset < gInputSizeSquared) { | |
58: _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ]; | |
59: } | |
60: } | |
61: barrier(CLK_LOCAL_MEM_FENCE); | |
62: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
63: for (int u = minu; u <= maxu; u++) { | |
64: int inputRow = outputRow + u; | |
65: #if gPadZeros == 0 | |
66: inputRow += gHalfFilterSize; | |
67: #endif | |
68: int inputimagerowoffset = inputRow * gInputSize; | |
69: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
70: for (int v = minv; v <= maxv; v++) { | |
71: int inputCol = outputCol + v; | |
72: #if gPadZeros == 0 | |
73: inputCol += gHalfFilterSize; | |
74: #endif | |
75: if (localId < gOutputSizeSquared) { | |
76: sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
77: } | |
78: } | |
79: } | |
80: } | |
81: | |
82: // output are organized like [imageid][filterid][row][col] | |
83: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
84: if (localId < gOutputSizeSquared) { | |
85: output[resultIndex ] = sum; | |
86: } | |
87: } | |
88: | |
89: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 3: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: // concept: each workgroup handles convolving one input example with one filtercube | |
8: // and writing out one single output plane | |
9: // | |
10: // workgroup id organized like: [imageid][outplane] | |
11: // local id organized like: [outrow][outcol] | |
12: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
13: // number workgroups = 32 | |
14: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
15: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
16: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
17: // output are organized like [imageid][filterid][row][col] | |
18: void kernel forward_3_by_n_outplane(const int batchSize, | |
19: global const float *images, global const float *filters, | |
20: global float *output, | |
21: local float *_upstreamImage, local float *_filterCube) { | |
22: const int globalId = get_global_id(0); | |
23: | |
24: const int workgroupId = get_group_id(0); | |
25: const int workgroupSize = get_local_size(0); | |
26: const int n = workgroupId / gNumFilters; | |
27: const int outPlane = workgroupId % gNumFilters; | |
28: | |
29: const int localId = get_local_id(0); | |
30: const int outputRow = localId / gOutputSize; | |
31: const int outputCol = localId % gOutputSize; | |
32: | |
33: const int minu = gPadZeros ? max(-gHalfFilterSize, -outputRow) : -gHalfFilterSize; | |
34: const int maxu = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputRow - gEven) : gHalfFilterSize - gEven; | |
35: const int minv = gPadZeros ? max(-gHalfFilterSize, -outputCol) : - gHalfFilterSize; | |
36: const int maxv = gPadZeros ? min(gHalfFilterSize - gEven, gOutputSize - 1 - outputCol - gEven) : gHalfFilterSize - gEven; | |
37: | |
38: const int numUpstreamsPerThread = (gInputSizeSquared + workgroupSize - 1) / workgroupSize; | |
39: | |
40: const int filterCubeLength = gInputPlanes * gFilterSizeSquared; | |
41: const int filterCubeGlobalOffset = outPlane * filterCubeLength; | |
42: const int numPixelsPerThread = (filterCubeLength + workgroupSize - 1) / workgroupSize; | |
43: for (int i = 0; i < numPixelsPerThread; i++) { | |
44: int thisOffset = localId + i * workgroupSize; | |
45: if (thisOffset < filterCubeLength) { | |
46: _filterCube[thisOffset] = filters[filterCubeGlobalOffset + thisOffset]; | |
47: } | |
48: } | |
49: // dont need a barrier, since we'll just run behind the barrier from the upstream image download | |
50: | |
51: float sum = 0; | |
52: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
53: int thisUpstreamImageOffset = (n * gInputPlanes + upstreamPlane) * gInputSizeSquared; | |
54: barrier(CLK_LOCAL_MEM_FENCE); | |
55: for (int i = 0; i < numUpstreamsPerThread; i++) { | |
56: int thisOffset = workgroupSize * i + localId; | |
57: if (thisOffset < gInputSizeSquared) { | |
58: _upstreamImage[ thisOffset ] = images[ thisUpstreamImageOffset + thisOffset ]; | |
59: } | |
60: } | |
61: barrier(CLK_LOCAL_MEM_FENCE); | |
62: int filterImageOffset = upstreamPlane * gFilterSizeSquared; | |
63: for (int u = minu; u <= maxu; u++) { | |
64: int inputRow = outputRow + u; | |
65: #if gPadZeros == 0 | |
66: inputRow += gHalfFilterSize; | |
67: #endif | |
68: int inputimagerowoffset = inputRow * gInputSize; | |
69: int filterrowoffset = filterImageOffset + (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
70: for (int v = minv; v <= maxv; v++) { | |
71: int inputCol = outputCol + v; | |
72: #if gPadZeros == 0 | |
73: inputCol += gHalfFilterSize; | |
74: #endif | |
75: if (localId < gOutputSizeSquared) { | |
76: sum += _upstreamImage[ inputimagerowoffset + inputCol] * _filterCube[ filterrowoffset + v ]; | |
77: } | |
78: } | |
79: } | |
80: } | |
81: | |
82: // output are organized like [imageid][filterid][row][col] | |
83: int resultIndex = (n * gNumFilters + outPlane) * gOutputSizeSquared + localId; | |
84: if (localId < gOutputSizeSquared) { | |
85: output[resultIndex ] = sum; | |
86: } | |
87: } | |
88: | |
89: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward3.cl build log: | |
error : syntax error in compiler option string " -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
... not valid | |
forward try kernel 4 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
kernel build error: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, int N) { | |
8: int numLoops = (N + get_local_size(0) - 1) / get_local_size(0); | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * get_local_size(0) + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [n][filterid] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [n][filterid][outrow][outcol] | |
26: // the pixels per thread thing... : | |
27: // - we have one thread (~= cuda core) per output value, | |
28: // ie one thread for each combination of [outrow][outcol] | |
29: // - however, the number of threads is typically limited on a gpu, | |
30: // eg to 512 (eg Intel HD), or 1024 (eg nVidia K520) | |
31: // - so what happens if the number of output points is larger than | |
32: // the maximum workgroup size? | |
33: // - then we have several possibilities really: | |
34: // - we can divide the image into blocks, and process each block | |
35: // separately. This is probably a good option, but fair amount of | |
36: // work | |
37: // - we can get each thread to handle more than one output | |
38: // pixel, by looping | |
39: // - we can consider the output image in 1d, by putting the rows | |
40: // one after another, and assign each contiguous workgroup-size | |
41: // block to one workgroup | |
42: // => this is how this kernel works | |
43: // basically, it's a hack, so larger images actually run, without | |
44: // crashing, and we can probably improve it a lot :-) | |
45: // | |
46: // So, when outputSize * outputSize > workgroupSize, then | |
47: // multiple workgroups will be created for each output plane | |
48: // the number of such workgroups is given by: `gPixelsPerThread` | |
49: // the id of our workgroup within such a set of workgroups is calculated | |
50: // as `pixel` | |
51: // effectiveLocalId is our local id if we had one enormous workgroup | |
52: // containing the whole output image plane | |
53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize, | |
54: global const float *images, global const float *filters, | |
55: global float *output, | |
56: local float *_inputPlane, local float *_filterPlane) { | |
57: #define globalId (get_global_id(0)) | |
58: | |
59: #define localId (get_local_id(0)) | |
60: #define workgroupId (get_group_id(0)) | |
61: // const int workgroupSize = get_local_size(0); | |
62: const int effectiveWorkgroupId = workgroupId / gPixelsPerThread; | |
63: const int pixel = workgroupId % gPixelsPerThread; | |
64: const int effectiveLocalId = localId + pixel * gWorkgroupSize; | |
65: const int n = effectiveWorkgroupId / gNumFilters; | |
66: const int outPlane = effectiveWorkgroupId % gNumFilters; | |
67: | |
68: const int outputRow = effectiveLocalId / gOutputSize; | |
69: const int outputCol = effectiveLocalId % gOutputSize; | |
70: | |
71: float sum = 0; | |
72: for (int upstreamPlane = 0; upstreamPlane < gInputPlanes; upstreamPlane++) { | |
73: barrier(CLK_LOCAL_MEM_FENCE); | |
74: copyLocal(_inputPlane, images + (n * gInputPlanes + upstreamPlane) * gInputSizeSquared, gInputSizeSquared); | |
75: copyLocal(_filterPlane, filters + (outPlane * gInputPlanes + upstreamPlane) * gFilterSizeSquared, gFilterSizeSquared); | |
76: barrier(CLK_LOCAL_MEM_FENCE); | |
77: | |
78: if (effectiveLocalId < gOutputSizeSquared) { | |
79: for (int u = -gHalfFilterSize; u <= gHalfFilterSize - gEven; u++) { | |
80: // trying to reduce register pressure... | |
81: #if gPadZeros == 1 | |
82: #define inputRow (outputRow + u) | |
83: #else | |
84: #define inputRow (outputRow + u + gHalfFilterSize) | |
85: #endif | |
86: int inputimagerowoffset = inputRow * gInputSize; | |
87: int filterrowoffset = (u+gHalfFilterSize) * gFilterSize + gHalfFilterSize; | |
88: bool rowOk = inputRow >= 0 && inputRow < gInputSize; | |
89: for (int v = -gHalfFilterSize; v <= gHalfFilterSize - gEven; v++) { | |
90: #if gPadZeros == 1 | |
91: #define inputCol (outputCol + v) | |
92: #else | |
93: #define inputCol (outputCol + v + gHalfFilterSize) | |
94: #endif | |
95: bool process = rowOk && inputCol >= 0 && inputCol < gInputSize; | |
96: if (process) { | |
97: sum += _inputPlane[ inputimagerowoffset + inputCol] * _filterPlane[ filterrowoffset + v ]; | |
98: } | |
99: } | |
100: } | |
101: } | |
102: } | |
103: // output are organized like [imageid][filterid][row][col] | |
104: #define resultIndex (( n * gNumFilters + outPlane) * gOutputSizeSquared + effectiveLocalId) | |
105: if (effectiveLocalId < gOutputSizeSquared) { | |
106: output[resultIndex ] = sum; | |
107: } | |
108: } | |
109: #endif | |
110: | |
111: | |
Something went wrong with clCreateKernel, OpenCL erorr code -45 | |
cl/forward4.cl build log: | |
error : syntax error in compiler option string " -D gWorkgroupSize=32 -D gPixelsPerThread=1 -D gNumInputPlanes=2 -D gInputPlanes=2 -D gInputSize=4 -D gInputSizeSquared=16 -D gNumFilters=2 -D gFilterSize=3 -D gHalfFilterSize=1 -D gFilterSizeSquared=9 -D gNumOutputPlanes=2 -D gOutputPlanes=2 -D gOutputSize=2 -D gOutputSizeSquared=4 -D gPadZeros=0 -D gMargin=0 -D gEven=0 -D gSkip=0" | |
ForwardAuto: kernel 4: this instance cant be used: | |
kernel source: | |
1: // Copyright Hugh Perkins 2014, 2015 hughperkins at gmail | |
2: // | |
3: // This Source Code Form is subject to the terms of the Mozilla Public License, | |
4: // v. 2.0. If a copy of the MPL was not distributed with this file, You can | |
5: // obtain one at http://mozilla.org/MPL/2.0/. | |
6: | |
7: void copyLocal(local float *target, global float const *source, int N) { | |
8: int numLoops = (N + get_local_size(0) - 1) / get_local_size(0); | |
9: for (int loop = 0; loop < numLoops; loop++) { | |
10: int offset = loop * get_local_size(0) + get_local_id(0); | |
11: if (offset < N) { | |
12: target[offset] = source[offset]; | |
13: } | |
14: } | |
15: } | |
16: | |
17: #ifdef gOutputSize // for previous tests that dont define it | |
18: // workgroup id organized like: [n][filterid] | |
19: // local id organized like: [outrow][outcol] | |
20: // each thread iterates over: [upstreamplane][filterrow][filtercol] | |
21: // number workgroups = 32 | |
22: // one filter plane takes up 5 * 5 * 4 = 100 bytes | |
23: // one filter cube (corresponding to one outplane) = 5*5 * 32 * 4 = 3.2KB (ok) | |
24: // all filter cubes = 3.2KB * 32 = 102KB (too big) | |
25: // output are organized like [n][filterid][outrow][outcol] | |
26: // the pixels per thread thing... : | |
27: // - we have one thread (~= cuda core) per output value, | |
28: // ie one thread for each combination of [outrow][outcol] | |
29: // - however, the number of threads is typically limited on a gpu, | |
30: // eg to 512 (eg Intel HD), or 1024 (eg nVidia K520) | |
31: // - so what happens if the number of output points is larger than | |
32: // the maximum workgroup size? | |
33: // - then we have several possibilities really: | |
34: // - we can divide the image into blocks, and process each block | |
35: // separately. This is probably a good option, but fair amount of | |
36: // work | |
37: // - we can get each thread to handle more than one output | |
38: // pixel, by looping | |
39: // - we can consider the output image in 1d, by putting the rows | |
40: // one after another, and assign each contiguous workgroup-size | |
41: // block to one workgroup | |
42: // => this is how this kernel works | |
43: // basically, it's a hack, so larger images actually run, without | |
44: // crashing, and we can probably improve it a lot :-) | |
45: // | |
46: // So, when outputSize * outputSize > workgroupSize, then | |
47: // multiple workgroups will be created for each output plane | |
48: // the number of such workgroups is given by: `gPixelsPerThread` | |
49: // the id of our workgroup within such a set of workgroups is calculated | |
50: // as `pixel` | |
51: // effectiveLocalId is our local id if we had one enormous workgroup | |
52: // containing the whole output image plane | |
53: void kernel forward_4_by_n_outplane_smallercache(const int batchSize, | |
54: global const float *images, global const float *filters, | |
55: global float *output, | |
56: local float *_inputPlane, local float *_filterPlane) { | |
57: #define globalId (get_global_id(0)) | |
58: | |
59: #define localId (get_local_id(0)) | |
60: #define workgroupId (get_group_id(0)) | |
61: // const int workgroupSize = get_local_size(0); | |
62: const int effectiveWorkgroupId = workgroupId / gPixelsPerThread; | |
63: const int pixel = workgroupId % gPixelsPerThread; | |
64: const int effectiveLocalId = localId + pixel * gWorkgroupSize; | |
65: const int n = effectiveWorkgroupId / gNumFilters; | |
66: const int outPlane = effectiveWorkgroupId % gNumFilters; | |
67: | |
68: const int outputRow = effectiveLocalId / gOutputSize; | |
69: const int outputCol = effectiveLocalId % gOutputSize; | |
70: | |
71: float sum = 0; | |
72: for (int upstreamPlane = 0; upstreamP |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment