Last active
August 29, 2016 16:03
-
-
Save Mezzano/bb7384cb00f40c6272e24a0af7099bba to your computer and use it in GitHub Desktop.
caffe make runtest -j4
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
iCar-G3-446:/dev/disk/mounted/sda1/opencl/caffe # make runtest -j4 | |
CXX/LD -o .build_release/test/test_all.testbin src/caffe/test/test_caffe_main.cpp | |
In file included from ../caffe_requirements/ViennaCL-1.7.1/viennacl/scheduler/preset.hpp:21:0, | |
from ../caffe_requirements/ViennaCL-1.7.1/viennacl/linalg/opencl/kernels/vector.hpp:25, | |
from ../caffe_requirements/ViennaCL-1.7.1/viennacl/linalg/opencl/vector_operations.hpp:35, | |
from ../caffe_requirements/ViennaCL-1.7.1/viennacl/linalg/vector_operations.hpp:39, | |
from ../caffe_requirements/ViennaCL-1.7.1/viennacl/vector.hpp:33, | |
from ./include/caffe/greentea/greentea.hpp:40, | |
from ./include/caffe/common.hpp:25, | |
from ./include/caffe/blob.hpp:8, | |
from ./include/caffe/caffe.hpp:7, | |
from src/caffe/test/test_caffe_main.cpp:3: | |
../caffe_requirements/ViennaCL-1.7.1/viennacl/device_specific/forwards.h:129:20: warning: ‘std::string viennacl::device_specific::generate_value_kernel_argument(const string&, const string&)’ defined but not used [-Wunused-function] | |
static std::string generate_value_kernel_argument(std::string const & scalartype, std::string const & name) | |
^ | |
../caffe_requirements/ViennaCL-1.7.1/viennacl/device_specific/forwards.h:135:20: warning: ‘std::string viennacl::device_specific::generate_pointer_kernel_argument(const string&, const string&, const string&)’ defined but not used [-Wunused-function] | |
static std::string generate_pointer_kernel_argument(std::string const & address_space, std::string const & scalartype, std::string const & name) | |
^ | |
.build_release/tools/caffe | |
caffe: command line brew | |
usage: caffe <command> <args> | |
commands: | |
train train or finetune a model | |
test score a model | |
device_query show GPU diagnostic information | |
time benchmark model execution time autotune autotune a model | |
Flags from tools/caffe.cpp: | |
-gpu (Optional; run in GPU mode on given device IDs separated by ','.Use | |
'-gpu all' to run on all available GPUs. The effective training batch | |
size is multiplied by the number of devices.) type: string default: "" | |
-iterations (The number of iterations to run.) type: int32 default: 50 | |
-level (Optional; network level.) type: int32 default: 0 | |
-model (The model definition protocol buffer text file.) type: string | |
default: "" | |
-phase (Optional; network phase (TRAIN or TEST). Only used for 'time'.) | |
type: string default: "" | |
-sighup_effect (Optional; action to take when a SIGHUP signal is received: | |
snapshot, stop or none.) type: string default: "snapshot" | |
-sigint_effect (Optional; action to take when a SIGINT signal is received: | |
snapshot, stop or none.) type: string default: "stop" | |
-snapshot (Optional; the snapshot solver state to resume training.) | |
type: string default: "" | |
-solver (The solver definition protocol buffer text file.) type: string | |
default: "" | |
-stage (Optional; network stages (not to be confused with phase), separated | |
by ','.) type: string default: "" | |
-weights (Optional; the pretrained weights to initialize finetuning, | |
separated by ','. Cannot be set simultaneously with snapshot.) | |
type: string default: "" | |
.build_release/test/test_all.testbin 0 --gtest_shuffle | |
Setting to use device 0 | |
Note: Randomizing tests' orders with a seed of 49319 . | |
[==========] Running 2046 tests from 274 test cases. | |
[----------] Global test environment set-up. | |
[----------] 5 tests from BenchmarkTest/1, where TypeParam = caffe::CPUDevice<double> | |
[ RUN ] BenchmarkTest/1.TestTimerSeconds | |
[ OK ] BenchmarkTest/1.TestTimerSeconds (318 ms) | |
[ RUN ] BenchmarkTest/1.TestTimerConstructor | |
[ OK ] BenchmarkTest/1.TestTimerConstructor (0 ms) | |
[ RUN ] BenchmarkTest/1.TestTimerMilliSeconds | |
[ OK ] BenchmarkTest/1.TestTimerMilliSeconds (314 ms) | |
[ RUN ] BenchmarkTest/1.TestTimerStop | |
[ OK ] BenchmarkTest/1.TestTimerStop (0 ms) | |
[ RUN ] BenchmarkTest/1.TestTimerStart | |
[ OK ] BenchmarkTest/1.TestTimerStart (1 ms) | |
[----------] 5 tests from BenchmarkTest/1 (634 ms total) | |
[----------] 9 tests from InnerProductLayerTest/0, where TypeParam = caffe::CPUDevice<float> | |
[ RUN ] InnerProductLayerTest/0.TestSetUpTranposeTrue | |
[ OK ] InnerProductLayerTest/0.TestSetUpTranposeTrue (9 ms) | |
[ RUN ] InnerProductLayerTest/0.TestForward | |
[ OK ] InnerProductLayerTest/0.TestForward (16 ms) | |
[ RUN ] InnerProductLayerTest/0.TestSetUp | |
[ OK ] InnerProductLayerTest/0.TestSetUp (1 ms) | |
[ RUN ] InnerProductLayerTest/0.TestGradient | |
[ OK ] InnerProductLayerTest/0.TestGradient (1950 ms) | |
[ RUN ] InnerProductLayerTest/0.TestGradientTranspose | |
[ OK ] InnerProductLayerTest/0.TestGradientTranspose (2308 ms) | |
[ RUN ] InnerProductLayerTest/0.TestBackwardTranspose | |
[ OK ] InnerProductLayerTest/0.TestBackwardTranspose (2 ms) | |
[ RUN ] InnerProductLayerTest/0.TestForwardTranspose | |
[ OK ] InnerProductLayerTest/0.TestForwardTranspose (1 ms) | |
[ RUN ] InnerProductLayerTest/0.TestForwardNoBatch | |
[ OK ] InnerProductLayerTest/0.TestForwardNoBatch (0 ms) | |
[ RUN ] InnerProductLayerTest/0.TestSetUpTranposeFalse | |
[ OK ] InnerProductLayerTest/0.TestSetUpTranposeFalse (1 ms) | |
[----------] 9 tests from InnerProductLayerTest/0 (4291 ms total) | |
[----------] 5 tests from ImageDataLayerTest/1, where TypeParam = caffe::CPUDevice<double> | |
[ RUN ] ImageDataLayerTest/1.TestResize | |
[ OK ] ImageDataLayerTest/1.TestResize (764 ms) | |
[ RUN ] ImageDataLayerTest/1.TestSpace | |
[ OK ] ImageDataLayerTest/1.TestSpace (247 ms) | |
[ RUN ] ImageDataLayerTest/1.TestShuffle | |
[ OK ] ImageDataLayerTest/1.TestShuffle (1038 ms) | |
[ RUN ] ImageDataLayerTest/1.TestReshape | |
[ OK ] ImageDataLayerTest/1.TestReshape (242 ms) | |
[ RUN ] ImageDataLayerTest/1.TestRead | |
[ OK ] ImageDataLayerTest/1.TestRead (1012 ms) | |
[----------] 5 tests from ImageDataLayerTest/1 (3304 ms total) | |
[----------] 2 tests from SigmoidCrossEntropyLossLayerTest/3, where TypeParam = caffe::GPUDevice<double> | |
[ RUN ] SigmoidCrossEntropyLossLayerTest/3.TestGradient | |
[ OK ] SigmoidCrossEntropyLossLayerTest/3.TestGradient (1 ms) | |
[ RUN ] SigmoidCrossEntropyLossLayerTest/3.TestSigmoidCrossEntropyLoss | |
[ OK ] SigmoidCrossEntropyLossLayerTest/3.TestSigmoidCrossEntropyLoss (0 ms) | |
[----------] 2 tests from SigmoidCrossEntropyLossLayerTest/3 (2 ms total) | |
[----------] 2 tests from HingeLossLayerTest/3, where TypeParam = caffe::GPUDevice<double> | |
[ RUN ] HingeLossLayerTest/3.TestGradientL2 | |
[ OK ] HingeLossLayerTest/3.TestGradientL2 (0 ms) | |
[ RUN ] HingeLossLayerTest/3.TestGradientL1 | |
[ OK ] HingeLossLayerTest/3.TestGradientL1 (0 ms) | |
[----------] 2 tests from HingeLossLayerTest/3 (1 ms total) | |
[----------] 8 tests from RMSPropSolverTest/0, where TypeParam = caffe::CPUDevice<float> | |
[ RUN ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithWeightDecay | |
[ OK ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithWeightDecay (303 ms) | |
[ RUN ] RMSPropSolverTest/0.TestLeastSquaresUpdateWithEverythingAccumShare | |
[ OK ] RMSPropSolverTest/0.TestLeastSquaresUpdateWithEverythingAccumShare (44 ms) | |
[ RUN ] RMSPropSolverTest/0.TestLeastSquaresUpdateWithEverythingAccum | |
[ OK ] RMSPropSolverTest/0.TestLeastSquaresUpdateWithEverythingAccum (31 ms) | |
[ RUN ] RMSPropSolverTest/0.TestSnapshotShare | |
[ OK ] RMSPropSolverTest/0.TestSnapshotShare (270 ms) | |
[ RUN ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithEverything | |
[ OK ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithEverything (721 ms) | |
[ RUN ] RMSPropSolverTest/0.TestSnapshot | |
[ OK ] RMSPropSolverTest/0.TestSnapshot (192 ms) | |
[ RUN ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithEverythingShare | |
[ OK ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithEverythingShare (1171 ms) | |
[ RUN ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithRmsDecay | |
[ OK ] RMSPropSolverTest/0.TestRMSPropLeastSquaresUpdateWithRmsDecay (721 ms) | |
[----------] 8 tests from RMSPropSolverTest/0 (3455 ms total) | |
[----------] 1 test from SolverTest/0, where TypeParam = caffe::CPUDevice<float> | |
[ RUN ] SolverTest/0.TestInitTrainTestNets | |
[ OK ] SolverTest/0.TestInitTrainTestNets (34 ms) | |
[----------] 1 test from SolverTest/0 (34 ms total) | |
[----------] 3 tests from BlobMathTest/2, where TypeParam = caffe::GPUDevice<float> | |
[ RUN ] BlobMathTest/2.TestSumOfSquares | |
[ OK ] BlobMathTest/2.TestSumOfSquares (74 ms) | |
[ RUN ] BlobMathTest/2.TestScaleData | |
[ OK ] BlobMathTest/2.TestScaleData (86 ms) | |
[ RUN ] BlobMathTest/2.TestAsum | |
[ OK ] BlobMathTest/2.TestAsum (38 ms) | |
[----------] 3 tests from BlobMathTest/2 (199 ms total) | |
[----------] 12 tests from SGDSolverTest/2, where TypeParam = caffe::GPUDevice<float> | |
[ RUN ] SGDSolverTest/2.TestLeastSquaresUpdate | |
OpenCL compiler error/warning: | |
(1491:0) : error : function: 'InitAccRegisters' hasn't the corresponding declaration | |
(1536:0) : error : function: 'MultiplyAccumulate' hasn't the corresponding declaration | |
(1664:0) : error : function: 'XgemmBody' hasn't the corresponding declaration | |
(1668:0) : error : function: 'StoreResults' hasn't the corresponding declaration | |
OpenCL code program was: | |
#define KWG 16 | |
#define KWI 2 | |
#define MDIMA 8 | |
#define MDIMC 8 | |
#define STRM 0 | |
#define MWG 32 | |
#define NDIMC 8 | |
#define SB 0 | |
#define STRN 0 | |
#define VWM 1 | |
#define PAD_WPTX 1 | |
#define VWN 1 | |
#define NWG 64 | |
#define PAD_WPTY 1 | |
#define COPY_VW 1 | |
#define PAD_DIMX 8 | |
#define COPY_DIMY 8 | |
#define NDIMB 8 | |
#define COPY_DIMX 8 | |
#define SA 0 | |
#define COPY_WPT 1 | |
#define TRA_SHUFFLE 0 | |
#define PAD_DIMY 8 | |
#define TRA_PAD 0 | |
#define TRA_DIM 4 | |
#define PADTRA_WPT 1 | |
#define PADTRA_PAD 0 | |
#define TRA_WPT 1 | |
#define PADTRA_TILE 8 | |
#define PRECISION 32 | |
#define ROUTINE_GEMM | |
// ================================================================================================= | |
// Parameters set by the tuner or by the database. Here they are given a basic default value in case | |
// this file is used outside of the CLBlast library. | |
#ifndef PRECISION | |
#define PRECISION 32 // Data-types: half, single or double precision, complex or regular | |
#endif | |
// ================================================================================================= | |
// Enable support for double-precision | |
#if PRECISION == 16 | |
#pragma OPENCL EXTENSION cl_khr_fp16: enable | |
#endif | |
// Enable support for double-precision | |
#if PRECISION == 64 || PRECISION == 6464 | |
#if __OPENCL_VERSION__ <= CL_VERSION_1_1 | |
#pragma OPENCL EXTENSION cl_khr_fp64: enable | |
#endif | |
#endif | |
// Half-precision | |
#if PRECISION == 16 | |
typedef half real; | |
typedef half2 real2; | |
typedef half4 real4; | |
typedef half8 real8; | |
typedef half16 real16; | |
#define ZERO 0 | |
#define ONE 1 | |
#define SMALLEST -1.0e14 | |
// Single-precision | |
#elif PRECISION == 32 | |
typedef float real; | |
typedef float2 real2; | |
typedef float4 real4; | |
typedef float8 real8; | |
typedef float16 real16; | |
#define ZERO 0.0f | |
#define ONE 1.0f | |
#define SMALLEST -1.0e37f | |
// Double-precision | |
#elif PRECISION == 64 | |
typedef double real; | |
typedef double2 real2; | |
typedef double4 real4; | |
typedef double8 real8; | |
typedef double16 real16; | |
#define ZERO 0.0 | |
#define ONE 1.0 | |
#define SMALLEST -1.0e37 | |
// Complex single-precision | |
#elif PRECISION == 3232 | |
typedef struct cfloat {float x; float y;} real; | |
typedef struct cfloat2 {real x; real y;} real2; | |
typedef struct cfloat4 {real x; real y; real z; real w;} real4; | |
typedef struct cfloat8 {real s0; real s1; real s2; real s3; | |
real s4; real s5; real s6; real s7;} real8; | |
typedef struct cfloat16 {real s0; real s1; real s2; real s3; | |
real s4; real s5; real s6; real s7; | |
real s8; real s9; real sA; real sB; | |
real sC; real sD; real sE; real sF;} real16; | |
#define ZERO 0.0f | |
#define ONE 1.0f | |
#define SMALLEST -1.0e37f | |
// Complex double-precision | |
#elif PRECISION == 6464 | |
typedef struct cdouble {double x; double y;} real; | |
typedef struct cdouble2 {real x; real y;} real2; | |
typedef struct cdouble4 {real x; real y; real z; real w;} real4; | |
typedef struct cdouble8 {real s0; real s1; real s2; real s3; | |
real s4; real s5; real s6; real s7;} real8; | |
typedef struct cdouble16 {real s0; real s1; real s2; real s3; | |
real s4; real s5; real s6; real s7; | |
real s8; real s9; real sA; real sB; | |
real sC; real sD; real sE; real sF;} real16; | |
#define ZERO 0.0 | |
#define ONE 1.0 | |
#define SMALLEST -1.0e37 | |
#endif | |
// Single-element version of a complex number | |
#if PRECISION == 3232 | |
typedef float singlereal; | |
#elif PRECISION == 6464 | |
typedef double singlereal; | |
#else | |
typedef real singlereal; | |
#endif | |
// ================================================================================================= | |
// Don't use the non-IEEE754 compliant OpenCL built-in mad() instruction per default. For specific | |
// devices, this is enabled (see src/routine.cc). | |
#ifndef USE_CL_MAD | |
#define USE_CL_MAD 0 | |
#endif | |
// Sets a variable to zero | |
#if PRECISION == 3232 || PRECISION == 6464 | |
#define SetToZero(a) a.x = ZERO; a.y = ZERO | |
#else | |
#define SetToZero(a) a = ZERO | |
#endif | |
// Sets a variable to zero (only the imaginary part) | |
#if PRECISION == 3232 || PRECISION == 6464 | |
#define ImagToZero(a) a.y = ZERO | |
#else | |
#define ImagToZero(a) | |
#endif | |
// Sets a variable to one | |
#if PRECISION == 3232 || PRECISION == 6464 | |
#define SetToOne(a) a.x = ONE; a.y = ZERO | |
#else | |
#define SetToOne(a) a = ONE | |
#endif | |
// The absolute value (component-wise) | |
#if PRECISION == 3232 || PRECISION == 6464 | |
#define AbsoluteValue(value) value.x = fabs(value.x); value.y = fabs(value.y) | |
#else | |
#define AbsoluteValue(value) value = fabs(value) | |
#endif | |
// Adds two complex variables | |
#if PRECISION == 3232 || PRECISION == 6464 | |
#define Add(c, a, b) c.x = a.x + b.x; c.y = a.y + b.y | |
#else | |
#define Add(c, a, b) c = a + b | |
#endif | |
// Multiply two complex variables (used in the defines below) | |
#if PRECISION == 3232 || PRECISION == 6464 | |
#define MulReal(a, b) a.x*b.x - a.y*b.y | |
#define MulImag(a, b) a.x*b.y + a.y*b.x | |
#endif | |
// The scalar multiply function | |
#if PRECISION == 3232 || PRECISION == 6464 | |
#define Multiply(c, a, b) c.x = MulReal(a,b); c.y = MulImag(a,b) | |
#else | |
#define Multiply(c, a, b) c = a * b | |
#endif | |
// The scalar multiply-add function | |
#if PRECISION == 3232 || PRECISION == 6464 | |
#define MultiplyAdd(c, a, b) c.x += MulReal(a,b); c.y += MulImag(a,b) | |
#else | |
#if USE_CL_MAD == 1 | |
#define MultiplyAdd(c, a, b) c = mad(a, b, c) | |
#else | |
#define MultiplyAdd(c, a, b) c += a * b | |
#endif | |
#endif | |
// The scalar AXPBY function | |
#if PRECISION == 3232 || PRECISION == 6464 | |
#define AXPBY(e, a, b, c, d) e.x = MulReal(a,b) + MulReal(c,d); e.y = MulImag(a,b) + MulImag(c,d) | |
#else | |
#define AXPBY(e, a, b, c, d) e = a*b + c*d | |
#endif | |
// The complex conjugate operation for complex transforms | |
#if PRECISION == 3232 || PRECISION == 6464 | |
#define COMPLEX_CONJUGATE(value) value.x = value.x; value.y = -value.y | |
#else | |
#define COMPLEX_CONJUGATE(value) value = value | |
#endif | |
// ================================================================================================= | |
// Shuffled workgroup indices to avoid partition camping, see below. For specific devices, this is | |
// enabled (see src/routine.cc). | |
#ifndef USE_STAGGERED_INDICES | |
#define USE_STAGGERED_INDICES 0 | |
#endif | |
// Staggered/shuffled group indices to avoid partition camping (AMD GPUs). Formula's are taken from: | |
// http://docs.nvidia.com/cuda/samples/6_Advanced/transpose/doc/MatrixTranspose.pdf | |
// More details: https://github.com/CNugteren/CLBlast/issues/53 | |
#if USE_STAGGERED_INDICES == 1 | |
inline size_t GetGroupIDFlat() { | |
return get_group_id(0) + get_num_groups(0) * get_group_id(1); | |
} | |
inline size_t GetGroupID1() { | |
return (GetGroupIDFlat()) % get_num_groups(1); | |
} | |
inline size_t GetGroupID0() { | |
return ((GetGroupIDFlat() / get_num_groups(1)) + GetGroupID1()) % get_num_groups(0); | |
} | |
#else | |
inline size_t GetGroupID1() { return get_group_id(1); } | |
inline size_t GetGroupID0() { return get_group_id(0); } | |
#endif | |
// ================================================================================================= | |
// End of the C++11 raw string literal | |
// ================================================================================================= | |
// Parameters set by the tuner or by the database. Here they are given a basic default value in case | |
// this kernel file is used outside of the CLBlast library. | |
// For the 'fast' copy kernel | |
#ifndef COPY_DIMX | |
#define COPY_DIMX 8 // Local workgroup size in the first dimension (x) | |
#endif | |
#ifndef COPY_DIMY | |
#define COPY_DIMY 8 // Local workgroup size in the second dimension (y) | |
#endif | |
#ifndef COPY_WPT | |
#define COPY_WPT 1 // Work per thread in the first dimension (x) | |
#endif | |
#ifndef COPY_VW | |
#define COPY_VW 1 // Vector width in the second dimension (y) | |
#endif | |
// For the padding/copy kernels and the conversion kernels | |
#ifndef PAD_DIMX | |
#define PAD_DIMX 8 // Local workgroup size in the first dimension (x) | |
#endif | |
#ifndef PAD_DIMY | |
#define PAD_DIMY 8 // Local workgroup size in the second dimension (y) | |
#endif | |
#ifndef PAD_WPTX | |
#define PAD_WPTX 1 // Work per thread in the first dimension (x) | |
#endif | |
#ifndef PAD_WPTY | |
#define PAD_WPTY 1 // Work per thread in the second dimension (y) | |
#endif | |
// For the 'fast' transpose kernel | |
#ifndef TRA_DIM | |
#define TRA_DIM 8 // Number of local threads in the two dimensions (x,y) | |
#endif | |
#ifndef TRA_WPT | |
#define TRA_WPT 1 // Work per thread in one dimension and vector-width in the other | |
#endif | |
#ifndef TRA_PAD | |
#define TRA_PAD 0 // Padding of the local memory to avoid bank-conflicts | |
#endif | |
#ifndef TRA_SHUFFLE | |
#define TRA_SHUFFLE 0 // Shuffling of the global indices to avoid global memory bank-conflicts | |
#endif | |
// For the padding/transpose kernels | |
#ifndef PADTRA_TILE | |
#define PADTRA_TILE 8 // Number of local threads in the two dimensions (x,y) | |
#endif | |
#ifndef PADTRA_WPT | |
#define PADTRA_WPT 1 // Amount of work per thread | |
#endif | |
#ifndef PADTRA_PAD | |
#define PADTRA_PAD 0 // Padding of the local memory to avoid bank-conflicts | |
#endif | |
// ================================================================================================= | |
// End of the C++11 raw string literal | |
// ================================================================================================= | |
// Data-widths | |
#if COPY_VW == 1 | |
typedef real realC; | |
#elif COPY_VW == 2 | |
typedef real2 realC; | |
#elif COPY_VW == 4 | |
typedef real4 realC; | |
#elif COPY_VW == 8 | |
typedef real8 realC; | |
#elif COPY_VW == 16 | |
typedef real16 realC; | |
#endif | |
// ================================================================================================= | |
// Fast copy kernel. Requires 'ld' and the number of threads in dimension 0 to be a multiple of | |
// COPY_VW. Also requires both matrices to be of the same dimensions and without offset. | |
__kernel __attribute__((reqd_work_group_size(COPY_DIMX, COPY_DIMY, 1))) | |
void CopyMatrixFast(const int ld, | |
__global const realC* restrict src, | |
__global realC* dest, | |
const __constant real* restrict arg_alpha) { | |
const real alpha = arg_alpha[0]; | |
#pragma unroll | |
for (int w_one=0; w_one<COPY_WPT; ++w_one) { | |
const int id_one = get_global_id(0); | |
const int id_two = (get_group_id(1)*COPY_WPT + w_one) * COPY_DIMY + get_local_id(1); | |
const int id = id_two*(ld/COPY_VW) + id_one; | |
realC result; | |
#if COPY_VW == 1 | |
Multiply(result, alpha, src[id]); | |
#elif COPY_VW == 2 | |
Multiply(result.x, alpha, src[id].x); | |
Multiply(result.y, alpha, src[id].y); | |
#elif COPY_VW == 4 | |
Multiply(result.x, alpha, src[id].x); | |
Multiply(result.y, alpha, src[id].y); | |
Multiply(result.z, alpha, src[id].z); | |
Multiply(result.w, alpha, src[id].w); | |
#elif COPY_VW == 8 | |
Multiply(result.s0, alpha, src[id].s0); | |
Multiply(result.s1, alpha, src[id].s1); | |
Multiply(result.s2, alpha, src[id].s2); | |
Multiply(result.s3, alpha, src[id].s3); | |
Multiply(result.s4, alpha, src[id].s4); | |
Multiply(result.s5, alpha, src[id].s5); | |
Multiply(result.s6, alpha, src[id].s6); | |
Multiply(result.s7, alpha, src[id].s7); | |
#elif COPY_VW == 16 | |
Multiply(result.s0, alpha, src[id].s0); | |
Multiply(result.s1, alpha, src[id].s1); | |
Multiply(result.s2, alpha, src[id].s2); | |
Multiply(result.s3, alpha, src[id].s3); | |
Multiply(result.s4, alpha, src[id].s4); | |
Multiply(result.s5, alpha, src[id].s5); | |
Multiply(result.s6, alpha, src[id].s6); | |
Multiply(result.s7, alpha, src[id].s7); | |
Multiply(result.s8, alpha, src[id].s8); | |
Multiply(result.s9, alpha, src[id].s9); | |
Multiply(result.sA, alpha, src[id].sA); | |
Multiply(result.sB, alpha, src[id].sB); | |
Multiply(result.sC, alpha, src[id].sC); | |
Multiply(result.sD, alpha, src[id].sD); | |
Multiply(result.sE, alpha, src[id].sE); | |
Multiply(result.sF, alpha, src[id].sF); | |
#endif | |
dest[id] = result;; | |
} | |
} | |
// ================================================================================================= | |
// End of the C++11 raw string literal | |
// ================================================================================================= | |
// Copies a matrix from source to destination. The output is padded with zero values in case the | |
// destination matrix dimensions are larger than the source matrix dimensions. Additionally, the ld | |
// value and offset can be different. | |
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1))) | |
void CopyPadMatrix(const int src_one, const int src_two, | |
const int src_ld, const int src_offset, | |
__global const real* restrict src, | |
const int dest_one, const int dest_two, | |
const int dest_ld, const int dest_offset, | |
__global real* dest, | |
const __constant real* restrict arg_alpha, | |
const int do_conjugate) { | |
const real alpha = arg_alpha[0]; | |
// Loops over the work per thread in both dimensions | |
#pragma unroll | |
for (int w_one=0; w_one<PAD_WPTX; ++w_one) { | |
const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0); | |
#pragma unroll | |
for (int w_two=0; w_two<PAD_WPTY; ++w_two) { | |
const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1); | |
if (id_two < dest_two && id_one < dest_one) { | |
// Loads data if the thread IDs are within bounds of the source matrix. Otherwise, set the | |
// value to be written to zero. | |
real value; | |
SetToZero(value); | |
if (id_two < src_two && id_one < src_one) { | |
value = src[id_two*src_ld + id_one + src_offset]; | |
} | |
// Stores the value in the destination matrix | |
if (do_conjugate == 1) { COMPLEX_CONJUGATE(value); } | |
Multiply(dest[id_two*dest_ld + id_one + dest_offset], alpha, value); | |
} | |
} | |
} | |
} | |
// ================================================================================================= | |
// Same as above, but now un-pads a matrix. This kernel reads data from a padded source matrix, but | |
// writes only the actual data back to the destination matrix. Again, the ld value and offset can | |
// be different. | |
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1))) | |
void CopyMatrix(const int src_one, const int src_two, | |
const int src_ld, const int src_offset, | |
__global const real* restrict src, | |
const int dest_one, const int dest_two, | |
const int dest_ld, const int dest_offset, | |
__global real* dest, | |
const __constant real* restrict arg_alpha, | |
const int upper, const int lower, | |
const int diagonal_imag_zero) { | |
const real alpha = arg_alpha[0]; | |
// Loops over the work per thread in both dimensions | |
#pragma unroll | |
for (int w_one=0; w_one<PAD_WPTX; ++w_one) { | |
const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0); | |
#pragma unroll | |
for (int w_two=0; w_two<PAD_WPTY; ++w_two) { | |
const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1); | |
// Masking in case of triangular matrices: updates only the upper or lower part | |
bool condition = true; | |
#if defined(ROUTINE_SYRK) || defined(ROUTINE_HERK) || defined(ROUTINE_SYR2K) || defined(ROUTINE_HER2K) | |
if (upper == 1) { condition = (id_two >= id_one); } | |
else if (lower == 1) { condition = (id_two <= id_one); } | |
#endif | |
if (condition) { | |
// Copies the value into the destination matrix. This is always within bounds of the source | |
// matrix, as we know that the destination matrix is smaller or equal to the source. | |
if (id_two < dest_two && id_one < dest_one) { | |
real value = src[id_two*src_ld + id_one + src_offset]; | |
if (diagonal_imag_zero == 1 && id_one == id_two) { ImagToZero(value); } | |
Multiply(dest[id_two*dest_ld + id_one + dest_offset], alpha, value); | |
} | |
} | |
} | |
} | |
} | |
// ================================================================================================= | |
// End of the C++11 raw string literal | |
// ================================================================================================= | |
// Data-widths | |
#if TRA_WPT == 1 | |
typedef real realT; | |
#elif TRA_WPT == 2 | |
typedef real2 realT; | |
#elif TRA_WPT == 4 | |
typedef real4 realT; | |
#elif TRA_WPT == 8 | |
typedef real8 realT; | |
#elif TRA_WPT == 16 | |
typedef real16 realT; | |
#endif | |
// ================================================================================================= | |
// Transposes and copies a matrix. Requires both matrices to be of the same dimensions and without | |
// offset. A more general version is available in 'padtranspose.opencl'. | |
__kernel __attribute__((reqd_work_group_size(TRA_DIM, TRA_DIM, 1))) | |
void TransposeMatrixFast(const int ld, | |
__global const realT* restrict src, | |
__global realT* dest, | |
const __constant real* restrict arg_alpha) { | |
const real alpha = arg_alpha[0]; | |
// Sets the group identifiers. They might be 'shuffled' around to distribute work in a different | |
// way over workgroups, breaking memory-bank dependencies. | |
const int gid0 = get_group_id(0); | |
#if TRA_SHUFFLE == 1 | |
const int gid1 = (get_group_id(0) + get_group_id(1)) % get_num_groups(0); | |
#else | |
const int gid1 = get_group_id(1); | |
#endif | |
// Local memory to store a tile of the matrix (for coalescing) | |
__local realT tile[TRA_WPT*TRA_DIM][TRA_DIM + TRA_PAD]; | |
// Loops over the work per thread | |
#pragma unroll | |
for (int w_one=0; w_one<TRA_WPT; ++w_one) { | |
// Computes the identifiers for the source matrix. Note that the local and global dimensions | |
// do not correspond to each other! | |
const int id_one = gid1 * TRA_DIM + get_local_id(0); | |
const int id_two = (gid0 * TRA_DIM + get_local_id(1))*TRA_WPT + w_one; | |
// Loads data into the local memory | |
realT value = src[id_two*(ld/TRA_WPT) + id_one]; | |
tile[get_local_id(0)*TRA_WPT + w_one][get_local_id(1)] = value; | |
} | |
// Synchronizes all threads in a workgroup | |
barrier(CLK_LOCAL_MEM_FENCE); | |
// Loads transposed data from the local memory | |
realT v[TRA_WPT]; | |
#pragma unroll | |
for (int w_one=0; w_one<TRA_WPT; ++w_one) { | |
v[w_one] = tile[get_local_id(1)*TRA_WPT + w_one][get_local_id(0)]; | |
} | |
// Performs the register-level transpose of the vectorized data | |
realT results[TRA_WPT]; | |
#if TRA_WPT == 1 | |
results[0] = v[0]; | |
#elif TRA_WPT == 2 | |
results[0] = (realT) {v[0].x, v[1].x}; | |
results[1] = (realT) {v[0].y, v[1].y}; | |
#elif TRA_WPT == 4 | |
results[0] = (realT) {v[0].x, v[1].x, v[2].x, v[3].x}; | |
results[1] = (realT) {v[0].y, v[1].y, v[2].y, v[3].y}; | |
results[2] = (realT) {v[0].z, v[1].z, v[2].z, v[3].z}; | |
results[3] = (realT) {v[0].w, v[1].w, v[2].w, v[3].w}; | |
#elif TRA_WPT == 8 | |
results[0] = (realT) {v[0].s0, v[1].s0, v[2].s0, v[3].s0, v[4].s0, v[5].s0, v[6].s0, v[7].s0}; | |
results[1] = (realT) {v[0].s1, v[1].s1, v[2].s1, v[3].s1, v[4].s1, v[5].s1, v[6].s1, v[7].s1}; | |
results[2] = (realT) {v[0].s2, v[1].s2, v[2].s2, v[3].s2, v[4].s2, v[5].s2, v[6].s2, v[7].s2}; | |
results[3] = (realT) {v[0].s3, v[1].s3, v[2].s3, v[3].s3, v[4].s3, v[5].s3, v[6].s3, v[7].s3}; | |
results[4] = (realT) {v[0].s4, v[1].s4, v[2].s4, v[3].s4, v[4].s4, v[5].s4, v[6].s4, v[7].s4}; | |
results[5] = (realT) {v[0].s5, v[1].s5, v[2].s5, v[3].s5, v[4].s5, v[5].s5, v[6].s5, v[7].s5}; | |
results[6] = (realT) {v[0].s6, v[1].s6, v[2].s6, v[3].s6, v[4].s6, v[5].s6, v[6].s6, v[7].s6}; | |
results[7] = (realT) {v[0].s7, v[1].s7, v[2].s7, v[3].s7, v[4].s7, v[5].s7, v[6].s7, v[7].s7}; | |
#elif TRA_WPT == 16 | |
results[ 0] = (realT) {v[0].s0, v[1].s0, v[2].s0, v[3].s0, v[4].s0, v[5].s0, v[6].s0, v[7].s0, v[8].s0, v[9].s0, v[10].s0, v[11].s0, v[12].s0, v[13].s0, v[14].s0, v[15].s0}; | |
results[ 1] = (realT) {v[0].s1, v[1].s1, v[2].s1, v[3].s1, v[4].s1, v[5].s1, v[6].s1, v[7].s1, v[8].s1, v[9].s1, v[10].s1, v[11].s1, v[12].s1, v[13].s1, v[14].s1, v[15].s1}; | |
results[ 2] = (realT) {v[0].s2, v[1].s2, v[2].s2, v[3].s2, v[4].s2, v[5].s2, v[6].s2, v[7].s2, v[8].s2, v[9].s2, v[10].s2, v[11].s2, v[12].s2, v[13].s2, v[14].s2, v[15].s2}; | |
results[ 3] = (realT) {v[0].s3, v[1].s3, v[2].s3, v[3].s3, v[4].s3, v[5].s3, v[6].s3, v[7].s3, v[8].s3, v[9].s3, v[10].s3, v[11].s3, v[12].s3, v[13].s3, v[14].s3, v[15].s3}; | |
results[ 4] = (realT) {v[0].s4, v[1].s4, v[2].s4, v[3].s4, v[4].s4, v[5].s4, v[6].s4, v[7].s4, v[8].s4, v[9].s4, v[10].s4, v[11].s4, v[12].s4, v[13].s4, v[14].s4, v[15].s4}; | |
results[ 5] = (realT) {v[0].s5, v[1].s5, v[2].s5, v[3].s5, v[4].s5, v[5].s5, v[6].s5, v[7].s5, v[8].s5, v[9].s5, v[10].s5, v[11].s5, v[12].s5, v[13].s5, v[14].s5, v[15].s5}; | |
results[ 6] = (realT) {v[0].s6, v[1].s6, v[2].s6, v[3].s6, v[4].s6, v[5].s6, v[6].s6, v[7].s6, v[8].s6, v[9].s6, v[10].s6, v[11].s6, v[12].s6, v[13].s6, v[14].s6, v[15].s6}; | |
results[ 7] = (realT) {v[0].s7, v[1].s7, v[2].s7, v[3].s7, v[4].s7, v[5].s7, v[6].s7, v[7].s7, v[8].s7, v[9].s7, v[10].s7, v[11].s7, v[12].s7, v[13].s7, v[14].s7, v[15].s7}; | |
results[ 8] = (realT) {v[0].s8, v[1].s8, v[2].s8, v[3].s8, v[4].s8, v[5].s8, v[6].s8, v[7].s8, v[8].s8, v[9].s8, v[10].s8, v[11].s8, v[12].s8, v[13].s8, v[14].s8, v[15].s8}; | |
results[ 9] = (realT) {v[0].s9, v[1].s9, v[2].s9, v[3].s9, v[4].s9, v[5].s9, v[6].s9, v[7].s9, v[8].s9, v[9].s9, v[10].s9, v[11].s9, v[12].s9, v[13].s9, v[14].s9, v[15].s9}; | |
results[10] = (realT) {v[0].sA, v[1].sA, v[2].sA, v[3].sA, v[4].sA, v[5].sA, v[6].sA, v[7].sA, v[8].sA, v[9].sA, v[10].sA, v[11].sA, v[12].sA, v[13].sA, v[14].sA, v[15].sA}; | |
results[11] = (realT) {v[0].sB, v[1].sB, v[2].sB, v[3].sB, v[4].sB, v[5].sB, v[6].sB, v[7].sB, v[8].sB, v[9].sB, v[10].sB, v[11].sB, v[12].sB, v[13].sB, v[14].sB, v[15].sB}; | |
results[12] = (realT) {v[0].sC, v[1].sC, v[2].sC, v[3].sC, v[4].sC, v[5].sC, v[6].sC, v[7].sC, v[8].sC, v[9].sC, v[10].sC, v[11].sC, v[12].sC, v[13].sC, v[14].sC, v[15].sC}; | |
results[13] = (realT) {v[0].sD, v[1].sD, v[2].sD, v[3].sD, v[4].sD, v[5].sD, v[6].sD, v[7].sD, v[8].sD, v[9].sD, v[10].sD, v[11].sD, v[12].sD, v[13].sD, v[14].sD, v[15].sD}; | |
results[14] = (realT) {v[0].sE, v[1].sE, v[2].sE, v[3].sE, v[4].sE, v[5].sE, v[6].sE, v[7].sE, v[8].sE, v[9].sE, v[10].sE, v[11].sE, v[12].sE, v[13].sE, v[14].sE, v[15].sE}; | |
results[15] = (realT) {v[0].sF, v[1].sF, v[2].sF, v[3].sF, v[4].sF, v[5].sF, v[6].sF, v[7].sF, v[8].sF, v[9].sF, v[10].sF, v[11].sF, v[12].sF, v[13].sF, v[14].sF, v[15].sF}; | |
#endif | |
// Multiplies by alpha and then stores the results into the destination matrix | |
#pragma unroll | |
for (int w_two=0; w_two<TRA_WPT; ++w_two) { | |
realT result; | |
#if TRA_WPT == 1 | |
Multiply(result, alpha, results[w_two]); | |
#elif TRA_WPT == 2 | |
Multiply(result.x, alpha, results[w_two].x); | |
Multiply(result.y, alpha, results[w_two].y); | |
#elif TRA_WPT == 4 | |
Multiply(result.x, alpha, results[w_two].x); | |
Multiply(result.y, alpha, results[w_two].y); | |
Multiply(result.z, alpha, results[w_two].z); | |
Multiply(result.w, alpha, results[w_two].w); | |
#elif TRA_WPT == 8 | |
Multiply(result.s0, alpha, results[w_two].s0); | |
Multiply(result.s1, alpha, results[w_two].s1); | |
Multiply(result.s2, alpha, results[w_two].s2); | |
Multiply(result.s3, alpha, results[w_two].s3); | |
Multiply(result.s4, alpha, results[w_two].s4); | |
Multiply(result.s5, alpha, results[w_two].s5); | |
Multiply(result.s6, alpha, results[w_two].s6); | |
Multiply(result.s7, alpha, results[w_two].s7); | |
#elif TRA_WPT == 16 | |
Multiply(result.s0, alpha, results[w_two].s0); | |
Multiply(result.s1, alpha, results[w_two].s1); | |
Multiply(result.s2, alpha, results[w_two].s2); | |
Multiply(result.s3, alpha, results[w_two].s3); | |
Multiply(result.s4, alpha, results[w_two].s4); | |
Multiply(result.s5, alpha, results[w_two].s5); | |
Multiply(result.s6, alpha, results[w_two].s6); | |
Multiply(result.s7, alpha, results[w_two].s7); | |
Multiply(result.s8, alpha, results[w_two].s8); | |
Multiply(result.s9, alpha, results[w_two].s9); | |
Multiply(result.sA, alpha, results[w_two].sA); | |
Multiply(result.sB, alpha, results[w_two].sB); | |
Multiply(result.sC, alpha, results[w_two].sC); | |
Multiply(result.sD, alpha, results[w_two].sD); | |
Multiply(result.sE, alpha, results[w_two].sE); | |
Multiply(result.sF, alpha, results[w_two].sF); | |
#endif | |
const int id_one = gid0*TRA_DIM + get_local_id(0); | |
const int id_two = (gid1*TRA_DIM + get_local_id(1))*TRA_WPT + w_two; | |
dest[id_two*(ld/TRA_WPT) + id_one] = result; | |
} | |
} | |
// ================================================================================================= | |
// End of the C++11 raw string literal | |
// ================================================================================================= | |
// Transposes a matrix from source to destination. The output is padded with zero values in case the | |
// destination matrix dimensions are larger than the transposed source matrix dimensions. | |
__kernel __attribute__((reqd_work_group_size(PADTRA_TILE, PADTRA_TILE, 1))) | |
void TransposePadMatrix(const int src_one, const int src_two, | |
const int src_ld, const int src_offset, | |
__global const real* restrict src, | |
const int dest_one, const int dest_two, | |
const int dest_ld, const int dest_offset, | |
__global real* dest, | |
const __constant real* restrict arg_alpha, | |
const int do_conjugate) { | |
const real alpha = arg_alpha[0]; | |
// Local memory to store a tile of the matrix (for coalescing) | |
__local real tile[PADTRA_WPT*PADTRA_TILE][PADTRA_WPT*PADTRA_TILE + PADTRA_PAD]; | |
// Loop over the work per thread | |
#pragma unroll | |
for (int w_one=0; w_one<PADTRA_WPT; ++w_one) { | |
#pragma unroll | |
for (int w_two=0; w_two<PADTRA_WPT; ++w_two) { | |
// Computes the identifiers for the source matrix. Note that the local and global dimensions | |
// do not correspond to each other! | |
const int id_src_one = (get_group_id(1)*PADTRA_WPT + w_two) * PADTRA_TILE + get_local_id(0); | |
const int id_src_two = (get_group_id(0)*PADTRA_WPT + w_one) * PADTRA_TILE + get_local_id(1); | |
// Loads data into the local memory if the thread IDs are within bounds of the source matrix. | |
// Otherwise, set the local memory value to zero. | |
real value; | |
SetToZero(value); | |
if (id_src_two < src_two && id_src_one < src_one) { | |
value = src[id_src_two*src_ld + id_src_one + src_offset]; | |
} | |
tile[get_local_id(1)*PADTRA_WPT + w_two][get_local_id(0)*PADTRA_WPT + w_one] = value; | |
} | |
} | |
// Synchronizes all threads in a workgroup | |
barrier(CLK_LOCAL_MEM_FENCE); | |
// Loop over the work per thread | |
#pragma unroll | |
for (int w_one=0; w_one<PADTRA_WPT; ++w_one) { | |
#pragma unroll | |
for (int w_two=0; w_two<PADTRA_WPT; ++w_two) { | |
// Computes the identifiers for the destination matrix | |
const int id_dest_one = (get_group_id(0)*PADTRA_WPT + w_one) * PADTRA_TILE + get_local_id(0); | |
const int id_dest_two = (get_group_id(1)*PADTRA_WPT + w_two) * PADTRA_TILE + get_local_id(1); | |
// Stores the transposed value in the destination matrix | |
if ((id_dest_one < dest_one) && (id_dest_two < dest_two)) { | |
real value = tile[get_local_id(0)*PADTRA_WPT + w_two][get_local_id(1)*PADTRA_WPT + w_one]; | |
if (do_conjugate == 1) { COMPLEX_CONJUGATE(value); } | |
Multiply(dest[id_dest_two*dest_ld + id_dest_one + dest_offset], alpha, value); | |
} | |
} | |
} | |
} | |
// ================================================================================================= | |
// Transposes a matrix, while considering possible padding in the source matrix. Data is read from a | |
// padded source matrix, but only the actual data is written back to the transposed destination | |
// matrix. This kernel optionally checks for upper/lower triangular matrices. | |
__kernel __attribute__((reqd_work_group_size(PADTRA_TILE, PADTRA_TILE, 1))) | |
void TransposeMatrix(const int src_one, const int src_two, | |
const int src_ld, const int src_offset, | |
__global const real* restrict src, | |
const int dest_one, const int dest_two, | |
const int dest_ld, const int dest_offset, | |
__global real* dest, | |
const __constant real* restrict arg_alpha, | |
const int upper, const int lower, | |
const int diagonal_imag_zero) { | |
const real alpha = arg_alpha[0]; | |
// Local memory to store a tile of the matrix (for coalescing) | |
__local real tile[PADTRA_WPT*PADTRA_TILE][PADTRA_WPT*PADTRA_TILE + PADTRA_PAD]; | |
// Loop over the work per thread | |
#pragma unroll | |
for (int w_one=0; w_one<PADTRA_WPT; ++w_one) { | |
#pragma unroll | |
for (int w_two=0; w_two<PADTRA_WPT; ++w_two) { | |
// Computes the identifiers for the source matrix. Note that the local and global dimensions | |
// do not correspond to each other! | |
const int id_src_one = (get_group_id(1)*PADTRA_WPT + w_two) * PADTRA_TILE + get_local_id(0); | |
const int id_src_two = (get_group_id(0)*PADTRA_WPT + w_one) * PADTRA_TILE + get_local_id(1); | |
// Loads data into the local memory if the thread IDs are within bounds of the source matrix. | |
if ((id_src_one < src_one) && (id_src_two < src_two)) { | |
real value = src[id_src_two*src_ld + id_src_one + src_offset]; | |
tile[get_local_id(1)*PADTRA_WPT + w_two][get_local_id(0)*PADTRA_WPT + w_one] = value; | |
} | |
} | |
} | |
// Synchronizes all threads in a workgroup | |
barrier(CLK_LOCAL_MEM_FENCE); | |
// Loop over the work per thread | |
#pragma unroll | |
for (int w_one=0; w_one<PADTRA_WPT; ++w_one) { | |
#pragma unroll | |
for (int w_two=0; w_two<PADTRA_WPT; ++w_two) { | |
// Computes the identifiers for the destination matrix | |
const int id_dest_one = (get_group_id(0)*PADTRA_WPT + w_one) * PADTRA_TILE + get_local_id(0); | |
const int id_dest_two = (get_group_id(1)*PADTRA_WPT + w_two) * PADTRA_TILE + get_local_id(1); | |
// Masking in case of triangular matrices: updates only the upper or lower part | |
bool condition = true; | |
#if defined(ROUTINE_SYRK) || defined(ROUTINE_HERK) || defined(ROUTINE_SYR2K) || defined(ROUTINE_HER2K) | |
if (upper == 1) { condition = (id_dest_one >= id_dest_two); } | |
else if (lower == 1) { condition = (id_dest_one <= id_dest_two); } | |
#endif | |
if (condition) { | |
// Stores the transposed value in the destination matrix | |
if ((id_dest_one < dest_one) && (id_dest_two < dest_two)) { | |
real value = tile[get_local_id(0)*PADTRA_WPT + w_two][get_local_id(1)*PADTRA_WPT + w_one]; | |
if (diagonal_imag_zero == 1 && id_dest_one == id_dest_two) { ImagToZero(value); } | |
Multiply(dest[id_dest_two*dest_ld + id_dest_one + dest_offset], alpha, value); | |
} | |
} | |
} | |
} | |
} | |
// ================================================================================================= | |
// End of the C++11 raw string literal | |
// ================================================================================================= | |
#if defined(ROUTINE_SYMM) | |
// Kernel to populate a squared symmetric matrix, given that the triangle which holds the data is | |
// stored as the lower-triangle of the input matrix. This uses the padding kernel's parameters. | |
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1))) | |
void SymmLowerToSquared(const int src_dim, | |
const int src_ld, const int src_offset, | |
__global const real* restrict src, | |
const int dest_dim, | |
const int dest_ld, const int dest_offset, | |
__global real* dest) { | |
// Loops over the work per thread in both dimensions | |
#pragma unroll | |
for (int w_one=0; w_one<PAD_WPTX; ++w_one) { | |
const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0); | |
#pragma unroll | |
for (int w_two=0; w_two<PAD_WPTY; ++w_two) { | |
const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1); | |
if (id_two < dest_dim && id_one < dest_dim) { | |
// Loads data from the lower-symmetric matrix | |
real result; | |
SetToZero(result); | |
if (id_two < src_dim && id_one < src_dim) { | |
if (id_two <= id_one) { result = src[id_two*src_ld + id_one + src_offset]; } | |
else { result = src[id_one*src_ld + id_two + src_offset]; } | |
} | |
// Stores the result in the destination matrix | |
dest[id_two*dest_ld + id_one + dest_offset] = result; | |
} | |
} | |
} | |
} | |
// Same as above, but now the matrix' data is stored in the upper-triangle | |
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1))) | |
void SymmUpperToSquared(const int src_dim, | |
const int src_ld, const int src_offset, | |
__global const real* restrict src, | |
const int dest_dim, | |
const int dest_ld, const int dest_offset, | |
__global real* dest) { | |
// Loops over the work per thread in both dimensions | |
#pragma unroll | |
for (int w_one=0; w_one<PAD_WPTX; ++w_one) { | |
const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0); | |
#pragma unroll | |
for (int w_two=0; w_two<PAD_WPTY; ++w_two) { | |
const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1); | |
if (id_two < dest_dim && id_one < dest_dim) { | |
// Loads data from the upper-symmetric matrix | |
real result; | |
SetToZero(result); | |
if (id_two < src_dim && id_one < src_dim) { | |
if (id_one <= id_two) { result = src[id_two*src_ld + id_one + src_offset]; } | |
else { result = src[id_one*src_ld + id_two + src_offset]; } | |
} | |
// Stores the result in the destination matrix | |
dest[id_two*dest_ld + id_one + dest_offset] = result; | |
} | |
} | |
} | |
} | |
#endif | |
// ================================================================================================= | |
// End of the C++11 raw string literal | |
// ================================================================================================= | |
#if defined(ROUTINE_TRMM) | |
// Kernel to populate a squared triangular matrix, given that the triangle which holds the data is | |
// stored as the lower-triangle of the input matrix. This uses the padding kernel's parameters. | |
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1))) | |
void TriaLowerToSquared(const int src_dim, | |
const int src_ld, const int src_offset, | |
__global const real* restrict src, | |
const int dest_dim, | |
const int dest_ld, const int dest_offset, | |
__global real* dest, | |
const int unit_diagonal) { | |
// Loops over the work per thread in both dimensions | |
#pragma unroll | |
for (int w_one=0; w_one<PAD_WPTX; ++w_one) { | |
const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0); | |
#pragma unroll | |
for (int w_two=0; w_two<PAD_WPTY; ++w_two) { | |
const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1); | |
if (id_two < dest_dim && id_one < dest_dim) { | |
// Loads data from the lower-triangular matrix | |
real result; | |
SetToZero(result); | |
if (id_two < src_dim && id_one < src_dim) { | |
if (id_two <= id_one) { result = src[id_two*src_ld + id_one + src_offset]; } | |
if (id_two == id_one && unit_diagonal) { SetToOne(result); } | |
// Else: result is zero | |
} | |
// Stores the result in the destination matrix | |
dest[id_two*dest_ld + id_one + dest_offset] = result; | |
} | |
} | |
} | |
} | |
// Same as above, but now the matrix' data is stored in the upper-triangle | |
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1))) | |
void TriaUpperToSquared(const int src_dim, | |
const int src_ld, const int src_offset, | |
__global const real* restrict src, | |
const int dest_dim, | |
const int dest_ld, const int dest_offset, | |
__global real* dest, | |
const int unit_diagonal) { | |
// Loops over the work per thread in both dimensions | |
#pragma unroll | |
for (int w_one=0; w_one<PAD_WPTX; ++w_one) { | |
const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0); | |
#pragma unroll | |
for (int w_two=0; w_two<PAD_WPTY; ++w_two) { | |
const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1); | |
if (id_two < dest_dim && id_one < dest_dim) { | |
// Loads data from the upper-triangular matrix | |
real result; | |
SetToZero(result); | |
if (id_two < src_dim && id_one < src_dim) { | |
if (id_one <= id_two) { result = src[id_two*src_ld + id_one + src_offset]; } | |
if (id_one == id_two && unit_diagonal) { SetToOne(result); } | |
// Else: result is zero | |
} | |
// Stores the result in the destination matrix | |
dest[id_two*dest_ld + id_one + dest_offset] = result; | |
} | |
} | |
} | |
} | |
#endif | |
// ================================================================================================= | |
// End of the C++11 raw string literal | |
// ================================================================================================= | |
#if defined(ROUTINE_HEMM) && (PRECISION == 3232 || PRECISION == 6464) | |
// Kernel to populate a squared hermitian matrix, given that the triangle which holds the data is | |
// stored as the lower-triangle of the input matrix. This uses the padding kernel's parameters. | |
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1))) | |
void HermLowerToSquared(const int src_dim, | |
const int src_ld, const int src_offset, | |
__global const real* restrict src, | |
const int dest_dim, | |
const int dest_ld, const int dest_offset, | |
__global real* dest) { | |
// Loops over the work per thread in both dimensions | |
#pragma unroll | |
for (int w_one=0; w_one<PAD_WPTX; ++w_one) { | |
const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0); | |
#pragma unroll | |
for (int w_two=0; w_two<PAD_WPTY; ++w_two) { | |
const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1); | |
if (id_two < dest_dim && id_one < dest_dim) { | |
// Loads data from the lower-hermitian matrix | |
real result; | |
SetToZero(result); | |
if (id_two < src_dim && id_one < src_dim) { | |
if (id_two <= id_one) { | |
result = src[id_two*src_ld + id_one + src_offset]; | |
if (id_one == id_two) { result.y = ZERO; } | |
} | |
else { | |
result = src[id_one*src_ld + id_two + src_offset]; | |
COMPLEX_CONJUGATE(result); | |
} | |
} | |
// Stores the result in the destination matrix | |
dest[id_two*dest_ld + id_one + dest_offset] = result; | |
} | |
} | |
} | |
} | |
// Same as above, but now the matrix' data is stored in the upper-triangle | |
__kernel __attribute__((reqd_work_group_size(PAD_DIMX, PAD_DIMY, 1))) | |
void HermUpperToSquared(const int src_dim, | |
const int src_ld, const int src_offset, | |
__global const real* restrict src, | |
const int dest_dim, | |
const int dest_ld, const int dest_offset, | |
__global real* dest) { | |
// Loops over the work per thread in both dimensions | |
#pragma unroll | |
for (int w_one=0; w_one<PAD_WPTX; ++w_one) { | |
const int id_one = (get_group_id(0)*PAD_WPTX + w_one) * PAD_DIMX + get_local_id(0); | |
#pragma unroll | |
for (int w_two=0; w_two<PAD_WPTY; ++w_two) { | |
const int id_two = (get_group_id(1)*PAD_WPTY + w_two) * PAD_DIMY + get_local_id(1); | |
if (id_two < dest_dim && id_one < dest_dim) { | |
// Loads data from the upper-hermitian matrix | |
real result; | |
SetToZero(result); | |
if (id_two < src_dim && id_one < src_dim) { | |
if (id_one <= id_two) { | |
result = src[id_two*src_ld + id_one + src_offset]; | |
if (id_one == id_two) { result.y = ZERO; } | |
} | |
else { | |
result = src[id_one*src_ld + id_two + src_offset]; | |
COMPLEX_CONJUGATE(result); | |
} | |
} | |
// Stores the result in the destination matrix | |
dest[id_two*dest_ld + id_one + dest_offset] = result; | |
} | |
} | |
} | |
} | |
#endif | |
// ================================================================================================= | |
// End of the C++11 raw string literal | |
// ================================================================================================= | |
// Parameters set by the tuner or by the database. Here they are given a basic default value in case | |
// this kernel file is used outside of the CLBlast library. | |
#ifndef MWG | |
#define MWG 8 // Tile-size in dimension M (e.g. 64, 128) | |
#endif | |
#ifndef NWG | |
#define NWG 8 // Tile-size in dimension N (e.g. 64, 128) | |
#endif | |
#ifndef KWG | |
#define KWG 8 // Tile-size in dimension K (e.g. 8, 16) | |
#endif | |
#ifndef MDIMC | |
#define MDIMC 8 // Threads per workgroup in M-dimension (e.g. 8, 16, 32) | |
#endif | |
#ifndef NDIMC | |
#define NDIMC 8 // Threads per workgroup in N-dimension (e.g. 8, 16, 32) | |
#endif | |
#ifndef MDIMA | |
#define MDIMA 8 // Re-shaped tile dimension of matrix A: KDIMA * MDIMA | |
#endif | |
#ifndef NDIMB | |
#define NDIMB 8 // Re-shaped tile dimension of matrix B: KDIMB * NDIMB | |
#endif | |
#ifndef KWI | |
#define KWI 1 // Unroll factor of the KWG loop (smaller or equal than KWG) | |
#endif | |
#ifndef VWM | |
#define VWM 1 // Vector width of matrices A and C | |
#endif | |
#ifndef VWN | |
#define VWN 1 // Vector width of matrix B | |
#endif | |
#ifndef STRM | |
#define STRM 0 // Use strided access within a thread in the M-dimension (1) or not (0) | |
#endif | |
#ifndef STRN | |
#define STRN 0 // Use strided access within a thread in the N-dimension (1) or not (0) | |
#endif | |
#ifndef SA | |
#define SA 0 // Use local/shared memory to cache matrix A (1) or not (0) | |
#endif | |
#ifndef SB | |
#define SB 0 // Use local/shared memory to cache matrix B (1) or not (0) | |
#endif | |
// Helper parameters based on the above tuning parameters | |
#define MWI (MWG/MDIMC) // Work per work-item (M-dimension) | |
#define NWI (NWG/NDIMC) // Work per work-item (N-dimension) | |
#define KDIMA ((MDIMC*NDIMC)/(MDIMA)) // Re-shaped tile dimension of matrix A: KDIMA * MDIMA | |
#define KDIMB ((MDIMC*NDIMC)/(NDIMB)) // Re-shaped tile dimension of matrix B: KDIMB * NDIMB | |
#define MWA (MWG/MDIMA) // Amount of loads-per-thread for matrix A (M-dimension) | |
#define KWA (KWG/KDIMA) // Amount of loads-per-thread for matrix A (K-dimension) | |
#define KWB (KWG/KDIMB) // Amount of loads-per-thread for matrix B (K-dimension) | |
#define NWB (NWG/NDIMB) // Amount of loads-per-thread for matrix B (N-dimension) | |
// Settings | |
#ifndef USE_VECTOR_MAD | |
#define USE_VECTOR_MAD 0 // Unroll (0) or don't (1) unroll the vector MAD manually | |
#endif | |
#ifndef GLOBAL_MEM_FENCE | |
#define GLOBAL_MEM_FENCE 0 // Global synchronisation barrier for potential better performance | |
#endif | |
// ================================================================================================= | |
// Data-widths in dimension M | |
#if VWM == 1 | |
typedef real realM; | |
#elif VWM == 2 | |
typedef real2 realM; | |
#elif VWM == 4 | |
typedef real4 realM; | |
#elif VWM == 8 | |
typedef real8 realM; | |
#elif VWM == 16 | |
typedef real16 realM; | |
#endif | |
// Data-widths in dimension N | |
#if VWN == 1 | |
typedef real realN; | |
#elif VWN == 2 | |
typedef real2 realN; | |
#elif VWN == 4 | |
typedef real4 realN; | |
#elif VWN == 8 | |
typedef real8 realN; | |
#elif VWN == 16 | |
typedef real16 realN; | |
#endif | |
// ================================================================================================= | |
// Initializes the accumulation registers to zero | |
void InitAccRegisters(realM cpm[NWI][MWI/VWM]) { | |
#pragma unroll | |
for (int mi=0; mi<MWI/VWM; ++mi) { | |
#pragma unroll | |
for (int ni=0; ni<NWI; ++ni) { | |
#if VWM == 1 | |
SetToZero(cpm[ni][mi]); | |
#elif VWM == 2 | |
SetToZero(cpm[ni][mi].x); | |
SetToZero(cpm[ni][mi].y); | |
#elif VWM == 4 | |
SetToZero(cpm[ni][mi].x); | |
SetToZero(cpm[ni][mi].y); | |
SetToZero(cpm[ni][mi].z); | |
SetToZero(cpm[ni][mi].w); | |
#elif VWM == 8 | |
SetToZero(cpm[ni][mi].s0); | |
SetToZero(cpm[ni][mi].s1); | |
SetToZero(cpm[ni][mi].s2); | |
SetToZero(cpm[ni][mi].s3); | |
SetToZero(cpm[ni][mi].s4); | |
SetToZero(cpm[ni][mi].s5); | |
SetToZero(cpm[ni][mi].s6); | |
SetToZero(cpm[ni][mi].s7); | |
#elif VWM == 16 | |
SetToZero(cpm[ni][mi].s0); | |
SetToZero(cpm[ni][mi].s1); | |
SetToZero(cpm[ni][mi].s2); | |
SetToZero(cpm[ni][mi].s3); | |
SetToZero(cpm[ni][mi].s4); | |
SetToZero(cpm[ni][mi].s5); | |
SetToZero(cpm[ni][mi].s6); | |
SetToZero(cpm[ni][mi].s7); | |
SetToZero(cpm[ni][mi].s8); | |
SetToZero(cpm[ni][mi].s9); | |
SetToZero(cpm[ni][mi].sA); | |
SetToZero(cpm[ni][mi].sB); | |
SetToZero(cpm[ni][mi].sC); | |
SetToZero(cpm[ni][mi].sD); | |
SetToZero(cpm[ni][mi].sE); | |
SetToZero(cpm[ni][mi].sF); | |
#endif | |
} | |
} | |
} | |
// ================================================================================================= | |
// Caches global off-chip memory into local (shared) memory on-chip. This function is specific for | |
// caching the A input matrix. | |
#if SA == 1 | |
inline void GlobalToLocalA(const __global realM* restrict agm, __local realM* alm, | |
const int kSizeM, const int tid, const int kwg) { | |
const int la0 = tid % MDIMA; | |
const int la1 = tid / MDIMA; | |
#pragma unroll | |
for (int mia=0; mia<MWA/VWM; ++mia) { | |
#pragma unroll | |
for (int kia=0; kia<KWA; ++kia) { | |
// Computes the indices based on strided/non-strided access | |
#if STRM == 0 | |
int mg = mia + la0*(MWA/VWM); | |
#elif STRM == 1 | |
int mg = la0 + mia*MDIMA; | |
#endif | |
// Computes the indices for the global memory | |
int kg = kia + la1*KWA; | |
int idm = mg + GetGroupID0() * (MWG/VWM); | |
int idk = kg + kwg; | |
// Loads the data from global memory (not transposed) into the local memory | |
alm[kg*(MWG/VWM) + mg] = agm[idk*(kSizeM/VWM) + idm]; | |
} | |
} | |
} | |
#endif | |
// Same as above, but now for the B input matrix | |
#if SB == 1 | |
inline void GlobalToLocalB(const __global realN* restrict bgm, __local realN* blm, | |
const int kSizeN, const int tid, const int kwg) { | |
const int lb0 = tid % NDIMB; | |
const int lb1 = tid / NDIMB; | |
#pragma unroll | |
for (int kib=0; kib<KWB; ++kib) { | |
#pragma unroll | |
for (int nib=0; nib<NWB/VWN; ++nib) { | |
// Computes the indices based on strided/non-strided access | |
#if STRN == 0 | |
int ng = nib + lb0*(NWB/VWN); | |
#elif STRN == 1 | |
int ng = lb0 + nib*NDIMB; | |
#endif | |
// Computes the indices for the global memory | |
int kg = kib + lb1*KWB; | |
int idn = ng + GetGroupID1() * (NWG/VWN); | |
int idk = kg + kwg; | |
// Loads the data from global memory (transposed) into the local memory | |
blm[kg*(NWG/VWN) + ng] = bgm[idk*(kSizeN/VWN) + idn]; | |
} | |
} | |
} | |
#endif | |
// ================================================================================================= | |
// Caches global off-chip memory directly into per-thread private memory (registers). This function | |
// is specific for caching the A input matrix. | |
#if SA == 0 | |
inline void GlobalToPrivateA(const __global realM* restrict agm, realM apm[MWI/VWM], | |
const int kSizeM, const int idk, const int kwg) { | |
#pragma unroll | |
for (int mi=0; mi<MWI/VWM; ++mi) { | |
// Computes the indices based on strided/non-strided access | |
#if STRM == 0 | |
int mg = mi + get_local_id(0)*(MWI/VWM); | |
#elif STRM == 1 | |
int mg = get_local_id(0) + mi*MDIMC; | |
#endif | |
// Computes the indices for the global memory | |
int idm = mg + GetGroupID0() * (MWG/VWM); | |
// Loads the data from global memory (not transposed) and stores into registers | |
apm[mi] = agm[idk*(kSizeM/VWM) + idm]; | |
} | |
} | |
#endif | |
// Same as above, but now for the B input matrix | |
#if SB == 0 | |
inline void GlobalToPrivateB(const __global realN* restrict bgm, realN bpm[NWI/VWN], | |
const int kSizeN, const int idk) { | |
#pragma unroll | |
for (int ni=0; ni<NWI/VWN; ++ni) { | |
// Computes the indices based on strided/non-strided access | |
#if STRN == 0 | |
int ng = ni + get_local_id(1)*(NWI/VWN); | |
#elif STRN == 1 | |
int ng = get_local_id(1) + ni*NDIMC; | |
#endif | |
// Computes the indices for the global memory | |
int idn = ng + GetGroupID1() * (NWG/VWN); | |
// Loads the data from global memory (transposed) and stores into registers | |
bpm[ni] = bgm[idk*(kSizeN/VWN) + idn]; | |
} | |
} | |
#endif | |
// ================================================================================================= | |
// Caches on-chip local memory into per-thread private memory (registers). This function is specific | |
// for caching the A input matrix. | |
#if SA == 1 | |
inline void LocalToPrivateA(__local realM* alm, realM apm[MWI/VWM], const int kg) { | |
#pragma unroll | |
for (int mi=0; mi<MWI/VWM; ++mi) { | |
#if STRM == 0 | |
int mg = mi + get_local_id(0)*(MWI/VWM); | |
#elif STRM == 1 | |
int mg = get_local_id(0) + mi*MDIMC; | |
#endif | |
apm[mi] = alm[kg*(MWG/VWM) + mg]; | |
} | |
} | |
#endif | |
// Same as above, but now for the B input matrix | |
#if SB == 1 | |
inline void LocalToPrivateB(__local realN* blm, realN bpm[NWI/VWN], const int kg) { | |
#pragma unroll | |
for (int ni=0; ni<NWI/VWN; ++ni) { | |
#if STRN == 0 | |
int ng = ni + get_local_id(1)*(NWI/VWN); | |
#elif STRN == 1 | |
int ng = get_local_id(1) + ni*NDIMC; | |
#endif | |
bpm[ni] = blm[kg*(NWG/VWN) + ng]; | |
} | |
} | |
#endif | |
// ================================================================================================= | |
// End of the C++11 raw string literal | |
// ================================================================================================= | |
// The vectorised multiply-add function | |
inline realM MultiplyAddVector(realM cvec, const realM avec, const real bval) { | |
#if USE_VECTOR_MAD == 1 | |
cvec += avec * bval; | |
#else | |
#if VWM == 1 | |
MultiplyAdd(cvec, avec, bval); | |
#elif VWM == 2 | |
MultiplyAdd(cvec.x , avec.x, bval); | |
MultiplyAdd(cvec.y , avec.y, bval); | |
#elif VWM == 4 | |
MultiplyAdd(cvec.x , avec.x, bval); | |
MultiplyAdd(cvec.y , avec.y, bval); | |
MultiplyAdd(cvec.z , avec.z, bval); | |
MultiplyAdd(cvec.w , avec.w, bval); | |
#elif VWM == 8 | |
MultiplyAdd(cvec.s0, avec.s0, bval); | |
MultiplyAdd(cvec.s1, avec.s1, bval); | |
MultiplyAdd(cvec.s2, avec.s2, bval); | |
MultiplyAdd(cvec.s3, avec.s3, bval); | |
MultiplyAdd(cvec.s4, avec.s4, bval); | |
MultiplyAdd(cvec.s5, avec.s5, bval); | |
MultiplyAdd(cvec.s6, avec.s6, bval); | |
MultiplyAdd(cvec.s7, avec.s7, bval); | |
#elif VWM == 16 | |
MultiplyAdd(cvec.s0, avec.s0, bval); | |
MultiplyAdd(cvec.s1, avec.s1, bval); | |
MultiplyAdd(cvec.s2, avec.s2, bval); | |
MultiplyAdd(cvec.s3, avec.s3, bval); | |
MultiplyAdd(cvec.s4, avec.s4, bval); | |
MultiplyAdd(cvec.s5, avec.s5, bval); | |
MultiplyAdd(cvec.s6, avec.s6, bval); | |
MultiplyAdd(cvec.s7, avec.s7, bval); | |
MultiplyAdd(cvec.s8, avec.s8, bval); | |
MultiplyAdd(cvec.s9, avec.s9, bval); | |
MultiplyAdd(cvec.sA, avec.sA, bval); | |
MultiplyAdd(cvec.sB, avec.sB, bval); | |
MultiplyAdd(cvec.sC, avec.sC, bval); | |
MultiplyAdd(cvec.sD, avec.sD, bval); | |
MultiplyAdd(cvec.sE, avec.sE, bval); | |
MultiplyAdd(cvec.sF, avec.sF, bval); | |
#endif | |
#endif | |
return cvec; | |
} | |
// Performs the actual computation: Cpm += Apm * Bpm | |
inline void MultiplyAccumulate(realM cpm[NWI][MWI/VWM], realM apm[MWI/VWM], realN bpm[NWI/VWN]) { | |
#pragma unroll | |
for (int ni=0; ni<NWI/VWN; ++ni) { | |
#pragma unroll | |
for (int mi=0; mi<MWI/VWM; ++mi) { | |
const realM aval = apm[mi]; | |
#if VWN == 1 | |
cpm[ni*VWN + 0][mi] = MultiplyAddVector(cpm[ni*VWN + 0][mi], aval, bpm[ni]); | |
#elif VWN == 2 | |
cpm[ni*VWN + 0][mi] = MultiplyAddVector(cpm[ni*VWN + 0][mi], aval, bpm[ni].x); | |
cpm[ni*VWN + 1][mi] = MultiplyAddVector(cpm[ni*VWN + 1][mi], aval, bpm[ni].y); | |
#elif VWN == 4 | |
cpm[ni*VWN + 0][mi] = MultiplyAddVector(cpm[ni*VWN + 0][mi], aval, bpm[ni].x); | |
cpm[ni*VWN + 1][mi] = MultiplyAddVector(cpm[ni*VWN + 1][mi], aval, bpm[ni].y); | |
cpm[ni*VWN + 2][mi] = MultiplyAddVector(cpm[ni*VWN + 2][mi], aval, bpm[ni].z); | |
cpm[ni*VWN + 3][mi] = MultiplyAddVector(cpm[ni*VWN + 3][mi], aval, bpm[ni].w); | |
#elif VWN == 8 | |
cpm[ni*VWN + 0][mi] = MultiplyAddVector(cpm[ni*VWN + 0][mi], aval, bpm[ni].s0); | |
cpm[ni*VWN + 1][mi] = MultiplyAddVector(cpm[ni*VWN + 1][mi], aval, bpm[ni].s1); | |
cpm[ni*VWN + 2][mi] = MultiplyAddVector(cpm[ni*VWN + 2][mi], aval, bpm[ni].s2); | |
cpm[ni*VWN + 3][mi] = MultiplyAddVector(cpm[ni*VWN + 3][mi], aval, bpm[ni].s3); | |
cpm[ni*VWN + 4][mi] = MultiplyAddVector(cpm[ni*VWN + 4][mi], aval, bpm[ni].s4); | |
cpm[ni*VWN + 5][mi] = MultiplyAddVector(cpm[ni*VWN + 5][mi], aval, bpm[ni].s5); | |
cpm[ni*VWN + 6][mi] = MultiplyAddVector(cpm[ni*VWN + 6][mi], aval, bpm[ni].s6); | |
cpm[ni*VWN + 7][mi] = MultiplyAddVector(cpm[ni*VWN + 7][mi], aval, bpm[ni].s7); | |
#elif VWN == 16 | |
cpm[ni*VWN + 0 ][mi] = MultiplyAddVector(cpm[ni*VWN + 0 ][mi], aval, bpm[ni].s0); | |
cpm[ni*VWN + 1 ][mi] = MultiplyAddVector(cpm[ni*VWN + 1 ][mi], aval, bpm[ni].s1); | |
cpm[ni*VWN + 2 ][mi] = MultiplyAddVector(cpm[ni*VWN + 2 ][mi], aval, bpm[ni].s2); | |
cpm[ni*VWN + 3 ][mi] = MultiplyAddVector(cpm[ni*VWN + 3 ][mi], aval, bpm[ni].s3); | |
cpm[ni*VWN + 4 ][mi] = MultiplyAddVector(cpm[ni*VWN + 4 ][mi], aval, bpm[ni].s4); | |
cpm[ni*VWN + 5 ][mi] = MultiplyAddVector(cpm[ni*VWN + 5 ][mi], aval, bpm[ni].s5); | |
cpm[ni*VWN + 6 ][mi] = MultiplyAddVector(cpm[ni*VWN + 6 ][mi], aval, bpm[ni].s6); | |
cpm[ni*VWN + 7 ][mi] = MultiplyAddVector(cpm[ni*VWN + 7 ][mi], aval, bpm[ni].s7); | |
cpm[ni*VWN + 8 ][mi] = MultiplyAddVector(cpm[ni*VWN + 8 ][mi], aval, bpm[ni].s8); | |
cpm[ni*VWN + 9 ][mi] = MultiplyAddVector(cpm[ni*VWN + 9 ][mi], aval, bpm[ni].s9); | |
cpm[ni*VWN + 10][mi] = MultiplyAddVector(cpm[ni*VWN + 10][mi], aval, bpm[ni].sA); | |
cpm[ni*VWN + 11][mi] = MultiplyAddVector(cpm[ni*VWN + 11][mi], aval, bpm[ni].sB); | |
cpm[ni*VWN + 12][mi] = MultiplyAddVector(cpm[ni*VWN + 12][mi], aval, bpm[ni].sC); | |
cpm[ni*VWN + 13][mi] = MultiplyAddVector(cpm[ni*VWN + 13][mi], aval, bpm[ni].sD); | |
cpm[ni*VWN + 14][mi] = MultiplyAddVector(cpm[ni*VWN + 14][mi], aval, bpm[ni].sE); | |
cpm[ni*VWN + 15][mi] = MultiplyAddVector(cpm[ni*VWN + 15][mi], aval, bpm[ni].sF); | |
#endif | |
} | |
} | |
} | |
// ================================================================================================= | |
// Merges the results in Cpm with the global array in Cgm. This also performs the multiplication | |
// with the constants: Cgm = alpha*A*B + beta*Cgm = alpha*Cpm + beta*Cgm | |
inline void StoreResults(__global realM* cgm, realM cpm[NWI][MWI/VWM], const int kSizeM, | |
const real alpha, const real beta) { | |
#pragma unroll | |
for (int ni=0; ni<NWI; ++ni) { | |
#pragma unroll | |
for (int mi=0; mi<MWI/VWM; ++mi) { | |
#if STRM == 0 | |
int mg = mi + get_local_id(0)*(MWI/VWM); | |
#elif STRM == 1 | |
int mg = get_local_id(0) + mi*MDIMC; | |
#endif | |
#if STRN == 0 | |
int ng = ni + get_local_id(1)*NWI; | |
#elif STRN == 1 | |
int ng = ni%VWN + get_local_id(1)*VWN + (ni/VWN)*VWN*NDIMC; | |
#endif | |
int idm = mg + GetGroupID0() * (MWG/VWM); | |
int idn = ng + GetGroupID1() * NWG; | |
// The final multiplication with alpha and the addition with beta*C | |
int index = idn*(kSizeM/VWM) + idm; | |
realM result; | |
realM xval = cpm[ni][mi]; | |
realM yval = cgm[index]; | |
#if VWM == 1 | |
AXPBY(result, alpha, xval, beta, yval); | |
#elif VWM == 2 | |
AXPBY(result.x, alpha, xval.x, beta, yval.x); | |
AXPBY(result.y, alpha, xval.y, beta, yval.y); | |
#elif VWM == 4 | |
AXPBY(result.x, alpha, xval.x, beta, yval.x); | |
AXPBY(result.y, alpha, xval.y, beta, yval.y); | |
AXPBY(result.z, alpha, xval.z, beta, yval.z); | |
AXPBY(result.w, alpha, xval.w, beta, yval.w); | |
#elif VWM == 8 | |
AXPBY(result.s0, alpha, xval.s0, beta, yval.s0); | |
AXPBY(result.s1, alpha, xval.s1, beta, yval.s1); | |
AXPBY(result.s2, alpha, xval.s2, beta, yval.s2); | |
AXPBY(result.s3, alpha, xval.s3, beta, yval.s3); | |
AXPBY(result.s4, alpha, xval.s4, beta, yval.s4); | |
AXPBY(result.s5, alpha, xval.s5, beta, yval.s5); | |
AXPBY(result.s6, alpha, xval.s6, beta, yval.s6); | |
AXPBY(result.s7, alpha, xval.s7, beta, yval.s7); | |
#elif VWM == 16 | |
AXPBY(result.s0, alpha, xval.s0, beta, yval.s0); | |
AXPBY(result.s1, alpha, xval.s1, beta, yval.s1); | |
AXPBY(result.s2, alpha, xval.s2, beta, yval.s2); | |
AXPBY(result.s3, alpha, xval.s3, beta, yval.s3); | |
AXPBY(result.s4, alpha, xval.s4, beta, yval.s4); | |
AXPBY(result.s5, alpha, xval.s5, beta, yval.s5); | |
AXPBY(result.s6, alpha, xval.s6, beta, yval.s6); | |
AXPBY(result.s7, alpha, xval.s7, beta, yval.s7); | |
AXPBY(result.s8, alpha, xval.s8, beta, yval.s8); | |
AXPBY(result.s9, alpha, xval.s9, beta, yval.s9); | |
AXPBY(result.sA, alpha, xval.sA, beta, yval.sA); | |
AXPBY(result.sB, alpha, xval.sB, beta, yval.sB); | |
AXPBY(result.sC, alpha, xval.sC, beta, yval.sC); | |
AXPBY(result.sD, alpha, xval.sD, beta, yval.sD); | |
AXPBY(result.sE, alpha, xval.sE, beta, yval.sE); | |
AXPBY(result.sF, alpha, xval.sF, beta, yval.sF); | |
#endif | |
cgm[index] = result; | |
} | |
} | |
} | |
// ================================================================================================= | |
// Main body of the matrix-multiplication algorithm. It calls the (inlined) functions above. | |
inline void XgemmBody(const int kSizeM, const int kSizeN, const int kSizeK, | |
const __global realM* restrict agm, const __global realN* restrict bgm, | |
__global realM* cgm, realM cpm[NWI][MWI/VWM] | |
#if SA == 1 && SB == 1 | |
, __local realM* alm, __local realN* blm | |
#elif SA == 1 | |
, __local realM* alm | |
#elif SB == 1 | |
, __local realN* blm | |
#endif | |
) { | |
// Allocates workitem-private memory (registers) | |
realM apm[MWI/VWM]; | |
realN bpm[NWI/VWN]; | |
// Combined thread identifier (volatile to disable caching) | |
#if SA == 1 || SB == 1 | |
volatile int tid = get_local_id(0) + MDIMC*get_local_id(1); | |
#endif | |
// Initializes the accumulation registers | |
InitAccRegisters(cpm); | |
// Loops over all workgroup tiles | |
for (int kwg=0; kwg<kSizeK; kwg+=KWG) { | |
// Loads data: off-chip --> local (matrix A) | |
#if SA == 1 | |
GlobalToLocalA(agm, alm, kSizeM, tid, kwg); | |
#endif | |
// Loads data: off-chip --> local (matrix B) | |
#if SB == 1 | |
GlobalToLocalB(bgm, blm, kSizeN, tid, kwg); | |
#endif | |
#if SA == 1 || SB == 1 | |
barrier(CLK_LOCAL_MEM_FENCE); | |
#endif | |
// Loops over all workitem tiles, unrolled by a factor KWI | |
for (int pwi=0; pwi<KWG; pwi+=KWI) { | |
#pragma unroll | |
for (int pit=0; pit<KWI; ++pit) { | |
#if SA == 0 || SB == 0 | |
int idk = kwg + pwi + pit; | |
#endif | |
#if SA == 1 || SB == 1 | |
int kg = pwi+pit; | |
#endif | |
// Loads data: local --> private (matrix A) | |
#if SA == 1 | |
LocalToPrivateA(alm, apm, kg); | |
// Loads data: off-chip --> private (matrix A) | |
#else | |
GlobalToPrivateA(agm, apm, kSizeM, idk, kwg); | |
#endif | |
// Loads data: local --> private (matrix B) | |
#if SB == 1 | |
LocalToPrivateB(blm, bpm, kg); | |
// Loads data: off-chip --> private (matrix B) | |
#else | |
GlobalToPrivateB(bgm, bpm, kSizeN, idk); | |
#endif | |
// Performs the accumulation (Cpm += Apm * Bpm) | |
MultiplyAccumulate(cpm, apm, bpm); | |
} | |
} | |
#if SA == 1 || SB == 1 | |
barrier(CLK_LOCAL_MEM_FENCE); | |
#endif | |
} | |
#if GLOBAL_MEM_FENCE == 1 | |
barrier(CLK_GLOBAL_MEM_FENCE); | |
#endif | |
} | |
// ================================================================================================= | |
// The upper-triangular and lower-triangular kernels are only used in special cases | |
#if defined(ROUTINE_SYRK) || defined(ROUTINE_HERK) || defined(ROUTINE_SYR2K) || defined(ROUTINE_HER2K) | |
// Main entry point of the kernel. This is the upper-triangular version. | |
__kernel __attribute__((reqd_work_group_size(MDIMC, NDIMC, 1))) | |
void XgemmUpper(const int kSizeN, const int kSizeK, | |
const __constant real* restrict arg_alpha, | |
const __constant real* restrict arg_beta, | |
const __global realM* restrict agm, | |
const __global realN* restrict bgm, | |
__global realM* cgm) { | |
const real alpha = arg_alpha[0]; | |
const real beta = arg_beta[0]; | |
// Skip these threads if they do not contain threads contributing to the upper-triangle | |
if (GetGroupID1()*NWG < GetGroupID0()*MWG) { | |
return; | |
} | |
// Allocates workgroup-private memory (local memory) | |
#if SA == 1 | |
__local realM alm[KWG * MWG/VWM]; | |
#endif | |
#if SB == 1 | |
__local realN blm[KWG * NWG/VWN]; | |
#endif | |
// Computes the matrix-multiplication and stores the result in register memory | |
realM cpm[NWI][MWI/VWM]; | |
#if SA == 1 && SB == 1 | |
XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm, alm, blm); | |
#elif SA == 1 | |
XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm, alm); | |
#elif SB == 1 | |
XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm, blm); | |
#else | |
XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm); | |
#endif | |
// Stores an MWG * NWG tile of results and performs the multiplication with alpha and beta | |
StoreResults(cgm, cpm, kSizeN, alpha, beta); | |
} | |
// Main entry point of the kernel. This is the lower-triangular version. | |
__kernel __attribute__((reqd_work_group_size(MDIMC, NDIMC, 1))) | |
void XgemmLower(const int kSizeN, const int kSizeK, | |
const __constant real* restrict arg_alpha, | |
const __constant real* restrict arg_beta, | |
const __global realM* restrict agm, | |
const __global realN* restrict bgm, | |
__global realM* cgm) { | |
const real alpha = arg_alpha[0]; | |
const real beta = arg_beta[0]; | |
// Skip these threads if they do not contain threads contributing to the lower-triangle | |
if (GetGroupID1()*NWG > GetGroupID0()*MWG) { | |
return; | |
} | |
// Allocates workgroup-private memory (local memory) | |
#if SA == 1 | |
__local realM alm[KWG * MWG/VWM]; | |
#endif | |
#if SB == 1 | |
__local realN blm[KWG * NWG/VWN]; | |
#endif | |
// Computes the matrix-multiplication and stores the result in register memory | |
realM cpm[NWI][MWI/VWM]; | |
#if SA == 1 && SB == 1 | |
XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm, alm, blm); | |
#elif SA == 1 | |
XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm, alm); | |
#elif SB == 1 | |
XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm, blm); | |
#else | |
XgemmBody(kSizeN, kSizeN, kSizeK, agm, bgm, cgm, cpm); | |
#endif | |
// Stores an MWG * NWG tile of results and performs the multiplication with alpha and beta | |
StoreResults(cgm, cpm, kSizeN, alpha, beta); | |
} | |
// ================================================================================================= | |
// If not using a triangular version, include the regular kernel | |
#else | |
// Main entry point of the kernel. This is the regular full version. | |
__kernel __attribute__((reqd_work_group_size(MDIMC, NDIMC, 1))) | |
void Xgemm(const int kSizeM, const int kSizeN, const int kSizeK, | |
const __constant real* restrict arg_alpha, | |
const __constant real* restrict arg_beta, | |
const __global realM* restrict agm, | |
const __global realN* restrict bgm, | |
__global realM* cgm) { | |
const real alpha = arg_alpha[0]; | |
const real beta = arg_beta[0]; | |
// Allocates workgroup-private memory (local memory) | |
#if SA == 1 | |
__local realM alm[KWG * MWG/VWM]; | |
#endif | |
#if SB == 1 | |
__local realN blm[KWG * NWG/VWN]; | |
#endif | |
// Computes the matrix-multiplication and stores the result in register memory | |
realM cpm[NWI][MWI/VWM]; | |
#if SA == 1 && SB == 1 | |
XgemmBody(kSizeM, kSizeN, kSizeK, agm, bgm, cgm, cpm, alm, blm); | |
#elif SA == 1 | |
XgemmBody(kSizeM, kSizeN, kSizeK, agm, bgm, cgm, cpm, alm); | |
#elif SB == 1 | |
XgemmBody(kSizeM, kSizeN, kSizeK, agm, bgm, cgm, cpm, blm); | |
#else | |
XgemmBody(kSizeM, kSizeN, kSizeK, agm, bgm, cgm, cpm); | |
#endif | |
// Stores an MWG * NWG tile of results and performs the multiplication with alpha and beta | |
StoreResults(cgm, cpm, kSizeM, alpha, beta); | |
} | |
#endif | |
// ================================================================================================= | |
// End of the C++11 raw string literal | |
F0829 11:54:00.688835 5841 greentea_math_functions.cpp:254] Check failed: static_cast<int>(status) == static_cast<int>(clblast::StatusCode::kSuccess) (-11 vs. 0) GREENTEA ERROR: CLBlast error | |
*** Check failure stack trace: *** | |
@ 0x76ea4c24 (unknown) | |
@ 0x76ea6cfc (unknown) | |
@ 0x76ea4814 (unknown) | |
@ 0x76ea7648 (unknown) | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
@ 0x74bdc138 caffe::greentea_gpu_gemm<>() | |
Makefile:657: recipe for target 'runtest' failed | |
make: *** [runtest] Aborted | |
iCar-G3-446:/dev/disk/mounted/sda1/opencl/caffe # |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment