Skip to content

Instantly share code, notes, and snippets.

$ CHAINER_DTYPE=float16 python train_ptb.py -d 0 -e 10
#vocab = 10000
epoch iteration perplexity val_perplexity
0 500 326440
0 1000 301342
1 1500 298940 inf
1 2000 334369
1 2500 334369
2 3000 306202 inf
2 3500 339762
@takagi
takagi / nccl_broadcast.py
Created July 22, 2019 09:10
Test code for cupy.cuda.nccl.NcclCommunicator's broadcast method.
import multiprocessing
import cupy
from cupy import cuda
from cupy.cuda import nccl
from cupy import testing
def f(n_devices, device, comm_id, rank):
device.use()
comm = nccl.NcclCommunicator(n_devices, comm_id, rank)
@takagi
takagi / out
Created November 29, 2020 04:30
This file has been truncated, but you can view the full file.
{
"_nodetype": "FileAST",
"coord": null,
"ext": [
{
"_nodetype": "Pragma",
"coord": "../utils/fake_libc_include/_fake_typedefs.h:56:9",
"string": "GCC diagnostic ignored \"-Wunused-function\""
},
{
@takagi
takagi / README.md
Last active February 1, 2021 23:56

A tool to generate files for C extensions of CUDA-relatged libraries for CuPy. Currently covered are cuBLAS, cuSPARSE, and cuSOLVER, which have so many APIs to write their extensions by hands.

Usage

Generate files for all of the libraries
./gen.sh
@takagi
takagi / diff.patch
Created December 21, 2022 01:11
Eliminate D2H sync on ascending flag
diff --git a/cupyx/scipy/interpolate/_interpolate.py b/cupyx/scipy/interpolate/_interpolate.py
index bab74671e..ec3c5bcac 100644
--- a/cupyx/scipy/interpolate/_interpolate.py
+++ b/cupyx/scipy/interpolate/_interpolate.py
@@ -22,7 +22,7 @@ INTERVAL_KERNEL = r'''
extern "C" {
__global__ void find_breakpoint_position(
const double* breakpoints, const double* x, long long* out,
- bool extrapolate, int total_x, int total_breakpoints, bool asc) {
+ bool extrapolate, int total_x, int total_breakpoints, const bool* pasc) {
@takagi
takagi / test.cu
Created January 30, 2023 01:12
small kernels
#include <cassert>
#include <iostream>
#include <thread>
__global__ void vecAddOne(float *a, int n) {
int id = blockIdx.x * blockDim.x + threadIdx.x;
if (id < n)
a[id] += 1.0f;
}