Bilal Khan bilal2vec

Memory Usage

Background

GPU memory is used in a few main ways:

Memory to store the network's parameters
Memory to store the network's gradients
Memory to store the activations of the current batch
Memory used by optimizers (momentum, adam, etc) that stores running averages

TPU info

v2 and v3
v2 has 180 tflops 64gb ram
- colab uses v2s
v3 has 420 tflops 128gb ram
two versions, single tpus or pods
- single tpus
8 cores each

Initialization

means and stddevs of activations should be close to 0 and 1 to prevent gradients exploding or vanishing
activations of layers have stddevs close to sqrt(num_input_channels)
so, to get the stddevs back to 1, multiply random weights by 1 / sqrt(c_in)
this works well without activations, but results in vanishing or exploding gradients when used with a tanh or sigmoid activation function
bias weights should be initialized to 0
intializations can either be from a uniform distribution or a normal distribution
use xavier for sigmoid and softmax activations

Try more architectures
Basic architectures are sometimes better
Try other forms of ensembling than cv
Blend with linear regression
Rely more on shakeup predictions
Make sure copied code is correct
Pay more attention to correlations between folds
Try not to extensively tune hyperparameters
Optimizing thresholds can lead to "brittle" models
Random initializations between folds might help diversity\

	version: "2"

	networks:
	gitea:
	external: false

	services:
	server:
	image: gitea/gitea:latest
	environment:

	import os
	import numpy as np

	import torch
	import torchvision

	from torch.autograd import Variable
	import torch.nn as nn
	import torch.nn.functional as F