Skip to content

Instantly share code, notes, and snippets.

Memory Usage

Background

GPU memory is used in a few main ways:

  • Memory to store the network's parameters
  • Memory to store the network's gradients
  • Memory to store the activations of the current batch
  • Memory used by optimizers (momentum, adam, etc) that stores running averages
@bilal2vec
bilal2vec / TPU.md
Last active October 15, 2019 21:36

TPU info

  • v2 and v3

  • v2 has 180 tflops 64gb ram

    • colab uses v2s
  • v3 has 420 tflops 128gb ram

  • two versions, single tpus or pods

    • single tpus
  • 8 cores each

version: "2"
networks:
gitea:
external: false
services:
server:
image: gitea/gitea:latest
environment:

Initialization

  • means and stddevs of activations should be close to 0 and 1 to prevent gradients exploding or vanishing

  • activations of layers have stddevs close to sqrt(num_input_channels)

  • so, to get the stddevs back to 1, multiply random weights by 1 / sqrt(c_in)

  • this works well without activations, but results in vanishing or exploding gradients when used with a tanh or sigmoid activation function

  • bias weights should be initialized to 0

  • intializations can either be from a uniform distribution or a normal distribution

  • use xavier for sigmoid and softmax activations

Try more architectures
Basic architectures are sometimes better
Try other forms of ensembling than cv
Blend with linear regression
Rely more on shakeup predictions
Make sure copied code is correct
Pay more attention to correlations between folds
Try not to extensively tune hyperparameters
Optimizing thresholds can lead to "brittle" models
Random initializations between folds might help diversity\

import os
import numpy as np
import torch
import torchvision
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F