| version: "2" | |
| networks: | |
| gitea: | |
| external: false | |
| services: | |
| server: | |
| image: gitea/gitea:latest | |
| environment: |
-
means and stddevs of activations should be close to 0 and 1 to prevent gradients exploding or vanishing
-
activations of layers have stddevs close to sqrt(num_input_channels)
-
so, to get the stddevs back to 1, multiply random weights by 1 / sqrt(c_in)
-
this works well without activations, but results in vanishing or exploding gradients when used with a tanh or sigmoid activation function
-
bias weights should be initialized to 0
-
intializations can either be from a uniform distribution or a normal distribution
-
use xavier for sigmoid and softmax activations
Try more architectures
Basic architectures are sometimes better
Try other forms of ensembling than cv
Blend with linear regression
Rely more on shakeup predictions
Make sure copied code is correct
Pay more attention to correlations between folds
Try not to extensively tune hyperparameters
Optimizing thresholds can lead to "brittle" models
Random initializations between folds might help diversity\
| import os | |
| import numpy as np | |
| import torch | |
| import torchvision | |
| from torch.autograd import Variable | |
| import torch.nn as nn | |
| import torch.nn.functional as F |