ML experiments may be very hard to reproduce. You have a lot of hyperparameters, different dataset splits, different ways to preprocess your data, bugs, etc. Ideally, you should log data split (already preprocessed), all hyperparameters (including learning rate scheduling), the initial state of your model and optimizer, random seeds used for initialization, dataset shuffling and all of your code. Your GPU is also should be in deterministic mode (which is not the default mode). For every single model run. This is a very hard task. Different random seed can significantly change your metrics and even GPU-induced randomness can be important. We're not solving all of these problems, but we need to address at least what we can handle.
For every result you report in the paper you need (at least) to:
- Track your model and optimizer hyperparameters (including learning rate schedule)
- Save final model parameters
- Report all of the parameters in the paper (make a table in the appendix) and release the code
- Set random seeds
(it is not as easy as
torch.manual_seed(42)
, follow the link). - Store everything in the cloud
To save your hyperparameters you can use the TensorBoard HParams plugin, but we recommend using a specialized service like wandb.ai. These services not only store all of your logs but provide an easy interface to store hyperparameters, code and model files.
Ideally, also:
- Save the exact code you used (create a tag in your repository for each run)
- Save your preprocessed data, especially if you are working on a dataset paper (Data Version Control helps)
- Save your model and optimizer initialization (the state at step 0)
- Use GPU in deterministic mode (this will slightly affect the performance)
- Store everything in the cloud and locally
An easy way to do this:
# Before the training:
import random
import wandb
import torch
import numpy as np
random.seed(args.seed) # python random generator
np.random.seed(args.seed) # numpy random generator
torch.manual_seed(args.seed)
torch.cuda.manual_seed_all(args.seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
wandb.init(..., config=args) # keep all your hyperparameters in args
# wandb also saves your code files and git repository automatically
checkpoint = {
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'settings': args,
'epoch': epoch,
'step': global_step
}
torch.save(checkpoint, save_path_for_init)
wandb.save(save_path_for_init) # upload your initialization to wandb
# Your training loop is here
# After the training:
checkpoint = {
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'settings': args,
'epoch': epoch,
'step': global_step
}
torch.save(checkpoint, save_path)
wandb.save(save_path) # upload your trained model to wandb
At least keep all your hyperparameters for every run. Use specialized tools like wandb.ai or tensorboard.dev + TensorBoard HParams for this. Store them in the cloud, not on your machine.