ML experiments may be very hard to reproduce. You have a lot of hyperparameters, different dataset splits, different ways to preprocess your data, bugs, etc.
Ideally, you should log data split (already preprocessed), all hyperparameters (including learning rate scheduling), the initial state of your model and optimizer, random seeds used for initialization, dataset shuffling and all of your code. Your GPU is also should be in deterministic mode (which is not the default mode). For every single model run. This is a very hard task. Different random seed can significantly change your metrics and even GPU-induced randomness can be important. We're not solving all of these problems, but we need to address at least what we can handle.
For every result you report in the paper you need (at least) to:
- Track your model and optimizer hyperparameters (including learning rate schedule)
- Save final model parameters
- Report all of the parameters in the pap