Skip to content

Instantly share code, notes, and snippets.

View bearpelican's full-sized avatar

Andrew Shaw bearpelican

View GitHub Profile
@bearpelican
bearpelican / 32gpu_93_apect_ratio.txt
Created July 9, 2018 05:19
4 p3.16xlarge machines. Validation by nearest aspect ratio. ~46.3 minutes
~~epoch hours top1Accuracy
Distributed: init_process_group success
Loaded model
Defined loss and optimizer
Created data loaders
Begin training
Changing LR from None to 1.4
~~0 0.01851892861111111 14.500
@bearpelican
bearpelican / 32gpu_93acc_50epochs_noar.txt
Created July 9, 2018 05:23
4x p3.16xlarge machines. Trained without validation by nearest aspect ratio. Almost 1 hour. No BN 0. Linear LR increase
~~epoch hours top1Accuracy
Distributed: init_process_group success
Loaded model
Defined loss and optimizer
Created data loaders
Begin training
Begin training loop: 1530911465.1864338
Prefetcher first preload complete
Received input: 3.8542351722717285
@bearpelican
bearpelican / 93model_no_aspect_ratio.txt
Created July 9, 2018 08:42
Log of our best model with original validation. Resize images of 1.14 and center crop
Test: [0/391] Time 1.912 (1.912) Loss 1.3594 (1.3594) Prec@1 67.188 (67.188) Prec@5 86.719 (86.719)
Test: [10/391] Time 0.088 (0.400) Loss 1.0459 (0.9636) Prec@1 79.688 (75.000) Prec@5 89.062 (92.259)
Test: [20/391] Time 0.088 (0.392) Loss 0.9121 (1.0274) Prec@1 75.000 (73.772) Prec@5 95.312 (91.592)
Test: [30/391] Time 0.088 (0.350) Loss 0.8262 (1.0025) Prec@1 82.031 (74.320) Prec@5 93.750 (92.087)
Test: [40/391] Time 0.088 (0.357) Loss 1.0703 (0.9653) Prec@1 71.094 (75.305) Prec@5 92.969 (92.530)
Test: [50/391] Time 0.090 (0.337) Loss 1.2402 (1.0169) Prec@1 69.531 (74.357) Prec@5 92.969 (91.881)
Test: [60/391] Time 0.088 (0.340) Loss 1.7568 (1.0623) Prec@1 54.688 (73.335) Prec@5 83.594 (91.304)
Test: [70/391] Time 0.088 (0.331) Loss 1.1191 (1.0536) Prec@1 74.219 (73.537) Prec@5 89.844 (91.384)
Test: [80/391] Time 0.088 (0.331) Loss 0.9688 (1.0258) Prec@1 75.000 (74.199) Prec@5 90.625 (91.763)
Test: [90/391] Time 0.088 (0.324) Loss 1.0059 (1.0122) Prec@1 73.438 (74.511) Prec@5 93.750 (92.033)
@bearpelican
bearpelican / 93model_val_by_aspect_ratio.txt
Created July 9, 2018 08:47
Took original model trained to 93% accuracy. Validated batches of images close to their original aspect ratio. Also resized them to 1.14.
Test: [0/391] Time 2.462 (2.462) Loss 1.0381 (1.0381) Prec@1 73.438 (73.438) Prec@5 91.406 (91.406)
Test: [10/391] Time 0.138 (0.494) Loss 0.9663 (0.8747) Prec@1 79.688 (77.202) Prec@5 92.188 (94.247)
Test: [20/391] Time 0.123 (0.463) Loss 0.9185 (0.9419) Prec@1 78.125 (75.930) Prec@5 94.531 (93.824)
Test: [30/391] Time 0.121 (0.408) Loss 0.7988 (0.9348) Prec@1 85.938 (76.260) Prec@5 92.969 (93.800)
Test: [40/391] Time 0.133 (0.411) Loss 1.0264 (0.9115) Prec@1 73.438 (77.115) Prec@5 93.750 (94.074)
Test: [50/391] Time 0.113 (0.387) Loss 1.1367 (0.9567) Prec@1 73.438 (76.149) Prec@5 91.406 (93.367)
Test: [60/391] Time 0.113 (0.386) Loss 1.6260 (0.9970) Prec@1 56.250 (75.128) Prec@5 85.938 (92.841)
Test: [70/391] Time 0.113 (0.373) Loss 1.0781 (0.9921) Prec@1 73.438 (75.253) Prec@5 92.969 (92.848)
Test: [80/391] Time 0.105 (0.371) Loss 0.8721 (0.9677) Prec@1 76.562 (75.791) Prec@5 93.750 (93.142)
Test: [90/391] Time 0.109 (0.361) Loss 0.8960 (0.9565) Prec@1 78.125 (75.953) Prec@5 96.094 (93.286)
# creating snapshot
v = #volume
snapshot = ec2.create_snapshot(
Description='Imagenet data snapshot',
VolumeId=v.id,
TagSpecifications=[
{
'ResourceType': 'snapshot',
'Tags': [
#!/bin/bash
# This assumes base DLAMI - "Deep Learning AMI (Ubuntu) Version 12.0"
# YOU MUST RUN THESE COMMANDS BEFORE YOU RUN THIS SCRIPT
# conda create -n pytorch_source -y
# source activate pytorch_source
sudo rm -rf /usr/local/cuda
NCCL_RINGS="8 21 18 14 28 6 13 20 3 24 10 16 5 1 30 17 11 27 0 19 15 9 7 12 4 23 29 22 2 26 25 31 | 14 18 24 12 30 22 0 29 25 5 1 10 9 2 4 23 20 11 16 7 27 15 31 3 26 17 6 8 28 19 21 13 | 31 27 4 18 25 23 6 7 13 28 22 2 12 21 20 15 3 30 1 5 16 14 19 8 10 26 9 11 29 24 0 17 | 5 10 24 1 14 21 7 28 3 4 25 11 8 29 13 20 27 26 17 12 6 0 30 2 15 16 18 23 9 22 19 31"
Changing LR from 2.188196721311475 to 2.1901597577529497
Epoch: [1][10/157] Time 0.381 (0.675) Data 0.001 (0.019) Loss 5.2887 (5.2825) Prec@1 7.458 (7.638) Prec@5 20.349 (20.150) bw 2.941 2.941
Epoch: [1][20/157] Time 0.379 (0.540) Data 0.001 (0.018) Loss 5.1616 (5.2322) Prec@1 9.119 (8.117) Prec@5 21.826 (20.883) bw 12.484 12.484
Epoch: [1][30/157] Time 0.380 (0.493) Data 0.001 (0.020) Loss 5.0941 (5.2052) Prec@1 9.253 (8.359) Prec@5 23.938 (21.450) bw 13.183 13.183
Epoch: [1][40/157] Time 0.381 (0.470) Data 0.001 (0.020) Loss 5.0707 (5.1734) Prec@1 9.363 (8.611) Pr
Namespace(arch='resnet50', batch_sched='512,192,128', data='/home/ubuntu/data/imagenet', dist_backend='nccl', dist_url='file:///home/ubuntu/data/file.sync', distributed=True, epochs=35, evaluate=False, fp16=True, init_bn0=True, local_rank=2, logdir='/efs/runs/one_machine_e35_nobnwd.03', loss_scale=1024.0, lr=1.0, lr_linear_scale=True, lr_sched='0.14,0.47,0.78,0.95', momentum=0.9, no_bn_wd=True, pretrained=False, print_freq=10, prof=False, resize_sched='0.4,0.92', resume='', save_dir='/home/ubuntu/data/training/nv/2018-08-01_22-38-one_machine_e35_nobnwd-w8', start_epoch=0, val_ar=True, weight_decay=0.0001, workers=8, world_size=8)
~~epoch hours top1Accuracy
Distributed: initializing process group
Distributed: success (2/8)
Loading model
Creating data loaders (this could take 6-12 minutes)
Begin training
Dataset changed.
Image size: 128
import argparse
import os
import shutil
import time
import torch
import torch.nn as nn
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.nn as nn
import math
import torch.utils.model_zoo as model_zoo
__all__ = ['ResNet', 'resnet18', 'resnet34', 'resnet50', 'resnet101',
'resnet152']
model_urls = {