bearpelican’s gists

bearpelican / 32gpu_93_apect_ratio.txt

Created July 9, 2018 05:19

4 p3.16xlarge machines. Validation by nearest aspect ratio. ~46.3 minutes

	~~epoch hours top1Accuracy

	Distributed: init_process_group success
	Loaded model
	Defined loss and optimizer
	Created data loaders
	Begin training
	Changing LR from None to 1.4
	~~0 0.01851892861111111 14.500

bearpelican / 32gpu_93acc_50epochs_noar.txt

Created July 9, 2018 05:23

4x p3.16xlarge machines. Trained without validation by nearest aspect ratio. Almost 1 hour. No BN 0. Linear LR increase

	~~epoch hours top1Accuracy

	Distributed: init_process_group success
	Loaded model
	Defined loss and optimizer
	Created data loaders
	Begin training
	Begin training loop: 1530911465.1864338
	Prefetcher first preload complete
	Received input: 3.8542351722717285

bearpelican / 93model_no_aspect_ratio.txt

Created July 9, 2018 08:42

Log of our best model with original validation. Resize images of 1.14 and center crop

bearpelican / 93model_val_by_aspect_ratio.txt

Created July 9, 2018 08:47

Took original model trained to 93% accuracy. Validated batches of images close to their original aspect ratio. Also resized them to 1.14.

bearpelican / creating_snapshot.py

Created July 13, 2018 12:28

	# creating snapshot
	v = #volume

	snapshot = ec2.create_snapshot(
	Description='Imagenet data snapshot',
	VolumeId=v.id,
	TagSpecifications=[
	{
	'ResourceType': 'snapshot',
	'Tags': [

bearpelican / min_pytorch_source.sh

Created July 26, 2018 00:30

	#!/bin/bash

	# This assumes base DLAMI - "Deep Learning AMI (Ubuntu) Version 12.0"

	# YOU MUST RUN THESE COMMANDS BEFORE YOU RUN THIS SCRIPT
	# conda create -n pytorch_source -y
	# source activate pytorch_source


	sudo rm -rf /usr/local/cuda

bearpelican / nccl_rings_test.txt

Created August 2, 2018 07:51

	NCCL_RINGS="8 21 18 14 28 6 13 20 3 24 10 16 5 1 30 17 11 27 0 19 15 9 7 12 4 23 29 22 2 26 25 31 \| 14 18 24 12 30 22 0 29 25 5 1 10 9 2 4 23 20 11 16 7 27 15 31 3 26 17 6 8 28 19 21 13 \| 31 27 4 18 25 23 6 7 13 28 22 2 12 21 20 15 3 30 1 5 16 14 19 8 10 26 9 11 29 24 0 17 \| 5 10 24 1 14 21 7 28 3 4 25 11 8 29 13 20 27 26 17 12 6 0 30 2 15 16 18 23 9 22 19 31"

	Changing LR from 2.188196721311475 to 2.1901597577529497
	Epoch: [1][10/157] Time 0.381 (0.675) Data 0.001 (0.019) Loss 5.2887 (5.2825) Prec@1 7.458 (7.638) Prec@5 20.349 (20.150) bw 2.941 2.941
	Epoch: [1][20/157] Time 0.379 (0.540) Data 0.001 (0.018) Loss 5.1616 (5.2322) Prec@1 9.119 (8.117) Prec@5 21.826 (20.883) bw 12.484 12.484
	Epoch: [1][30/157] Time 0.380 (0.493) Data 0.001 (0.020) Loss 5.0941 (5.2052) Prec@1 9.253 (8.359) Prec@5 23.938 (21.450) bw 13.183 13.183
	Epoch: [1][40/157] Time 0.381 (0.470) Data 0.001 (0.020) Loss 5.0707 (5.1734) Prec@1 9.363 (8.611) Pr

bearpelican / single_machine_benchmark_e35.txt

Created August 2, 2018 09:04

	Namespace(arch='resnet50', batch_sched='512,192,128', data='/home/ubuntu/data/imagenet', dist_backend='nccl', dist_url='file:///home/ubuntu/data/file.sync', distributed=True, epochs=35, evaluate=False, fp16=True, init_bn0=True, local_rank=2, logdir='/efs/runs/one_machine_e35_nobnwd.03', loss_scale=1024.0, lr=1.0, lr_linear_scale=True, lr_sched='0.14,0.47,0.78,0.95', momentum=0.9, no_bn_wd=True, pretrained=False, print_freq=10, prof=False, resize_sched='0.4,0.92', resume='', save_dir='/home/ubuntu/data/training/nv/2018-08-01_22-38-one_machine_e35_nobnwd-w8', start_epoch=0, val_ar=True, weight_decay=0.0001, workers=8, world_size=8)
	~~epoch hours top1Accuracy

	Distributed: initializing process group
	Distributed: success (2/8)
	Loading model
	Creating data loaders (this could take 6-12 minutes)
	Begin training
	Dataset changed.
	Image size: 128

bearpelican / main.py

Created August 3, 2018 17:21

	import argparse
	import os
	import shutil
	import time

	import torch
	import torch.nn as nn
	import torch.nn.parallel
	import torch.backends.cudnn as cudnn
	import torch.distributed as dist

bearpelican / resnet_mod.py

Created August 3, 2018 17:21

	import torch.nn as nn
	import math
	import torch.utils.model_zoo as model_zoo


	__all__ = ['ResNet', 'resnet18', 'resnet34', 'resnet50', 'resnet101',
	'resnet152']


	model_urls = {

Andrew Shaw bearpelican