Skip to content

Instantly share code, notes, and snippets.

@taylanbil
Last active November 20, 2020 18:06
Show Gist options
  • Save taylanbil/c4841e7397bd875e3b0908f752fcc602 to your computer and use it in GitHub Desktop.
Save taylanbil/c4841e7397bd875e3b0908f752fcc602 to your computer and use it in GitHub Desktop.
# TPU CLI
tpu=dlrm-init
TPU_IP_ADDRESS=`gcloud compute tpus describe --zone=europe-west4-a dlrm-init | grep ipAddress | cut -d ':' -f2 | head -1 | sed 's/ //g'`
export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
python dlrm/dlrm_tpu_runner.py \
--arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
--arch-sparse-feature-size=64 \
--arch-mlp-bot=512-512-64 \
--arch-mlp-top=1024-1024-1024-1 \
--arch-interaction-op=dot \
--lr-num-warmup-steps 10 \
--lr-decay-start-step 10 \
--mini-batch-size=2048 \
--num-batches=1000 \
--data-generation='random' \
--numpy-rand-seed=727 \
--print-time \
--print-freq 100 \
--num-indices-per-lookup=100 \
--use-tpu \
--num-indices-per-lookup-fixed \
--tpu-model-parallel-group-len 8 \
--tpu-cores 8
# GPU CLI
cd dlrm
CUDA_VISIBLE_DEVICES=0,1,2,3 python dlrm_s_pytorch.py \
--mini-batch-size=2048 \
--test-mini-batch-size=16384 \
--test-num-workers=0 \
--num-batches=1000 \
--data-generation=random \
--arch-mlp-bot=512-512-64 \
--arch-mlp-top=1024-1024-1024-1 \
--arch-sparse-feature-size=64 \
--arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
--num-indices-per-lookup=100 \
--arch-interaction-op=dot \
--numpy-rand-seed=727 \
--print-freq=100 \
--print-time \
--enable-profiling \
--use-gpu
# NO BF16
Using 8 TPU core(s)...
XLA replica groups for Model Parallel:
[[0, 1, 2, 3, 4, 5, 6, 7]]
XLA replica groups for Model Parallel:
[[0], [1], [2], [3], [4], [5], [6], [7]]
TPU model-parallel mode, setting --drop-last=True
time/loss/accuracy (if enabled):
Finished training it 100/1000 of epoch 0, 364.48 ms/it, loss 0.084503, accuracy 0.000 %
Finished training it 200/1000 of epoch 0, 315.64 ms/it, loss 0.083122, accuracy 0.000 %
Finished training it 300/1000 of epoch 0, 326.83 ms/it, loss 0.082922, accuracy 0.000 %
Finished training it 400/1000 of epoch 0, 320.92 ms/it, loss 0.083768, accuracy 0.000 %
Finished training it 500/1000 of epoch 0, 330.18 ms/it, loss 0.082434, accuracy 0.000 %
Finished training it 600/1000 of epoch 0, 321.81 ms/it, loss 0.083133, accuracy 0.000 %
Finished training it 700/1000 of epoch 0, 319.91 ms/it, loss 0.083932, accuracy 0.000 %
Finished training it 800/1000 of epoch 0, 317.70 ms/it, loss 0.083939, accuracy 0.000 %
Finished training it 900/1000 of epoch 0, 322.93 ms/it, loss 0.082846, accuracy 0.000 %
Finished training it 1000/1000 of epoch 0, 319.75 ms/it, loss 0.082794, accuracy 0.000 %
# With BF16:
Using 8 TPU core(s)...
XLA replica groups for Model Parallel:
[[0, 1, 2, 3, 4, 5, 6, 7]]
XLA replica groups for Model Parallel:
[[0], [1], [2], [3], [4], [5], [6], [7]]
TPU model-parallel mode, setting --drop-last=True
time/loss/accuracy (if enabled):
Finished training it 100/1000 of epoch 0, 363.40 ms/it, loss 0.087402, accuracy 0.000 %
Finished training it 200/1000 of epoch 0, 320.97 ms/it, loss 0.086914, accuracy 0.000 %
Finished training it 300/1000 of epoch 0, 318.49 ms/it, loss 0.084961, accuracy 0.000 %
Finished training it 400/1000 of epoch 0, 321.91 ms/it, loss 0.085449, accuracy 0.000 %
Finished training it 500/1000 of epoch 0, 324.15 ms/it, loss 0.084961, accuracy 0.000 %
Finished training it 600/1000 of epoch 0, 321.54 ms/it, loss 0.083984, accuracy 0.000 %
Finished training it 700/1000 of epoch 0, 320.97 ms/it, loss 0.085449, accuracy 0.000 %
Finished training it 800/1000 of epoch 0, 323.22 ms/it, loss 0.085449, accuracy 0.000 %
Finished training it 900/1000 of epoch 0, 324.06 ms/it, loss 0.084961, accuracy 0.000 %
Finished training it 1000/1000 of epoch 0, 322.35 ms/it, loss 0.084473, accuracy 0.000 %
@taylanbil
Copy link
Author

taylanbil commented Jun 25, 2020

NOTE THIS

ms/it numbers are off on tpus. There's a bug I introduced probably. But I print timestamps for every report, please use that to approximate the TCO numbers.. Sorry for the inconvenience.

ignoring ms/it reporting, if i time the processes, here is how it looks:

TPU v3-8

Using 8 TPU core(s)...
XLA replica groups for Model Parallel:
         [[0, 1, 2, 3, 4, 5, 6, 7]]
XLA replica groups for Model Parallel:
         [[0], [1], [2], [3], [4], [5], [6], [7]]
TPU model-parallel mode, setting --drop-last=True
time/loss/accuracy (if enabled):
Finished training it 100/1000 of epoch 0, 144.76 ms/it, loss 0.084503, accuracy 0.000 % 2020-06-25 22:17:51.787500
Finished training it 200/1000 of epoch 0, 130.05 ms/it, loss 0.083122, accuracy 0.000 % 2020-06-25 22:18:23.687072
Finished training it 300/1000 of epoch 0, 243.49 ms/it, loss 0.082922, accuracy 0.000 % 2020-06-25 22:18:56.182165
Finished training it 400/1000 of epoch 0, 319.45 ms/it, loss 0.083768, accuracy 0.000 % 2020-06-25 22:19:28.656055
Finished training it 500/1000 of epoch 0, 331.90 ms/it, loss 0.082434, accuracy 0.000 % 2020-06-25 22:20:02.894406
Finished training it 600/1000 of epoch 0, 327.96 ms/it, loss 0.083133, accuracy 0.000 % 2020-06-25 22:20:35.901529
Finished training it 700/1000 of epoch 0, 329.52 ms/it, loss 0.083932, accuracy 0.000 % 2020-06-25 22:21:09.116600
Finished training it 800/1000 of epoch 0, 310.80 ms/it, loss 0.083939, accuracy 0.000 % 2020-06-25 22:21:41.494541
Finished training it 900/1000 of epoch 0, 326.32 ms/it, loss 0.082846, accuracy 0.000 % 2020-06-25 22:22:14.438292

GPU 4 v100s (16 gb each)

Using 4 GPU(s)...
time/loss/accuracy (if enabled):
Finished training it 100/1000 of epoch 0, 68.60 ms/it, loss 0.084125, accuracy 0.000 % 2020-06-25 22:19:29.875367
Finished training it 200/1000 of epoch 0, 47.29 ms/it, loss 0.083529, accuracy 0.000 % 2020-06-25 22:20:34.964602
Finished training it 300/1000 of epoch 0, 47.75 ms/it, loss 0.083705, accuracy 0.000 % 2020-06-25 22:21:39.348142
Finished training it 400/1000 of epoch 0, 47.48 ms/it, loss 0.082972, accuracy 0.000 % 2020-06-25 22:22:43.165263
Finished training it 500/1000 of epoch 0, 47.96 ms/it, loss 0.083616, accuracy 0.000 % 2020-06-25 22:23:46.980054

@taylanbil
Copy link
Author

2 things:

  • mini batch size per device is 2x higher on GPUs.
  • each gpu holds 2 emb. tables whereas tpus hold 1 each.

@taylanbil
Copy link
Author

1 GPU vs 1 TPU

Note the change in mini batch size and --arch-embedding-size

1 GPU v100 (16 gb)

#!/bin/bash

cd dlrm
CUDA_VISIBLE_DEVICES=0 python dlrm_s_pytorch.py \
        --mini-batch-size=256 \
        --test-mini-batch-size=16384 \
        --test-num-workers=0 \
        --num-batches=1000 \
        --data-generation=random \
        --arch-mlp-bot=512-512-64 \
        --arch-mlp-top=1024-1024-1024-1 \
        --arch-sparse-feature-size=64 \
        --arch-embedding-size=1000000-1000000 \
        --num-indices-per-lookup=100 \
        --arch-interaction-op=dot \
        --numpy-rand-seed=727 \
        --print-freq=100 \
        --print-time \
        --enable-profiling \
        --use-gpu

Result:

taylanbil@dlrm-gpu:~$ ./dlrm-1gpu.sh
Using 1 GPU(s)...
time/loss/accuracy (if enabled):
Finished training it 100/1000 of epoch 0, 9.81 ms/it, loss 0.141502, accuracy 0.000 % 2020-06-25 22:32:50.847409
Finished training it 200/1000 of epoch 0, 6.36 ms/it, loss 0.084229, accuracy 0.000 % 2020-06-25 22:32:53.557824
Finished training it 300/1000 of epoch 0, 6.44 ms/it, loss 0.083750, accuracy 0.000 % 2020-06-25 22:32:56.273147
Finished training it 400/1000 of epoch 0, 6.26 ms/it, loss 0.083291, accuracy 0.000 % 2020-06-25 22:32:58.962694
Finished training it 500/1000 of epoch 0, 6.48 ms/it, loss 0.084722, accuracy 0.000 % 2020-06-25 22:33:01.682793
Finished training it 600/1000 of epoch 0, 6.55 ms/it, loss 0.083890, accuracy 0.000 % 2020-06-25 22:33:04.430844
Finished training it 700/1000 of epoch 0, 6.57 ms/it, loss 0.083577, accuracy 0.000 % 2020-06-25 22:33:07.169423
Finished training it 800/1000 of epoch 0, 6.56 ms/it, loss 0.084575, accuracy 0.000 % 2020-06-25 22:33:09.905148
Finished training it 900/1000 of epoch 0, 6.69 ms/it, loss 0.083511, accuracy 0.000 % 2020-06-25 22:33:12.649616
Finished training it 1000/1000 of epoch 0, 6.81 ms/it, loss 0.083468, accuracy 0.000 % 2020-06-25 22:33:15.417056

1 TPU core (v3)

python dlrm/dlrm_tpu_runner.py \
    --arch-embedding-size=1000000-1000000 \
    --arch-sparse-feature-size=64 \
    --arch-mlp-bot=512-512-64 \
    --arch-mlp-top=1024-1024-1024-1 \
    --arch-interaction-op=dot \
    --lr-num-warmup-steps 10 \
    --lr-decay-start-step 10 \
    --mini-batch-size=256 \
    --num-batches=1000 \
    --data-generation='random' \
    --numpy-rand-seed=727 \
    --print-time \
    --print-freq 100 \
    --num-indices-per-lookup=100 \
        --use-tpu \
        --num-indices-per-lookup-fixed \
        --tpu-model-parallel-group-len 1 \
        --tpu-cores=1

Result:

Using 1 TPU core(s)...
XLA replica groups for Model Parallel:
         [[0]]
XLA replica groups for Model Parallel:
         [[0]]
TPU data-parallel mode, setting --tpu-data-parallel to True
time/loss/accuracy (if enabled):
Finished training it 100/1000 of epoch 0, 40.63 ms/it, loss 0.147117, accuracy 0.000 % 2020-06-25 22:35:23.580818
Finished training it 200/1000 of epoch 0, 40.37 ms/it, loss 0.083852, accuracy 0.000 % 2020-06-25 22:35:28.628789
Finished training it 300/1000 of epoch 0, 40.52 ms/it, loss 0.083658, accuracy 0.000 % 2020-06-25 22:35:32.783742
Finished training it 400/1000 of epoch 0, 40.40 ms/it, loss 0.083472, accuracy 0.000 % 2020-06-25 22:35:36.926116
Finished training it 500/1000 of epoch 0, 40.40 ms/it, loss 0.083506, accuracy 0.000 % 2020-06-25 22:35:41.068098
Finished training it 600/1000 of epoch 0, 40.39 ms/it, loss 0.083928, accuracy 0.000 % 2020-06-25 22:35:45.208756
Finished training it 700/1000 of epoch 0, 40.29 ms/it, loss 0.083616, accuracy 0.000 % 2020-06-25 22:35:49.336625
Finished training it 800/1000 of epoch 0, 40.77 ms/it, loss 0.083280, accuracy 0.000 % 2020-06-25 22:35:53.515664
Finished training it 900/1000 of epoch 0, 40.89 ms/it, loss 0.083492, accuracy 0.000 % 2020-06-25 22:35:57.705103
Finished training it 1000/1000 of epoch 0, 40.47 ms/it, loss 0.083240, accuracy 0.000 % 2020-06-25 22:36:01.853866

@taylanbil
Copy link
Author

8 GPU vs 8 TPU

TPU (v3-8)

same results as above, included for convenience

python dlrm/dlrm_tpu_runner.py \
    --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
    --arch-sparse-feature-size=64 \
    --arch-mlp-bot=512-512-64 \
    --arch-mlp-top=1024-1024-1024-1 \
    --arch-interaction-op=dot \
    --lr-num-warmup-steps 10 \
    --lr-decay-start-step 10 \
    --mini-batch-size=2048 \
    --num-batches=1000 \
    --data-generation='random' \
    --numpy-rand-seed=727 \
    --print-time \
    --print-freq 100 \
    --num-indices-per-lookup=100 \
        --use-tpu \
        --num-indices-per-lookup-fixed \
        --tpu-model-parallel-group-len 8 \
        --tpu-cores 8

same results as above, included for convenience

Using 8 TPU core(s)...
XLA replica groups for Model Parallel:
         [[0, 1, 2, 3, 4, 5, 6, 7]]
XLA replica groups for Model Parallel:
         [[0], [1], [2], [3], [4], [5], [6], [7]]
TPU model-parallel mode, setting --drop-last=True
time/loss/accuracy (if enabled):
Finished training it 100/1000 of epoch 0, 144.76 ms/it, loss 0.084503, accuracy 0.000 % 2020-06-25 22:17:51.787500
Finished training it 200/1000 of epoch 0, 130.05 ms/it, loss 0.083122, accuracy 0.000 % 2020-06-25 22:18:23.687072
Finished training it 300/1000 of epoch 0, 243.49 ms/it, loss 0.082922, accuracy 0.000 % 2020-06-25 22:18:56.182165
Finished training it 400/1000 of epoch 0, 319.45 ms/it, loss 0.083768, accuracy 0.000 % 2020-06-25 22:19:28.656055
Finished training it 500/1000 of epoch 0, 331.90 ms/it, loss 0.082434, accuracy 0.000 % 2020-06-25 22:20:02.894406
Finished training it 600/1000 of epoch 0, 327.96 ms/it, loss 0.083133, accuracy 0.000 % 2020-06-25 22:20:35.901529
Finished training it 700/1000 of epoch 0, 329.52 ms/it, loss 0.083932, accuracy 0.000 % 2020-06-25 22:21:09.116600
Finished training it 800/1000 of epoch 0, 310.80 ms/it, loss 0.083939, accuracy 0.000 % 2020-06-25 22:21:41.494541
Finished training it 900/1000 of epoch 0, 326.32 ms/it, loss 0.082846, accuracy 0.000 % 2020-06-25 22:22:14.438292

8 GPU (v100, 16 gb)

taylanbil@dlrm-gpu-8:~$ cat dlrm-bench.sh
#!/bin/bash

cd dlrm
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python dlrm_s_pytorch.py \
        --mini-batch-size=2048 \
        --test-mini-batch-size=16384 \
        --test-num-workers=0 \
        --num-batches=1000 \
        --data-generation=random \
        --arch-mlp-bot=512-512-64 \
        --arch-mlp-top=1024-1024-1024-1 \
        --arch-sparse-feature-size=64 \
        --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
        --num-indices-per-lookup=100 \
        --arch-interaction-op=dot \
        --numpy-rand-seed=727 \
        --print-freq=100 \
        --print-time \
        --enable-profiling \

Results:

taylanbil@dlrm-gpu-8:~$ ./dlrm-bench.sh
Using 8 GPU(s)...
time/loss/accuracy (if enabled):
Finished training it 100/1000 of epoch 0, 138.55 ms/it, loss 0.084125, accuracy 0.000 % 2020-06-25 23:03:33.398652
Finished training it 200/1000 of epoch 0, 69.54 ms/it, loss 0.083529, accuracy 0.000 % 2020-06-25 23:04:38.280997
Finished training it 300/1000 of epoch 0, 71.71 ms/it, loss 0.083705, accuracy 0.000 % 2020-06-25 23:05:44.568711
Finished training it 400/1000 of epoch 0, 71.17 ms/it, loss 0.082972, accuracy 0.000 % 2020-06-25 23:06:49.885169
Finished training it 500/1000 of epoch 0, 72.14 ms/it, loss 0.083616, accuracy 0.000 % 2020-06-25 23:07:56.599056
Finished training it 600/1000 of epoch 0, 72.94 ms/it, loss 0.083187, accuracy 0.000 % 2020-06-25 23:09:03.127100
Finished training it 700/1000 of epoch 0, 72.18 ms/it, loss 0.083440, accuracy 0.000 % 2020-06-25 23:10:08.666714
Finished training it 800/1000 of epoch 0, 75.03 ms/it, loss 0.083453, accuracy 0.000 % 2020-06-25 23:11:14.998793
Finished training it 900/1000 of epoch 0, 76.54 ms/it, loss 0.083306, accuracy 0.000 % 2020-06-25 23:12:21.965458
Finished training it 1000/1000 of epoch 0, 75.13 ms/it, loss 0.083338, accuracy 0.000 % 2020-06-25 23:13:28.107878

@taylanbil
Copy link
Author

taylanbil commented Jun 25, 2020

8 GPUs vs 8 TPUs, more iters

TPU (v3-8)

python dlrm/dlrm_tpu_runner.py \
    --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
    --arch-sparse-feature-size=64 \
    --arch-mlp-bot=512-512-64 \
    --arch-mlp-top=1024-1024-1024-1 \
    --arch-interaction-op=dot \
    --lr-num-warmup-steps 10 \
    --lr-decay-start-step 10 \
    --mini-batch-size=2048 \
    --num-batches=10000 \
    --data-generation='random' \
    --numpy-rand-seed=727 \
    --print-time \
    --print-freq 1000 \
    --num-indices-per-lookup=100 \
        --use-tpu \
        --num-indices-per-lookup-fixed \
        --tpu-model-parallel-group-len 8 \
        --tpu-cores=8

Results

Using 8 TPU core(s)...
XLA replica groups for Model Parallel:
         [[0, 1, 2, 3, 4, 5, 6, 7]]
XLA replica groups for Model Parallel:
         [[0], [1], [2], [3], [4], [5], [6], [7]]
TPU model-parallel mode, setting --drop-last=True
time/loss/accuracy (if enabled):
Finished training it 1000/10000 of epoch 0, 312.83 ms/it, loss 0.083339, accuracy 0.000 % 2020-06-25 23:10:44.639935
Finished training it 2000/10000 of epoch 0, 322.82 ms/it, loss 0.083453, accuracy 0.000 % 2020-06-25 23:16:09.488755
Finished training it 3000/10000 of epoch 0, 322.62 ms/it, loss 0.083233, accuracy 0.000 % 2020-06-25 23:21:37.548171
Finished training it 4000/10000 of epoch 0, 320.99 ms/it, loss 0.083312, accuracy 0.000 % 2020-06-25 23:27:01.543967
Finished training it 5000/10000 of epoch 0, 317.38 ms/it, loss 0.083520, accuracy 0.000 % 2020-06-25 23:32:26.462033
Finished training it 6000/10000 of epoch 0, 324.74 ms/it, loss 0.083305, accuracy 0.000 % 2020-06-25 23:37:55.277979
Finished training it 7000/10000 of epoch 0, 327.99 ms/it, loss 0.083492, accuracy 0.000 % 2020-06-25 23:43:25.281327
Finished training it 8000/10000 of epoch 0, 326.43 ms/it, loss 0.083512, accuracy 0.000 % 2020-06-25 23:48:54.851570

8 GPUs (v100, 16gb)

#!/bin/bash

cd dlrm
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python dlrm_s_pytorch.py \
        --mini-batch-size=2048 \
        --test-mini-batch-size=16384 \
        --test-num-workers=0 \
        --num-batches=1000 \
        --data-generation=random \
        --arch-mlp-bot=512-512-64 \
        --arch-mlp-top=1024-1024-1024-1 \
        --arch-sparse-feature-size=64 \
        --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
        --num-indices-per-lookup=100 \
        --arch-interaction-op=dot \
        --numpy-rand-seed=727 \
        --print-freq=100 \
        --print-time \
        --use-gpu

Results:

taylanbil@dlrm-gpu-8:~$ ./dlrm-bench-moreiter.sh
Using 8 GPU(s)...
time/loss/accuracy (if enabled):
Finished training it 1000/10000 of epoch 0, 57.12 ms/it, loss 0.083467, accuracy 0.000 % 2020-06-25 23:29:58.506048
Finished training it 2000/10000 of epoch 0, 29.53 ms/it, loss 0.083403, accuracy 0.000 % 2020-06-25 23:39:43.421000
Finished training it 3000/10000 of epoch 0, 31.34 ms/it, loss 0.083501, accuracy 0.000 % 2020-06-25 23:49:31.391754
Finished training it 4000/10000 of epoch 0, 31.31 ms/it, loss 0.083371, accuracy 0.000 % 2020-06-25 23:59:18.873084
Finished training it 5000/10000 of epoch 0, 32.68 ms/it, loss 0.083269, accuracy 0.000 % 2020-06-26 00:09:27.544340
Finished training it 6000/10000 of epoch 0, 32.95 ms/it, loss 0.083395, accuracy 0.000 % 2020-06-26 00:19:41.636447
Finished training it 7000/10000 of epoch 0, 31.33 ms/it, loss 0.083328, accuracy 0.000 % 2020-06-26 00:29:54.229305
Finished training it 8000/10000 of epoch 0, 28.39 ms/it, loss 0.083358, accuracy 0.000 % 2020-06-26 00:39:35.963765
Finished training it 9000/10000 of epoch 0, 28.38 ms/it, loss 0.083281, accuracy 0.000 % 2020-06-26 00:49:16.935803
Finished training it 10000/10000 of epoch 0, 28.44 ms/it, loss 0.083407, accuracy 0.000 % 2020-06-26 00:58:57.481942

@taylanbil
Copy link
Author

taylanbil commented Jun 26, 2020

FIXING ms/it

As I noted in the first comment, ms/it reporting is problematic on tpus. Because of the async execution of torch_xla, I believe the current way ms/it is computed is apples-to-oranges when comparing gpu vs tpu perf..

I added the following diff, to do apples-to apples (I don't think it's possible to do apples to apples for the current upstream way of measuring ms/it).

diff --git a/dlrm_s_pytorch.py b/dlrm_s_pytorch.py
index 344c167..7cb988f 100644
--- a/dlrm_s_pytorch.py
+++ b/dlrm_s_pytorch.py
@@ -934,7 +934,8 @@ if __name__ == "__main__":
                         iteration_time = 0
                     previous_iteration_time = current_time
                 else:
-                    t1 = time_wrap(use_gpu)
+                    if not j:
+                        t1 = time_wrap(use_gpu)

                 # early exit if nbatches was set by the user and has been exceeded
                 if nbatches > 0 and j >= nbatches:
@@ -986,7 +987,10 @@ if __name__ == "__main__":
                     total_time += iteration_time
                 else:
                     t2 = time_wrap(use_gpu)
+                    if j:
+                        print('ADDTIME', t2-t1, total_time, j, total_iter)
                     total_time += t2 - t1
+                    t1=t2
                 total_accu += A
                 total_loss += L * mbs
                 total_iter += 1
@@ -1002,6 +1006,7 @@ if __name__ == "__main__":
                 # print time, loss and accuracy
                 if should_print or should_test:
                     gT = 1000.0 * total_time / total_iter if args.print_time else -1
+                    print('time'.upper(), total_time, total_iter, j+1)
                     total_time = 0

                     gA = total_accu / total_samp
@@ -1011,11 +1016,12 @@ if __name__ == "__main__":
                     total_loss = 0

                     str_run_type = "inference" if args.inference_only else "training"
+                    from datetime import datetime
                     print(
                         "Finished {} it {}/{} of epoch {}, {:.2f} ms/it, ".format(
                             str_run_type, j + 1, nbatches, k, gT
                         )
-                        + "loss {:.6f}, accuracy {:3.3f} %".format(gL, gA * 100)
+                        + "loss {:.6f}, accuracy {:3.3f} % {}".format(gL, gA * 100, datetime.now())
                     )
                     # Uncomment the line below to print out the total time with overhead
                     # print("Accumulated time so far: {}" \

Here are the results

GPU (8 v100s, 16gb)

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python dlrm_s_pytorch.py \
        --mini-batch-size=2048 \
        --test-mini-batch-size=16384 \
        --test-num-workers=0 \
        --num-batches=1000 \
        --data-generation=random \
        --arch-mlp-bot=512-512-64 \
        --arch-mlp-top=1024-1024-1024-1 \
        --arch-sparse-feature-size=64 \
        --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
        --num-indices-per-lookup=100 \
        --arch-interaction-op=dot \
        --numpy-rand-seed=727 \
        --print-freq=100 \
        --print-time \
        --enable-profiling \
        --use-gpu
taylanbil@dlrm-gpu-8:~$ ./dlrm-bench.sh | grep ^Fini
Finished training it 100/1000 of epoch 0, 705.09 ms/it, loss 0.084125, accuracy 0.000 % 2020-06-26 17:21:00.805777
Finished training it 200/1000 of epoch 0, 647.25 ms/it, loss 0.083529, accuracy 0.000 % 2020-06-26 17:22:05.530502
Finished training it 300/1000 of epoch 0, 642.23 ms/it, loss 0.083705, accuracy 0.000 % 2020-06-26 17:23:09.753057
Finished training it 400/1000 of epoch 0, 643.49 ms/it, loss 0.082972, accuracy 0.000 % 2020-06-26 17:24:14.102139
Finished training it 500/1000 of epoch 0, 645.93 ms/it, loss 0.083616, accuracy 0.000 % 2020-06-26 17:25:18.695563
Finished training it 600/1000 of epoch 0, 656.90 ms/it, loss 0.083187, accuracy 0.000 % 2020-06-26 17:26:24.385822
Finished training it 700/1000 of epoch 0, 644.19 ms/it, loss 0.083440, accuracy 0.000 % 2020-06-26 17:27:28.804989
Finished training it 800/1000 of epoch 0, 657.80 ms/it, loss 0.083453, accuracy 0.000 % 2020-06-26 17:28:34.585144
Finished training it 900/1000 of epoch 0, 654.89 ms/it, loss 0.083306, accuracy 0.000 % 2020-06-26 17:29:40.074329

TPU (v3-8)

python dlrm/dlrm_tpu_runner.py \
    --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
    --arch-sparse-feature-size=64 \
    --arch-mlp-bot=512-512-64 \
    --arch-mlp-top=1024-1024-1024-1 \
    --arch-interaction-op=dot \
    --lr-num-warmup-steps 10 \
    --lr-decay-start-step 10 \
    --mini-batch-size=2048 \
    --num-batches=1000 \
    --data-generation='random' \
    --numpy-rand-seed=727 \
    --print-time \
    --print-freq 100 \
    --num-indices-per-lookup=100 \
        --use-tpu \
        --num-indices-per-lookup-fixed \
        --tpu-model-parallel-group-len 8 \
        --tpu-cores=8

Results:

Finished training it 100/1000 of epoch 0, 619.24 ms/it, loss 0.084503, accuracy 0.000 % 2020-06-26 17:32:41.667455
Finished training it 200/1000 of epoch 0, 435.55 ms/it, loss 0.083122, accuracy 0.000 % 2020-06-26 17:33:25.188651
Finished training it 300/1000 of epoch 0, 353.28 ms/it, loss 0.082922, accuracy 0.000 % 2020-06-26 17:34:00.517240
Finished training it 400/1000 of epoch 0, 357.82 ms/it, loss 0.083768, accuracy 0.000 % 2020-06-26 17:34:36.299376
Finished training it 500/1000 of epoch 0, 352.48 ms/it, loss 0.082434, accuracy 0.000 % 2020-06-26 17:35:11.547412
Finished training it 600/1000 of epoch 0, 352.33 ms/it, loss 0.083133, accuracy 0.000 % 2020-06-26 17:35:46.780538
Finished training it 700/1000 of epoch 0, 365.90 ms/it, loss 0.083932, accuracy 0.000 % 2020-06-26 17:36:23.372595
Finished training it 800/1000 of epoch 0, 353.92 ms/it, loss 0.083939, accuracy 0.000 % 2020-06-26 17:36:58.762597
Finished training it 900/1000 of epoch 0, 354.02 ms/it, loss 0.082846, accuracy 0.000 % 2020-06-26 17:37:34.164837
Finished training it 1000/1000 of epoch 0, 353.33 ms/it, loss 0.082794, accuracy 0.000 % 2020-06-26 17:38:09.497301

@dmudiger
Copy link

Do you have the 1 GPU/TPU runs with this fix ?

@taylanbil
Copy link
Author

taylanbil commented Jun 26, 2020

Same config as in https://gist.github.com/taylanbil/c4841e7397bd875e3b0908f752fcc602#gistcomment-3354701:

Using 1 GPU(s)...
time/loss/accuracy (if enabled):
Finished training it 100/1000 of epoch 0, 27.40 ms/it, loss 0.141502, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:45:44.139930
Finished training it 200/1000 of epoch 0, 24.26 ms/it, loss 0.084229, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:45:46.565963
Finished training it 300/1000 of epoch 0, 24.68 ms/it, loss 0.083750, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:45:49.033689
Finished training it 400/1000 of epoch 0, 24.28 ms/it, loss 0.083291, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:45:51.462181
Finished training it 500/1000 of epoch 0, 24.49 ms/it, loss 0.084722, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:45:53.910777
Finished training it 600/1000 of epoch 0, 24.98 ms/it, loss 0.083890, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:45:56.408931
Finished training it 700/1000 of epoch 0, 24.70 ms/it, loss 0.083577, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:45:58.878910
Finished training it 800/1000 of epoch 0, 24.82 ms/it, loss 0.084575, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:46:01.360570
Finished training it 900/1000 of epoch 0, 24.83 ms/it, loss 0.083511, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:46:03.843675
Finished training it 1000/1000 of epoch 0, 24.77 ms/it, loss 0.083468, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:46:06.320895

vs

Using 1 TPU core(s)...
XLA replica groups for Model Parallel:
         [[0]]
XLA replica groups for Data Parallel:
         [[0]]
TPU data-parallel mode, setting --tpu-data-parallel to True
time/loss/accuracy (if enabled):
Finished training it 100/1000 of epoch 0, 61.36 ms/it, loss 0.147117, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:01:32.387126
Finished training it 200/1000 of epoch 0, 51.82 ms/it, loss 0.083852, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:01:37.533774
Finished training it 300/1000 of epoch 0, 42.16 ms/it, loss 0.083658, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:01:41.749390
Finished training it 400/1000 of epoch 0, 42.08 ms/it, loss 0.083472, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:01:45.957078
Finished training it 500/1000 of epoch 0, 42.34 ms/it, loss 0.083506, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:01:50.190714
Finished training it 600/1000 of epoch 0, 42.09 ms/it, loss 0.083928, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:01:54.399981
Finished training it 700/1000 of epoch 0, 41.97 ms/it, loss 0.083616, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:01:58.597187
Finished training it 800/1000 of epoch 0, 42.19 ms/it, loss 0.083280, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:02:02.816375
Finished training it 900/1000 of epoch 0, 42.50 ms/it, loss 0.083492, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:02:07.065817
Finished training it 1000/1000 of epoch 0, 42.44 ms/it, loss 0.083240, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:02:11.310073

Noting again that we typically do not do this comparison. We go off of the equivalence 1 v100 = 2 v3 tpu cores.

@taylanbil
Copy link
Author

taylanbil commented Nov 19, 2020

8 GPUs vs 8 TPUs, more iters

I added num-workers to gpu and it looks significantly faster, turns out this was input bound:

GPU

taylanbil@dlrm-gpu-8:~$ cat dlrm-bench-moreiter.sh 
#!/bin/bash

cd dlrm
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python dlrm_s_pytorch.py \
        --mini-batch-size=2048 \
        --test-mini-batch-size=16384 \
        --test-num-workers=0 \
        --num-workers=9 \
        --num-batches=10000 \
        --data-generation=random \
        --arch-mlp-bot=512-512-64 \
        --arch-mlp-top=1024-1024-1024-1 \
        --arch-sparse-feature-size=64 \
        --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
        --num-indices-per-lookup=100 \
        --arch-interaction-op=dot \
        --numpy-rand-seed=727 \
        --print-freq=1000 \
        --print-time \
        --use-gpu
taylanbil@dlrm-gpu-8:~$ ./dlrm-bench-moreiter.sh 
Using 8 GPU(s)...
time/loss/accuracy (if enabled):  2020-11-19 21:20:24.774574
Finished training it 1000/10000 of epoch 0, 62.38 ms/it, loss 0.083458, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:21:44.984052
Finished training it 2000/10000 of epoch 0, 32.06 ms/it, loss 0.083541, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:22:33.474091
Finished training it 3000/10000 of epoch 0, 31.60 ms/it, loss 0.083576, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:23:20.998825
Finished training it 4000/10000 of epoch 0, 31.12 ms/it, loss 0.083544, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:24:07.823271
Finished training it 5000/10000 of epoch 0, 31.32 ms/it, loss 0.083297, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:24:54.423306
Finished training it 6000/10000 of epoch 0, 32.18 ms/it, loss 0.083065, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:25:42.277924
Finished training it 7000/10000 of epoch 0, 32.27 ms/it, loss 0.083232, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:26:30.194078
Finished training it 8000/10000 of epoch 0, 32.76 ms/it, loss 0.083562, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:27:17.811339
Finished training it 9000/10000 of epoch 0, 32.53 ms/it, loss 0.083362, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:28:05.319312
Finished training it 10000/10000 of epoch 0, 33.18 ms/it, loss 0.083401, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:28:53.578277

TPU

$ cat ./bench-tpu-v3-8-moresteps.sh
#!/bin/bash
pkill -9 python

#export XLA_USE_BF16=1
tpu=dlrm-init
data_path=
TPU_IP_ADDRESS=`gcloud compute tpus describe --zone=europe-west4-a dlrm-init | grep ipAddress | cut -d ':' -f2 | head -1 | sed 's/ //g'`
export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
other="
 --test-num-workers=0 \
 --test-mini-batch-size=16384 \
    --data-size=$(( 512*300 )) \
"

python dlrm/dlrm_tpu_runner.py \
    --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
    --arch-sparse-feature-size=64 \
    --arch-mlp-bot=512-512-64 \
    --arch-mlp-top=1024-1024-1024-1 \
    --arch-interaction-op=dot \
    --lr-num-warmup-steps 10 \
    --lr-decay-start-step 10 \
    --mini-batch-size=2048 \
    --num-batches=10000 \
    --data-generation='random' \
    --numpy-rand-seed=727 \
    --print-time \
    --print-freq 1000 \
    --num-workers 9 \
    --num-indices-per-lookup=100 \
        --use-tpu \
        --num-indices-per-lookup-fixed \
        --tpu-model-parallel-group-len 8 \
        --tpu-cores=8
Finished training it 1000/10000 of epoch 0, -1.00 ms/it, loss 0.083518, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:52:11.963543
Finished training it 2000/10000 of epoch 0, -1.00 ms/it, loss 0.083332, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:53:01.571697
Finished training it 3000/10000 of epoch 0, -1.00 ms/it, loss 0.083392, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:53:48.243824
Finished training it 4000/10000 of epoch 0, -1.00 ms/it, loss 0.083312, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:54:34.156202
Finished training it 5000/10000 of epoch 0, -1.00 ms/it, loss 0.083407, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:55:20.454105
Finished training it 6000/10000 of epoch 0, -1.00 ms/it, loss 0.083292, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:56:06.454427
Finished training it 7000/10000 of epoch 0, -1.00 ms/it, loss 0.083444, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:56:52.473373
Finished training it 8000/10000 of epoch 0, -1.00 ms/it, loss 0.083433, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:57:38.422858
Finished training it 9000/10000 of epoch 0, -1.00 ms/it, loss 0.083330, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:58:24.602985
Finished training it 10000/10000 of epoch 0, -1.00 ms/it, loss 0.083335, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:59:10.352625

@shz0116
Copy link

shz0116 commented Nov 19, 2020

Still slower than TPUv3-8?

@taylanbil
Copy link
Author

Yes, updated comment above w/ TPU numbers. We get:

  • TPU: 46 seconds to do 1000 steps
  • GPU: 48 seconds to do 1000 steps

Much smaller gap as before, I think initial gap was due to workload being input bound + TPUs having the MultiProcessing design, so it had 8 processes to load data to GPU's 1. So now TPU's seem to be a faster by a smaller margin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment