-
-
Save taylanbil/c4841e7397bd875e3b0908f752fcc602 to your computer and use it in GitHub Desktop.
| # TPU CLI | |
| tpu=dlrm-init | |
| TPU_IP_ADDRESS=`gcloud compute tpus describe --zone=europe-west4-a dlrm-init | grep ipAddress | cut -d ':' -f2 | head -1 | sed 's/ //g'` | |
| export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470" | |
| python dlrm/dlrm_tpu_runner.py \ | |
| --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \ | |
| --arch-sparse-feature-size=64 \ | |
| --arch-mlp-bot=512-512-64 \ | |
| --arch-mlp-top=1024-1024-1024-1 \ | |
| --arch-interaction-op=dot \ | |
| --lr-num-warmup-steps 10 \ | |
| --lr-decay-start-step 10 \ | |
| --mini-batch-size=2048 \ | |
| --num-batches=1000 \ | |
| --data-generation='random' \ | |
| --numpy-rand-seed=727 \ | |
| --print-time \ | |
| --print-freq 100 \ | |
| --num-indices-per-lookup=100 \ | |
| --use-tpu \ | |
| --num-indices-per-lookup-fixed \ | |
| --tpu-model-parallel-group-len 8 \ | |
| --tpu-cores 8 | |
| # GPU CLI | |
| cd dlrm | |
| CUDA_VISIBLE_DEVICES=0,1,2,3 python dlrm_s_pytorch.py \ | |
| --mini-batch-size=2048 \ | |
| --test-mini-batch-size=16384 \ | |
| --test-num-workers=0 \ | |
| --num-batches=1000 \ | |
| --data-generation=random \ | |
| --arch-mlp-bot=512-512-64 \ | |
| --arch-mlp-top=1024-1024-1024-1 \ | |
| --arch-sparse-feature-size=64 \ | |
| --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \ | |
| --num-indices-per-lookup=100 \ | |
| --arch-interaction-op=dot \ | |
| --numpy-rand-seed=727 \ | |
| --print-freq=100 \ | |
| --print-time \ | |
| --enable-profiling \ | |
| --use-gpu | |
| # NO BF16 | |
| Using 8 TPU core(s)... | |
| XLA replica groups for Model Parallel: | |
| [[0, 1, 2, 3, 4, 5, 6, 7]] | |
| XLA replica groups for Model Parallel: | |
| [[0], [1], [2], [3], [4], [5], [6], [7]] | |
| TPU model-parallel mode, setting --drop-last=True | |
| time/loss/accuracy (if enabled): | |
| Finished training it 100/1000 of epoch 0, 364.48 ms/it, loss 0.084503, accuracy 0.000 % | |
| Finished training it 200/1000 of epoch 0, 315.64 ms/it, loss 0.083122, accuracy 0.000 % | |
| Finished training it 300/1000 of epoch 0, 326.83 ms/it, loss 0.082922, accuracy 0.000 % | |
| Finished training it 400/1000 of epoch 0, 320.92 ms/it, loss 0.083768, accuracy 0.000 % | |
| Finished training it 500/1000 of epoch 0, 330.18 ms/it, loss 0.082434, accuracy 0.000 % | |
| Finished training it 600/1000 of epoch 0, 321.81 ms/it, loss 0.083133, accuracy 0.000 % | |
| Finished training it 700/1000 of epoch 0, 319.91 ms/it, loss 0.083932, accuracy 0.000 % | |
| Finished training it 800/1000 of epoch 0, 317.70 ms/it, loss 0.083939, accuracy 0.000 % | |
| Finished training it 900/1000 of epoch 0, 322.93 ms/it, loss 0.082846, accuracy 0.000 % | |
| Finished training it 1000/1000 of epoch 0, 319.75 ms/it, loss 0.082794, accuracy 0.000 % | |
| # With BF16: | |
| Using 8 TPU core(s)... | |
| XLA replica groups for Model Parallel: | |
| [[0, 1, 2, 3, 4, 5, 6, 7]] | |
| XLA replica groups for Model Parallel: | |
| [[0], [1], [2], [3], [4], [5], [6], [7]] | |
| TPU model-parallel mode, setting --drop-last=True | |
| time/loss/accuracy (if enabled): | |
| Finished training it 100/1000 of epoch 0, 363.40 ms/it, loss 0.087402, accuracy 0.000 % | |
| Finished training it 200/1000 of epoch 0, 320.97 ms/it, loss 0.086914, accuracy 0.000 % | |
| Finished training it 300/1000 of epoch 0, 318.49 ms/it, loss 0.084961, accuracy 0.000 % | |
| Finished training it 400/1000 of epoch 0, 321.91 ms/it, loss 0.085449, accuracy 0.000 % | |
| Finished training it 500/1000 of epoch 0, 324.15 ms/it, loss 0.084961, accuracy 0.000 % | |
| Finished training it 600/1000 of epoch 0, 321.54 ms/it, loss 0.083984, accuracy 0.000 % | |
| Finished training it 700/1000 of epoch 0, 320.97 ms/it, loss 0.085449, accuracy 0.000 % | |
| Finished training it 800/1000 of epoch 0, 323.22 ms/it, loss 0.085449, accuracy 0.000 % | |
| Finished training it 900/1000 of epoch 0, 324.06 ms/it, loss 0.084961, accuracy 0.000 % | |
| Finished training it 1000/1000 of epoch 0, 322.35 ms/it, loss 0.084473, accuracy 0.000 % | |
8 GPUs vs 8 TPUs, more iters
TPU (v3-8)
python dlrm/dlrm_tpu_runner.py \
--arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
--arch-sparse-feature-size=64 \
--arch-mlp-bot=512-512-64 \
--arch-mlp-top=1024-1024-1024-1 \
--arch-interaction-op=dot \
--lr-num-warmup-steps 10 \
--lr-decay-start-step 10 \
--mini-batch-size=2048 \
--num-batches=10000 \
--data-generation='random' \
--numpy-rand-seed=727 \
--print-time \
--print-freq 1000 \
--num-indices-per-lookup=100 \
--use-tpu \
--num-indices-per-lookup-fixed \
--tpu-model-parallel-group-len 8 \
--tpu-cores=8Results
Using 8 TPU core(s)...
XLA replica groups for Model Parallel:
[[0, 1, 2, 3, 4, 5, 6, 7]]
XLA replica groups for Model Parallel:
[[0], [1], [2], [3], [4], [5], [6], [7]]
TPU model-parallel mode, setting --drop-last=True
time/loss/accuracy (if enabled):
Finished training it 1000/10000 of epoch 0, 312.83 ms/it, loss 0.083339, accuracy 0.000 % 2020-06-25 23:10:44.639935
Finished training it 2000/10000 of epoch 0, 322.82 ms/it, loss 0.083453, accuracy 0.000 % 2020-06-25 23:16:09.488755
Finished training it 3000/10000 of epoch 0, 322.62 ms/it, loss 0.083233, accuracy 0.000 % 2020-06-25 23:21:37.548171
Finished training it 4000/10000 of epoch 0, 320.99 ms/it, loss 0.083312, accuracy 0.000 % 2020-06-25 23:27:01.543967
Finished training it 5000/10000 of epoch 0, 317.38 ms/it, loss 0.083520, accuracy 0.000 % 2020-06-25 23:32:26.462033
Finished training it 6000/10000 of epoch 0, 324.74 ms/it, loss 0.083305, accuracy 0.000 % 2020-06-25 23:37:55.277979
Finished training it 7000/10000 of epoch 0, 327.99 ms/it, loss 0.083492, accuracy 0.000 % 2020-06-25 23:43:25.281327
Finished training it 8000/10000 of epoch 0, 326.43 ms/it, loss 0.083512, accuracy 0.000 % 2020-06-25 23:48:54.851570
8 GPUs (v100, 16gb)
#!/bin/bash
cd dlrm
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python dlrm_s_pytorch.py \
--mini-batch-size=2048 \
--test-mini-batch-size=16384 \
--test-num-workers=0 \
--num-batches=1000 \
--data-generation=random \
--arch-mlp-bot=512-512-64 \
--arch-mlp-top=1024-1024-1024-1 \
--arch-sparse-feature-size=64 \
--arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
--num-indices-per-lookup=100 \
--arch-interaction-op=dot \
--numpy-rand-seed=727 \
--print-freq=100 \
--print-time \
--use-gpuResults:
taylanbil@dlrm-gpu-8:~$ ./dlrm-bench-moreiter.sh
Using 8 GPU(s)...
time/loss/accuracy (if enabled):
Finished training it 1000/10000 of epoch 0, 57.12 ms/it, loss 0.083467, accuracy 0.000 % 2020-06-25 23:29:58.506048
Finished training it 2000/10000 of epoch 0, 29.53 ms/it, loss 0.083403, accuracy 0.000 % 2020-06-25 23:39:43.421000
Finished training it 3000/10000 of epoch 0, 31.34 ms/it, loss 0.083501, accuracy 0.000 % 2020-06-25 23:49:31.391754
Finished training it 4000/10000 of epoch 0, 31.31 ms/it, loss 0.083371, accuracy 0.000 % 2020-06-25 23:59:18.873084
Finished training it 5000/10000 of epoch 0, 32.68 ms/it, loss 0.083269, accuracy 0.000 % 2020-06-26 00:09:27.544340
Finished training it 6000/10000 of epoch 0, 32.95 ms/it, loss 0.083395, accuracy 0.000 % 2020-06-26 00:19:41.636447
Finished training it 7000/10000 of epoch 0, 31.33 ms/it, loss 0.083328, accuracy 0.000 % 2020-06-26 00:29:54.229305
Finished training it 8000/10000 of epoch 0, 28.39 ms/it, loss 0.083358, accuracy 0.000 % 2020-06-26 00:39:35.963765
Finished training it 9000/10000 of epoch 0, 28.38 ms/it, loss 0.083281, accuracy 0.000 % 2020-06-26 00:49:16.935803
Finished training it 10000/10000 of epoch 0, 28.44 ms/it, loss 0.083407, accuracy 0.000 % 2020-06-26 00:58:57.481942
FIXING ms/it
As I noted in the first comment, ms/it reporting is problematic on tpus. Because of the async execution of torch_xla, I believe the current way ms/it is computed is apples-to-oranges when comparing gpu vs tpu perf..
I added the following diff, to do apples-to apples (I don't think it's possible to do apples to apples for the current upstream way of measuring ms/it).
diff --git a/dlrm_s_pytorch.py b/dlrm_s_pytorch.py
index 344c167..7cb988f 100644
--- a/dlrm_s_pytorch.py
+++ b/dlrm_s_pytorch.py
@@ -934,7 +934,8 @@ if __name__ == "__main__":
iteration_time = 0
previous_iteration_time = current_time
else:
- t1 = time_wrap(use_gpu)
+ if not j:
+ t1 = time_wrap(use_gpu)
# early exit if nbatches was set by the user and has been exceeded
if nbatches > 0 and j >= nbatches:
@@ -986,7 +987,10 @@ if __name__ == "__main__":
total_time += iteration_time
else:
t2 = time_wrap(use_gpu)
+ if j:
+ print('ADDTIME', t2-t1, total_time, j, total_iter)
total_time += t2 - t1
+ t1=t2
total_accu += A
total_loss += L * mbs
total_iter += 1
@@ -1002,6 +1006,7 @@ if __name__ == "__main__":
# print time, loss and accuracy
if should_print or should_test:
gT = 1000.0 * total_time / total_iter if args.print_time else -1
+ print('time'.upper(), total_time, total_iter, j+1)
total_time = 0
gA = total_accu / total_samp
@@ -1011,11 +1016,12 @@ if __name__ == "__main__":
total_loss = 0
str_run_type = "inference" if args.inference_only else "training"
+ from datetime import datetime
print(
"Finished {} it {}/{} of epoch {}, {:.2f} ms/it, ".format(
str_run_type, j + 1, nbatches, k, gT
)
- + "loss {:.6f}, accuracy {:3.3f} %".format(gL, gA * 100)
+ + "loss {:.6f}, accuracy {:3.3f} % {}".format(gL, gA * 100, datetime.now())
)
# Uncomment the line below to print out the total time with overhead
# print("Accumulated time so far: {}" \
Here are the results
GPU (8 v100s, 16gb)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python dlrm_s_pytorch.py \
--mini-batch-size=2048 \
--test-mini-batch-size=16384 \
--test-num-workers=0 \
--num-batches=1000 \
--data-generation=random \
--arch-mlp-bot=512-512-64 \
--arch-mlp-top=1024-1024-1024-1 \
--arch-sparse-feature-size=64 \
--arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
--num-indices-per-lookup=100 \
--arch-interaction-op=dot \
--numpy-rand-seed=727 \
--print-freq=100 \
--print-time \
--enable-profiling \
--use-gputaylanbil@dlrm-gpu-8:~$ ./dlrm-bench.sh | grep ^Fini
Finished training it 100/1000 of epoch 0, 705.09 ms/it, loss 0.084125, accuracy 0.000 % 2020-06-26 17:21:00.805777
Finished training it 200/1000 of epoch 0, 647.25 ms/it, loss 0.083529, accuracy 0.000 % 2020-06-26 17:22:05.530502
Finished training it 300/1000 of epoch 0, 642.23 ms/it, loss 0.083705, accuracy 0.000 % 2020-06-26 17:23:09.753057
Finished training it 400/1000 of epoch 0, 643.49 ms/it, loss 0.082972, accuracy 0.000 % 2020-06-26 17:24:14.102139
Finished training it 500/1000 of epoch 0, 645.93 ms/it, loss 0.083616, accuracy 0.000 % 2020-06-26 17:25:18.695563
Finished training it 600/1000 of epoch 0, 656.90 ms/it, loss 0.083187, accuracy 0.000 % 2020-06-26 17:26:24.385822
Finished training it 700/1000 of epoch 0, 644.19 ms/it, loss 0.083440, accuracy 0.000 % 2020-06-26 17:27:28.804989
Finished training it 800/1000 of epoch 0, 657.80 ms/it, loss 0.083453, accuracy 0.000 % 2020-06-26 17:28:34.585144
Finished training it 900/1000 of epoch 0, 654.89 ms/it, loss 0.083306, accuracy 0.000 % 2020-06-26 17:29:40.074329
TPU (v3-8)
python dlrm/dlrm_tpu_runner.py \
--arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
--arch-sparse-feature-size=64 \
--arch-mlp-bot=512-512-64 \
--arch-mlp-top=1024-1024-1024-1 \
--arch-interaction-op=dot \
--lr-num-warmup-steps 10 \
--lr-decay-start-step 10 \
--mini-batch-size=2048 \
--num-batches=1000 \
--data-generation='random' \
--numpy-rand-seed=727 \
--print-time \
--print-freq 100 \
--num-indices-per-lookup=100 \
--use-tpu \
--num-indices-per-lookup-fixed \
--tpu-model-parallel-group-len 8 \
--tpu-cores=8Results:
Finished training it 100/1000 of epoch 0, 619.24 ms/it, loss 0.084503, accuracy 0.000 % 2020-06-26 17:32:41.667455
Finished training it 200/1000 of epoch 0, 435.55 ms/it, loss 0.083122, accuracy 0.000 % 2020-06-26 17:33:25.188651
Finished training it 300/1000 of epoch 0, 353.28 ms/it, loss 0.082922, accuracy 0.000 % 2020-06-26 17:34:00.517240
Finished training it 400/1000 of epoch 0, 357.82 ms/it, loss 0.083768, accuracy 0.000 % 2020-06-26 17:34:36.299376
Finished training it 500/1000 of epoch 0, 352.48 ms/it, loss 0.082434, accuracy 0.000 % 2020-06-26 17:35:11.547412
Finished training it 600/1000 of epoch 0, 352.33 ms/it, loss 0.083133, accuracy 0.000 % 2020-06-26 17:35:46.780538
Finished training it 700/1000 of epoch 0, 365.90 ms/it, loss 0.083932, accuracy 0.000 % 2020-06-26 17:36:23.372595
Finished training it 800/1000 of epoch 0, 353.92 ms/it, loss 0.083939, accuracy 0.000 % 2020-06-26 17:36:58.762597
Finished training it 900/1000 of epoch 0, 354.02 ms/it, loss 0.082846, accuracy 0.000 % 2020-06-26 17:37:34.164837
Finished training it 1000/1000 of epoch 0, 353.33 ms/it, loss 0.082794, accuracy 0.000 % 2020-06-26 17:38:09.497301
Do you have the 1 GPU/TPU runs with this fix ?
Same config as in https://gist.github.com/taylanbil/c4841e7397bd875e3b0908f752fcc602#gistcomment-3354701:
Using 1 GPU(s)...
time/loss/accuracy (if enabled):
Finished training it 100/1000 of epoch 0, 27.40 ms/it, loss 0.141502, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:45:44.139930
Finished training it 200/1000 of epoch 0, 24.26 ms/it, loss 0.084229, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:45:46.565963
Finished training it 300/1000 of epoch 0, 24.68 ms/it, loss 0.083750, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:45:49.033689
Finished training it 400/1000 of epoch 0, 24.28 ms/it, loss 0.083291, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:45:51.462181
Finished training it 500/1000 of epoch 0, 24.49 ms/it, loss 0.084722, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:45:53.910777
Finished training it 600/1000 of epoch 0, 24.98 ms/it, loss 0.083890, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:45:56.408931
Finished training it 700/1000 of epoch 0, 24.70 ms/it, loss 0.083577, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:45:58.878910
Finished training it 800/1000 of epoch 0, 24.82 ms/it, loss 0.084575, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:46:01.360570
Finished training it 900/1000 of epoch 0, 24.83 ms/it, loss 0.083511, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:46:03.843675
Finished training it 1000/1000 of epoch 0, 24.77 ms/it, loss 0.083468, accuracy 0.000 %, 25600 samples, @ 2020-06-26 21:46:06.320895
vs
Using 1 TPU core(s)...
XLA replica groups for Model Parallel:
[[0]]
XLA replica groups for Data Parallel:
[[0]]
TPU data-parallel mode, setting --tpu-data-parallel to True
time/loss/accuracy (if enabled):
Finished training it 100/1000 of epoch 0, 61.36 ms/it, loss 0.147117, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:01:32.387126
Finished training it 200/1000 of epoch 0, 51.82 ms/it, loss 0.083852, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:01:37.533774
Finished training it 300/1000 of epoch 0, 42.16 ms/it, loss 0.083658, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:01:41.749390
Finished training it 400/1000 of epoch 0, 42.08 ms/it, loss 0.083472, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:01:45.957078
Finished training it 500/1000 of epoch 0, 42.34 ms/it, loss 0.083506, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:01:50.190714
Finished training it 600/1000 of epoch 0, 42.09 ms/it, loss 0.083928, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:01:54.399981
Finished training it 700/1000 of epoch 0, 41.97 ms/it, loss 0.083616, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:01:58.597187
Finished training it 800/1000 of epoch 0, 42.19 ms/it, loss 0.083280, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:02:02.816375
Finished training it 900/1000 of epoch 0, 42.50 ms/it, loss 0.083492, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:02:07.065817
Finished training it 1000/1000 of epoch 0, 42.44 ms/it, loss 0.083240, accuracy 0.000 %, 25600 samples, @ 2020-06-26 22:02:11.310073
Noting again that we typically do not do this comparison. We go off of the equivalence 1 v100 = 2 v3 tpu cores.
8 GPUs vs 8 TPUs, more iters
I added num-workers to gpu and it looks significantly faster, turns out this was input bound:
GPU
taylanbil@dlrm-gpu-8:~$ cat dlrm-bench-moreiter.sh
#!/bin/bash
cd dlrm
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python dlrm_s_pytorch.py \
--mini-batch-size=2048 \
--test-mini-batch-size=16384 \
--test-num-workers=0 \
--num-workers=9 \
--num-batches=10000 \
--data-generation=random \
--arch-mlp-bot=512-512-64 \
--arch-mlp-top=1024-1024-1024-1 \
--arch-sparse-feature-size=64 \
--arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
--num-indices-per-lookup=100 \
--arch-interaction-op=dot \
--numpy-rand-seed=727 \
--print-freq=1000 \
--print-time \
--use-gputaylanbil@dlrm-gpu-8:~$ ./dlrm-bench-moreiter.sh
Using 8 GPU(s)...
time/loss/accuracy (if enabled): 2020-11-19 21:20:24.774574
Finished training it 1000/10000 of epoch 0, 62.38 ms/it, loss 0.083458, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:21:44.984052
Finished training it 2000/10000 of epoch 0, 32.06 ms/it, loss 0.083541, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:22:33.474091
Finished training it 3000/10000 of epoch 0, 31.60 ms/it, loss 0.083576, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:23:20.998825
Finished training it 4000/10000 of epoch 0, 31.12 ms/it, loss 0.083544, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:24:07.823271
Finished training it 5000/10000 of epoch 0, 31.32 ms/it, loss 0.083297, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:24:54.423306
Finished training it 6000/10000 of epoch 0, 32.18 ms/it, loss 0.083065, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:25:42.277924
Finished training it 7000/10000 of epoch 0, 32.27 ms/it, loss 0.083232, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:26:30.194078
Finished training it 8000/10000 of epoch 0, 32.76 ms/it, loss 0.083562, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:27:17.811339
Finished training it 9000/10000 of epoch 0, 32.53 ms/it, loss 0.083362, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:28:05.319312
Finished training it 10000/10000 of epoch 0, 33.18 ms/it, loss 0.083401, accuracy 0.000 %, 2048000 samples, @ 2020-11-19 21:28:53.578277
TPU
$ cat ./bench-tpu-v3-8-moresteps.sh
#!/bin/bash
pkill -9 python
#export XLA_USE_BF16=1
tpu=dlrm-init
data_path=
TPU_IP_ADDRESS=`gcloud compute tpus describe --zone=europe-west4-a dlrm-init | grep ipAddress | cut -d ':' -f2 | head -1 | sed 's/ //g'`
export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
other="
--test-num-workers=0 \
--test-mini-batch-size=16384 \
--data-size=$(( 512*300 )) \
"
python dlrm/dlrm_tpu_runner.py \
--arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \
--arch-sparse-feature-size=64 \
--arch-mlp-bot=512-512-64 \
--arch-mlp-top=1024-1024-1024-1 \
--arch-interaction-op=dot \
--lr-num-warmup-steps 10 \
--lr-decay-start-step 10 \
--mini-batch-size=2048 \
--num-batches=10000 \
--data-generation='random' \
--numpy-rand-seed=727 \
--print-time \
--print-freq 1000 \
--num-workers 9 \
--num-indices-per-lookup=100 \
--use-tpu \
--num-indices-per-lookup-fixed \
--tpu-model-parallel-group-len 8 \
--tpu-cores=8Finished training it 1000/10000 of epoch 0, -1.00 ms/it, loss 0.083518, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:52:11.963543
Finished training it 2000/10000 of epoch 0, -1.00 ms/it, loss 0.083332, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:53:01.571697
Finished training it 3000/10000 of epoch 0, -1.00 ms/it, loss 0.083392, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:53:48.243824
Finished training it 4000/10000 of epoch 0, -1.00 ms/it, loss 0.083312, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:54:34.156202
Finished training it 5000/10000 of epoch 0, -1.00 ms/it, loss 0.083407, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:55:20.454105
Finished training it 6000/10000 of epoch 0, -1.00 ms/it, loss 0.083292, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:56:06.454427
Finished training it 7000/10000 of epoch 0, -1.00 ms/it, loss 0.083444, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:56:52.473373
Finished training it 8000/10000 of epoch 0, -1.00 ms/it, loss 0.083433, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:57:38.422858
Finished training it 9000/10000 of epoch 0, -1.00 ms/it, loss 0.083330, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:58:24.602985
Finished training it 10000/10000 of epoch 0, -1.00 ms/it, loss 0.083335, accuracy 0.000 %, 2048000 samples, @ 2020-11-20 17:59:10.352625
Still slower than TPUv3-8?
Yes, updated comment above w/ TPU numbers. We get:
- TPU: 46 seconds to do 1000 steps
- GPU: 48 seconds to do 1000 steps
Much smaller gap as before, I think initial gap was due to workload being input bound + TPUs having the MultiProcessing design, so it had 8 processes to load data to GPU's 1. So now TPU's seem to be a faster by a smaller margin.
8 GPU vs 8 TPU
TPU (v3-8)
same results as above, included for convenience
python dlrm/dlrm_tpu_runner.py \ --arch-embedding-size=1000000-1000000-1000000-1000000-1000000-1000000-1000000-1000000 \ --arch-sparse-feature-size=64 \ --arch-mlp-bot=512-512-64 \ --arch-mlp-top=1024-1024-1024-1 \ --arch-interaction-op=dot \ --lr-num-warmup-steps 10 \ --lr-decay-start-step 10 \ --mini-batch-size=2048 \ --num-batches=1000 \ --data-generation='random' \ --numpy-rand-seed=727 \ --print-time \ --print-freq 100 \ --num-indices-per-lookup=100 \ --use-tpu \ --num-indices-per-lookup-fixed \ --tpu-model-parallel-group-len 8 \ --tpu-cores 8same results as above, included for convenience
8 GPU (v100, 16 gb)
Results: