Skip to content

Instantly share code, notes, and snippets.

@armancohan
Created October 25, 2019 18:44
Show Gist options
  • Save armancohan/ae914f96e7884a92c164dd021964ce49 to your computer and use it in GitHub Desktop.
Save armancohan/ae914f96e7884a92c164dd021964ce49 to your computer and use it in GitHub Desktop.
```
g batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 354 at step 88500 | time: 1345.55s | valid loss 0.74 | bpc 1.07357
----------------------------------------------------------------------------------------------------
| epoch 130 step 88510 | 16 batches | lr 0.000442 | ms/batch 11917.46 | loss 0.75 | bpc 1.07888
| epoch 130 step 88520 | 26 batches | lr 0.000442 | ms/batch 5110.18 | loss 0.78 | bpc 1.12858
| epoch 130 step 88530 | 36 batches | lr 0.000442 | ms/batch 5107.78 | loss 0.71 | bpc 1.02528
| epoch 130 step 88540 | 46 batches | lr 0.000442 | ms/batch 5109.07 | loss 0.74 | bpc 1.07031
| epoch 130 step 88550 | 56 batches | lr 0.000442 | ms/batch 5111.60 | loss 0.78 | bpc 1.12227
| epoch 130 step 88560 | 66 batches | lr 0.000442 | ms/batch 5109.90 | loss 0.76 | bpc 1.10196
| epoch 130 step 88570 | 76 batches | lr 0.000442 | ms/batch 5110.75 | loss 0.72 | bpc 1.04134
| epoch 130 step 88580 | 86 batches | lr 0.000442 | ms/batch 5106.93 | loss 0.72 | bpc 1.03947
| epoch 130 step 88590 | 96 batches | lr 0.000442 | ms/batch 5111.03 | loss 0.72 | bpc 1.03796
| epoch 130 step 88600 | 106 batches | lr 0.000442 | ms/batch 5111.09 | loss 0.76 | bpc 1.09314
| epoch 130 step 88610 | 116 batches | lr 0.000442 | ms/batch 5110.12 | loss 0.75 | bpc 1.08070
| epoch 130 step 88620 | 126 batches | lr 0.000442 | ms/batch 5111.19 | loss 0.76 | bpc 1.09950
| epoch 130 step 88630 | 136 batches | lr 0.000442 | ms/batch 5109.70 | loss 0.78 | bpc 1.12416
| epoch 130 step 88640 | 146 batches | lr 0.000442 | ms/batch 5109.59 | loss 0.77 | bpc 1.11730
| epoch 130 step 88650 | 156 batches | lr 0.000442 | ms/batch 5108.91 | loss 0.78 | bpc 1.11911
| epoch 130 step 88660 | 166 batches | lr 0.000442 | ms/batch 5107.52 | loss 0.78 | bpc 1.13135
| epoch 130 step 88670 | 176 batches | lr 0.000442 | ms/batch 5109.05 | loss 0.68 | bpc 0.98543
| epoch 130 step 88680 | 186 batches | lr 0.000442 | ms/batch 5108.64 | loss 0.69 | bpc 0.99098
| epoch 130 step 88690 | 196 batches | lr 0.000442 | ms/batch 5111.96 | loss 0.67 | bpc 0.95956
| epoch 130 step 88700 | 206 batches | lr 0.000442 | ms/batch 5109.56 | loss 0.65 | bpc 0.94402
| epoch 130 step 88710 | 216 batches | lr 0.000442 | ms/batch 5109.99 | loss 0.76 | bpc 1.08980
| epoch 130 step 88720 | 226 batches | lr 0.000442 | ms/batch 5110.06 | loss 0.74 | bpc 1.07421
| epoch 130 step 88730 | 236 batches | lr 0.000442 | ms/batch 5109.98 | loss 0.74 | bpc 1.06617
| epoch 130 step 88740 | 246 batches | lr 0.000442 | ms/batch 5111.55 | loss 0.78 | bpc 1.12912
| epoch 130 step 88750 | 256 batches | lr 0.000442 | ms/batch 5109.93 | loss 0.75 | bpc 1.08629
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 355 at step 88750 | time: 1345.67s | valid loss 0.75 | bpc 1.07538
----------------------------------------------------------------------------------------------------
| epoch 130 step 88760 | 266 batches | lr 0.000442 | ms/batch 11920.42 | loss 0.71 | bpc 1.01712
| epoch 130 step 88770 | 276 batches | lr 0.000442 | ms/batch 5116.65 | loss 0.75 | bpc 1.07615
| epoch 130 step 88780 | 286 batches | lr 0.000442 | ms/batch 5110.19 | loss 0.79 | bpc 1.13928
| epoch 130 step 88790 | 296 batches | lr 0.000442 | ms/batch 5109.88 | loss 0.75 | bpc 1.07881
| epoch 130 step 88800 | 306 batches | lr 0.000442 | ms/batch 5110.47 | loss 0.75 | bpc 1.08729
| epoch 130 step 88810 | 316 batches | lr 0.000442 | ms/batch 5117.64 | loss 0.78 | bpc 1.12217
| epoch 130 step 88820 | 326 batches | lr 0.000442 | ms/batch 5113.82 | loss 0.75 | bpc 1.08376
| epoch 130 step 88830 | 336 batches | lr 0.000442 | ms/batch 5110.18 | loss 0.75 | bpc 1.08456
| epoch 130 step 88840 | 346 batches | lr 0.000442 | ms/batch 5108.92 | loss 0.72 | bpc 1.04117
| epoch 130 step 88850 | 356 batches | lr 0.000442 | ms/batch 5110.31 | loss 0.78 | bpc 1.13044
| epoch 130 step 88860 | 366 batches | lr 0.000442 | ms/batch 5112.23 | loss 0.74 | bpc 1.07128
| epoch 130 step 88870 | 376 batches | lr 0.000442 | ms/batch 5108.59 | loss 0.73 | bpc 1.06027
| epoch 130 step 88880 | 386 batches | lr 0.000442 | ms/batch 5111.86 | loss 0.77 | bpc 1.11659
| epoch 130 step 88890 | 396 batches | lr 0.000442 | ms/batch 5110.05 | loss 0.76 | bpc 1.10069
| epoch 130 step 88900 | 406 batches | lr 0.000441 | ms/batch 5109.84 | loss 0.77 | bpc 1.11302
| epoch 130 step 88910 | 416 batches | lr 0.000441 | ms/batch 5107.88 | loss 0.75 | bpc 1.08743
| epoch 130 step 88920 | 426 batches | lr 0.000441 | ms/batch 5108.41 | loss 0.72 | bpc 1.03669
| epoch 130 step 88930 | 436 batches | lr 0.000441 | ms/batch 5106.52 | loss 0.78 | bpc 1.13148
| epoch 130 step 88940 | 446 batches | lr 0.000441 | ms/batch 5106.56 | loss 0.76 | bpc 1.09944
| epoch 130 step 88950 | 456 batches | lr 0.000441 | ms/batch 5111.91 | loss 0.76 | bpc 1.09645
| epoch 130 step 88960 | 466 batches | lr 0.000441 | ms/batch 5108.37 | loss 0.75 | bpc 1.08181
| epoch 130 step 88970 | 476 batches | lr 0.000441 | ms/batch 5111.26 | loss 0.74 | bpc 1.07304
| epoch 130 step 88980 | 486 batches | lr 0.000441 | ms/batch 5112.44 | loss 0.76 | bpc 1.09970
| epoch 130 step 88990 | 496 batches | lr 0.000441 | ms/batch 5110.46 | loss 0.76 | bpc 1.08984
| epoch 130 step 89000 | 506 batches | lr 0.000441 | ms/batch 5109.07 | loss 0.74 | bpc 1.06597
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 356 at step 89000 | time: 1345.80s | valid loss 0.74 | bpc 1.07352
----------------------------------------------------------------------------------------------------
| epoch 130 step 89010 | 516 batches | lr 0.000441 | ms/batch 11921.14 | loss 0.76 | bpc 1.08943
| epoch 130 step 89020 | 526 batches | lr 0.000441 | ms/batch 5110.10 | loss 0.76 | bpc 1.10151
| epoch 130 step 89030 | 536 batches | lr 0.000441 | ms/batch 5110.45 | loss 0.79 | bpc 1.14166
| epoch 130 step 89040 | 546 batches | lr 0.000441 | ms/batch 5107.80 | loss 0.74 | bpc 1.07298
| epoch 130 step 89050 | 556 batches | lr 0.000441 | ms/batch 5110.10 | loss 0.77 | bpc 1.11248
| epoch 130 step 89060 | 566 batches | lr 0.000441 | ms/batch 5110.26 | loss 0.78 | bpc 1.12557
| epoch 130 step 89070 | 576 batches | lr 0.000441 | ms/batch 5108.65 | loss 0.74 | bpc 1.06934
| epoch 130 step 89080 | 586 batches | lr 0.000441 | ms/batch 5109.55 | loss 0.75 | bpc 1.07815
| epoch 130 step 89090 | 596 batches | lr 0.000441 | ms/batch 5110.68 | loss 0.77 | bpc 1.11480
| epoch 130 step 89100 | 606 batches | lr 0.000441 | ms/batch 5108.65 | loss 0.74 | bpc 1.06829
| epoch 130 step 89110 | 616 batches | lr 0.000441 | ms/batch 5111.61 | loss 0.74 | bpc 1.06888
| epoch 130 step 89120 | 626 batches | lr 0.000441 | ms/batch 5109.05 | loss 0.76 | bpc 1.09776
| epoch 130 step 89130 | 636 batches | lr 0.000441 | ms/batch 5109.98 | loss 0.77 | bpc 1.11165
| epoch 130 step 89140 | 646 batches | lr 0.000441 | ms/batch 5111.07 | loss 0.76 | bpc 1.10206
| epoch 130 step 89150 | 656 batches | lr 0.000441 | ms/batch 5111.23 | loss 0.79 | bpc 1.14227
| epoch 130 step 89160 | 666 batches | lr 0.000441 | ms/batch 5109.62 | loss 0.78 | bpc 1.13069
| epoch 130 step 89170 | 676 batches | lr 0.000441 | ms/batch 5109.70 | loss 0.70 | bpc 1.01404
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 130 step 89180 | 686 batches | lr 0.000441 | ms/batch 5109.98 | loss 0.70 | bpc 1.00632
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 131 step 89190 | 10 batches | lr 0.000441 | ms/batch 5109.06 | loss 0.73 | bpc 1.05705
| epoch 131 step 89200 | 20 batches | lr 0.000441 | ms/batch 5111.14 | loss 0.76 | bpc 1.10275
| epoch 131 step 89210 | 30 batches | lr 0.000441 | ms/batch 5109.35 | loss 0.76 | bpc 1.09028
| epoch 131 step 89220 | 40 batches | lr 0.000441 | ms/batch 5108.13 | loss 0.72 | bpc 1.03795
| epoch 131 step 89230 | 50 batches | lr 0.000441 | ms/batch 5108.15 | loss 0.76 | bpc 1.09844
| epoch 131 step 89240 | 60 batches | lr 0.000441 | ms/batch 5108.84 | loss 0.77 | bpc 1.11221
| epoch 131 step 89250 | 70 batches | lr 0.000441 | ms/batch 5106.78 | loss 0.75 | bpc 1.07824
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 357 at step 89250 | time: 1345.60s | valid loss 0.75 | bpc 1.07814
----------------------------------------------------------------------------------------------------
| epoch 131 step 89260 | 80 batches | lr 0.000441 | ms/batch 11920.11 | loss 0.71 | bpc 1.02346
| epoch 131 step 89270 | 90 batches | lr 0.000441 | ms/batch 5111.27 | loss 0.74 | bpc 1.06365
| epoch 131 step 89280 | 100 batches | lr 0.000441 | ms/batch 5108.17 | loss 0.71 | bpc 1.02649
| epoch 131 step 89290 | 110 batches | lr 0.000441 | ms/batch 5111.37 | loss 0.77 | bpc 1.10952
| epoch 131 step 89300 | 120 batches | lr 0.000441 | ms/batch 5106.71 | loss 0.74 | bpc 1.06355
| epoch 131 step 89310 | 130 batches | lr 0.000441 | ms/batch 5109.98 | loss 0.78 | bpc 1.12773
| epoch 131 step 89320 | 140 batches | lr 0.000441 | ms/batch 5110.65 | loss 0.80 | bpc 1.14948
| epoch 131 step 89330 | 150 batches | lr 0.000441 | ms/batch 5107.86 | loss 0.76 | bpc 1.08944
| epoch 131 step 89340 | 160 batches | lr 0.000441 | ms/batch 5111.13 | loss 0.79 | bpc 1.13985
| epoch 131 step 89350 | 170 batches | lr 0.000441 | ms/batch 5112.23 | loss 0.74 | bpc 1.06195
| epoch 131 step 89360 | 180 batches | lr 0.000441 | ms/batch 5112.82 | loss 0.68 | bpc 0.97906
| epoch 131 step 89370 | 190 batches | lr 0.000441 | ms/batch 5111.72 | loss 0.68 | bpc 0.97975
| epoch 131 step 89380 | 200 batches | lr 0.000441 | ms/batch 5107.57 | loss 0.64 | bpc 0.91855
| epoch 131 step 89390 | 210 batches | lr 0.000441 | ms/batch 5107.89 | loss 0.70 | bpc 1.01642
| epoch 131 step 89400 | 220 batches | lr 0.000441 | ms/batch 5109.97 | loss 0.77 | bpc 1.11069
| epoch 131 step 89410 | 230 batches | lr 0.000441 | ms/batch 5113.63 | loss 0.74 | bpc 1.07091
| epoch 131 step 89420 | 240 batches | lr 0.000441 | ms/batch 5109.81 | loss 0.74 | bpc 1.07293
| epoch 131 step 89430 | 250 batches | lr 0.000441 | ms/batch 5113.48 | loss 0.79 | bpc 1.13605
| epoch 131 step 89440 | 260 batches | lr 0.000441 | ms/batch 5110.54 | loss 0.73 | bpc 1.04907
| epoch 131 step 89450 | 270 batches | lr 0.000441 | ms/batch 5109.24 | loss 0.71 | bpc 1.02469
| epoch 131 step 89460 | 280 batches | lr 0.000441 | ms/batch 5109.88 | loss 0.76 | bpc 1.08932
| epoch 131 step 89470 | 290 batches | lr 0.000441 | ms/batch 5109.57 | loss 0.78 | bpc 1.12607
| epoch 131 step 89480 | 300 batches | lr 0.000441 | ms/batch 5108.89 | loss 0.75 | bpc 1.08896
| epoch 131 step 89490 | 310 batches | lr 0.000441 | ms/batch 5110.77 | loss 0.78 | bpc 1.11997
| epoch 131 step 89500 | 320 batches | lr 0.000441 | ms/batch 5109.16 | loss 0.74 | bpc 1.06922
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 358 at step 89500 | time: 1349.56s | valid loss 0.74 | bpc 1.07336
----------------------------------------------------------------------------------------------------
| epoch 131 step 89510 | 330 batches | lr 0.000441 | ms/batch 11920.62 | loss 0.76 | bpc 1.10263
| epoch 131 step 89520 | 340 batches | lr 0.000441 | ms/batch 5111.67 | loss 0.73 | bpc 1.04803
| epoch 131 step 89530 | 350 batches | lr 0.000441 | ms/batch 5110.09 | loss 0.75 | bpc 1.07544
| epoch 131 step 89540 | 360 batches | lr 0.000441 | ms/batch 5110.72 | loss 0.79 | bpc 1.14284
| epoch 131 step 89550 | 370 batches | lr 0.000441 | ms/batch 5109.86 | loss 0.73 | bpc 1.05921
| epoch 131 step 89560 | 380 batches | lr 0.000441 | ms/batch 5108.25 | loss 0.75 | bpc 1.08468
| epoch 131 step 89570 | 390 batches | lr 0.000441 | ms/batch 5108.25 | loss 0.75 | bpc 1.08260
| epoch 131 step 89580 | 400 batches | lr 0.000441 | ms/batch 5109.65 | loss 0.79 | bpc 1.14525
| epoch 131 step 89590 | 410 batches | lr 0.000441 | ms/batch 5107.64 | loss 0.76 | bpc 1.09968
| epoch 131 step 89600 | 420 batches | lr 0.000441 | ms/batch 5112.26 | loss 0.73 | bpc 1.05447
| epoch 131 step 89610 | 430 batches | lr 0.000441 | ms/batch 5107.44 | loss 0.74 | bpc 1.06818
| epoch 131 step 89620 | 440 batches | lr 0.000441 | ms/batch 5108.20 | loss 0.77 | bpc 1.11727
| epoch 131 step 89630 | 450 batches | lr 0.000441 | ms/batch 5109.88 | loss 0.76 | bpc 1.10264
| epoch 131 step 89640 | 460 batches | lr 0.000441 | ms/batch 5109.29 | loss 0.76 | bpc 1.09631
| epoch 131 step 89650 | 470 batches | lr 0.000441 | ms/batch 5108.21 | loss 0.73 | bpc 1.05995
| epoch 131 step 89660 | 480 batches | lr 0.000441 | ms/batch 5108.40 | loss 0.77 | bpc 1.10835
| epoch 131 step 89670 | 490 batches | lr 0.000441 | ms/batch 5109.71 | loss 0.75 | bpc 1.07987
| epoch 131 step 89680 | 500 batches | lr 0.000441 | ms/batch 5108.73 | loss 0.74 | bpc 1.07462
| epoch 131 step 89690 | 510 batches | lr 0.00044 | ms/batch 5109.62 | loss 0.75 | bpc 1.08154
| epoch 131 step 89700 | 520 batches | lr 0.00044 | ms/batch 5110.14 | loss 0.76 | bpc 1.09554
| epoch 131 step 89710 | 530 batches | lr 0.00044 | ms/batch 5113.81 | loss 0.78 | bpc 1.11824
| epoch 131 step 89720 | 540 batches | lr 0.00044 | ms/batch 5107.65 | loss 0.78 | bpc 1.13175
| epoch 131 step 89730 | 550 batches | lr 0.00044 | ms/batch 5108.45 | loss 0.74 | bpc 1.07103
| epoch 131 step 89740 | 560 batches | lr 0.00044 | ms/batch 5112.42 | loss 0.77 | bpc 1.11572
| epoch 131 step 89750 | 570 batches | lr 0.00044 | ms/batch 5112.12 | loss 0.76 | bpc 1.10327
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 359 at step 89750 | time: 1345.59s | valid loss 0.75 | bpc 1.07808
----------------------------------------------------------------------------------------------------
| epoch 131 step 89760 | 580 batches | lr 0.00044 | ms/batch 11917.89 | loss 0.74 | bpc 1.07099
| epoch 131 step 89770 | 590 batches | lr 0.00044 | ms/batch 5109.72 | loss 0.76 | bpc 1.08982
| epoch 131 step 89780 | 600 batches | lr 0.00044 | ms/batch 5110.57 | loss 0.76 | bpc 1.09137
| epoch 131 step 89790 | 610 batches | lr 0.00044 | ms/batch 5110.66 | loss 0.73 | bpc 1.06013
| epoch 131 step 89800 | 620 batches | lr 0.00044 | ms/batch 5109.59 | loss 0.76 | bpc 1.09484
| epoch 131 step 89810 | 630 batches | lr 0.00044 | ms/batch 5110.00 | loss 0.77 | bpc 1.11182
| epoch 131 step 89820 | 640 batches | lr 0.00044 | ms/batch 5111.86 | loss 0.76 | bpc 1.09106
| epoch 131 step 89830 | 650 batches | lr 0.00044 | ms/batch 5106.51 | loss 0.80 | bpc 1.14809
| epoch 131 step 89840 | 660 batches | lr 0.00044 | ms/batch 5111.10 | loss 0.78 | bpc 1.11960
| epoch 131 step 89850 | 670 batches | lr 0.00044 | ms/batch 5109.66 | loss 0.75 | bpc 1.08794
| epoch 131 step 89860 | 680 batches | lr 0.00044 | ms/batch 5114.22 | loss 0.71 | bpc 1.01771
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 132 step 89870 | 4 batches | lr 0.00044 | ms/batch 5112.71 | loss 0.70 | bpc 1.01284
| epoch 132 step 89880 | 14 batches | lr 0.00044 | ms/batch 5112.81 | loss 0.74 | bpc 1.07386
| epoch 132 step 89890 | 24 batches | lr 0.00044 | ms/batch 5109.00 | loss 0.77 | bpc 1.11671
| epoch 132 step 89900 | 34 batches | lr 0.00044 | ms/batch 5108.15 | loss 0.73 | bpc 1.05401
| epoch 132 step 89910 | 44 batches | lr 0.00044 | ms/batch 5108.40 | loss 0.73 | bpc 1.05036
| epoch 132 step 89920 | 54 batches | lr 0.00044 | ms/batch 5111.73 | loss 0.77 | bpc 1.11514
| epoch 132 step 89930 | 64 batches | lr 0.00044 | ms/batch 5111.25 | loss 0.76 | bpc 1.09007
| epoch 132 step 89940 | 74 batches | lr 0.00044 | ms/batch 5112.55 | loss 0.74 | bpc 1.07113
| epoch 132 step 89950 | 84 batches | lr 0.00044 | ms/batch 5108.78 | loss 0.71 | bpc 1.01992
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
| epoch 132 step 89960 | 94 batches | lr 0.00044 | ms/batch 5110.21 | loss 0.72 | bpc 1.04388
| epoch 132 step 89970 | 104 batches | lr 0.00044 | ms/batch 5108.28 | loss 0.75 | bpc 1.07957
| epoch 132 step 89980 | 114 batches | lr 0.00044 | ms/batch 5112.02 | loss 0.75 | bpc 1.08257
| epoch 132 step 89990 | 124 batches | lr 0.00044 | ms/batch 5111.27 | loss 0.76 | bpc 1.09971
| epoch 132 step 90000 | 134 batches | lr 0.00044 | ms/batch 5111.29 | loss 0.76 | bpc 1.10265
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 360 at step 90000 | time: 1345.86s | valid loss 0.74 | bpc 1.07125
----------------------------------------------------------------------------------------------------
| epoch 132 step 90010 | 144 batches | lr 0.00044 | ms/batch 11918.63 | loss 0.79 | bpc 1.13840
| epoch 132 step 90020 | 154 batches | lr 0.00044 | ms/batch 5111.13 | loss 0.77 | bpc 1.10728
| epoch 132 step 90030 | 164 batches | lr 0.00044 | ms/batch 5108.84 | loss 0.79 | bpc 1.14190
| epoch 132 step 90040 | 174 batches | lr 0.00044 | ms/batch 5104.21 | loss 0.69 | bpc 0.98871
| epoch 132 step 90050 | 184 batches | lr 0.00044 | ms/batch 5109.38 | loss 0.69 | bpc 0.99003
| epoch 132 step 90060 | 194 batches | lr 0.00044 | ms/batch 5108.63 | loss 0.69 | bpc 0.99983
| epoch 132 step 90070 | 204 batches | lr 0.00044 | ms/batch 5113.68 | loss 0.62 | bpc 0.89916
| epoch 132 step 90080 | 214 batches | lr 0.00044 | ms/batch 5107.05 | loss 0.73 | bpc 1.05723
| epoch 132 step 90090 | 224 batches | lr 0.00044 | ms/batch 5110.52 | loss 0.75 | bpc 1.07939
| epoch 132 step 90100 | 234 batches | lr 0.00044 | ms/batch 5107.37 | loss 0.74 | bpc 1.07226
| epoch 132 step 90110 | 244 batches | lr 0.00044 | ms/batch 5111.16 | loss 0.76 | bpc 1.09685
| epoch 132 step 90120 | 254 batches | lr 0.00044 | ms/batch 5109.74 | loss 0.78 | bpc 1.12795
| epoch 132 step 90130 | 264 batches | lr 0.00044 | ms/batch 5110.06 | loss 0.71 | bpc 1.01871
| epoch 132 step 90140 | 274 batches | lr 0.00044 | ms/batch 5109.58 | loss 0.73 | bpc 1.04763
| epoch 132 step 90150 | 284 batches | lr 0.00044 | ms/batch 5109.58 | loss 0.77 | bpc 1.11597
| epoch 132 step 90160 | 294 batches | lr 0.00044 | ms/batch 5106.87 | loss 0.76 | bpc 1.10210
| epoch 132 step 90170 | 304 batches | lr 0.00044 | ms/batch 5110.62 | loss 0.76 | bpc 1.09758
| epoch 132 step 90180 | 314 batches | lr 0.00044 | ms/batch 5113.50 | loss 0.77 | bpc 1.11179
| epoch 132 step 90190 | 324 batches | lr 0.00044 | ms/batch 5113.21 | loss 0.75 | bpc 1.08505
| epoch 132 step 90200 | 334 batches | lr 0.00044 | ms/batch 5110.91 | loss 0.76 | bpc 1.09656
| epoch 132 step 90210 | 344 batches | lr 0.00044 | ms/batch 5110.33 | loss 0.71 | bpc 1.02596
| epoch 132 step 90220 | 354 batches | lr 0.00044 | ms/batch 5109.63 | loss 0.78 | bpc 1.12958
| epoch 132 step 90230 | 364 batches | lr 0.00044 | ms/batch 5109.47 | loss 0.75 | bpc 1.08200
| epoch 132 step 90240 | 374 batches | lr 0.00044 | ms/batch 5109.00 | loss 0.73 | bpc 1.04912
| epoch 132 step 90250 | 384 batches | lr 0.00044 | ms/batch 5110.77 | loss 0.76 | bpc 1.10366
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 361 at step 90250 | time: 1345.61s | valid loss 0.75 | bpc 1.07563
----------------------------------------------------------------------------------------------------
| epoch 132 step 90260 | 394 batches | lr 0.00044 | ms/batch 11914.05 | loss 0.76 | bpc 1.09406
| epoch 132 step 90270 | 404 batches | lr 0.00044 | ms/batch 5110.51 | loss 0.77 | bpc 1.11756
| epoch 132 step 90280 | 414 batches | lr 0.00044 | ms/batch 5109.65 | loss 0.76 | bpc 1.09405
| epoch 132 step 90290 | 424 batches | lr 0.00044 | ms/batch 5108.36 | loss 0.73 | bpc 1.04742
| epoch 132 step 90300 | 434 batches | lr 0.00044 | ms/batch 5108.32 | loss 0.77 | bpc 1.10474
| epoch 132 step 90310 | 444 batches | lr 0.00044 | ms/batch 5105.16 | loss 0.77 | bpc 1.10366
| epoch 132 step 90320 | 454 batches | lr 0.00044 | ms/batch 5111.63 | loss 0.77 | bpc 1.11278
| epoch 132 step 90330 | 464 batches | lr 0.00044 | ms/batch 5108.50 | loss 0.75 | bpc 1.08353
| epoch 132 step 90340 | 474 batches | lr 0.00044 | ms/batch 5111.40 | loss 0.74 | bpc 1.06543
| epoch 132 step 90350 | 484 batches | lr 0.00044 | ms/batch 5110.23 | loss 0.76 | bpc 1.09448
| epoch 132 step 90360 | 494 batches | lr 0.00044 | ms/batch 5111.32 | loss 0.76 | bpc 1.09820
| epoch 132 step 90370 | 504 batches | lr 0.00044 | ms/batch 5110.38 | loss 0.74 | bpc 1.06161
| epoch 132 step 90380 | 514 batches | lr 0.00044 | ms/batch 5110.22 | loss 0.74 | bpc 1.07323
| epoch 132 step 90390 | 524 batches | lr 0.00044 | ms/batch 5112.17 | loss 0.75 | bpc 1.08921
| epoch 132 step 90400 | 534 batches | lr 0.00044 | ms/batch 5109.81 | loss 0.80 | bpc 1.15589
| epoch 132 step 90410 | 544 batches | lr 0.00044 | ms/batch 5109.77 | loss 0.75 | bpc 1.07926
| epoch 132 step 90420 | 554 batches | lr 0.00044 | ms/batch 5109.18 | loss 0.76 | bpc 1.10255
| epoch 132 step 90430 | 564 batches | lr 0.00044 | ms/batch 5109.46 | loss 0.78 | bpc 1.11963
| epoch 132 step 90440 | 574 batches | lr 0.00044 | ms/batch 5108.64 | loss 0.75 | bpc 1.07886
| epoch 132 step 90450 | 584 batches | lr 0.00044 | ms/batch 5108.96 | loss 0.74 | bpc 1.06800
| epoch 132 step 90460 | 594 batches | lr 0.00044 | ms/batch 5109.54 | loss 0.77 | bpc 1.11170
| epoch 132 step 90470 | 604 batches | lr 0.00044 | ms/batch 5105.91 | loss 0.75 | bpc 1.08418
| epoch 132 step 90480 | 614 batches | lr 0.000439 | ms/batch 5108.94 | loss 0.73 | bpc 1.05024
| epoch 132 step 90490 | 624 batches | lr 0.000439 | ms/batch 5106.03 | loss 0.76 | bpc 1.09936
| epoch 132 step 90500 | 634 batches | lr 0.000439 | ms/batch 5108.51 | loss 0.77 | bpc 1.11459
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 362 at step 90500 | time: 1345.44s | valid loss 0.75 | bpc 1.07697
----------------------------------------------------------------------------------------------------
| epoch 132 step 90510 | 644 batches | lr 0.000439 | ms/batch 11917.64 | loss 0.75 | bpc 1.08751
| epoch 132 step 90520 | 654 batches | lr 0.000439 | ms/batch 5108.35 | loss 0.81 | bpc 1.16146
| epoch 132 step 90530 | 664 batches | lr 0.000439 | ms/batch 5110.55 | loss 0.78 | bpc 1.12155
| epoch 132 step 90540 | 674 batches | lr 0.000439 | ms/batch 5109.67 | loss 0.71 | bpc 1.02879
| epoch 132 step 90550 | 684 batches | lr 0.000439 | ms/batch 5107.41 | loss 0.70 | bpc 1.01536
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 133 step 90560 | 8 batches | lr 0.000439 | ms/batch 5109.46 | loss 0.72 | bpc 1.04578
| epoch 133 step 90570 | 18 batches | lr 0.000439 | ms/batch 5111.35 | loss 0.75 | bpc 1.08669
| epoch 133 step 90580 | 28 batches | lr 0.000439 | ms/batch 5110.25 | loss 0.77 | bpc 1.10752
| epoch 133 step 90590 | 38 batches | lr 0.000439 | ms/batch 5109.34 | loss 0.72 | bpc 1.03898
| epoch 133 step 90600 | 48 batches | lr 0.000439 | ms/batch 5110.65 | loss 0.74 | bpc 1.06887
| epoch 133 step 90610 | 58 batches | lr 0.000439 | ms/batch 5110.19 | loss 0.77 | bpc 1.11426
| epoch 133 step 90620 | 68 batches | lr 0.000439 | ms/batch 5109.88 | loss 0.76 | bpc 1.09651
| epoch 133 step 90630 | 78 batches | lr 0.000439 | ms/batch 5108.22 | loss 0.71 | bpc 1.02904
| epoch 133 step 90640 | 88 batches | lr 0.000439 | ms/batch 5112.59 | loss 0.73 | bpc 1.04620
| epoch 133 step 90650 | 98 batches | lr 0.000439 | ms/batch 5108.71 | loss 0.71 | bpc 1.02830
| epoch 133 step 90660 | 108 batches | lr 0.000439 | ms/batch 5109.81 | loss 0.77 | bpc 1.11130
| epoch 133 step 90670 | 118 batches | lr 0.000439 | ms/batch 5110.42 | loss 0.74 | bpc 1.07122
| epoch 133 step 90680 | 128 batches | lr 0.000439 | ms/batch 5111.45 | loss 0.78 | bpc 1.12016
| epoch 133 step 90690 | 138 batches | lr 0.000439 | ms/batch 5110.80 | loss 0.79 | bpc 1.13280
| epoch 133 step 90700 | 148 batches | lr 0.000439 | ms/batch 5110.50 | loss 0.75 | bpc 1.07933
| epoch 133 step 90710 | 158 batches | lr 0.000439 | ms/batch 5110.77 | loss 0.79 | bpc 1.13885
| epoch 133 step 90720 | 168 batches | lr 0.000439 | ms/batch 5111.67 | loss 0.76 | bpc 1.09247
| epoch 133 step 90730 | 178 batches | lr 0.000439 | ms/batch 5111.43 | loss 0.68 | bpc 0.97713
| epoch 133 step 90740 | 188 batches | lr 0.000439 | ms/batch 5107.81 | loss 0.68 | bpc 0.98147
| epoch 133 step 90750 | 198 batches | lr 0.000439 | ms/batch 5108.37 | loss 0.65 | bpc 0.94320
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 363 at step 90750 | time: 1345.63s | valid loss 0.74 | bpc 1.07103
----------------------------------------------------------------------------------------------------
| epoch 133 step 90760 | 208 batches | lr 0.000439 | ms/batch 11911.55 | loss 0.67 | bpc 0.97205
| epoch 133 step 90770 | 218 batches | lr 0.000439 | ms/batch 5107.83 | loss 0.76 | bpc 1.09557
| epoch 133 step 90780 | 228 batches | lr 0.000439 | ms/batch 5111.95 | loss 0.75 | bpc 1.07634
| epoch 133 step 90790 | 238 batches | lr 0.000439 | ms/batch 5109.72 | loss 0.74 | bpc 1.06581
| epoch 133 step 90800 | 248 batches | lr 0.000439 | ms/batch 5104.38 | loss 0.78 | bpc 1.12792
| epoch 133 step 90810 | 258 batches | lr 0.000439 | ms/batch 5109.96 | loss 0.75 | bpc 1.07968
| epoch 133 step 90820 | 268 batches | lr 0.000439 | ms/batch 5111.96 | loss 0.70 | bpc 1.00924
| epoch 133 step 90830 | 278 batches | lr 0.000439 | ms/batch 5113.19 | loss 0.75 | bpc 1.08046
| epoch 133 step 90840 | 288 batches | lr 0.000439 | ms/batch 5108.38 | loss 0.79 | bpc 1.13475
| epoch 133 step 90850 | 298 batches | lr 0.000439 | ms/batch 5106.72 | loss 0.75 | bpc 1.08389
| epoch 133 step 90860 | 308 batches | lr 0.000439 | ms/batch 5119.74 | loss 0.76 | bpc 1.09639
| epoch 133 step 90870 | 318 batches | lr 0.000439 | ms/batch 5127.41 | loss 0.76 | bpc 1.09804
| epoch 133 step 90880 | 328 batches | lr 0.000439 | ms/batch 5128.21 | loss 0.76 | bpc 1.09567
| epoch 133 step 90890 | 338 batches | lr 0.000439 | ms/batch 5128.82 | loss 0.73 | bpc 1.05680
| epoch 133 step 90900 | 348 batches | lr 0.000439 | ms/batch 5131.21 | loss 0.73 | bpc 1.04701
| epoch 133 step 90910 | 358 batches | lr 0.000439 | ms/batch 5125.34 | loss 0.80 | bpc 1.14760
| epoch 133 step 90920 | 368 batches | lr 0.000439 | ms/batch 5121.63 | loss 0.74 | bpc 1.07188
| epoch 133 step 90930 | 378 batches | lr 0.000439 | ms/batch 5127.15 | loss 0.74 | bpc 1.06917
| epoch 133 step 90940 | 388 batches | lr 0.000439 | ms/batch 5128.78 | loss 0.77 | bpc 1.11407
| epoch 133 step 90950 | 398 batches | lr 0.000439 | ms/batch 5122.35 | loss 0.77 | bpc 1.10687
| epoch 133 step 90960 | 408 batches | lr 0.000439 | ms/batch 5119.91 | loss 0.77 | bpc 1.11076
| epoch 133 step 90970 | 418 batches | lr 0.000439 | ms/batch 5122.06 | loss 0.74 | bpc 1.06522
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
| epoch 133 step 90980 | 428 batches | lr 0.000439 | ms/batch 5119.29 | loss 0.73 | bpc 1.05165
| epoch 133 step 90990 | 438 batches | lr 0.000439 | ms/batch 5119.67 | loss 0.77 | bpc 1.11531
| epoch 133 step 91000 | 448 batches | lr 0.000439 | ms/batch 5120.20 | loss 0.77 | bpc 1.10507
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 364 at step 91000 | time: 1347.88s | valid loss 0.75 | bpc 1.07572
----------------------------------------------------------------------------------------------------
| epoch 133 step 91010 | 458 batches | lr 0.000439 | ms/batch 11940.64 | loss 0.76 | bpc 1.10225
| epoch 133 step 91020 | 468 batches | lr 0.000439 | ms/batch 5122.30 | loss 0.73 | bpc 1.04596
| epoch 133 step 91030 | 478 batches | lr 0.000439 | ms/batch 5118.91 | loss 0.76 | bpc 1.09550
| epoch 133 step 91040 | 488 batches | lr 0.000439 | ms/batch 5119.02 | loss 0.75 | bpc 1.08896
| epoch 133 step 91050 | 498 batches | lr 0.000439 | ms/batch 5118.44 | loss 0.75 | bpc 1.07983
| epoch 133 step 91060 | 508 batches | lr 0.000439 | ms/batch 5125.66 | loss 0.76 | bpc 1.08979
| epoch 133 step 91070 | 518 batches | lr 0.000439 | ms/batch 5120.20 | loss 0.74 | bpc 1.07275
| epoch 133 step 91080 | 528 batches | lr 0.000439 | ms/batch 5121.33 | loss 0.77 | bpc 1.11169
| epoch 133 step 91090 | 538 batches | lr 0.000439 | ms/batch 5122.21 | loss 0.79 | bpc 1.14165
| epoch 133 step 91100 | 548 batches | lr 0.000439 | ms/batch 5111.56 | loss 0.74 | bpc 1.06155
| epoch 133 step 91110 | 558 batches | lr 0.000439 | ms/batch 5120.01 | loss 0.77 | bpc 1.10714
| epoch 133 step 91120 | 568 batches | lr 0.000439 | ms/batch 5123.99 | loss 0.78 | bpc 1.12115
| epoch 133 step 91130 | 578 batches | lr 0.000439 | ms/batch 5120.14 | loss 0.74 | bpc 1.06153
| epoch 133 step 91140 | 588 batches | lr 0.000439 | ms/batch 5123.13 | loss 0.76 | bpc 1.09132
| epoch 133 step 91150 | 598 batches | lr 0.000439 | ms/batch 5119.97 | loss 0.76 | bpc 1.09955
| epoch 133 step 91160 | 608 batches | lr 0.000439 | ms/batch 5120.13 | loss 0.74 | bpc 1.06929
| epoch 133 step 91170 | 618 batches | lr 0.000439 | ms/batch 5120.88 | loss 0.75 | bpc 1.08082
| epoch 133 step 91180 | 628 batches | lr 0.000439 | ms/batch 5122.55 | loss 0.77 | bpc 1.10757
| epoch 133 step 91190 | 638 batches | lr 0.000439 | ms/batch 5124.05 | loss 0.74 | bpc 1.07109
| epoch 133 step 91200 | 648 batches | lr 0.000439 | ms/batch 5115.69 | loss 0.79 | bpc 1.14675
| epoch 133 step 91210 | 658 batches | lr 0.000439 | ms/batch 5126.04 | loss 0.78 | bpc 1.12030
| epoch 133 step 91220 | 668 batches | lr 0.000439 | ms/batch 5121.64 | loss 0.77 | bpc 1.11491
| epoch 133 step 91230 | 678 batches | lr 0.000439 | ms/batch 5118.10 | loss 0.70 | bpc 1.00776
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 134 step 91240 | 2 batches | lr 0.000439 | ms/batch 5119.29 | loss 0.70 | bpc 1.00748
| epoch 134 step 91250 | 12 batches | lr 0.000438 | ms/batch 5119.71 | loss 0.74 | bpc 1.07389
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 365 at step 91250 | time: 1364.88s | valid loss 0.74 | bpc 1.07403
----------------------------------------------------------------------------------------------------
| epoch 134 step 91260 | 22 batches | lr 0.000438 | ms/batch 11939.70 | loss 0.77 | bpc 1.11210
| epoch 134 step 91270 | 32 batches | lr 0.000438 | ms/batch 5125.48 | loss 0.74 | bpc 1.06260
| epoch 134 step 91280 | 42 batches | lr 0.000438 | ms/batch 5123.30 | loss 0.72 | bpc 1.04584
| epoch 134 step 91290 | 52 batches | lr 0.000438 | ms/batch 5121.10 | loss 0.76 | bpc 1.09389
| epoch 134 step 91300 | 62 batches | lr 0.000438 | ms/batch 5124.09 | loss 0.76 | bpc 1.09346
| epoch 134 step 91310 | 72 batches | lr 0.000438 | ms/batch 5120.03 | loss 0.76 | bpc 1.09210
| epoch 134 step 91320 | 82 batches | lr 0.000438 | ms/batch 5122.85 | loss 0.70 | bpc 1.01647
| epoch 134 step 91330 | 92 batches | lr 0.000438 | ms/batch 5122.64 | loss 0.72 | bpc 1.04324
| epoch 134 step 91340 | 102 batches | lr 0.000438 | ms/batch 5121.83 | loss 0.74 | bpc 1.06071
| epoch 134 step 91350 | 112 batches | lr 0.000438 | ms/batch 5120.23 | loss 0.75 | bpc 1.08647
| epoch 134 step 91360 | 122 batches | lr 0.000438 | ms/batch 5120.56 | loss 0.75 | bpc 1.08814
| epoch 134 step 91370 | 132 batches | lr 0.000438 | ms/batch 5125.54 | loss 0.76 | bpc 1.10039
| epoch 134 step 91380 | 142 batches | lr 0.000438 | ms/batch 5122.66 | loss 0.80 | bpc 1.15846
| epoch 134 step 91390 | 152 batches | lr 0.000438 | ms/batch 5125.40 | loss 0.76 | bpc 1.10032
| epoch 134 step 91400 | 162 batches | lr 0.000438 | ms/batch 5122.63 | loss 0.79 | bpc 1.13451
| epoch 134 step 91410 | 172 batches | lr 0.000438 | ms/batch 5121.27 | loss 0.71 | bpc 1.01871
| epoch 134 step 91420 | 182 batches | lr 0.000438 | ms/batch 5117.23 | loss 0.68 | bpc 0.98026
| epoch 134 step 91430 | 192 batches | lr 0.000438 | ms/batch 5122.47 | loss 0.69 | bpc 0.99124
| epoch 134 step 91440 | 202 batches | lr 0.000438 | ms/batch 5123.32 | loss 0.61 | bpc 0.88409
| epoch 134 step 91450 | 212 batches | lr 0.000438 | ms/batch 5126.42 | loss 0.73 | bpc 1.05204
| epoch 134 step 91460 | 222 batches | lr 0.000438 | ms/batch 5120.11 | loss 0.76 | bpc 1.10027
| epoch 134 step 91470 | 232 batches | lr 0.000438 | ms/batch 5119.94 | loss 0.74 | bpc 1.07314
| epoch 134 step 91480 | 242 batches | lr 0.000438 | ms/batch 5120.24 | loss 0.74 | bpc 1.07302
| epoch 134 step 91490 | 252 batches | lr 0.000438 | ms/batch 5125.06 | loss 0.79 | bpc 1.14510
| epoch 134 step 91500 | 262 batches | lr 0.000438 | ms/batch 5120.84 | loss 0.72 | bpc 1.03439
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 366 at step 91500 | time: 1354.76s | valid loss 0.74 | bpc 1.07290
----------------------------------------------------------------------------------------------------
| epoch 134 step 91510 | 272 batches | lr 0.000438 | ms/batch 11933.97 | loss 0.71 | bpc 1.02127
| epoch 134 step 91520 | 282 batches | lr 0.000438 | ms/batch 5124.03 | loss 0.77 | bpc 1.10711
| epoch 134 step 91530 | 292 batches | lr 0.000438 | ms/batch 5121.65 | loss 0.77 | bpc 1.10673
| epoch 134 step 91540 | 302 batches | lr 0.000438 | ms/batch 5123.80 | loss 0.76 | bpc 1.09445
| epoch 134 step 91550 | 312 batches | lr 0.000438 | ms/batch 5116.27 | loss 0.78 | bpc 1.12568
| epoch 134 step 91560 | 322 batches | lr 0.000438 | ms/batch 5126.32 | loss 0.74 | bpc 1.06682
| epoch 134 step 91570 | 332 batches | lr 0.000438 | ms/batch 5121.02 | loss 0.77 | bpc 1.11527
| epoch 134 step 91580 | 342 batches | lr 0.000438 | ms/batch 5124.98 | loss 0.70 | bpc 1.01335
| epoch 134 step 91590 | 352 batches | lr 0.000438 | ms/batch 5121.97 | loss 0.77 | bpc 1.10874
| epoch 134 step 91600 | 362 batches | lr 0.000438 | ms/batch 5123.63 | loss 0.77 | bpc 1.11129
| epoch 134 step 91610 | 372 batches | lr 0.000438 | ms/batch 5123.68 | loss 0.72 | bpc 1.04203
| epoch 134 step 91620 | 382 batches | lr 0.000438 | ms/batch 5123.21 | loss 0.76 | bpc 1.09713
| epoch 134 step 91630 | 392 batches | lr 0.000438 | ms/batch 5125.98 | loss 0.75 | bpc 1.08790
| epoch 134 step 91640 | 402 batches | lr 0.000438 | ms/batch 5121.36 | loss 0.79 | bpc 1.14159
| epoch 134 step 91650 | 412 batches | lr 0.000438 | ms/batch 5121.59 | loss 0.75 | bpc 1.08212
| epoch 134 step 91660 | 422 batches | lr 0.000438 | ms/batch 5125.01 | loss 0.73 | bpc 1.05964
| epoch 134 step 91670 | 432 batches | lr 0.000438 | ms/batch 5119.58 | loss 0.74 | bpc 1.07421
| epoch 134 step 91680 | 442 batches | lr 0.000438 | ms/batch 5121.63 | loss 0.78 | bpc 1.12457
| epoch 134 step 91690 | 452 batches | lr 0.000438 | ms/batch 5124.41 | loss 0.76 | bpc 1.09801
| epoch 134 step 91700 | 462 batches | lr 0.000438 | ms/batch 5124.75 | loss 0.76 | bpc 1.09236
| epoch 134 step 91710 | 472 batches | lr 0.000438 | ms/batch 5120.29 | loss 0.74 | bpc 1.07291
| epoch 134 step 91720 | 482 batches | lr 0.000438 | ms/batch 5123.87 | loss 0.76 | bpc 1.09363
| epoch 134 step 91730 | 492 batches | lr 0.000438 | ms/batch 5127.08 | loss 0.75 | bpc 1.07610
| epoch 134 step 91740 | 502 batches | lr 0.000438 | ms/batch 5116.91 | loss 0.74 | bpc 1.07240
| epoch 134 step 91750 | 512 batches | lr 0.000438 | ms/batch 5122.61 | loss 0.75 | bpc 1.07580
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 367 at step 91750 | time: 1348.93s | valid loss 0.75 | bpc 1.07552
----------------------------------------------------------------------------------------------------
| epoch 134 step 91760 | 522 batches | lr 0.000438 | ms/batch 11937.75 | loss 0.76 | bpc 1.09140
| epoch 134 step 91770 | 532 batches | lr 0.000438 | ms/batch 5124.05 | loss 0.79 | bpc 1.14535
| epoch 134 step 91780 | 542 batches | lr 0.000438 | ms/batch 5125.27 | loss 0.75 | bpc 1.08582
| epoch 134 step 91790 | 552 batches | lr 0.000438 | ms/batch 5122.81 | loss 0.76 | bpc 1.09162
| epoch 134 step 91800 | 562 batches | lr 0.000438 | ms/batch 5118.09 | loss 0.78 | bpc 1.12371
| epoch 134 step 91810 | 572 batches | lr 0.000438 | ms/batch 5118.24 | loss 0.75 | bpc 1.08403
| epoch 134 step 91820 | 582 batches | lr 0.000438 | ms/batch 5124.38 | loss 0.73 | bpc 1.05600
| epoch 134 step 91830 | 592 batches | lr 0.000438 | ms/batch 5119.11 | loss 0.78 | bpc 1.11963
| epoch 134 step 91840 | 602 batches | lr 0.000438 | ms/batch 5126.73 | loss 0.74 | bpc 1.07304
| epoch 134 step 91850 | 612 batches | lr 0.000438 | ms/batch 5120.16 | loss 0.73 | bpc 1.04737
| epoch 134 step 91860 | 622 batches | lr 0.000438 | ms/batch 5120.81 | loss 0.77 | bpc 1.11122
| epoch 134 step 91870 | 632 batches | lr 0.000438 | ms/batch 5120.20 | loss 0.77 | bpc 1.10538
| epoch 134 step 91880 | 642 batches | lr 0.000438 | ms/batch 5120.47 | loss 0.76 | bpc 1.09339
| epoch 134 step 91890 | 652 batches | lr 0.000438 | ms/batch 5121.04 | loss 0.79 | bpc 1.14010
| epoch 134 step 91900 | 662 batches | lr 0.000438 | ms/batch 5122.39 | loss 0.78 | bpc 1.12818
| epoch 134 step 91910 | 672 batches | lr 0.000438 | ms/batch 5121.94 | loss 0.72 | bpc 1.04372
| epoch 134 step 91920 | 682 batches | lr 0.000438 | ms/batch 5122.30 | loss 0.71 | bpc 1.02469
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 135 step 91930 | 6 batches | lr 0.000438 | ms/batch 5119.01 | loss 0.71 | bpc 1.02902
| epoch 135 step 91940 | 16 batches | lr 0.000438 | ms/batch 5126.68 | loss 0.75 | bpc 1.07787
| epoch 135 step 91950 | 26 batches | lr 0.000438 | ms/batch 5123.15 | loss 0.78 | bpc 1.12506
| epoch 135 step 91960 | 36 batches | lr 0.000438 | ms/batch 5121.75 | loss 0.71 | bpc 1.02348
| epoch 135 step 91970 | 46 batches | lr 0.000438 | ms/batch 5119.93 | loss 0.74 | bpc 1.06674
| epoch 135 step 91980 | 56 batches | lr 0.000438 | ms/batch 5115.76 | loss 0.77 | bpc 1.11783
| epoch 135 step 91990 | 66 batches | lr 0.000438 | ms/batch 5120.60 | loss 0.76 | bpc 1.10003
| epoch 135 step 92000 | 76 batches | lr 0.000438 | ms/batch 5127.21 | loss 0.72 | bpc 1.03862
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 368 at step 92000 | time: 1348.79s | valid loss 0.74 | bpc 1.07102
----------------------------------------------------------------------------------------------------
| epoch 135 step 92010 | 86 batches | lr 0.000438 | ms/batch 11942.76 | loss 0.72 | bpc 1.03705
| epoch 135 step 92020 | 96 batches | lr 0.000438 | ms/batch 5123.06 | loss 0.72 | bpc 1.03665
| epoch 135 step 92030 | 106 batches | lr 0.000437 | ms/batch 5124.66 | loss 0.76 | bpc 1.08955
| epoch 135 step 92040 | 116 batches | lr 0.000437 | ms/batch 5123.00 | loss 0.75 | bpc 1.07701
| epoch 135 step 92050 | 126 batches | lr 0.000437 | ms/batch 5118.86 | loss 0.76 | bpc 1.09663
| epoch 135 step 92060 | 136 batches | lr 0.000437 | ms/batch 5124.19 | loss 0.78 | bpc 1.12018
| epoch 135 step 92070 | 146 batches | lr 0.000437 | ms/batch 5118.52 | loss 0.77 | bpc 1.11527
| epoch 135 step 92080 | 156 batches | lr 0.000437 | ms/batch 5123.18 | loss 0.77 | bpc 1.11421
| epoch 135 step 92090 | 166 batches | lr 0.000437 | ms/batch 5123.93 | loss 0.78 | bpc 1.12540
| epoch 135 step 92100 | 176 batches | lr 0.000437 | ms/batch 5122.49 | loss 0.68 | bpc 0.98181
| epoch 135 step 92110 | 186 batches | lr 0.000437 | ms/batch 5118.11 | loss 0.68 | bpc 0.98641
| epoch 135 step 92120 | 196 batches | lr 0.000437 | ms/batch 5124.54 | loss 0.66 | bpc 0.95618
| epoch 135 step 92130 | 206 batches | lr 0.000437 | ms/batch 5124.78 | loss 0.65 | bpc 0.94087
| epoch 135 step 92140 | 216 batches | lr 0.000437 | ms/batch 5121.32 | loss 0.75 | bpc 1.08768
| epoch 135 step 92150 | 226 batches | lr 0.000437 | ms/batch 5119.58 | loss 0.74 | bpc 1.06973
| epoch 135 step 92160 | 236 batches | lr 0.000437 | ms/batch 5123.95 | loss 0.74 | bpc 1.06216
| epoch 135 step 92170 | 246 batches | lr 0.000437 | ms/batch 5124.08 | loss 0.78 | bpc 1.12672
| epoch 135 step 92180 | 256 batches | lr 0.000437 | ms/batch 5118.71 | loss 0.75 | bpc 1.08195
| epoch 135 step 92190 | 266 batches | lr 0.000437 | ms/batch 5120.78 | loss 0.70 | bpc 1.01363
| epoch 135 step 92200 | 276 batches | lr 0.000437 | ms/batch 5120.06 | loss 0.74 | bpc 1.07181
| epoch 135 step 92210 | 286 batches | lr 0.000437 | ms/batch 5126.28 | loss 0.79 | bpc 1.13477
| epoch 135 step 92220 | 296 batches | lr 0.000437 | ms/batch 5122.35 | loss 0.75 | bpc 1.07681
| epoch 135 step 92230 | 306 batches | lr 0.000437 | ms/batch 5123.06 | loss 0.75 | bpc 1.08436
| epoch 135 step 92240 | 316 batches | lr 0.000437 | ms/batch 5119.68 | loss 0.78 | bpc 1.11880
| epoch 135 step 92250 | 326 batches | lr 0.000437 | ms/batch 5122.97 | loss 0.75 | bpc 1.08105
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 369 at step 92250 | time: 1348.85s | valid loss 0.74 | bpc 1.07153
----------------------------------------------------------------------------------------------------
| epoch 135 step 92260 | 336 batches | lr 0.000437 | ms/batch 11938.75 | loss 0.75 | bpc 1.07953
| epoch 135 step 92270 | 346 batches | lr 0.000437 | ms/batch 5120.34 | loss 0.72 | bpc 1.03947
| epoch 135 step 92280 | 356 batches | lr 0.000437 | ms/batch 5120.88 | loss 0.78 | bpc 1.12837
| epoch 135 step 92290 | 366 batches | lr 0.000437 | ms/batch 5117.57 | loss 0.74 | bpc 1.06870
| epoch 135 step 92300 | 376 batches | lr 0.000437 | ms/batch 5120.35 | loss 0.73 | bpc 1.05865
| epoch 135 step 92310 | 386 batches | lr 0.000437 | ms/batch 5117.74 | loss 0.77 | bpc 1.11322
| epoch 135 step 92320 | 396 batches | lr 0.000437 | ms/batch 5119.63 | loss 0.76 | bpc 1.10073
| epoch 135 step 92330 | 406 batches | lr 0.000437 | ms/batch 5126.25 | loss 0.77 | bpc 1.11209
| epoch 135 step 92340 | 416 batches | lr 0.000437 | ms/batch 5121.94 | loss 0.75 | bpc 1.08439
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
| epoch 135 step 92350 | 426 batches | lr 0.000437 | ms/batch 5123.37 | loss 0.72 | bpc 1.03635
| epoch 135 step 92360 | 436 batches | lr 0.000437 | ms/batch 5121.58 | loss 0.78 | bpc 1.12885
| epoch 135 step 92370 | 446 batches | lr 0.000437 | ms/batch 5119.95 | loss 0.76 | bpc 1.09494
| epoch 135 step 92380 | 456 batches | lr 0.000437 | ms/batch 5122.46 | loss 0.76 | bpc 1.09319
| epoch 135 step 92390 | 466 batches | lr 0.000437 | ms/batch 5121.26 | loss 0.75 | bpc 1.08044
| epoch 135 step 92400 | 476 batches | lr 0.000437 | ms/batch 5120.14 | loss 0.74 | bpc 1.06922
| epoch 135 step 92410 | 486 batches | lr 0.000437 | ms/batch 5125.21 | loss 0.76 | bpc 1.09745
| epoch 135 step 92420 | 496 batches | lr 0.000437 | ms/batch 5118.92 | loss 0.75 | bpc 1.08748
| epoch 135 step 92430 | 506 batches | lr 0.000437 | ms/batch 5115.12 | loss 0.74 | bpc 1.06351
| epoch 135 step 92440 | 516 batches | lr 0.000437 | ms/batch 5125.97 | loss 0.75 | bpc 1.08489
| epoch 135 step 92450 | 526 batches | lr 0.000437 | ms/batch 5120.37 | loss 0.76 | bpc 1.10136
| epoch 135 step 92460 | 536 batches | lr 0.000437 | ms/batch 5123.31 | loss 0.79 | bpc 1.13906
| epoch 135 step 92470 | 546 batches | lr 0.000437 | ms/batch 5124.34 | loss 0.74 | bpc 1.07216
| epoch 135 step 92480 | 556 batches | lr 0.000437 | ms/batch 5119.22 | loss 0.77 | bpc 1.10954
| epoch 135 step 92490 | 566 batches | lr 0.000437 | ms/batch 5123.43 | loss 0.78 | bpc 1.12285
| epoch 135 step 92500 | 576 batches | lr 0.000437 | ms/batch 5123.83 | loss 0.74 | bpc 1.06467
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 370 at step 92500 | time: 1348.64s | valid loss 0.74 | bpc 1.07452
----------------------------------------------------------------------------------------------------
| epoch 135 step 92510 | 586 batches | lr 0.000437 | ms/batch 11942.68 | loss 0.74 | bpc 1.07290
| epoch 135 step 92520 | 596 batches | lr 0.000437 | ms/batch 5123.11 | loss 0.77 | bpc 1.11339
| epoch 135 step 92530 | 606 batches | lr 0.000437 | ms/batch 5124.56 | loss 0.74 | bpc 1.06615
| epoch 135 step 92540 | 616 batches | lr 0.000437 | ms/batch 5123.49 | loss 0.74 | bpc 1.06559
| epoch 135 step 92550 | 626 batches | lr 0.000437 | ms/batch 5124.33 | loss 0.76 | bpc 1.09354
| epoch 135 step 92560 | 636 batches | lr 0.000437 | ms/batch 5113.53 | loss 0.77 | bpc 1.11025
| epoch 135 step 92570 | 646 batches | lr 0.000437 | ms/batch 5123.59 | loss 0.76 | bpc 1.10032
| epoch 135 step 92580 | 656 batches | lr 0.000437 | ms/batch 5125.44 | loss 0.79 | bpc 1.13868
| epoch 135 step 92590 | 666 batches | lr 0.000437 | ms/batch 5120.03 | loss 0.78 | bpc 1.12575
| epoch 135 step 92600 | 676 batches | lr 0.000437 | ms/batch 5122.01 | loss 0.70 | bpc 1.01207
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
| epoch 135 step 92610 | 686 batches | lr 0.000437 | ms/batch 5122.36 | loss 0.70 | bpc 1.00351
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 136 step 92620 | 10 batches | lr 0.000437 | ms/batch 5122.12 | loss 0.73 | bpc 1.05398
| epoch 136 step 92630 | 20 batches | lr 0.000437 | ms/batch 5122.74 | loss 0.76 | bpc 1.09805
| epoch 136 step 92640 | 30 batches | lr 0.000437 | ms/batch 5119.12 | loss 0.76 | bpc 1.08943
| epoch 136 step 92650 | 40 batches | lr 0.000437 | ms/batch 5126.25 | loss 0.72 | bpc 1.03534
| epoch 136 step 92660 | 50 batches | lr 0.000437 | ms/batch 5118.66 | loss 0.76 | bpc 1.09752
| epoch 136 step 92670 | 60 batches | lr 0.000437 | ms/batch 5123.61 | loss 0.77 | bpc 1.11096
| epoch 136 step 92680 | 70 batches | lr 0.000437 | ms/batch 5120.75 | loss 0.75 | bpc 1.07573
| epoch 136 step 92690 | 80 batches | lr 0.000437 | ms/batch 5124.45 | loss 0.71 | bpc 1.02107
| epoch 136 step 92700 | 90 batches | lr 0.000437 | ms/batch 5121.95 | loss 0.74 | bpc 1.06097
| epoch 136 step 92710 | 100 batches | lr 0.000437 | ms/batch 5123.19 | loss 0.71 | bpc 1.02251
| epoch 136 step 92720 | 110 batches | lr 0.000437 | ms/batch 5113.39 | loss 0.77 | bpc 1.10710
| epoch 136 step 92730 | 120 batches | lr 0.000437 | ms/batch 5123.65 | loss 0.74 | bpc 1.06208
| epoch 136 step 92740 | 130 batches | lr 0.000437 | ms/batch 5125.91 | loss 0.78 | bpc 1.12550
| epoch 136 step 92750 | 140 batches | lr 0.000437 | ms/batch 5127.09 | loss 0.80 | bpc 1.14714
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 371 at step 92750 | time: 1349.11s | valid loss 0.74 | bpc 1.07145
----------------------------------------------------------------------------------------------------
| epoch 136 step 92760 | 150 batches | lr 0.000437 | ms/batch 11945.82 | loss 0.75 | bpc 1.08482
| epoch 136 step 92770 | 160 batches | lr 0.000437 | ms/batch 5123.42 | loss 0.79 | bpc 1.13710
| epoch 136 step 92780 | 170 batches | lr 0.000437 | ms/batch 5121.72 | loss 0.73 | bpc 1.05799
| epoch 136 step 92790 | 180 batches | lr 0.000436 | ms/batch 5123.36 | loss 0.68 | bpc 0.97656
| epoch 136 step 92800 | 190 batches | lr 0.000436 | ms/batch 5124.59 | loss 0.68 | bpc 0.97435
| epoch 136 step 92810 | 200 batches | lr 0.000436 | ms/batch 5127.55 | loss 0.63 | bpc 0.91498
| epoch 136 step 92820 | 210 batches | lr 0.000436 | ms/batch 5126.69 | loss 0.70 | bpc 1.01295
| epoch 136 step 92830 | 220 batches | lr 0.000436 | ms/batch 5123.19 | loss 0.77 | bpc 1.10973
| epoch 136 step 92840 | 230 batches | lr 0.000436 | ms/batch 5124.17 | loss 0.74 | bpc 1.07001
| epoch 136 step 92850 | 240 batches | lr 0.000436 | ms/batch 5125.55 | loss 0.74 | bpc 1.07058
| epoch 136 step 92860 | 250 batches | lr 0.000436 | ms/batch 5121.88 | loss 0.79 | bpc 1.13413
| epoch 136 step 92870 | 260 batches | lr 0.000436 | ms/batch 5121.36 | loss 0.73 | bpc 1.04599
| epoch 136 step 92880 | 270 batches | lr 0.000436 | ms/batch 5113.14 | loss 0.71 | bpc 1.02279
| epoch 136 step 92890 | 280 batches | lr 0.000436 | ms/batch 5129.46 | loss 0.75 | bpc 1.08839
| epoch 136 step 92900 | 290 batches | lr 0.000436 | ms/batch 5126.24 | loss 0.78 | bpc 1.12098
| epoch 136 step 92910 | 300 batches | lr 0.000436 | ms/batch 5123.44 | loss 0.75 | bpc 1.08574
| epoch 136 step 92920 | 310 batches | lr 0.000436 | ms/batch 5122.10 | loss 0.77 | bpc 1.11493
| epoch 136 step 92930 | 320 batches | lr 0.000436 | ms/batch 5123.49 | loss 0.74 | bpc 1.06234
| epoch 136 step 92940 | 330 batches | lr 0.000436 | ms/batch 5120.29 | loss 0.76 | bpc 1.10155
| epoch 136 step 92950 | 340 batches | lr 0.000436 | ms/batch 5124.48 | loss 0.73 | bpc 1.04707
| epoch 136 step 92960 | 350 batches | lr 0.000436 | ms/batch 5127.02 | loss 0.74 | bpc 1.07461
| epoch 136 step 92970 | 360 batches | lr 0.000436 | ms/batch 5123.65 | loss 0.79 | bpc 1.13808
| epoch 136 step 92980 | 370 batches | lr 0.000436 | ms/batch 5120.34 | loss 0.73 | bpc 1.05712
| epoch 136 step 92990 | 380 batches | lr 0.000436 | ms/batch 5128.76 | loss 0.75 | bpc 1.08118
| epoch 136 step 93000 | 390 batches | lr 0.000436 | ms/batch 5121.16 | loss 0.75 | bpc 1.07838
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 372 at step 93000 | time: 1352.88s | valid loss 0.74 | bpc 1.07187
----------------------------------------------------------------------------------------------------
| epoch 136 step 93010 | 400 batches | lr 0.000436 | ms/batch 11945.09 | loss 0.79 | bpc 1.13832
| epoch 136 step 93020 | 410 batches | lr 0.000436 | ms/batch 5125.38 | loss 0.76 | bpc 1.09491
| epoch 136 step 93030 | 420 batches | lr 0.000436 | ms/batch 5125.95 | loss 0.73 | bpc 1.05675
| epoch 136 step 93040 | 430 batches | lr 0.000436 | ms/batch 5123.19 | loss 0.74 | bpc 1.06510
| epoch 136 step 93050 | 440 batches | lr 0.000436 | ms/batch 5114.47 | loss 0.77 | bpc 1.11440
| epoch 136 step 93060 | 450 batches | lr 0.000436 | ms/batch 5118.84 | loss 0.76 | bpc 1.09957
| epoch 136 step 93070 | 460 batches | lr 0.000436 | ms/batch 5127.15 | loss 0.76 | bpc 1.09381
| epoch 136 step 93080 | 470 batches | lr 0.000436 | ms/batch 5125.75 | loss 0.73 | bpc 1.05676
| epoch 136 step 93090 | 480 batches | lr 0.000436 | ms/batch 5124.33 | loss 0.77 | bpc 1.10690
| epoch 136 step 93100 | 490 batches | lr 0.000436 | ms/batch 5126.13 | loss 0.74 | bpc 1.07454
| epoch 136 step 93110 | 500 batches | lr 0.000436 | ms/batch 5124.39 | loss 0.74 | bpc 1.06879
| epoch 136 step 93120 | 510 batches | lr 0.000436 | ms/batch 5120.59 | loss 0.75 | bpc 1.07689
| epoch 136 step 93130 | 520 batches | lr 0.000436 | ms/batch 5120.14 | loss 0.76 | bpc 1.09218
| epoch 136 step 93140 | 530 batches | lr 0.000436 | ms/batch 5119.96 | loss 0.77 | bpc 1.11468
| epoch 136 step 93150 | 540 batches | lr 0.000436 | ms/batch 5123.98 | loss 0.78 | bpc 1.12716
| epoch 136 step 93160 | 550 batches | lr 0.000436 | ms/batch 5129.19 | loss 0.74 | bpc 1.06615
| epoch 136 step 93170 | 560 batches | lr 0.000436 | ms/batch 5126.45 | loss 0.77 | bpc 1.11445
| epoch 136 step 93180 | 570 batches | lr 0.000436 | ms/batch 5120.76 | loss 0.76 | bpc 1.09928
| epoch 136 step 93190 | 580 batches | lr 0.000436 | ms/batch 5119.55 | loss 0.74 | bpc 1.07053
| epoch 136 step 93200 | 590 batches | lr 0.000436 | ms/batch 5118.19 | loss 0.75 | bpc 1.08633
| epoch 136 step 93210 | 600 batches | lr 0.000436 | ms/batch 5120.93 | loss 0.76 | bpc 1.08994
| epoch 136 step 93220 | 610 batches | lr 0.000436 | ms/batch 5121.75 | loss 0.73 | bpc 1.05895
| epoch 136 step 93230 | 620 batches | lr 0.000436 | ms/batch 5127.13 | loss 0.76 | bpc 1.09280
| epoch 136 step 93240 | 630 batches | lr 0.000436 | ms/batch 5120.74 | loss 0.77 | bpc 1.10741
| epoch 136 step 93250 | 640 batches | lr 0.000436 | ms/batch 5122.27 | loss 0.75 | bpc 1.08669
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 373 at step 93250 | time: 1348.88s | valid loss 0.74 | bpc 1.07242
----------------------------------------------------------------------------------------------------
| epoch 136 step 93260 | 650 batches | lr 0.000436 | ms/batch 11927.68 | loss 0.79 | bpc 1.14331
| epoch 136 step 93270 | 660 batches | lr 0.000436 | ms/batch 5171.41 | loss 0.77 | bpc 1.11696
| epoch 136 step 93280 | 670 batches | lr 0.000436 | ms/batch 5193.36 | loss 0.75 | bpc 1.08359
| epoch 136 step 93290 | 680 batches | lr 0.000436 | ms/batch 5184.30 | loss 0.70 | bpc 1.01391
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 137 step 93300 | 4 batches | lr 0.000436 | ms/batch 5132.94 | loss 0.70 | bpc 1.00973
| epoch 137 step 93310 | 14 batches | lr 0.000436 | ms/batch 5125.01 | loss 0.74 | bpc 1.07193
| epoch 137 step 93320 | 24 batches | lr 0.000436 | ms/batch 5121.24 | loss 0.77 | bpc 1.11415
| epoch 137 step 93330 | 34 batches | lr 0.000436 | ms/batch 5128.98 | loss 0.73 | bpc 1.05320
| epoch 137 step 93340 | 44 batches | lr 0.000436 | ms/batch 5169.73 | loss 0.73 | bpc 1.04993
| epoch 137 step 93350 | 54 batches | lr 0.000436 | ms/batch 5131.65 | loss 0.77 | bpc 1.11105
| epoch 137 step 93360 | 64 batches | lr 0.000436 | ms/batch 5150.06 | loss 0.75 | bpc 1.08663
| epoch 137 step 93370 | 74 batches | lr 0.000436 | ms/batch 5162.11 | loss 0.74 | bpc 1.06857
| epoch 137 step 93380 | 84 batches | lr 0.000436 | ms/batch 5170.18 | loss 0.70 | bpc 1.01583
| epoch 137 step 93390 | 94 batches | lr 0.000436 | ms/batch 5212.89 | loss 0.72 | bpc 1.04192
| epoch 137 step 93400 | 104 batches | lr 0.000436 | ms/batch 5223.41 | loss 0.75 | bpc 1.07514
| epoch 137 step 93410 | 114 batches | lr 0.000436 | ms/batch 5175.11 | loss 0.75 | bpc 1.07902
| epoch 137 step 93420 | 124 batches | lr 0.000436 | ms/batch 5204.32 | loss 0.76 | bpc 1.09765
| epoch 137 step 93430 | 134 batches | lr 0.000436 | ms/batch 5218.20 | loss 0.76 | bpc 1.09961
| epoch 137 step 93440 | 144 batches | lr 0.000436 | ms/batch 5243.48 | loss 0.79 | bpc 1.13649
| epoch 137 step 93450 | 154 batches | lr 0.000436 | ms/batch 5164.13 | loss 0.77 | bpc 1.10429
| epoch 137 step 93460 | 164 batches | lr 0.000436 | ms/batch 5217.47 | loss 0.79 | bpc 1.13742
| epoch 137 step 93470 | 174 batches | lr 0.000436 | ms/batch 5210.08 | loss 0.68 | bpc 0.98510
| epoch 137 step 93480 | 184 batches | lr 0.000436 | ms/batch 5241.20 | loss 0.68 | bpc 0.98751
| epoch 137 step 93490 | 194 batches | lr 0.000436 | ms/batch 5196.69 | loss 0.69 | bpc 0.99812
| epoch 137 step 93500 | 204 batches | lr 0.000436 | ms/batch 5212.62 | loss 0.62 | bpc 0.89872
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 374 at step 93500 | time: 1364.28s | valid loss 0.74 | bpc 1.06755
----------------------------------------------------------------------------------------------------
| epoch 137 step 93510 | 214 batches | lr 0.000436 | ms/batch 12621.16 | loss 0.73 | bpc 1.05456
| epoch 137 step 93520 | 224 batches | lr 0.000436 | ms/batch 5139.73 | loss 0.75 | bpc 1.07546
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
| epoch 137 step 93530 | 234 batches | lr 0.000436 | ms/batch 5115.56 | loss 0.74 | bpc 1.06832
| epoch 137 step 93540 | 244 batches | lr 0.000436 | ms/batch 5118.86 | loss 0.76 | bpc 1.09369
| epoch 137 step 93550 | 254 batches | lr 0.000436 | ms/batch 5164.09 | loss 0.78 | bpc 1.12919
| epoch 137 step 93560 | 264 batches | lr 0.000435 | ms/batch 5186.32 | loss 0.70 | bpc 1.01676
| epoch 137 step 93570 | 274 batches | lr 0.000435 | ms/batch 5165.77 | loss 0.72 | bpc 1.04330
| epoch 137 step 93580 | 284 batches | lr 0.000435 | ms/batch 5201.85 | loss 0.77 | bpc 1.11192
| epoch 137 step 93590 | 294 batches | lr 0.000435 | ms/batch 5195.54 | loss 0.76 | bpc 1.09840
| epoch 137 step 93600 | 304 batches | lr 0.000435 | ms/batch 5192.47 | loss 0.76 | bpc 1.09505
| epoch 137 step 93610 | 314 batches | lr 0.000435 | ms/batch 5221.33 | loss 0.77 | bpc 1.10992
| epoch 137 step 93620 | 324 batches | lr 0.000435 | ms/batch 5203.58 | loss 0.75 | bpc 1.08079
| epoch 137 step 93630 | 334 batches | lr 0.000435 | ms/batch 5168.22 | loss 0.76 | bpc 1.09462
| epoch 137 step 93640 | 344 batches | lr 0.000435 | ms/batch 5183.99 | loss 0.71 | bpc 1.02580
| epoch 137 step 93650 | 354 batches | lr 0.000435 | ms/batch 5217.09 | loss 0.78 | bpc 1.12581
| epoch 137 step 93660 | 364 batches | lr 0.000435 | ms/batch 5200.25 | loss 0.75 | bpc 1.07971
| epoch 137 step 93670 | 374 batches | lr 0.000435 | ms/batch 5210.11 | loss 0.72 | bpc 1.04414
| epoch 137 step 93680 | 384 batches | lr 0.000435 | ms/batch 5203.70 | loss 0.76 | bpc 1.10189
| epoch 137 step 93690 | 394 batches | lr 0.000435 | ms/batch 5180.42 | loss 0.76 | bpc 1.09135
| epoch 137 step 93700 | 404 batches | lr 0.000435 | ms/batch 5176.68 | loss 0.77 | bpc 1.11452
| epoch 137 step 93710 | 414 batches | lr 0.000435 | ms/batch 5205.28 | loss 0.76 | bpc 1.09466
| epoch 137 step 93720 | 424 batches | lr 0.000435 | ms/batch 5187.01 | loss 0.72 | bpc 1.04432
| epoch 137 step 93730 | 434 batches | lr 0.000435 | ms/batch 5197.36 | loss 0.76 | bpc 1.10224
| epoch 137 step 93740 | 444 batches | lr 0.000435 | ms/batch 5218.37 | loss 0.76 | bpc 1.10300
| epoch 137 step 93750 | 454 batches | lr 0.000435 | ms/batch 5212.56 | loss 0.77 | bpc 1.11054
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 375 at step 93750 | time: 1371.10s | valid loss 0.75 | bpc 1.07559
----------------------------------------------------------------------------------------------------
| epoch 137 step 93760 | 464 batches | lr 0.000435 | ms/batch 12001.39 | loss 0.75 | bpc 1.08236
| epoch 137 step 93770 | 474 batches | lr 0.000435 | ms/batch 5183.62 | loss 0.74 | bpc 1.06372
| epoch 137 step 93780 | 484 batches | lr 0.000435 | ms/batch 5154.18 | loss 0.76 | bpc 1.09129
| epoch 137 step 93790 | 494 batches | lr 0.000435 | ms/batch 5159.11 | loss 0.76 | bpc 1.09560
| epoch 137 step 93800 | 504 batches | lr 0.000435 | ms/batch 5152.80 | loss 0.74 | bpc 1.06099
| epoch 137 step 93810 | 514 batches | lr 0.000435 | ms/batch 5153.74 | loss 0.74 | bpc 1.07025
| epoch 137 step 93820 | 524 batches | lr 0.000435 | ms/batch 5160.83 | loss 0.75 | bpc 1.08716
| epoch 137 step 93830 | 534 batches | lr 0.000435 | ms/batch 5153.33 | loss 0.80 | bpc 1.15232
| epoch 137 step 93840 | 544 batches | lr 0.000435 | ms/batch 5155.31 | loss 0.74 | bpc 1.07454
| epoch 137 step 93850 | 554 batches | lr 0.000435 | ms/batch 5169.34 | loss 0.76 | bpc 1.10187
| epoch 137 step 93860 | 564 batches | lr 0.000435 | ms/batch 5158.10 | loss 0.77 | bpc 1.11545
| epoch 137 step 93870 | 574 batches | lr 0.000435 | ms/batch 5158.42 | loss 0.75 | bpc 1.07490
| epoch 137 step 93880 | 584 batches | lr 0.000435 | ms/batch 5153.78 | loss 0.74 | bpc 1.06443
| epoch 137 step 93890 | 594 batches | lr 0.000435 | ms/batch 5156.90 | loss 0.77 | bpc 1.10882
| epoch 137 step 93900 | 604 batches | lr 0.000435 | ms/batch 5151.98 | loss 0.75 | bpc 1.08118
| epoch 137 step 93910 | 614 batches | lr 0.000435 | ms/batch 5135.47 | loss 0.73 | bpc 1.04872
| epoch 137 step 93920 | 624 batches | lr 0.000435 | ms/batch 5160.83 | loss 0.76 | bpc 1.09569
| epoch 137 step 93930 | 634 batches | lr 0.000435 | ms/batch 5172.54 | loss 0.77 | bpc 1.11162
| epoch 137 step 93940 | 644 batches | lr 0.000435 | ms/batch 5147.93 | loss 0.75 | bpc 1.08028
| epoch 137 step 93950 | 654 batches | lr 0.000435 | ms/batch 5160.18 | loss 0.80 | bpc 1.15807
| epoch 137 step 93960 | 664 batches | lr 0.000435 | ms/batch 5174.04 | loss 0.78 | bpc 1.11910
| epoch 137 step 93970 | 674 batches | lr 0.000435 | ms/batch 5170.65 | loss 0.71 | bpc 1.02453
| epoch 137 step 93980 | 684 batches | lr 0.000435 | ms/batch 5188.26 | loss 0.70 | bpc 1.01341
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 138 step 93990 | 8 batches | lr 0.000435 | ms/batch 5182.10 | loss 0.72 | bpc 1.04069
| epoch 138 step 94000 | 18 batches | lr 0.000435 | ms/batch 5168.00 | loss 0.75 | bpc 1.08499
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 376 at step 94000 | time: 1362.85s | valid loss 0.75 | bpc 1.07588
----------------------------------------------------------------------------------------------------
| epoch 138 step 94010 | 28 batches | lr 0.000435 | ms/batch 12006.86 | loss 0.77 | bpc 1.10562
| epoch 138 step 94020 | 38 batches | lr 0.000435 | ms/batch 5185.89 | loss 0.72 | bpc 1.03517
| epoch 138 step 94030 | 48 batches | lr 0.000435 | ms/batch 5174.27 | loss 0.74 | bpc 1.06706
| epoch 138 step 94040 | 58 batches | lr 0.000435 | ms/batch 5186.87 | loss 0.77 | bpc 1.11073
| epoch 138 step 94050 | 68 batches | lr 0.000435 | ms/batch 5191.17 | loss 0.76 | bpc 1.09410
| epoch 138 step 94060 | 78 batches | lr 0.000435 | ms/batch 5166.96 | loss 0.71 | bpc 1.02675
| epoch 138 step 94070 | 88 batches | lr 0.000435 | ms/batch 5183.24 | loss 0.72 | bpc 1.04319
| epoch 138 step 94080 | 98 batches | lr 0.000435 | ms/batch 5174.32 | loss 0.71 | bpc 1.02535
| epoch 138 step 94090 | 108 batches | lr 0.000435 | ms/batch 5166.65 | loss 0.77 | bpc 1.10680
| epoch 138 step 94100 | 118 batches | lr 0.000435 | ms/batch 5169.01 | loss 0.74 | bpc 1.06946
| epoch 138 step 94110 | 128 batches | lr 0.000435 | ms/batch 5172.20 | loss 0.77 | bpc 1.11680
| epoch 138 step 94120 | 138 batches | lr 0.000435 | ms/batch 5166.17 | loss 0.78 | bpc 1.12942
| epoch 138 step 94130 | 148 batches | lr 0.000435 | ms/batch 5162.64 | loss 0.75 | bpc 1.07788
| epoch 138 step 94140 | 158 batches | lr 0.000435 | ms/batch 5168.57 | loss 0.79 | bpc 1.13732
| epoch 138 step 94150 | 168 batches | lr 0.000435 | ms/batch 5206.59 | loss 0.75 | bpc 1.08776
| epoch 138 step 94160 | 178 batches | lr 0.000435 | ms/batch 5204.62 | loss 0.67 | bpc 0.97335
| epoch 138 step 94170 | 188 batches | lr 0.000435 | ms/batch 5205.50 | loss 0.68 | bpc 0.97914
| epoch 138 step 94180 | 198 batches | lr 0.000435 | ms/batch 5177.71 | loss 0.65 | bpc 0.93954
| epoch 138 step 94190 | 208 batches | lr 0.000435 | ms/batch 5177.90 | loss 0.67 | bpc 0.97060
| epoch 138 step 94200 | 218 batches | lr 0.000435 | ms/batch 5160.02 | loss 0.76 | bpc 1.09306
| epoch 138 step 94210 | 228 batches | lr 0.000435 | ms/batch 5179.14 | loss 0.74 | bpc 1.07340
| epoch 138 step 94220 | 238 batches | lr 0.000435 | ms/batch 5178.22 | loss 0.74 | bpc 1.06503
| epoch 138 step 94230 | 248 batches | lr 0.000435 | ms/batch 5177.63 | loss 0.78 | bpc 1.12385
| epoch 138 step 94240 | 258 batches | lr 0.000435 | ms/batch 5170.73 | loss 0.75 | bpc 1.07925
| epoch 138 step 94250 | 268 batches | lr 0.000435 | ms/batch 5181.85 | loss 0.70 | bpc 1.00831
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 377 at step 94250 | time: 1363.07s | valid loss 0.74 | bpc 1.06965
----------------------------------------------------------------------------------------------------
| epoch 138 step 94260 | 278 batches | lr 0.000435 | ms/batch 11947.23 | loss 0.75 | bpc 1.07843
| epoch 138 step 94270 | 288 batches | lr 0.000435 | ms/batch 5175.94 | loss 0.78 | bpc 1.13199
| epoch 138 step 94280 | 298 batches | lr 0.000435 | ms/batch 5182.28 | loss 0.75 | bpc 1.07935
| epoch 138 step 94290 | 308 batches | lr 0.000435 | ms/batch 5172.44 | loss 0.76 | bpc 1.09519
| epoch 138 step 94300 | 318 batches | lr 0.000435 | ms/batch 5176.76 | loss 0.76 | bpc 1.09535
| epoch 138 step 94310 | 328 batches | lr 0.000434 | ms/batch 5156.50 | loss 0.76 | bpc 1.09502
| epoch 138 step 94320 | 338 batches | lr 0.000434 | ms/batch 5173.13 | loss 0.73 | bpc 1.05426
| epoch 138 step 94330 | 348 batches | lr 0.000434 | ms/batch 5174.36 | loss 0.72 | bpc 1.04226
| epoch 138 step 94340 | 358 batches | lr 0.000434 | ms/batch 5171.27 | loss 0.79 | bpc 1.14506
| epoch 138 step 94350 | 368 batches | lr 0.000434 | ms/batch 5163.74 | loss 0.74 | bpc 1.06686
| epoch 138 step 94360 | 378 batches | lr 0.000434 | ms/batch 5180.91 | loss 0.74 | bpc 1.06722
| epoch 138 step 94370 | 388 batches | lr 0.000434 | ms/batch 5161.36 | loss 0.77 | bpc 1.11238
| epoch 138 step 94380 | 398 batches | lr 0.000434 | ms/batch 5157.35 | loss 0.77 | bpc 1.10399
| epoch 138 step 94390 | 408 batches | lr 0.000434 | ms/batch 5166.62 | loss 0.77 | bpc 1.10810
| epoch 138 step 94400 | 418 batches | lr 0.000434 | ms/batch 5167.67 | loss 0.74 | bpc 1.06112
| epoch 138 step 94410 | 428 batches | lr 0.000434 | ms/batch 5172.25 | loss 0.73 | bpc 1.05076
| epoch 138 step 94420 | 438 batches | lr 0.000434 | ms/batch 5179.27 | loss 0.77 | bpc 1.11392
| epoch 138 step 94430 | 448 batches | lr 0.000434 | ms/batch 5175.88 | loss 0.76 | bpc 1.10206
| epoch 138 step 94440 | 458 batches | lr 0.000434 | ms/batch 5154.07 | loss 0.76 | bpc 1.10172
| epoch 138 step 94450 | 468 batches | lr 0.000434 | ms/batch 5171.02 | loss 0.73 | bpc 1.04677
| epoch 138 step 94460 | 478 batches | lr 0.000434 | ms/batch 5176.02 | loss 0.76 | bpc 1.09384
| epoch 138 step 94470 | 488 batches | lr 0.000434 | ms/batch 5176.10 | loss 0.75 | bpc 1.08262
| epoch 138 step 94480 | 498 batches | lr 0.000434 | ms/batch 5177.77 | loss 0.75 | bpc 1.07801
| epoch 138 step 94490 | 508 batches | lr 0.000434 | ms/batch 5174.29 | loss 0.75 | bpc 1.08594
| epoch 138 step 94500 | 518 batches | lr 0.000434 | ms/batch 5176.13 | loss 0.74 | bpc 1.07081
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 378 at step 94500 | time: 1360.97s | valid loss 0.75 | bpc 1.07687
----------------------------------------------------------------------------------------------------
| epoch 138 step 94510 | 528 batches | lr 0.000434 | ms/batch 12001.55 | loss 0.77 | bpc 1.10757
| epoch 138 step 94520 | 538 batches | lr 0.000434 | ms/batch 5180.03 | loss 0.79 | bpc 1.13825
| epoch 138 step 94530 | 548 batches | lr 0.000434 | ms/batch 5168.14 | loss 0.73 | bpc 1.05871
| epoch 138 step 94540 | 558 batches | lr 0.000434 | ms/batch 5164.24 | loss 0.76 | bpc 1.10223
| epoch 138 step 94550 | 568 batches | lr 0.000434 | ms/batch 5163.19 | loss 0.77 | bpc 1.11740
| epoch 138 step 94560 | 578 batches | lr 0.000434 | ms/batch 5155.72 | loss 0.73 | bpc 1.05834
| epoch 138 step 94570 | 588 batches | lr 0.000434 | ms/batch 5166.50 | loss 0.75 | bpc 1.08571
| epoch 138 step 94580 | 598 batches | lr 0.000434 | ms/batch 5174.21 | loss 0.76 | bpc 1.09762
| epoch 138 step 94590 | 608 batches | lr 0.000434 | ms/batch 5177.91 | loss 0.74 | bpc 1.06610
| epoch 138 step 94600 | 618 batches | lr 0.000434 | ms/batch 5189.60 | loss 0.75 | bpc 1.07733
| epoch 138 step 94610 | 628 batches | lr 0.000434 | ms/batch 5178.79 | loss 0.77 | bpc 1.10469
| epoch 138 step 94620 | 638 batches | lr 0.000434 | ms/batch 5181.69 | loss 0.74 | bpc 1.06901
| epoch 138 step 94630 | 648 batches | lr 0.000434 | ms/batch 5136.32 | loss 0.79 | bpc 1.14431
| epoch 138 step 94640 | 658 batches | lr 0.000434 | ms/batch 5132.51 | loss 0.77 | bpc 1.11659
| epoch 138 step 94650 | 668 batches | lr 0.000434 | ms/batch 5140.15 | loss 0.77 | bpc 1.11276
| epoch 138 step 94660 | 678 batches | lr 0.000434 | ms/batch 5175.81 | loss 0.70 | bpc 1.00657
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 139 step 94670 | 2 batches | lr 0.000434 | ms/batch 5188.81 | loss 0.70 | bpc 1.00286
| epoch 139 step 94680 | 12 batches | lr 0.000434 | ms/batch 5124.77 | loss 0.74 | bpc 1.06950
| epoch 139 step 94690 | 22 batches | lr 0.000434 | ms/batch 5122.41 | loss 0.77 | bpc 1.10895
| epoch 139 step 94700 | 32 batches | lr 0.000434 | ms/batch 5121.74 | loss 0.73 | bpc 1.05845
| epoch 139 step 94710 | 42 batches | lr 0.000434 | ms/batch 5130.95 | loss 0.72 | bpc 1.04099
| epoch 139 step 94720 | 52 batches | lr 0.000434 | ms/batch 5124.04 | loss 0.76 | bpc 1.09174
| epoch 139 step 94730 | 62 batches | lr 0.000434 | ms/batch 5124.90 | loss 0.76 | bpc 1.09084
| epoch 139 step 94740 | 72 batches | lr 0.000434 | ms/batch 5125.21 | loss 0.76 | bpc 1.09041
| epoch 139 step 94750 | 82 batches | lr 0.000434 | ms/batch 5126.28 | loss 0.70 | bpc 1.01592
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 379 at step 94750 | time: 1357.96s | valid loss 0.75 | bpc 1.07621
----------------------------------------------------------------------------------------------------
| epoch 139 step 94760 | 92 batches | lr 0.000434 | ms/batch 11943.53 | loss 0.72 | bpc 1.04068
| epoch 139 step 94770 | 102 batches | lr 0.000434 | ms/batch 5127.97 | loss 0.73 | bpc 1.05828
| epoch 139 step 94780 | 112 batches | lr 0.000434 | ms/batch 5123.15 | loss 0.75 | bpc 1.08323
| epoch 139 step 94790 | 122 batches | lr 0.000434 | ms/batch 5125.81 | loss 0.75 | bpc 1.08411
| epoch 139 step 94800 | 132 batches | lr 0.000434 | ms/batch 5125.96 | loss 0.76 | bpc 1.09974
| epoch 139 step 94810 | 142 batches | lr 0.000434 | ms/batch 5118.80 | loss 0.80 | bpc 1.15547
| epoch 139 step 94820 | 152 batches | lr 0.000434 | ms/batch 5123.82 | loss 0.76 | bpc 1.09688
| epoch 139 step 94830 | 162 batches | lr 0.000434 | ms/batch 5127.48 | loss 0.78 | bpc 1.13015
| epoch 139 step 94840 | 172 batches | lr 0.000434 | ms/batch 5127.12 | loss 0.70 | bpc 1.01362
| epoch 139 step 94850 | 182 batches | lr 0.000434 | ms/batch 5123.50 | loss 0.68 | bpc 0.97917
| epoch 139 step 94860 | 192 batches | lr 0.000434 | ms/batch 5126.51 | loss 0.69 | bpc 0.98891
| epoch 139 step 94870 | 202 batches | lr 0.000434 | ms/batch 5123.98 | loss 0.61 | bpc 0.88130
| epoch 139 step 94880 | 212 batches | lr 0.000434 | ms/batch 5121.24 | loss 0.73 | bpc 1.05134
| epoch 139 step 94890 | 222 batches | lr 0.000434 | ms/batch 5121.61 | loss 0.76 | bpc 1.09779
| epoch 139 step 94900 | 232 batches | lr 0.000434 | ms/batch 5133.83 | loss 0.74 | bpc 1.06997
| epoch 139 step 94910 | 242 batches | lr 0.000434 | ms/batch 5114.58 | loss 0.74 | bpc 1.07003
| epoch 139 step 94920 | 252 batches | lr 0.000434 | ms/batch 5111.28 | loss 0.79 | bpc 1.14244
| epoch 139 step 94930 | 262 batches | lr 0.000434 | ms/batch 5122.40 | loss 0.72 | bpc 1.03159
| epoch 139 step 94940 | 272 batches | lr 0.000434 | ms/batch 5123.72 | loss 0.71 | bpc 1.01863
| epoch 139 step 94950 | 282 batches | lr 0.000434 | ms/batch 5119.95 | loss 0.77 | bpc 1.10381
| epoch 139 step 94960 | 292 batches | lr 0.000434 | ms/batch 5126.96 | loss 0.77 | bpc 1.10410
| epoch 139 step 94970 | 302 batches | lr 0.000434 | ms/batch 5126.37 | loss 0.76 | bpc 1.09092
| epoch 139 step 94980 | 312 batches | lr 0.000434 | ms/batch 5122.77 | loss 0.78 | bpc 1.12285
| epoch 139 step 94990 | 322 batches | lr 0.000434 | ms/batch 5125.20 | loss 0.74 | bpc 1.06340
| epoch 139 step 95000 | 332 batches | lr 0.000434 | ms/batch 5126.05 | loss 0.77 | bpc 1.10942
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 380 at step 95000 | time: 1349.20s | valid loss 0.74 | bpc 1.07174
----------------------------------------------------------------------------------------------------
| epoch 139 step 95010 | 342 batches | lr 0.000434 | ms/batch 11936.92 | loss 0.70 | bpc 1.00856
| epoch 139 step 95020 | 352 batches | lr 0.000434 | ms/batch 5126.52 | loss 0.77 | bpc 1.10700
| epoch 139 step 95030 | 362 batches | lr 0.000434 | ms/batch 5124.78 | loss 0.77 | bpc 1.10818
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
| epoch 139 step 95040 | 372 batches | lr 0.000434 | ms/batch 5123.07 | loss 0.72 | bpc 1.03961
| epoch 139 step 95050 | 382 batches | lr 0.000434 | ms/batch 5127.18 | loss 0.76 | bpc 1.09516
| epoch 139 step 95060 | 392 batches | lr 0.000434 | ms/batch 5122.51 | loss 0.75 | bpc 1.08648
| epoch 139 step 95070 | 402 batches | lr 0.000433 | ms/batch 5118.40 | loss 0.79 | bpc 1.13840
| epoch 139 step 95080 | 412 batches | lr 0.000433 | ms/batch 5121.55 | loss 0.75 | bpc 1.07699
| epoch 139 step 95090 | 422 batches | lr 0.000433 | ms/batch 5126.27 | loss 0.73 | bpc 1.05669
| epoch 139 step 95100 | 432 batches | lr 0.000433 | ms/batch 5127.08 | loss 0.74 | bpc 1.07262
| epoch 139 step 95110 | 442 batches | lr 0.000433 | ms/batch 5126.64 | loss 0.78 | bpc 1.11924
| epoch 139 step 95120 | 452 batches | lr 0.000433 | ms/batch 5129.43 | loss 0.76 | bpc 1.09771
| epoch 139 step 95130 | 462 batches | lr 0.000433 | ms/batch 5123.54 | loss 0.76 | bpc 1.08936
| epoch 139 step 95140 | 472 batches | lr 0.000433 | ms/batch 5123.67 | loss 0.74 | bpc 1.06832
| epoch 139 step 95150 | 482 batches | lr 0.000433 | ms/batch 5124.47 | loss 0.76 | bpc 1.09452
| epoch 139 step 95160 | 492 batches | lr 0.000433 | ms/batch 5125.68 | loss 0.74 | bpc 1.07144
| epoch 139 step 95170 | 502 batches | lr 0.000433 | ms/batch 5123.88 | loss 0.74 | bpc 1.06827
| epoch 139 step 95180 | 512 batches | lr 0.000433 | ms/batch 5124.25 | loss 0.74 | bpc 1.07179
| epoch 139 step 95190 | 522 batches | lr 0.000433 | ms/batch 5113.61 | loss 0.75 | bpc 1.08604
| epoch 139 step 95200 | 532 batches | lr 0.000433 | ms/batch 5110.38 | loss 0.79 | bpc 1.14135
| epoch 139 step 95210 | 542 batches | lr 0.000433 | ms/batch 5117.81 | loss 0.75 | bpc 1.08428
| epoch 139 step 95220 | 552 batches | lr 0.000433 | ms/batch 5127.91 | loss 0.76 | bpc 1.08964
| epoch 139 step 95230 | 562 batches | lr 0.000433 | ms/batch 5123.08 | loss 0.78 | bpc 1.12097
| epoch 139 step 95240 | 572 batches | lr 0.000433 | ms/batch 5128.15 | loss 0.75 | bpc 1.08046
| epoch 139 step 95250 | 582 batches | lr 0.000433 | ms/batch 5125.29 | loss 0.73 | bpc 1.05245
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 381 at step 95250 | time: 1349.11s | valid loss 0.75 | bpc 1.07664
----------------------------------------------------------------------------------------------------
| epoch 139 step 95260 | 592 batches | lr 0.000433 | ms/batch 11933.65 | loss 0.77 | bpc 1.11679
| epoch 139 step 95270 | 602 batches | lr 0.000433 | ms/batch 5125.01 | loss 0.74 | bpc 1.07227
| epoch 139 step 95280 | 612 batches | lr 0.000433 | ms/batch 5122.39 | loss 0.72 | bpc 1.04350
| epoch 139 step 95290 | 622 batches | lr 0.000433 | ms/batch 5128.57 | loss 0.77 | bpc 1.10949
| epoch 139 step 95300 | 632 batches | lr 0.000433 | ms/batch 5126.67 | loss 0.76 | bpc 1.10094
| epoch 139 step 95310 | 642 batches | lr 0.000433 | ms/batch 5124.12 | loss 0.76 | bpc 1.08982
| epoch 139 step 95320 | 652 batches | lr 0.000433 | ms/batch 5122.40 | loss 0.79 | bpc 1.13868
| epoch 139 step 95330 | 662 batches | lr 0.000433 | ms/batch 5116.28 | loss 0.78 | bpc 1.12439
| epoch 139 step 95340 | 672 batches | lr 0.000433 | ms/batch 5127.17 | loss 0.72 | bpc 1.04112
| epoch 139 step 95350 | 682 batches | lr 0.000433 | ms/batch 5124.27 | loss 0.71 | bpc 1.02218
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 140 step 95360 | 6 batches | lr 0.000433 | ms/batch 5124.32 | loss 0.71 | bpc 1.02453
| epoch 140 step 95370 | 16 batches | lr 0.000433 | ms/batch 5119.99 | loss 0.74 | bpc 1.07471
| epoch 140 step 95380 | 26 batches | lr 0.000433 | ms/batch 5124.90 | loss 0.78 | bpc 1.12215
| epoch 140 step 95390 | 36 batches | lr 0.000433 | ms/batch 5124.08 | loss 0.71 | bpc 1.02018
| epoch 140 step 95400 | 46 batches | lr 0.000433 | ms/batch 5117.27 | loss 0.74 | bpc 1.06449
| epoch 140 step 95410 | 56 batches | lr 0.000433 | ms/batch 5123.31 | loss 0.77 | bpc 1.11520
| epoch 140 step 95420 | 66 batches | lr 0.000433 | ms/batch 5123.28 | loss 0.76 | bpc 1.09756
| epoch 140 step 95430 | 76 batches | lr 0.000433 | ms/batch 5125.95 | loss 0.72 | bpc 1.03453
| epoch 140 step 95440 | 86 batches | lr 0.000433 | ms/batch 5122.74 | loss 0.72 | bpc 1.03616
| epoch 140 step 95450 | 96 batches | lr 0.000433 | ms/batch 5125.27 | loss 0.72 | bpc 1.03370
| epoch 140 step 95460 | 106 batches | lr 0.000433 | ms/batch 5120.50 | loss 0.75 | bpc 1.08859
| epoch 140 step 95470 | 116 batches | lr 0.000433 | ms/batch 5119.06 | loss 0.74 | bpc 1.07450
| epoch 140 step 95480 | 126 batches | lr 0.000433 | ms/batch 5128.59 | loss 0.76 | bpc 1.09167
| epoch 140 step 95490 | 136 batches | lr 0.000433 | ms/batch 5126.28 | loss 0.77 | bpc 1.11717
| epoch 140 step 95500 | 146 batches | lr 0.000433 | ms/batch 5111.41 | loss 0.77 | bpc 1.10991
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 382 at step 95500 | time: 1348.88s | valid loss 0.74 | bpc 1.07078
----------------------------------------------------------------------------------------------------
| epoch 140 step 95510 | 156 batches | lr 0.000433 | ms/batch 11937.58 | loss 0.77 | bpc 1.11250
| epoch 140 step 95520 | 166 batches | lr 0.000433 | ms/batch 5120.65 | loss 0.78 | bpc 1.12161
| epoch 140 step 95530 | 176 batches | lr 0.000433 | ms/batch 5127.33 | loss 0.68 | bpc 0.98176
| epoch 140 step 95540 | 186 batches | lr 0.000433 | ms/batch 5121.15 | loss 0.68 | bpc 0.98519
| epoch 140 step 95550 | 196 batches | lr 0.000433 | ms/batch 5124.66 | loss 0.66 | bpc 0.95470
| epoch 140 step 95560 | 206 batches | lr 0.000433 | ms/batch 5125.40 | loss 0.65 | bpc 0.93864
| epoch 140 step 95570 | 216 batches | lr 0.000433 | ms/batch 5126.40 | loss 0.75 | bpc 1.08320
| epoch 140 step 95580 | 226 batches | lr 0.000433 | ms/batch 5119.82 | loss 0.74 | bpc 1.06770
| epoch 140 step 95590 | 236 batches | lr 0.000433 | ms/batch 5120.07 | loss 0.74 | bpc 1.06156
| epoch 140 step 95600 | 246 batches | lr 0.000433 | ms/batch 5124.59 | loss 0.78 | bpc 1.12296
| epoch 140 step 95610 | 256 batches | lr 0.000433 | ms/batch 5128.45 | loss 0.75 | bpc 1.07933
| epoch 140 step 95620 | 266 batches | lr 0.000433 | ms/batch 5125.60 | loss 0.70 | bpc 1.01103
| epoch 140 step 95630 | 276 batches | lr 0.000433 | ms/batch 5123.98 | loss 0.74 | bpc 1.06968
| epoch 140 step 95640 | 286 batches | lr 0.000433 | ms/batch 5125.66 | loss 0.79 | bpc 1.13333
| epoch 140 step 95650 | 296 batches | lr 0.000433 | ms/batch 5119.11 | loss 0.74 | bpc 1.07271
| epoch 140 step 95660 | 306 batches | lr 0.000433 | ms/batch 5122.45 | loss 0.75 | bpc 1.08275
| epoch 140 step 95670 | 316 batches | lr 0.000433 | ms/batch 5126.38 | loss 0.78 | bpc 1.11849
| epoch 140 step 95680 | 326 batches | lr 0.000433 | ms/batch 5127.75 | loss 0.75 | bpc 1.07864
| epoch 140 step 95690 | 336 batches | lr 0.000433 | ms/batch 5123.21 | loss 0.75 | bpc 1.07753
| epoch 140 step 95700 | 346 batches | lr 0.000433 | ms/batch 5123.39 | loss 0.72 | bpc 1.03681
| epoch 140 step 95710 | 356 batches | lr 0.000433 | ms/batch 5127.67 | loss 0.78 | bpc 1.12506
| epoch 140 step 95720 | 366 batches | lr 0.000433 | ms/batch 5117.95 | loss 0.74 | bpc 1.06681
| epoch 140 step 95730 | 376 batches | lr 0.000433 | ms/batch 5122.99 | loss 0.73 | bpc 1.05486
| epoch 140 step 95740 | 386 batches | lr 0.000433 | ms/batch 5126.79 | loss 0.77 | bpc 1.11117
| epoch 140 step 95750 | 396 batches | lr 0.000433 | ms/batch 5128.96 | loss 0.76 | bpc 1.09738
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 383 at step 95750 | time: 1349.29s | valid loss 0.74 | bpc 1.07126
----------------------------------------------------------------------------------------------------
| epoch 140 step 95760 | 406 batches | lr 0.000433 | ms/batch 11941.76 | loss 0.77 | bpc 1.10719
| epoch 140 step 95770 | 416 batches | lr 0.000433 | ms/batch 5125.06 | loss 0.75 | bpc 1.08093
| epoch 140 step 95780 | 426 batches | lr 0.000433 | ms/batch 5124.10 | loss 0.72 | bpc 1.03321
| epoch 140 step 95790 | 436 batches | lr 0.000433 | ms/batch 5124.10 | loss 0.78 | bpc 1.12472
| epoch 140 step 95800 | 446 batches | lr 0.000433 | ms/batch 5127.90 | loss 0.76 | bpc 1.09225
| epoch 140 step 95810 | 456 batches | lr 0.000432 | ms/batch 5123.80 | loss 0.76 | bpc 1.09098
| epoch 140 step 95820 | 466 batches | lr 0.000432 | ms/batch 5110.90 | loss 0.75 | bpc 1.07711
| epoch 140 step 95830 | 476 batches | lr 0.000432 | ms/batch 5107.94 | loss 0.74 | bpc 1.06761
| epoch 140 step 95840 | 486 batches | lr 0.000432 | ms/batch 5115.61 | loss 0.76 | bpc 1.09537
| epoch 140 step 95850 | 496 batches | lr 0.000432 | ms/batch 5125.26 | loss 0.75 | bpc 1.08684
| epoch 140 step 95860 | 506 batches | lr 0.000432 | ms/batch 5130.22 | loss 0.74 | bpc 1.06121
| epoch 140 step 95870 | 516 batches | lr 0.000432 | ms/batch 5125.69 | loss 0.75 | bpc 1.08475
| epoch 140 step 95880 | 526 batches | lr 0.000432 | ms/batch 5127.44 | loss 0.76 | bpc 1.09670
| epoch 140 step 95890 | 536 batches | lr 0.000432 | ms/batch 5126.51 | loss 0.79 | bpc 1.13858
| epoch 140 step 95900 | 546 batches | lr 0.000432 | ms/batch 5126.05 | loss 0.74 | bpc 1.06818
| epoch 140 step 95910 | 556 batches | lr 0.000432 | ms/batch 5120.81 | loss 0.77 | bpc 1.10715
| epoch 140 step 95920 | 566 batches | lr 0.000432 | ms/batch 5126.88 | loss 0.78 | bpc 1.11989
| epoch 140 step 95930 | 576 batches | lr 0.000432 | ms/batch 5127.51 | loss 0.74 | bpc 1.06395
| epoch 140 step 95940 | 586 batches | lr 0.000432 | ms/batch 5124.65 | loss 0.74 | bpc 1.07180
| epoch 140 step 95950 | 596 batches | lr 0.000432 | ms/batch 5125.97 | loss 0.77 | bpc 1.11154
| epoch 140 step 95960 | 606 batches | lr 0.000432 | ms/batch 5125.59 | loss 0.74 | bpc 1.06412
| epoch 140 step 95970 | 616 batches | lr 0.000432 | ms/batch 5124.64 | loss 0.74 | bpc 1.06261
| epoch 140 step 95980 | 626 batches | lr 0.000432 | ms/batch 5119.59 | loss 0.76 | bpc 1.09283
| epoch 140 step 95990 | 636 batches | lr 0.000432 | ms/batch 5124.83 | loss 0.77 | bpc 1.10548
| epoch 140 step 96000 | 646 batches | lr 0.000432 | ms/batch 5125.79 | loss 0.76 | bpc 1.09583
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 384 at step 96000 | time: 1352.32s | valid loss 0.74 | bpc 1.07481
----------------------------------------------------------------------------------------------------
| epoch 140 step 96010 | 656 batches | lr 0.000432 | ms/batch 11945.96 | loss 0.79 | bpc 1.13688
| epoch 140 step 96020 | 666 batches | lr 0.000432 | ms/batch 5127.74 | loss 0.78 | bpc 1.12369
| epoch 140 step 96030 | 676 batches | lr 0.000432 | ms/batch 5120.58 | loss 0.70 | bpc 1.00978
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
| epoch 140 step 96040 | 686 batches | lr 0.000432 | ms/batch 5121.95 | loss 0.69 | bpc 1.00215
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 141 step 96050 | 10 batches | lr 0.000432 | ms/batch 5127.71 | loss 0.73 | bpc 1.05169
| epoch 141 step 96060 | 20 batches | lr 0.000432 | ms/batch 5125.65 | loss 0.76 | bpc 1.09595
| epoch 141 step 96070 | 30 batches | lr 0.000432 | ms/batch 5126.96 | loss 0.75 | bpc 1.08467
| epoch 141 step 96080 | 40 batches | lr 0.000432 | ms/batch 5126.53 | loss 0.71 | bpc 1.03144
| epoch 141 step 96090 | 50 batches | lr 0.000432 | ms/batch 5124.73 | loss 0.76 | bpc 1.09211
| epoch 141 step 96100 | 60 batches | lr 0.000432 | ms/batch 5127.60 | loss 0.77 | bpc 1.10513
| epoch 141 step 96110 | 70 batches | lr 0.000432 | ms/batch 5125.50 | loss 0.74 | bpc 1.07240
| epoch 141 step 96120 | 80 batches | lr 0.000432 | ms/batch 5126.85 | loss 0.71 | bpc 1.01875
| epoch 141 step 96130 | 90 batches | lr 0.000432 | ms/batch 5125.44 | loss 0.73 | bpc 1.05720
| epoch 141 step 96140 | 100 batches | lr 0.000432 | ms/batch 5126.24 | loss 0.71 | bpc 1.02014
| epoch 141 step 96150 | 110 batches | lr 0.000432 | ms/batch 5125.54 | loss 0.77 | bpc 1.10669
| epoch 141 step 96160 | 120 batches | lr 0.000432 | ms/batch 5123.47 | loss 0.74 | bpc 1.06048
| epoch 141 step 96170 | 130 batches | lr 0.000432 | ms/batch 5127.32 | loss 0.78 | bpc 1.12161
| epoch 141 step 96180 | 140 batches | lr 0.000432 | ms/batch 5111.52 | loss 0.79 | bpc 1.14322
| epoch 141 step 96190 | 150 batches | lr 0.000432 | ms/batch 5112.34 | loss 0.75 | bpc 1.08179
| epoch 141 step 96200 | 160 batches | lr 0.000432 | ms/batch 5112.10 | loss 0.79 | bpc 1.13382
| epoch 141 step 96210 | 170 batches | lr 0.000432 | ms/batch 5121.98 | loss 0.73 | bpc 1.05524
| epoch 141 step 96220 | 180 batches | lr 0.000432 | ms/batch 5124.73 | loss 0.68 | bpc 0.97462
| epoch 141 step 96230 | 190 batches | lr 0.000432 | ms/batch 5124.47 | loss 0.67 | bpc 0.97335
| epoch 141 step 96240 | 200 batches | lr 0.000432 | ms/batch 5125.51 | loss 0.63 | bpc 0.91298
| epoch 141 step 96250 | 210 batches | lr 0.000432 | ms/batch 5127.39 | loss 0.70 | bpc 1.01110
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 385 at step 96250 | time: 1349.16s | valid loss 0.74 | bpc 1.07436
----------------------------------------------------------------------------------------------------
| epoch 141 step 96260 | 220 batches | lr 0.000432 | ms/batch 11937.33 | loss 0.77 | bpc 1.10689
| epoch 141 step 96270 | 230 batches | lr 0.000432 | ms/batch 5122.92 | loss 0.74 | bpc 1.06636
| epoch 141 step 96280 | 240 batches | lr 0.000432 | ms/batch 5128.02 | loss 0.74 | bpc 1.06648
| epoch 141 step 96290 | 250 batches | lr 0.000432 | ms/batch 5125.68 | loss 0.78 | bpc 1.13112
| epoch 141 step 96300 | 260 batches | lr 0.000432 | ms/batch 5126.58 | loss 0.72 | bpc 1.04356
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
| epoch 141 step 96310 | 270 batches | lr 0.000432 | ms/batch 5123.95 | loss 0.71 | bpc 1.01865
| epoch 141 step 96320 | 280 batches | lr 0.000432 | ms/batch 5128.07 | loss 0.75 | bpc 1.08395
| epoch 141 step 96330 | 290 batches | lr 0.000432 | ms/batch 5121.16 | loss 0.78 | bpc 1.12022
| epoch 141 step 96340 | 300 batches | lr 0.000432 | ms/batch 5125.08 | loss 0.75 | bpc 1.08346
| epoch 141 step 96350 | 310 batches | lr 0.000432 | ms/batch 5128.04 | loss 0.77 | bpc 1.11117
| epoch 141 step 96360 | 320 batches | lr 0.000432 | ms/batch 5127.93 | loss 0.73 | bpc 1.05946
| epoch 141 step 96370 | 330 batches | lr 0.000432 | ms/batch 5128.05 | loss 0.76 | bpc 1.09847
| epoch 141 step 96380 | 340 batches | lr 0.000432 | ms/batch 5128.05 | loss 0.72 | bpc 1.04358
| epoch 141 step 96390 | 350 batches | lr 0.000432 | ms/batch 5128.57 | loss 0.74 | bpc 1.07142
| epoch 141 step 96400 | 360 batches | lr 0.000432 | ms/batch 5118.73 | loss 0.79 | bpc 1.13636
| epoch 141 step 96410 | 370 batches | lr 0.000432 | ms/batch 5126.20 | loss 0.73 | bpc 1.05501
| epoch 141 step 96420 | 380 batches | lr 0.000432 | ms/batch 5125.43 | loss 0.75 | bpc 1.07943
| epoch 141 step 96430 | 390 batches | lr 0.000432 | ms/batch 5125.88 | loss 0.75 | bpc 1.07834
| epoch 141 step 96440 | 400 batches | lr 0.000432 | ms/batch 5129.00 | loss 0.79 | bpc 1.13689
| epoch 141 step 96450 | 410 batches | lr 0.000432 | ms/batch 5126.49 | loss 0.76 | bpc 1.09376
| epoch 141 step 96460 | 420 batches | lr 0.000432 | ms/batch 5126.84 | loss 0.73 | bpc 1.05081
| epoch 141 step 96470 | 430 batches | lr 0.000432 | ms/batch 5124.50 | loss 0.74 | bpc 1.06219
| epoch 141 step 96480 | 440 batches | lr 0.000432 | ms/batch 5126.20 | loss 0.77 | bpc 1.11118
| epoch 141 step 96490 | 450 batches | lr 0.000432 | ms/batch 5127.78 | loss 0.76 | bpc 1.09711
| epoch 141 step 96500 | 460 batches | lr 0.000432 | ms/batch 5128.34 | loss 0.76 | bpc 1.09085
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 386 at step 96500 | time: 1349.77s | valid loss 0.74 | bpc 1.07314
----------------------------------------------------------------------------------------------------
| epoch 141 step 96510 | 470 batches | lr 0.000432 | ms/batch 11944.34 | loss 0.73 | bpc 1.05398
| epoch 141 step 96520 | 480 batches | lr 0.000432 | ms/batch 5126.12 | loss 0.77 | bpc 1.10448
| epoch 141 step 96530 | 490 batches | lr 0.000432 | ms/batch 5129.32 | loss 0.74 | bpc 1.07203
| epoch 141 step 96540 | 500 batches | lr 0.000432 | ms/batch 5129.17 | loss 0.74 | bpc 1.06889
| epoch 141 step 96550 | 510 batches | lr 0.000432 | ms/batch 5125.62 | loss 0.75 | bpc 1.07538
| epoch 141 step 96560 | 520 batches | lr 0.000431 | ms/batch 5112.83 | loss 0.76 | bpc 1.09061
| epoch 141 step 96570 | 530 batches | lr 0.000431 | ms/batch 5111.93 | loss 0.77 | bpc 1.11201
| epoch 141 step 96580 | 540 batches | lr 0.000431 | ms/batch 5110.38 | loss 0.78 | bpc 1.12564
| epoch 141 step 96590 | 550 batches | lr 0.000431 | ms/batch 5122.44 | loss 0.74 | bpc 1.06404
| epoch 141 step 96600 | 560 batches | lr 0.000431 | ms/batch 5131.34 | loss 0.77 | bpc 1.10997
| epoch 141 step 96610 | 570 batches | lr 0.000431 | ms/batch 5129.31 | loss 0.76 | bpc 1.09649
| epoch 141 step 96620 | 580 batches | lr 0.000431 | ms/batch 5128.21 | loss 0.74 | bpc 1.06629
| epoch 141 step 96630 | 590 batches | lr 0.000431 | ms/batch 5126.94 | loss 0.75 | bpc 1.08347
| epoch 141 step 96640 | 600 batches | lr 0.000431 | ms/batch 5125.07 | loss 0.75 | bpc 1.08675
| epoch 141 step 96650 | 610 batches | lr 0.000431 | ms/batch 5130.56 | loss 0.73 | bpc 1.05618
| epoch 141 step 96660 | 620 batches | lr 0.000431 | ms/batch 5128.29 | loss 0.75 | bpc 1.08803
| epoch 141 step 96670 | 630 batches | lr 0.000431 | ms/batch 5130.34 | loss 0.77 | bpc 1.10551
| epoch 141 step 96680 | 640 batches | lr 0.000431 | ms/batch 5127.48 | loss 0.75 | bpc 1.08583
| epoch 141 step 96690 | 650 batches | lr 0.000431 | ms/batch 5127.99 | loss 0.79 | bpc 1.14271
| epoch 141 step 96700 | 660 batches | lr 0.000431 | ms/batch 5123.00 | loss 0.77 | bpc 1.11342
| epoch 141 step 96710 | 670 batches | lr 0.000431 | ms/batch 5124.72 | loss 0.75 | bpc 1.08183
| epoch 141 step 96720 | 680 batches | lr 0.000431 | ms/batch 5127.77 | loss 0.70 | bpc 1.01069
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 142 step 96730 | 4 batches | lr 0.000431 | ms/batch 5130.71 | loss 0.70 | bpc 1.00781
| epoch 142 step 96740 | 14 batches | lr 0.000431 | ms/batch 5127.98 | loss 0.74 | bpc 1.06889
| epoch 142 step 96750 | 24 batches | lr 0.000431 | ms/batch 5126.93 | loss 0.77 | bpc 1.11191
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 387 at step 96750 | time: 1357.63s | valid loss 0.74 | bpc 1.07156
----------------------------------------------------------------------------------------------------
| epoch 142 step 96760 | 34 batches | lr 0.000431 | ms/batch 11942.42 | loss 0.73 | bpc 1.05025
| epoch 142 step 96770 | 44 batches | lr 0.000431 | ms/batch 5130.72 | loss 0.72 | bpc 1.04545
| epoch 142 step 96780 | 54 batches | lr 0.000431 | ms/batch 5132.15 | loss 0.77 | bpc 1.10910
| epoch 142 step 96790 | 64 batches | lr 0.000431 | ms/batch 5126.77 | loss 0.75 | bpc 1.08379
| epoch 142 step 96800 | 74 batches | lr 0.000431 | ms/batch 5128.39 | loss 0.74 | bpc 1.06948
| epoch 142 step 96810 | 84 batches | lr 0.000431 | ms/batch 5131.17 | loss 0.70 | bpc 1.01414
| epoch 142 step 96820 | 94 batches | lr 0.000431 | ms/batch 5132.47 | loss 0.72 | bpc 1.04163
| epoch 142 step 96830 | 104 batches | lr 0.000431 | ms/batch 5129.43 | loss 0.74 | bpc 1.07443
| epoch 142 step 96840 | 114 batches | lr 0.000431 | ms/batch 5129.54 | loss 0.75 | bpc 1.07660
| epoch 142 step 96850 | 124 batches | lr 0.000431 | ms/batch 5129.20 | loss 0.76 | bpc 1.09513
| epoch 142 step 96860 | 134 batches | lr 0.000431 | ms/batch 5131.29 | loss 0.76 | bpc 1.09823
| epoch 142 step 96870 | 144 batches | lr 0.000431 | ms/batch 5129.92 | loss 0.78 | bpc 1.13223
| epoch 142 step 96880 | 154 batches | lr 0.000431 | ms/batch 5124.83 | loss 0.76 | bpc 1.10074
| epoch 142 step 96890 | 164 batches | lr 0.000431 | ms/batch 5128.38 | loss 0.79 | bpc 1.13790
| epoch 142 step 96900 | 174 batches | lr 0.000431 | ms/batch 5129.30 | loss 0.68 | bpc 0.98089
| epoch 142 step 96910 | 184 batches | lr 0.000431 | ms/batch 5129.64 | loss 0.68 | bpc 0.98558
| epoch 142 step 96920 | 194 batches | lr 0.000431 | ms/batch 5128.70 | loss 0.69 | bpc 0.99378
| epoch 142 step 96930 | 204 batches | lr 0.000431 | ms/batch 5126.87 | loss 0.62 | bpc 0.89657
| epoch 142 step 96940 | 214 batches | lr 0.000431 | ms/batch 5130.78 | loss 0.73 | bpc 1.05288
| epoch 142 step 96950 | 224 batches | lr 0.000431 | ms/batch 5127.45 | loss 0.74 | bpc 1.07466
| epoch 142 step 96960 | 234 batches | lr 0.000431 | ms/batch 5125.65 | loss 0.74 | bpc 1.06656
| epoch 142 step 96970 | 244 batches | lr 0.000431 | ms/batch 5132.50 | loss 0.76 | bpc 1.09150
| epoch 142 step 96980 | 254 batches | lr 0.000431 | ms/batch 5130.09 | loss 0.78 | bpc 1.12462
| epoch 142 step 96990 | 264 batches | lr 0.000431 | ms/batch 5125.75 | loss 0.70 | bpc 1.01548
| epoch 142 step 97000 | 274 batches | lr 0.000431 | ms/batch 5112.42 | loss 0.72 | bpc 1.04424
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 388 at step 97000 | time: 1350.38s | valid loss 0.74 | bpc 1.06756
----------------------------------------------------------------------------------------------------
| epoch 142 step 97010 | 284 batches | lr 0.000431 | ms/batch 11923.99 | loss 0.77 | bpc 1.10858
| epoch 142 step 97020 | 294 batches | lr 0.000431 | ms/batch 5131.61 | loss 0.76 | bpc 1.09604
| epoch 142 step 97030 | 304 batches | lr 0.000431 | ms/batch 5130.64 | loss 0.76 | bpc 1.09196
| epoch 142 step 97040 | 314 batches | lr 0.000431 | ms/batch 5134.65 | loss 0.77 | bpc 1.10711
| epoch 142 step 97050 | 324 batches | lr 0.000431 | ms/batch 5130.50 | loss 0.75 | bpc 1.07969
| epoch 142 step 97060 | 334 batches | lr 0.000431 | ms/batch 5130.67 | loss 0.76 | bpc 1.09259
| epoch 142 step 97070 | 344 batches | lr 0.000431 | ms/batch 5119.63 | loss 0.71 | bpc 1.02176
| epoch 142 step 97080 | 354 batches | lr 0.000431 | ms/batch 5128.96 | loss 0.78 | bpc 1.12317
| epoch 142 step 97090 | 364 batches | lr 0.000431 | ms/batch 5128.94 | loss 0.75 | bpc 1.07901
| epoch 142 step 97100 | 374 batches | lr 0.000431 | ms/batch 5131.16 | loss 0.72 | bpc 1.04381
| epoch 142 step 97110 | 384 batches | lr 0.000431 | ms/batch 5132.30 | loss 0.76 | bpc 1.10097
| epoch 142 step 97120 | 394 batches | lr 0.000431 | ms/batch 5132.85 | loss 0.75 | bpc 1.08669
| epoch 142 step 97130 | 404 batches | lr 0.000431 | ms/batch 5129.23 | loss 0.77 | bpc 1.11108
| epoch 142 step 97140 | 414 batches | lr 0.000431 | ms/batch 5125.17 | loss 0.76 | bpc 1.09074
| epoch 142 step 97150 | 424 batches | lr 0.000431 | ms/batch 5130.87 | loss 0.72 | bpc 1.04130
| epoch 142 step 97160 | 434 batches | lr 0.000431 | ms/batch 5129.74 | loss 0.76 | bpc 1.09622
| epoch 142 step 97170 | 444 batches | lr 0.000431 | ms/batch 5131.36 | loss 0.76 | bpc 1.09815
| epoch 142 step 97180 | 454 batches | lr 0.000431 | ms/batch 5132.76 | loss 0.77 | bpc 1.10811
| epoch 142 step 97190 | 464 batches | lr 0.000431 | ms/batch 5130.68 | loss 0.75 | bpc 1.07853
| epoch 142 step 97200 | 474 batches | lr 0.000431 | ms/batch 5130.55 | loss 0.74 | bpc 1.06056
| epoch 142 step 97210 | 484 batches | lr 0.000431 | ms/batch 5127.79 | loss 0.75 | bpc 1.08760
| epoch 142 step 97220 | 494 batches | lr 0.000431 | ms/batch 5132.59 | loss 0.76 | bpc 1.09304
| epoch 142 step 97230 | 504 batches | lr 0.000431 | ms/batch 5133.11 | loss 0.73 | bpc 1.05542
| epoch 142 step 97240 | 514 batches | lr 0.000431 | ms/batch 5133.62 | loss 0.74 | bpc 1.06759
| epoch 142 step 97250 | 524 batches | lr 0.000431 | ms/batch 5131.22 | loss 0.75 | bpc 1.08336
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 389 at step 97250 | time: 1355.96s | valid loss 0.74 | bpc 1.06993
----------------------------------------------------------------------------------------------------
| epoch 142 step 97260 | 534 batches | lr 0.000431 | ms/batch 11944.49 | loss 0.80 | bpc 1.14809
| epoch 142 step 97270 | 544 batches | lr 0.000431 | ms/batch 5131.13 | loss 0.74 | bpc 1.07420
| epoch 142 step 97280 | 554 batches | lr 0.000431 | ms/batch 5129.75 | loss 0.76 | bpc 1.09996
| epoch 142 step 97290 | 564 batches | lr 0.00043 | ms/batch 5131.24 | loss 0.77 | bpc 1.11402
| epoch 142 step 97300 | 574 batches | lr 0.00043 | ms/batch 5131.62 | loss 0.74 | bpc 1.07032
| epoch 142 step 97310 | 584 batches | lr 0.00043 | ms/batch 5129.91 | loss 0.74 | bpc 1.06349
| epoch 142 step 97320 | 594 batches | lr 0.00043 | ms/batch 5131.71 | loss 0.77 | bpc 1.10582
| epoch 142 step 97330 | 604 batches | lr 0.00043 | ms/batch 5129.56 | loss 0.75 | bpc 1.08081
| epoch 142 step 97340 | 614 batches | lr 0.00043 | ms/batch 5129.16 | loss 0.72 | bpc 1.04492
| epoch 142 step 97350 | 624 batches | lr 0.00043 | ms/batch 5130.03 | loss 0.76 | bpc 1.09383
| epoch 142 step 97360 | 634 batches | lr 0.00043 | ms/batch 5129.87 | loss 0.77 | bpc 1.11037
| epoch 142 step 97370 | 644 batches | lr 0.00043 | ms/batch 5135.54 | loss 0.75 | bpc 1.08050
| epoch 142 step 97380 | 654 batches | lr 0.00043 | ms/batch 5132.89 | loss 0.80 | bpc 1.15630
| epoch 142 step 97390 | 664 batches | lr 0.00043 | ms/batch 5129.43 | loss 0.77 | bpc 1.11663
| epoch 142 step 97400 | 674 batches | lr 0.00043 | ms/batch 5130.24 | loss 0.71 | bpc 1.02130
| epoch 142 step 97410 | 684 batches | lr 0.00043 | ms/batch 5130.71 | loss 0.70 | bpc 1.01013
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 143 step 97420 | 8 batches | lr 0.00043 | ms/batch 5133.54 | loss 0.72 | bpc 1.03918
| epoch 143 step 97430 | 18 batches | lr 0.00043 | ms/batch 5128.15 | loss 0.75 | bpc 1.08081
| epoch 143 step 97440 | 28 batches | lr 0.00043 | ms/batch 5128.76 | loss 0.76 | bpc 1.10207
| epoch 143 step 97450 | 38 batches | lr 0.00043 | ms/batch 5131.83 | loss 0.72 | bpc 1.03213
| epoch 143 step 97460 | 48 batches | lr 0.00043 | ms/batch 5108.71 | loss 0.74 | bpc 1.06229
| epoch 143 step 97470 | 58 batches | lr 0.00043 | ms/batch 5110.69 | loss 0.77 | bpc 1.10747
| epoch 143 step 97480 | 68 batches | lr 0.00043 | ms/batch 5111.84 | loss 0.76 | bpc 1.09198
| epoch 143 step 97490 | 78 batches | lr 0.00043 | ms/batch 5111.86 | loss 0.71 | bpc 1.02388
| epoch 143 step 97500 | 88 batches | lr 0.00043 | ms/batch 5132.47 | loss 0.72 | bpc 1.03949
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 390 at step 97500 | time: 1358.41s | valid loss 0.74 | bpc 1.07460
----------------------------------------------------------------------------------------------------
| epoch 143 step 97510 | 98 batches | lr 0.00043 | ms/batch 11946.92 | loss 0.71 | bpc 1.02130
| epoch 143 step 97520 | 108 batches | lr 0.00043 | ms/batch 5131.05 | loss 0.76 | bpc 1.10364
| epoch 143 step 97530 | 118 batches | lr 0.00043 | ms/batch 5130.62 | loss 0.74 | bpc 1.06718
| epoch 143 step 97540 | 128 batches | lr 0.00043 | ms/batch 5127.51 | loss 0.77 | bpc 1.11463
| epoch 143 step 97550 | 138 batches | lr 0.00043 | ms/batch 5131.10 | loss 0.78 | bpc 1.12757
| epoch 143 step 97560 | 148 batches | lr 0.00043 | ms/batch 5131.22 | loss 0.75 | bpc 1.07567
| epoch 143 step 97570 | 158 batches | lr 0.00043 | ms/batch 5127.67 | loss 0.79 | bpc 1.13340
| epoch 143 step 97580 | 168 batches | lr 0.00043 | ms/batch 5130.44 | loss 0.75 | bpc 1.08599
| epoch 143 step 97590 | 178 batches | lr 0.00043 | ms/batch 5131.15 | loss 0.67 | bpc 0.96969
| epoch 143 step 97600 | 188 batches | lr 0.00043 | ms/batch 5134.45 | loss 0.68 | bpc 0.97862
| epoch 143 step 97610 | 198 batches | lr 0.00043 | ms/batch 5129.46 | loss 0.65 | bpc 0.93874
| epoch 143 step 97620 | 208 batches | lr 0.00043 | ms/batch 5130.38 | loss 0.67 | bpc 0.96848
| epoch 143 step 97630 | 218 batches | lr 0.00043 | ms/batch 5130.21 | loss 0.76 | bpc 1.09128
| epoch 143 step 97640 | 228 batches | lr 0.00043 | ms/batch 5129.01 | loss 0.74 | bpc 1.06886
| epoch 143 step 97650 | 238 batches | lr 0.00043 | ms/batch 5130.57 | loss 0.73 | bpc 1.06032
| epoch 143 step 97660 | 248 batches | lr 0.00043 | ms/batch 5132.71 | loss 0.78 | bpc 1.12229
| epoch 143 step 97670 | 258 batches | lr 0.00043 | ms/batch 5130.19 | loss 0.74 | bpc 1.07409
| epoch 143 step 97680 | 268 batches | lr 0.00043 | ms/batch 5129.78 | loss 0.70 | bpc 1.00512
| epoch 143 step 97690 | 278 batches | lr 0.00043 | ms/batch 5124.13 | loss 0.75 | bpc 1.07490
| epoch 143 step 97700 | 288 batches | lr 0.00043 | ms/batch 5127.54 | loss 0.78 | bpc 1.12796
| epoch 143 step 97710 | 298 batches | lr 0.00043 | ms/batch 5128.96 | loss 0.75 | bpc 1.07857
| epoch 143 step 97720 | 308 batches | lr 0.00043 | ms/batch 5135.69 | loss 0.76 | bpc 1.09333
| epoch 143 step 97730 | 318 batches | lr 0.00043 | ms/batch 5131.11 | loss 0.76 | bpc 1.09119
| epoch 143 step 97740 | 328 batches | lr 0.00043 | ms/batch 5131.39 | loss 0.76 | bpc 1.09129
| epoch 143 step 97750 | 338 batches | lr 0.00043 | ms/batch 5127.77 | loss 0.73 | bpc 1.05140
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 391 at step 97750 | time: 1356.46s | valid loss 0.74 | bpc 1.07408
----------------------------------------------------------------------------------------------------
| epoch 143 step 97760 | 348 batches | lr 0.00043 | ms/batch 11956.39 | loss 0.72 | bpc 1.03965
| epoch 143 step 97770 | 358 batches | lr 0.00043 | ms/batch 5128.54 | loss 0.79 | bpc 1.14297
| epoch 143 step 97780 | 368 batches | lr 0.00043 | ms/batch 5134.27 | loss 0.74 | bpc 1.06666
| epoch 143 step 97790 | 378 batches | lr 0.00043 | ms/batch 5132.63 | loss 0.74 | bpc 1.06350
| epoch 143 step 97800 | 388 batches | lr 0.00043 | ms/batch 5129.73 | loss 0.77 | bpc 1.10927
| epoch 143 step 97810 | 398 batches | lr 0.00043 | ms/batch 5124.85 | loss 0.77 | bpc 1.10378
| epoch 143 step 97820 | 408 batches | lr 0.00043 | ms/batch 5126.05 | loss 0.77 | bpc 1.10608
| epoch 143 step 97830 | 418 batches | lr 0.00043 | ms/batch 5132.46 | loss 0.73 | bpc 1.05912
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
| epoch 143 step 97840 | 428 batches | lr 0.00043 | ms/batch 5127.82 | loss 0.73 | bpc 1.04711
| epoch 143 step 97850 | 438 batches | lr 0.00043 | ms/batch 5131.09 | loss 0.77 | bpc 1.10896
| epoch 143 step 97860 | 448 batches | lr 0.00043 | ms/batch 5131.35 | loss 0.76 | bpc 1.10050
| epoch 143 step 97870 | 458 batches | lr 0.00043 | ms/batch 5132.41 | loss 0.76 | bpc 1.09651
| epoch 143 step 97880 | 468 batches | lr 0.00043 | ms/batch 5130.74 | loss 0.72 | bpc 1.04289
| epoch 143 step 97890 | 478 batches | lr 0.00043 | ms/batch 5131.50 | loss 0.76 | bpc 1.09156
| epoch 143 step 97900 | 488 batches | lr 0.00043 | ms/batch 5132.46 | loss 0.75 | bpc 1.08192
| epoch 143 step 97910 | 498 batches | lr 0.00043 | ms/batch 5128.46 | loss 0.75 | bpc 1.07493
| epoch 143 step 97920 | 508 batches | lr 0.00043 | ms/batch 5131.00 | loss 0.75 | bpc 1.08292
| epoch 143 step 97930 | 518 batches | lr 0.00043 | ms/batch 5131.07 | loss 0.74 | bpc 1.06791
| epoch 143 step 97940 | 528 batches | lr 0.00043 | ms/batch 5131.77 | loss 0.77 | bpc 1.10556
| epoch 143 step 97950 | 538 batches | lr 0.00043 | ms/batch 5131.70 | loss 0.79 | bpc 1.13427
| epoch 143 step 97960 | 548 batches | lr 0.00043 | ms/batch 5113.15 | loss 0.73 | bpc 1.05715
| epoch 143 step 97970 | 558 batches | lr 0.00043 | ms/batch 5109.59 | loss 0.76 | bpc 1.10056
| epoch 143 step 97980 | 568 batches | lr 0.00043 | ms/batch 5109.00 | loss 0.77 | bpc 1.11576
| epoch 143 step 97990 | 578 batches | lr 0.00043 | ms/batch 5110.89 | loss 0.73 | bpc 1.05526
| epoch 143 step 98000 | 588 batches | lr 0.00043 | ms/batch 5120.63 | loss 0.75 | bpc 1.08391
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 392 at step 98000 | time: 1349.97s | valid loss 0.74 | bpc 1.07386
----------------------------------------------------------------------------------------------------
| epoch 143 step 98010 | 598 batches | lr 0.00043 | ms/batch 11947.65 | loss 0.76 | bpc 1.09492
| epoch 143 step 98020 | 608 batches | lr 0.00043 | ms/batch 5128.24 | loss 0.74 | bpc 1.06414
| epoch 143 step 98030 | 618 batches | lr 0.000429 | ms/batch 5125.80 | loss 0.75 | bpc 1.07529
| epoch 143 step 98040 | 628 batches | lr 0.000429 | ms/batch 5132.44 | loss 0.76 | bpc 1.10253
| epoch 143 step 98050 | 638 batches | lr 0.000429 | ms/batch 5132.28 | loss 0.74 | bpc 1.06609
| epoch 143 step 98060 | 648 batches | lr 0.000429 | ms/batch 5132.64 | loss 0.79 | bpc 1.14153
| epoch 143 step 98070 | 658 batches | lr 0.000429 | ms/batch 5131.94 | loss 0.77 | bpc 1.11401
| epoch 143 step 98080 | 668 batches | lr 0.000429 | ms/batch 5129.08 | loss 0.77 | bpc 1.10964
| epoch 143 step 98090 | 678 batches | lr 0.000429 | ms/batch 5133.00 | loss 0.70 | bpc 1.00344
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 144 step 98100 | 2 batches | lr 0.000429 | ms/batch 5130.95 | loss 0.69 | bpc 1.00181
| epoch 144 step 98110 | 12 batches | lr 0.000429 | ms/batch 5133.38 | loss 0.74 | bpc 1.06797
| epoch 144 step 98120 | 22 batches | lr 0.000429 | ms/batch 5131.27 | loss 0.77 | bpc 1.10700
| epoch 144 step 98130 | 32 batches | lr 0.000429 | ms/batch 5133.07 | loss 0.73 | bpc 1.05607
| epoch 144 step 98140 | 42 batches | lr 0.000429 | ms/batch 5134.27 | loss 0.72 | bpc 1.04003
| epoch 144 step 98150 | 52 batches | lr 0.000429 | ms/batch 5132.89 | loss 0.76 | bpc 1.08965
| epoch 144 step 98160 | 62 batches | lr 0.000429 | ms/batch 5134.15 | loss 0.75 | bpc 1.08540
| epoch 144 step 98170 | 72 batches | lr 0.000429 | ms/batch 5133.92 | loss 0.75 | bpc 1.08664
| epoch 144 step 98180 | 82 batches | lr 0.000429 | ms/batch 5133.00 | loss 0.70 | bpc 1.01277
| epoch 144 step 98190 | 92 batches | lr 0.000429 | ms/batch 5132.57 | loss 0.72 | bpc 1.03866
| epoch 144 step 98200 | 102 batches | lr 0.000429 | ms/batch 5127.51 | loss 0.73 | bpc 1.05558
| epoch 144 step 98210 | 112 batches | lr 0.000429 | ms/batch 5129.28 | loss 0.75 | bpc 1.07966
| epoch 144 step 98220 | 122 batches | lr 0.000429 | ms/batch 5129.34 | loss 0.75 | bpc 1.07959
| epoch 144 step 98230 | 132 batches | lr 0.000429 | ms/batch 5130.99 | loss 0.76 | bpc 1.09725
| epoch 144 step 98240 | 142 batches | lr 0.000429 | ms/batch 5131.55 | loss 0.80 | bpc 1.15291
| epoch 144 step 98250 | 152 batches | lr 0.000429 | ms/batch 5131.13 | loss 0.76 | bpc 1.09529
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 393 at step 98250 | time: 1352.76s | valid loss 0.74 | bpc 1.06719
----------------------------------------------------------------------------------------------------
| epoch 144 step 98260 | 162 batches | lr 0.000429 | ms/batch 12446.42 | loss 0.78 | bpc 1.12748
| epoch 144 step 98270 | 172 batches | lr 0.000429 | ms/batch 5131.24 | loss 0.70 | bpc 1.01142
| epoch 144 step 98280 | 182 batches | lr 0.000429 | ms/batch 5133.21 | loss 0.68 | bpc 0.97597
| epoch 144 step 98290 | 192 batches | lr 0.000429 | ms/batch 5131.91 | loss 0.68 | bpc 0.98759
| epoch 144 step 98300 | 202 batches | lr 0.000429 | ms/batch 5132.77 | loss 0.61 | bpc 0.87885
| epoch 144 step 98310 | 212 batches | lr 0.000429 | ms/batch 5133.70 | loss 0.73 | bpc 1.04924
| epoch 144 step 98320 | 222 batches | lr 0.000429 | ms/batch 5130.37 | loss 0.76 | bpc 1.09534
| epoch 144 step 98330 | 232 batches | lr 0.000429 | ms/batch 5132.10 | loss 0.74 | bpc 1.06598
| epoch 144 step 98340 | 242 batches | lr 0.000429 | ms/batch 5132.31 | loss 0.74 | bpc 1.06929
| epoch 144 step 98350 | 252 batches | lr 0.000429 | ms/batch 5134.31 | loss 0.79 | bpc 1.13957
| epoch 144 step 98360 | 262 batches | lr 0.000429 | ms/batch 5124.13 | loss 0.71 | bpc 1.02977
| epoch 144 step 98370 | 272 batches | lr 0.000429 | ms/batch 5127.88 | loss 0.70 | bpc 1.01560
| epoch 144 step 98380 | 282 batches | lr 0.000429 | ms/batch 5133.81 | loss 0.76 | bpc 1.10152
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0
| epoch 144 step 98390 | 292 batches | lr 0.000429 | ms/batch 5131.14 | loss 0.76 | bpc 1.10062
| epoch 144 step 98400 | 302 batches | lr 0.000429 | ms/batch 5130.83 | loss 0.75 | bpc 1.08911
| epoch 144 step 98410 | 312 batches | lr 0.000429 | ms/batch 5132.08 | loss 0.78 | bpc 1.11874
| epoch 144 step 98420 | 322 batches | lr 0.000429 | ms/batch 5134.07 | loss 0.73 | bpc 1.05887
| epoch 144 step 98430 | 332 batches | lr 0.000429 | ms/batch 5132.01 | loss 0.77 | bpc 1.10598
| epoch 144 step 98440 | 342 batches | lr 0.000429 | ms/batch 5134.58 | loss 0.70 | bpc 1.00552
| epoch 144 step 98450 | 352 batches | lr 0.000429 | ms/batch 5132.27 | loss 0.76 | bpc 1.10202
| epoch 144 step 98460 | 362 batches | lr 0.000429 | ms/batch 5132.35 | loss 0.77 | bpc 1.10740
| epoch 144 step 98470 | 372 batches | lr 0.000429 | ms/batch 5131.68 | loss 0.72 | bpc 1.03829
| epoch 144 step 98480 | 382 batches | lr 0.000429 | ms/batch 5128.85 | loss 0.76 | bpc 1.09289
| epoch 144 step 98490 | 392 batches | lr 0.000429 | ms/batch 5129.30 | loss 0.75 | bpc 1.08441
| epoch 144 step 98500 | 402 batches | lr 0.000429 | ms/batch 5110.59 | loss 0.79 | bpc 1.13765
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 394 at step 98500 | time: 1350.93s | valid loss 0.74 | bpc 1.07165
----------------------------------------------------------------------------------------------------
| epoch 144 step 98510 | 412 batches | lr 0.000429 | ms/batch 11923.55 | loss 0.75 | bpc 1.07535
| epoch 144 step 98520 | 422 batches | lr 0.000429 | ms/batch 5111.37 | loss 0.73 | bpc 1.05598
| epoch 144 step 98530 | 432 batches | lr 0.000429 | ms/batch 5121.47 | loss 0.74 | bpc 1.06764
| epoch 144 step 98540 | 442 batches | lr 0.000429 | ms/batch 5125.38 | loss 0.77 | bpc 1.11638
| epoch 144 step 98550 | 452 batches | lr 0.000429 | ms/batch 5134.20 | loss 0.76 | bpc 1.09369
| epoch 144 step 98560 | 462 batches | lr 0.000429 | ms/batch 5135.00 | loss 0.75 | bpc 1.08767
| epoch 144 step 98570 | 472 batches | lr 0.000429 | ms/batch 5133.05 | loss 0.74 | bpc 1.06612
| epoch 144 step 98580 | 482 batches | lr 0.000429 | ms/batch 5133.61 | loss 0.76 | bpc 1.09170
| epoch 144 step 98590 | 492 batches | lr 0.000429 | ms/batch 5132.08 | loss 0.74 | bpc 1.06924
| epoch 144 step 98600 | 502 batches | lr 0.000429 | ms/batch 5133.55 | loss 0.74 | bpc 1.06381
| epoch 144 step 98610 | 512 batches | lr 0.000429 | ms/batch 5132.62 | loss 0.74 | bpc 1.06926
| epoch 144 step 98620 | 522 batches | lr 0.000429 | ms/batch 5133.73 | loss 0.75 | bpc 1.08356
| epoch 144 step 98630 | 532 batches | lr 0.000429 | ms/batch 5133.86 | loss 0.79 | bpc 1.13988
| epoch 144 step 98640 | 542 batches | lr 0.000429 | ms/batch 5132.66 | loss 0.75 | bpc 1.07837
| epoch 144 step 98650 | 552 batches | lr 0.000429 | ms/batch 5134.59 | loss 0.76 | bpc 1.08933
| epoch 144 step 98660 | 562 batches | lr 0.000429 | ms/batch 5134.42 | loss 0.77 | bpc 1.11743
| epoch 144 step 98670 | 572 batches | lr 0.000429 | ms/batch 5131.12 | loss 0.75 | bpc 1.07767
| epoch 144 step 98680 | 582 batches | lr 0.000429 | ms/batch 5132.01 | loss 0.73 | bpc 1.05120
| epoch 144 step 98690 | 592 batches | lr 0.000429 | ms/batch 5131.73 | loss 0.77 | bpc 1.11512
| epoch 144 step 98700 | 602 batches | lr 0.000429 | ms/batch 5134.20 | loss 0.74 | bpc 1.06657
| epoch 144 step 98710 | 612 batches | lr 0.000429 | ms/batch 5128.71 | loss 0.72 | bpc 1.04075
| epoch 144 step 98720 | 622 batches | lr 0.000429 | ms/batch 5130.78 | loss 0.77 | bpc 1.10464
| epoch 144 step 98730 | 632 batches | lr 0.000429 | ms/batch 5132.08 | loss 0.76 | bpc 1.09809
| epoch 144 step 98740 | 642 batches | lr 0.000429 | ms/batch 5134.77 | loss 0.75 | bpc 1.08726
| epoch 144 step 98750 | 652 batches | lr 0.000429 | ms/batch 5132.02 | loss 0.79 | bpc 1.13441
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 395 at step 98750 | time: 1350.82s | valid loss 0.74 | bpc 1.07150
----------------------------------------------------------------------------------------------------
| epoch 144 step 98760 | 662 batches | lr 0.000428 | ms/batch 11947.26 | loss 0.78 | bpc 1.12179
| epoch 144 step 98770 | 672 batches | lr 0.000428 | ms/batch 5133.88 | loss 0.72 | bpc 1.04098
| epoch 144 step 98780 | 682 batches | lr 0.000428 | ms/batch 5131.29 | loss 0.71 | bpc 1.01959
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 145 step 98790 | 6 batches | lr 0.000428 | ms/batch 5134.40 | loss 0.71 | bpc 1.02109
| epoch 145 step 98800 | 16 batches | lr 0.000428 | ms/batch 5133.59 | loss 0.74 | bpc 1.07192
| epoch 145 step 98810 | 26 batches | lr 0.000428 | ms/batch 5133.88 | loss 0.78 | bpc 1.12025
| epoch 145 step 98820 | 36 batches | lr 0.000428 | ms/batch 5135.35 | loss 0.71 | bpc 1.01810
| epoch 145 step 98830 | 46 batches | lr 0.000428 | ms/batch 5131.09 | loss 0.74 | bpc 1.06045
| epoch 145 step 98840 | 56 batches | lr 0.000428 | ms/batch 5133.38 | loss 0.77 | bpc 1.11336
| epoch 145 step 98850 | 66 batches | lr 0.000428 | ms/batch 5133.21 | loss 0.76 | bpc 1.09361
| epoch 145 step 98860 | 76 batches | lr 0.000428 | ms/batch 5132.11 | loss 0.72 | bpc 1.03301
| epoch 145 step 98870 | 86 batches | lr 0.000428 | ms/batch 5130.95 | loss 0.72 | bpc 1.03271
| epoch 145 step 98880 | 96 batches | lr 0.000428 | ms/batch 5132.69 | loss 0.72 | bpc 1.03224
| epoch 145 step 98890 | 106 batches | lr 0.000428 | ms/batch 5133.25 | loss 0.75 | bpc 1.08383
| epoch 145 step 98900 | 116 batches | lr 0.000428 | ms/batch 5131.85 | loss 0.74 | bpc 1.07077
| epoch 145 step 98910 | 126 batches | lr 0.000428 | ms/batch 5130.76 | loss 0.76 | bpc 1.09058
| epoch 145 step 98920 | 136 batches | lr 0.000428 | ms/batch 5130.99 | loss 0.77 | bpc 1.11378
| epoch 145 step 98930 | 146 batches | lr 0.000428 | ms/batch 5128.79 | loss 0.77 | bpc 1.10790
| epoch 145 step 98940 | 156 batches | lr 0.000428 | ms/batch 5132.01 | loss 0.77 | bpc 1.10896
| epoch 145 step 98950 | 166 batches | lr 0.000428 | ms/batch 5132.48 | loss 0.78 | bpc 1.11872
| epoch 145 step 98960 | 176 batches | lr 0.000428 | ms/batch 5132.16 | loss 0.68 | bpc 0.97701
| epoch 145 step 98970 | 186 batches | lr 0.000428 | ms/batch 5134.49 | loss 0.68 | bpc 0.98286
| epoch 145 step 98980 | 196 batches | lr 0.000428 | ms/batch 5132.46 | loss 0.66 | bpc 0.94919
| epoch 145 step 98990 | 206 batches | lr 0.000428 | ms/batch 5132.84 | loss 0.65 | bpc 0.93711
| epoch 145 step 99000 | 216 batches | lr 0.000428 | ms/batch 5131.20 | loss 0.75 | bpc 1.08277
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 396 at step 99000 | time: 1359.99s | valid loss 0.74 | bpc 1.06702
----------------------------------------------------------------------------------------------------
| epoch 145 step 99010 | 226 batches | lr 0.000428 | ms/batch 12480.45 | loss 0.74 | bpc 1.06392
| epoch 145 step 99020 | 236 batches | lr 0.000428 | ms/batch 5132.56 | loss 0.73 | bpc 1.05754
| epoch 145 step 99030 | 246 batches | lr 0.000428 | ms/batch 5127.79 | loss 0.78 | bpc 1.11931
| epoch 145 step 99040 | 256 batches | lr 0.000428 | ms/batch 5135.40 | loss 0.75 | bpc 1.07578
| epoch 145 step 99050 | 266 batches | lr 0.000428 | ms/batch 5132.93 | loss 0.70 | bpc 1.00980
| epoch 145 step 99060 | 276 batches | lr 0.000428 | ms/batch 5134.80 | loss 0.74 | bpc 1.06543
| epoch 145 step 99070 | 286 batches | lr 0.000428 | ms/batch 5134.62 | loss 0.78 | bpc 1.13120
| epoch 145 step 99080 | 296 batches | lr 0.000428 | ms/batch 5114.43 | loss 0.74 | bpc 1.07015
| epoch 145 step 99090 | 306 batches | lr 0.000428 | ms/batch 5111.86 | loss 0.75 | bpc 1.08047
| epoch 145 step 99100 | 316 batches | lr 0.000428 | ms/batch 5115.50 | loss 0.77 | bpc 1.11436
| epoch 145 step 99110 | 326 batches | lr 0.000428 | ms/batch 5111.48 | loss 0.75 | bpc 1.07498
| epoch 145 step 99120 | 336 batches | lr 0.000428 | ms/batch 5110.55 | loss 0.75 | bpc 1.07500
| epoch 145 step 99130 | 346 batches | lr 0.000428 | ms/batch 5125.55 | loss 0.72 | bpc 1.03481
| epoch 145 step 99140 | 356 batches | lr 0.000428 | ms/batch 5136.73 | loss 0.78 | bpc 1.12544
| epoch 145 step 99150 | 366 batches | lr 0.000428 | ms/batch 5133.65 | loss 0.74 | bpc 1.06368
| epoch 145 step 99160 | 376 batches | lr 0.000428 | ms/batch 5131.94 | loss 0.73 | bpc 1.05290
| epoch 145 step 99170 | 386 batches | lr 0.000428 | ms/batch 5134.06 | loss 0.77 | bpc 1.10514
| epoch 145 step 99180 | 396 batches | lr 0.000428 | ms/batch 5132.04 | loss 0.76 | bpc 1.09312
| epoch 145 step 99190 | 406 batches | lr 0.000428 | ms/batch 5133.55 | loss 0.76 | bpc 1.10321
| epoch 145 step 99200 | 416 batches | lr 0.000428 | ms/batch 5131.40 | loss 0.75 | bpc 1.07884
| epoch 145 step 99210 | 426 batches | lr 0.000428 | ms/batch 5129.94 | loss 0.71 | bpc 1.02966
| epoch 145 step 99220 | 436 batches | lr 0.000428 | ms/batch 5130.14 | loss 0.78 | bpc 1.12073
| epoch 145 step 99230 | 446 batches | lr 0.000428 | ms/batch 5135.62 | loss 0.76 | bpc 1.09138
| epoch 145 step 99240 | 456 batches | lr 0.000428 | ms/batch 5130.62 | loss 0.75 | bpc 1.08856
| epoch 145 step 99250 | 466 batches | lr 0.000428 | ms/batch 5133.87 | loss 0.74 | bpc 1.07163
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 397 at step 99250 | time: 1365.73s | valid loss 0.75 | bpc 1.07557
----------------------------------------------------------------------------------------------------
| epoch 145 step 99260 | 476 batches | lr 0.000428 | ms/batch 11947.37 | loss 0.74 | bpc 1.06465
| epoch 145 step 99270 | 486 batches | lr 0.000428 | ms/batch 5134.08 | loss 0.76 | bpc 1.09069
| epoch 145 step 99280 | 496 batches | lr 0.000428 | ms/batch 5136.64 | loss 0.75 | bpc 1.08220
| epoch 145 step 99290 | 506 batches | lr 0.000428 | ms/batch 5132.63 | loss 0.73 | bpc 1.05702
| epoch 145 step 99300 | 516 batches | lr 0.000428 | ms/batch 5131.95 | loss 0.75 | bpc 1.07947
| epoch 145 step 99310 | 526 batches | lr 0.000428 | ms/batch 5132.71 | loss 0.76 | bpc 1.09374
| epoch 145 step 99320 | 536 batches | lr 0.000428 | ms/batch 5132.41 | loss 0.79 | bpc 1.13437
| epoch 145 step 99330 | 546 batches | lr 0.000428 | ms/batch 5132.46 | loss 0.74 | bpc 1.06502
| epoch 145 step 99340 | 556 batches | lr 0.000428 | ms/batch 5131.54 | loss 0.76 | bpc 1.10366
| epoch 145 step 99350 | 566 batches | lr 0.000428 | ms/batch 5134.02 | loss 0.78 | bpc 1.11852
| epoch 145 step 99360 | 576 batches | lr 0.000428 | ms/batch 5134.50 | loss 0.73 | bpc 1.05903
| epoch 145 step 99370 | 586 batches | lr 0.000428 | ms/batch 5133.40 | loss 0.74 | bpc 1.06825
| epoch 145 step 99380 | 596 batches | lr 0.000428 | ms/batch 5131.67 | loss 0.77 | bpc 1.10604
| epoch 145 step 99390 | 606 batches | lr 0.000428 | ms/batch 5133.88 | loss 0.74 | bpc 1.06177
| epoch 145 step 99400 | 616 batches | lr 0.000428 | ms/batch 5134.74 | loss 0.73 | bpc 1.05827
| epoch 145 step 99410 | 626 batches | lr 0.000428 | ms/batch 5131.72 | loss 0.75 | bpc 1.08903
| epoch 145 step 99420 | 636 batches | lr 0.000428 | ms/batch 5132.01 | loss 0.77 | bpc 1.10445
| epoch 145 step 99430 | 646 batches | lr 0.000428 | ms/batch 5134.56 | loss 0.76 | bpc 1.09423
| epoch 145 step 99440 | 656 batches | lr 0.000428 | ms/batch 5134.64 | loss 0.79 | bpc 1.13376
| epoch 145 step 99450 | 666 batches | lr 0.000428 | ms/batch 5132.58 | loss 0.78 | bpc 1.12000
| epoch 145 step 99460 | 676 batches | lr 0.000428 | ms/batch 5133.97 | loss 0.70 | bpc 1.00711
| epoch 145 step 99470 | 686 batches | lr 0.000428 | ms/batch 5134.62 | loss 0.69 | bpc 0.99857
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 146 step 99480 | 10 batches | lr 0.000427 | ms/batch 5133.39 | loss 0.73 | bpc 1.04966
| epoch 146 step 99490 | 20 batches | lr 0.000427 | ms/batch 5131.46 | loss 0.76 | bpc 1.09433
| epoch 146 step 99500 | 30 batches | lr 0.000427 | ms/batch 5134.10 | loss 0.75 | bpc 1.08155
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 398 at step 99500 | time: 1355.92s | valid loss 0.74 | bpc 1.07221
----------------------------------------------------------------------------------------------------
| epoch 146 step 99510 | 40 batches | lr 0.000427 | ms/batch 11948.10 | loss 0.71 | bpc 1.02970
| epoch 146 step 99520 | 50 batches | lr 0.000427 | ms/batch 5133.57 | loss 0.75 | bpc 1.08887
| epoch 146 step 99530 | 60 batches | lr 0.000427 | ms/batch 5134.67 | loss 0.77 | bpc 1.10436
| epoch 146 step 99540 | 70 batches | lr 0.000427 | ms/batch 5130.37 | loss 0.74 | bpc 1.06984
| epoch 146 step 99550 | 80 batches | lr 0.000427 | ms/batch 5134.16 | loss 0.70 | bpc 1.01636
| epoch 146 step 99560 | 90 batches | lr 0.000427 | ms/batch 5133.49 | loss 0.73 | bpc 1.05685
| epoch 146 step 99570 | 100 batches | lr 0.000427 | ms/batch 5132.06 | loss 0.71 | bpc 1.01796
| epoch 146 step 99580 | 110 batches | lr 0.000427 | ms/batch 5125.11 | loss 0.76 | bpc 1.10251
| epoch 146 step 99590 | 120 batches | lr 0.000427 | ms/batch 5132.78 | loss 0.73 | bpc 1.05633
| epoch 146 step 99600 | 130 batches | lr 0.000427 | ms/batch 5135.89 | loss 0.78 | bpc 1.11967
| epoch 146 step 99610 | 140 batches | lr 0.000427 | ms/batch 5132.50 | loss 0.79 | bpc 1.14019
| epoch 146 step 99620 | 150 batches | lr 0.000427 | ms/batch 5130.70 | loss 0.75 | bpc 1.07989
| epoch 146 step 99630 | 160 batches | lr 0.000427 | ms/batch 5132.27 | loss 0.78 | bpc 1.13139
| epoch 146 step 99640 | 170 batches | lr 0.000427 | ms/batch 5131.10 | loss 0.73 | bpc 1.05319
| epoch 146 step 99650 | 180 batches | lr 0.000427 | ms/batch 5132.63 | loss 0.67 | bpc 0.97087
| epoch 146 step 99660 | 190 batches | lr 0.000427 | ms/batch 5132.36 | loss 0.67 | bpc 0.97052
| epoch 146 step 99670 | 200 batches | lr 0.000427 | ms/batch 5136.83 | loss 0.63 | bpc 0.91016
| epoch 146 step 99680 | 210 batches | lr 0.000427 | ms/batch 5130.74 | loss 0.70 | bpc 1.00882
| epoch 146 step 99690 | 220 batches | lr 0.000427 | ms/batch 5133.01 | loss 0.77 | bpc 1.10412
| epoch 146 step 99700 | 230 batches | lr 0.000427 | ms/batch 5131.87 | loss 0.74 | bpc 1.06412
| epoch 146 step 99710 | 240 batches | lr 0.000427 | ms/batch 5131.69 | loss 0.74 | bpc 1.06468
| epoch 146 step 99720 | 250 batches | lr 0.000427 | ms/batch 5110.58 | loss 0.78 | bpc 1.12695
| epoch 146 step 99730 | 260 batches | lr 0.000427 | ms/batch 5112.41 | loss 0.72 | bpc 1.04057
| epoch 146 step 99740 | 270 batches | lr 0.000427 | ms/batch 5111.76 | loss 0.71 | bpc 1.01831
| epoch 146 step 99750 | 280 batches | lr 0.000427 | ms/batch 5112.48 | loss 0.75 | bpc 1.08026
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 399 at step 99750 | time: 1350.44s | valid loss 0.74 | bpc 1.07086
----------------------------------------------------------------------------------------------------
| epoch 146 step 99760 | 290 batches | lr 0.000427 | ms/batch 11936.34 | loss 0.77 | bpc 1.11779
| epoch 146 step 99770 | 300 batches | lr 0.000427 | ms/batch 5138.48 | loss 0.75 | bpc 1.08153
| epoch 146 step 99780 | 310 batches | lr 0.000427 | ms/batch 5133.68 | loss 0.77 | bpc 1.11038
| epoch 146 step 99790 | 320 batches | lr 0.000427 | ms/batch 5135.88 | loss 0.73 | bpc 1.05695
| epoch 146 step 99800 | 330 batches | lr 0.000427 | ms/batch 5131.38 | loss 0.76 | bpc 1.09577
| epoch 146 step 99810 | 340 batches | lr 0.000427 | ms/batch 5130.32 | loss 0.72 | bpc 1.04091
| epoch 146 step 99820 | 350 batches | lr 0.000427 | ms/batch 5137.12 | loss 0.74 | bpc 1.06843
| epoch 146 step 99830 | 360 batches | lr 0.000427 | ms/batch 5131.24 | loss 0.79 | bpc 1.13555
| epoch 146 step 99840 | 370 batches | lr 0.000427 | ms/batch 5132.55 | loss 0.73 | bpc 1.05109
| epoch 146 step 99850 | 380 batches | lr 0.000427 | ms/batch 5134.16 | loss 0.75 | bpc 1.07578
| epoch 146 step 99860 | 390 batches | lr 0.000427 | ms/batch 5131.61 | loss 0.74 | bpc 1.07312
| epoch 146 step 99870 | 400 batches | lr 0.000427 | ms/batch 5131.99 | loss 0.79 | bpc 1.13420
| epoch 146 step 99880 | 410 batches | lr 0.000427 | ms/batch 5134.14 | loss 0.76 | bpc 1.09191
| epoch 146 step 99890 | 420 batches | lr 0.000427 | ms/batch 5134.22 | loss 0.73 | bpc 1.04993
| epoch 146 step 99900 | 430 batches | lr 0.000427 | ms/batch 5130.97 | loss 0.73 | bpc 1.05848
| epoch 146 step 99910 | 440 batches | lr 0.000427 | ms/batch 5132.08 | loss 0.77 | bpc 1.10648
| epoch 146 step 99920 | 450 batches | lr 0.000427 | ms/batch 5132.87 | loss 0.76 | bpc 1.09308
| epoch 146 step 99930 | 460 batches | lr 0.000427 | ms/batch 5132.49 | loss 0.76 | bpc 1.08927
| epoch 146 step 99940 | 470 batches | lr 0.000427 | ms/batch 5130.56 | loss 0.73 | bpc 1.05161
| epoch 146 step 99950 | 480 batches | lr 0.000427 | ms/batch 5129.88 | loss 0.76 | bpc 1.10015
| epoch 146 step 99960 | 490 batches | lr 0.000427 | ms/batch 5132.43 | loss 0.74 | bpc 1.06973
| epoch 146 step 99970 | 500 batches | lr 0.000427 | ms/batch 5132.82 | loss 0.74 | bpc 1.06414
| epoch 146 step 99980 | 510 batches | lr 0.000427 | ms/batch 5131.34 | loss 0.74 | bpc 1.07050
| epoch 146 step 99990 | 520 batches | lr 0.000427 | ms/batch 5133.53 | loss 0.75 | bpc 1.08710
| epoch 146 step 100000 | 530 batches | lr 0.000427 | ms/batch 5131.35 | loss 0.77 | bpc 1.10833
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 400 at step 100000 | time: 1351.32s | valid loss 0.74 | bpc 1.07186
----------------------------------------------------------------------------------------------------
| epoch 146 step 100010 | 540 batches | lr 0.000427 | ms/batch 11944.90 | loss 0.78 | bpc 1.12299
| epoch 146 step 100020 | 550 batches | lr 0.000427 | ms/batch 5133.93 | loss 0.74 | bpc 1.06082
| epoch 146 step 100030 | 560 batches | lr 0.000427 | ms/batch 5137.92 | loss 0.77 | bpc 1.10735
| epoch 146 step 100040 | 570 batches | lr 0.000427 | ms/batch 5134.63 | loss 0.76 | bpc 1.09436
| epoch 146 step 100050 | 580 batches | lr 0.000427 | ms/batch 5133.04 | loss 0.74 | bpc 1.06308
| epoch 146 step 100060 | 590 batches | lr 0.000427 | ms/batch 5133.35 | loss 0.75 | bpc 1.07982
| epoch 146 step 100070 | 600 batches | lr 0.000427 | ms/batch 5131.73 | loss 0.75 | bpc 1.08444
| epoch 146 step 100080 | 610 batches | lr 0.000427 | ms/batch 5133.99 | loss 0.73 | bpc 1.05291
| epoch 146 step 100090 | 620 batches | lr 0.000427 | ms/batch 5131.55 | loss 0.75 | bpc 1.08698
| epoch 146 step 100100 | 630 batches | lr 0.000427 | ms/batch 5132.86 | loss 0.77 | bpc 1.10407
| epoch 146 step 100110 | 640 batches | lr 0.000427 | ms/batch 5132.31 | loss 0.75 | bpc 1.08010
| epoch 146 step 100120 | 650 batches | lr 0.000427 | ms/batch 5130.42 | loss 0.79 | bpc 1.13799
| epoch 146 step 100130 | 660 batches | lr 0.000427 | ms/batch 5127.07 | loss 0.77 | bpc 1.11329
| epoch 146 step 100140 | 670 batches | lr 0.000427 | ms/batch 5132.09 | loss 0.75 | bpc 1.07931
| epoch 146 step 100150 | 680 batches | lr 0.000427 | ms/batch 5130.90 | loss 0.70 | bpc 1.00788
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 147 step 100160 | 4 batches | lr 0.000427 | ms/batch 5134.22 | loss 0.70 | bpc 1.00335
| epoch 147 step 100170 | 14 batches | lr 0.000427 | ms/batch 5131.23 | loss 0.74 | bpc 1.06508
| epoch 147 step 100180 | 24 batches | lr 0.000427 | ms/batch 5133.92 | loss 0.77 | bpc 1.10925
| epoch 147 step 100190 | 34 batches | lr 0.000427 | ms/batch 5135.53 | loss 0.73 | bpc 1.04629
| epoch 147 step 100200 | 44 batches | lr 0.000426 | ms/batch 5133.21 | loss 0.72 | bpc 1.04100
| epoch 147 step 100210 | 54 batches | lr 0.000426 | ms/batch 5132.76 | loss 0.77 | bpc 1.10824
| epoch 147 step 100220 | 64 batches | lr 0.000426 | ms/batch 5131.88 | loss 0.75 | bpc 1.08258
| epoch 147 step 100230 | 74 batches | lr 0.000426 | ms/batch 5131.40 | loss 0.74 | bpc 1.06453
| epoch 147 step 100240 | 84 batches | lr 0.000426 | ms/batch 5132.88 | loss 0.70 | bpc 1.00981
| epoch 147 step 100250 | 94 batches | lr 0.000426 | ms/batch 5131.63 | loss 0.72 | bpc 1.03808
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 401 at step 100250 | time: 1357.19s | valid loss 0.74 | bpc 1.07105
----------------------------------------------------------------------------------------------------
| epoch 147 step 100260 | 104 batches | lr 0.000426 | ms/batch 11947.25 | loss 0.74 | bpc 1.07026
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0
| epoch 147 step 100270 | 114 batches | lr 0.000426 | ms/batch 5130.17 | loss 0.74 | bpc 1.07189
| epoch 147 step 100280 | 124 batches | lr 0.000426 | ms/batch 5131.76 | loss 0.76 | bpc 1.09304
| epoch 147 step 100290 | 134 batches | lr 0.000426 | ms/batch 5131.54 | loss 0.76 | bpc 1.09474
| epoch 147 step 100300 | 144 batches | lr 0.000426 | ms/batch 5129.89 | loss 0.78 | bpc 1.12835
| epoch 147 step 100310 | 154 batches | lr 0.000426 | ms/batch 5134.90 | loss 0.76 | bpc 1.09785
| epoch 147 step 100320 | 164 batches | lr 0.000426 | ms/batch 5136.26 | loss 0.78 | bpc 1.13203
| epoch 147 step 100330 | 174 batches | lr 0.000426 | ms/batch 5133.03 | loss 0.68 | bpc 0.97983
| epoch 147 step 100340 | 184 batches | lr 0.000426 | ms/batch 5132.35 | loss 0.68 | bpc 0.98159
| epoch 147 step 100350 | 194 batches | lr 0.000426 | ms/batch 5132.35 | loss 0.69 | bpc 0.99133
| epoch 147 step 100360 | 204 batches | lr 0.000426 | ms/batch 5131.34 | loss 0.62 | bpc 0.89225
| epoch 147 step 100370 | 214 batches | lr 0.000426 | ms/batch 5134.89 | loss 0.73 | bpc 1.05025
| epoch 147 step 100380 | 224 batches | lr 0.000426 | ms/batch 5134.31 | loss 0.74 | bpc 1.07277
| epoch 147 step 100390 | 234 batches | lr 0.000426 | ms/batch 5122.10 | loss 0.74 | bpc 1.06237
| epoch 147 step 100400 | 244 batches | lr 0.000426 | ms/batch 5111.10 | loss 0.75 | bpc 1.08693
| epoch 147 step 100410 | 254 batches | lr 0.000426 | ms/batch 5113.00 | loss 0.78 | bpc 1.12171
| epoch 147 step 100420 | 264 batches | lr 0.000426 | ms/batch 5111.51 | loss 0.70 | bpc 1.01041
| epoch 147 step 100430 | 274 batches | lr 0.000426 | ms/batch 5114.54 | loss 0.72 | bpc 1.03929
| epoch 147 step 100440 | 284 batches | lr 0.000426 | ms/batch 5112.41 | loss 0.77 | bpc 1.10618
| epoch 147 step 100450 | 294 batches | lr 0.000426 | ms/batch 5113.84 | loss 0.76 | bpc 1.09254
| epoch 147 step 100460 | 304 batches | lr 0.000426 | ms/batch 5138.11 | loss 0.76 | bpc 1.09071
| epoch 147 step 100470 | 314 batches | lr 0.000426 | ms/batch 5133.53 | loss 0.76 | bpc 1.10274
| epoch 147 step 100480 | 324 batches | lr 0.000426 | ms/batch 5133.44 | loss 0.75 | bpc 1.07684
| epoch 147 step 100490 | 334 batches | lr 0.000426 | ms/batch 5135.07 | loss 0.75 | bpc 1.08891
| epoch 147 step 100500 | 344 batches | lr 0.000426 | ms/batch 5135.13 | loss 0.71 | bpc 1.02029
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 402 at step 100500 | time: 1354.33s | valid loss 0.74 | bpc 1.06913
----------------------------------------------------------------------------------------------------
| epoch 147 step 100510 | 354 batches | lr 0.000426 | ms/batch 11947.80 | loss 0.78 | bpc 1.12067
| epoch 147 step 100520 | 364 batches | lr 0.000426 | ms/batch 5132.78 | loss 0.75 | bpc 1.07560
| epoch 147 step 100530 | 374 batches | lr 0.000426 | ms/batch 5133.48 | loss 0.72 | bpc 1.04075
| epoch 147 step 100540 | 384 batches | lr 0.000426 | ms/batch 5135.08 | loss 0.76 | bpc 1.09840
| epoch 147 step 100550 | 394 batches | lr 0.000426 | ms/batch 5135.64 | loss 0.75 | bpc 1.08434
| epoch 147 step 100560 | 404 batches | lr 0.000426 | ms/batch 5136.77 | loss 0.77 | bpc 1.10859
| epoch 147 step 100570 | 414 batches | lr 0.000426 | ms/batch 5136.33 | loss 0.75 | bpc 1.08894
| epoch 147 step 100580 | 424 batches | lr 0.000426 | ms/batch 5136.56 | loss 0.72 | bpc 1.03886
| epoch 147 step 100590 | 434 batches | lr 0.000426 | ms/batch 5134.79 | loss 0.76 | bpc 1.09524
| epoch 147 step 100600 | 444 batches | lr 0.000426 | ms/batch 5134.15 | loss 0.76 | bpc 1.09388
| epoch 147 step 100610 | 454 batches | lr 0.000426 | ms/batch 5133.53 | loss 0.77 | bpc 1.10520
| epoch 147 step 100620 | 464 batches | lr 0.000426 | ms/batch 5131.79 | loss 0.74 | bpc 1.07481
| epoch 147 step 100630 | 474 batches | lr 0.000426 | ms/batch 5134.80 | loss 0.73 | bpc 1.05763
| epoch 147 step 100640 | 484 batches | lr 0.000426 | ms/batch 5132.17 | loss 0.75 | bpc 1.08677
| epoch 147 step 100650 | 494 batches | lr 0.000426 | ms/batch 5136.87 | loss 0.75 | bpc 1.08864
| epoch 147 step 100660 | 504 batches | lr 0.000426 | ms/batch 5133.09 | loss 0.73 | bpc 1.05479
| epoch 147 step 100670 | 514 batches | lr 0.000426 | ms/batch 5138.58 | loss 0.74 | bpc 1.06438
| epoch 147 step 100680 | 524 batches | lr 0.000426 | ms/batch 5135.46 | loss 0.75 | bpc 1.08040
| epoch 147 step 100690 | 534 batches | lr 0.000426 | ms/batch 5132.76 | loss 0.79 | bpc 1.14504
| epoch 147 step 100700 | 544 batches | lr 0.000426 | ms/batch 5135.74 | loss 0.74 | bpc 1.07028
| epoch 147 step 100710 | 554 batches | lr 0.000426 | ms/batch 5133.96 | loss 0.76 | bpc 1.09573
| epoch 147 step 100720 | 564 batches | lr 0.000426 | ms/batch 5135.63 | loss 0.77 | bpc 1.11092
| epoch 147 step 100730 | 574 batches | lr 0.000426 | ms/batch 5134.48 | loss 0.74 | bpc 1.06933
| epoch 147 step 100740 | 584 batches | lr 0.000426 | ms/batch 5132.90 | loss 0.73 | bpc 1.05971
| epoch 147 step 100750 | 594 batches | lr 0.000426 | ms/batch 5132.31 | loss 0.76 | bpc 1.10201
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 403 at step 100750 | time: 1356.26s | valid loss 0.74 | bpc 1.07222
----------------------------------------------------------------------------------------------------
| epoch 147 step 100760 | 604 batches | lr 0.000426 | ms/batch 11951.81 | loss 0.75 | bpc 1.07534
| epoch 147 step 100770 | 614 batches | lr 0.000426 | ms/batch 5136.51 | loss 0.72 | bpc 1.04268
| epoch 147 step 100780 | 624 batches | lr 0.000426 | ms/batch 5139.92 | loss 0.76 | bpc 1.08991
| epoch 147 step 100790 | 634 batches | lr 0.000426 | ms/batch 5135.86 | loss 0.77 | bpc 1.10746
| epoch 147 step 100800 | 644 batches | lr 0.000426 | ms/batch 5133.34 | loss 0.75 | bpc 1.07886
| epoch 147 step 100810 | 654 batches | lr 0.000426 | ms/batch 5139.22 | loss 0.80 | bpc 1.15350
| epoch 147 step 100820 | 664 batches | lr 0.000426 | ms/batch 5133.20 | loss 0.77 | bpc 1.11341
| epoch 147 step 100830 | 674 batches | lr 0.000426 | ms/batch 5135.64 | loss 0.71 | bpc 1.02121
| epoch 147 step 100840 | 684 batches | lr 0.000426 | ms/batch 5136.21 | loss 0.70 | bpc 1.00793
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 148 step 100850 | 8 batches | lr 0.000426 | ms/batch 5137.10 | loss 0.72 | bpc 1.03628
| epoch 148 step 100860 | 18 batches | lr 0.000426 | ms/batch 5131.99 | loss 0.75 | bpc 1.07765
| epoch 148 step 100870 | 28 batches | lr 0.000426 | ms/batch 5129.95 | loss 0.76 | bpc 1.10232
| epoch 148 step 100880 | 38 batches | lr 0.000426 | ms/batch 5130.93 | loss 0.71 | bpc 1.02995
| epoch 148 step 100890 | 48 batches | lr 0.000426 | ms/batch 5136.46 | loss 0.73 | bpc 1.05832
| epoch 148 step 100900 | 58 batches | lr 0.000426 | ms/batch 5133.60 | loss 0.77 | bpc 1.10821
| epoch 148 step 100910 | 68 batches | lr 0.000426 | ms/batch 5132.46 | loss 0.76 | bpc 1.08935
| epoch 148 step 100920 | 78 batches | lr 0.000425 | ms/batch 5132.69 | loss 0.71 | bpc 1.02175
| epoch 148 step 100930 | 88 batches | lr 0.000425 | ms/batch 5138.33 | loss 0.72 | bpc 1.03857
| epoch 148 step 100940 | 98 batches | lr 0.000425 | ms/batch 5134.94 | loss 0.71 | bpc 1.01942
| epoch 148 step 100950 | 108 batches | lr 0.000425 | ms/batch 5135.74 | loss 0.76 | bpc 1.10065
| epoch 148 step 100960 | 118 batches | lr 0.000425 | ms/batch 5135.96 | loss 0.74 | bpc 1.06654
| epoch 148 step 100970 | 128 batches | lr 0.000425 | ms/batch 5135.47 | loss 0.77 | bpc 1.11065
| epoch 148 step 100980 | 138 batches | lr 0.000425 | ms/batch 5135.56 | loss 0.78 | bpc 1.12438
| epoch 148 step 100990 | 148 batches | lr 0.000425 | ms/batch 5136.93 | loss 0.74 | bpc 1.07237
| epoch 148 step 101000 | 158 batches | lr 0.000425 | ms/batch 5138.53 | loss 0.78 | bpc 1.13211
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 404 at step 101000 | time: 1352.10s | valid loss 0.75 | bpc 1.07575
----------------------------------------------------------------------------------------------------
| epoch 148 step 101010 | 168 batches | lr 0.000425 | ms/batch 11952.52 | loss 0.75 | bpc 1.08429
| epoch 148 step 101020 | 178 batches | lr 0.000425 | ms/batch 5133.86 | loss 0.67 | bpc 0.96932
| epoch 148 step 101030 | 188 batches | lr 0.000425 | ms/batch 5137.40 | loss 0.68 | bpc 0.97547
| epoch 148 step 101040 | 198 batches | lr 0.000425 | ms/batch 5135.94 | loss 0.65 | bpc 0.93763
| epoch 148 step 101050 | 208 batches | lr 0.000425 | ms/batch 5136.13 | loss 0.67 | bpc 0.96370
| epoch 148 step 101060 | 218 batches | lr 0.000425 | ms/batch 5135.03 | loss 0.75 | bpc 1.08903
| epoch 148 step 101070 | 228 batches | lr 0.000425 | ms/batch 5134.44 | loss 0.74 | bpc 1.06899
| epoch 148 step 101080 | 238 batches | lr 0.000425 | ms/batch 5133.32 | loss 0.73 | bpc 1.05981
| epoch 148 step 101090 | 248 batches | lr 0.000425 | ms/batch 5134.44 | loss 0.78 | bpc 1.11979
| epoch 148 step 101100 | 258 batches | lr 0.000425 | ms/batch 5137.77 | loss 0.74 | bpc 1.07294
| epoch 148 step 101110 | 268 batches | lr 0.000425 | ms/batch 5138.58 | loss 0.69 | bpc 1.00018
| epoch 148 step 101120 | 278 batches | lr 0.000425 | ms/batch 5135.20 | loss 0.74 | bpc 1.07188
| epoch 148 step 101130 | 288 batches | lr 0.000425 | ms/batch 5138.70 | loss 0.78 | bpc 1.12651
| epoch 148 step 101140 | 298 batches | lr 0.000425 | ms/batch 5137.81 | loss 0.74 | bpc 1.07470
| epoch 148 step 101150 | 308 batches | lr 0.000425 | ms/batch 5119.18 | loss 0.76 | bpc 1.08964
| epoch 148 step 101160 | 318 batches | lr 0.000425 | ms/batch 5110.16 | loss 0.75 | bpc 1.08753
| epoch 148 step 101170 | 328 batches | lr 0.000425 | ms/batch 5110.39 | loss 0.75 | bpc 1.08893
| epoch 148 step 101180 | 338 batches | lr 0.000425 | ms/batch 5111.73 | loss 0.73 | bpc 1.04763
| epoch 148 step 101190 | 348 batches | lr 0.000425 | ms/batch 5111.74 | loss 0.72 | bpc 1.03766
| epoch 148 step 101200 | 358 batches | lr 0.000425 | ms/batch 5113.41 | loss 0.79 | bpc 1.13964
| epoch 148 step 101210 | 368 batches | lr 0.000425 | ms/batch 5111.54 | loss 0.74 | bpc 1.06073
| epoch 148 step 101220 | 378 batches | lr 0.000425 | ms/batch 5130.00 | loss 0.73 | bpc 1.05980
| epoch 148 step 101230 | 388 batches | lr 0.000425 | ms/batch 5138.08 | loss 0.77 | bpc 1.10561
| epoch 148 step 101240 | 398 batches | lr 0.000425 | ms/batch 5137.90 | loss 0.76 | bpc 1.09865
| epoch 148 step 101250 | 408 batches | lr 0.000425 | ms/batch 5139.71 | loss 0.76 | bpc 1.10031
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 405 at step 101250 | time: 1350.61s | valid loss 0.74 | bpc 1.06764
----------------------------------------------------------------------------------------------------
| epoch 148 step 101260 | 418 batches | lr 0.000425 | ms/batch 11949.10 | loss 0.73 | bpc 1.05699
| epoch 148 step 101270 | 428 batches | lr 0.000425 | ms/batch 5132.45 | loss 0.72 | bpc 1.04350
| epoch 148 step 101280 | 438 batches | lr 0.000425 | ms/batch 5136.79 | loss 0.77 | bpc 1.10559
| epoch 148 step 101290 | 448 batches | lr 0.000425 | ms/batch 5135.89 | loss 0.76 | bpc 1.09627
| epoch 148 step 101300 | 458 batches | lr 0.000425 | ms/batch 5136.89 | loss 0.76 | bpc 1.09413
| epoch 148 step 101310 | 468 batches | lr 0.000425 | ms/batch 5139.34 | loss 0.72 | bpc 1.03914
| epoch 148 step 101320 | 478 batches | lr 0.000425 | ms/batch 5137.59 | loss 0.75 | bpc 1.08801
| epoch 148 step 101330 | 488 batches | lr 0.000425 | ms/batch 5143.30 | loss 0.75 | bpc 1.07909
| epoch 148 step 101340 | 498 batches | lr 0.000425 | ms/batch 5139.47 | loss 0.74 | bpc 1.07060
| epoch 148 step 101350 | 508 batches | lr 0.000425 | ms/batch 5139.49 | loss 0.75 | bpc 1.07993
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0
| epoch 148 step 101360 | 518 batches | lr 0.000425 | ms/batch 5134.76 | loss 0.74 | bpc 1.06622
| epoch 148 step 101370 | 528 batches | lr 0.000425 | ms/batch 5137.14 | loss 0.76 | bpc 1.10168
| epoch 148 step 101380 | 538 batches | lr 0.000425 | ms/batch 5136.89 | loss 0.78 | bpc 1.13203
| epoch 148 step 101390 | 548 batches | lr 0.000425 | ms/batch 5139.33 | loss 0.73 | bpc 1.05468
| epoch 148 step 101400 | 558 batches | lr 0.000425 | ms/batch 5138.22 | loss 0.76 | bpc 1.09788
| epoch 148 step 101410 | 568 batches | lr 0.000425 | ms/batch 5139.85 | loss 0.77 | bpc 1.11243
| epoch 148 step 101420 | 578 batches | lr 0.000425 | ms/batch 5136.87 | loss 0.73 | bpc 1.05150
| epoch 148 step 101430 | 588 batches | lr 0.000425 | ms/batch 5136.67 | loss 0.75 | bpc 1.08149
| epoch 148 step 101440 | 598 batches | lr 0.000425 | ms/batch 5133.66 | loss 0.76 | bpc 1.09418
| epoch 148 step 101450 | 608 batches | lr 0.000425 | ms/batch 5133.59 | loss 0.74 | bpc 1.06118
| epoch 148 step 101460 | 618 batches | lr 0.000425 | ms/batch 5137.08 | loss 0.74 | bpc 1.07279
| epoch 148 step 101470 | 628 batches | lr 0.000425 | ms/batch 5141.77 | loss 0.76 | bpc 1.10073
| epoch 148 step 101480 | 638 batches | lr 0.000425 | ms/batch 5136.48 | loss 0.74 | bpc 1.06419
| epoch 148 step 101490 | 648 batches | lr 0.000425 | ms/batch 5137.86 | loss 0.79 | bpc 1.13849
| epoch 148 step 101500 | 658 batches | lr 0.000425 | ms/batch 5136.17 | loss 0.77 | bpc 1.11173
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 406 at step 101500 | time: 1352.71s | valid loss 0.74 | bpc 1.07176
----------------------------------------------------------------------------------------------------
| epoch 148 step 101510 | 668 batches | lr 0.000425 | ms/batch 11949.99 | loss 0.77 | bpc 1.10574
| epoch 148 step 101520 | 678 batches | lr 0.000425 | ms/batch 5138.57 | loss 0.69 | bpc 1.00024
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
| epoch 149 step 101530 | 2 batches | lr 0.000425 | ms/batch 5137.52 | loss 0.69 | bpc 0.99837
| epoch 149 step 101540 | 12 batches | lr 0.000425 | ms/batch 5139.90 | loss 0.74 | bpc 1.06567
| epoch 149 step 101550 | 22 batches | lr 0.000425 | ms/batch 5135.73 | loss 0.77 | bpc 1.10631
| epoch 149 step 101560 | 32 batches | lr 0.000425 | ms/batch 5140.58 | loss 0.73 | bpc 1.05398
| epoch 149 step 101570 | 42 batches | lr 0.000425 | ms/batch 5138.21 | loss 0.72 | bpc 1.03804
| epoch 149 step 101580 | 52 batches | lr 0.000425 | ms/batch 5138.76 | loss 0.75 | bpc 1.08633
| epoch 149 step 101590 | 62 batches | lr 0.000425 | ms/batch 5139.06 | loss 0.75 | bpc 1.08207
| epoch 149 step 101600 | 72 batches | lr 0.000425 | ms/batch 5137.03 | loss 0.75 | bpc 1.08588
| epoch 149 step 101610 | 82 batches | lr 0.000425 | ms/batch 5139.12 | loss 0.70 | bpc 1.01112
| epoch 149 step 101620 | 92 batches | lr 0.000425 | ms/batch 5141.47 | loss 0.72 | bpc 1.03658
| epoch 149 step 101630 | 102 batches | lr 0.000424 | ms/batch 5138.61 | loss 0.73 | bpc 1.05425
| epoch 149 step 101640 | 112 batches | lr 0.000424 | ms/batch 5141.74 | loss 0.75 | bpc 1.07843
| epoch 149 step 101650 | 122 batches | lr 0.000424 | ms/batch 5133.86 | loss 0.75 | bpc 1.08025
| epoch 149 step 101660 | 132 batches | lr 0.000424 | ms/batch 5136.67 | loss 0.76 | bpc 1.09384
| epoch 149 step 101670 | 142 batches | lr 0.000424 | ms/batch 5140.33 | loss 0.80 | bpc 1.15137
| epoch 149 step 101680 | 152 batches | lr 0.000424 | ms/batch 5139.26 | loss 0.76 | bpc 1.09299
| epoch 149 step 101690 | 162 batches | lr 0.000424 | ms/batch 5136.88 | loss 0.78 | bpc 1.12396
| epoch 149 step 101700 | 172 batches | lr 0.000424 | ms/batch 5133.44 | loss 0.70 | bpc 1.00962
| epoch 149 step 101710 | 182 batches | lr 0.000424 | ms/batch 5138.53 | loss 0.67 | bpc 0.97317
| epoch 149 step 101720 | 192 batches | lr 0.000424 | ms/batch 5138.72 | loss 0.68 | bpc 0.98611
| epoch 149 step 101730 | 202 batches | lr 0.000424 | ms/batch 5136.32 | loss 0.61 | bpc 0.87611
| epoch 149 step 101740 | 212 batches | lr 0.000424 | ms/batch 5137.44 | loss 0.72 | bpc 1.04564
| epoch 149 step 101750 | 222 batches | lr 0.000424 | ms/batch 5141.88 | loss 0.76 | bpc 1.08981
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 407 at step 101750 | time: 1352.75s | valid loss 0.74 | bpc 1.07164
----------------------------------------------------------------------------------------------------
| epoch 149 step 101760 | 232 batches | lr 0.000424 | ms/batch 11952.14 | loss 0.74 | bpc 1.06380
| epoch 149 step 101770 | 242 batches | lr 0.000424 | ms/batch 5137.26 | loss 0.74 | bpc 1.06647
| epoch 149 step 101780 | 252 batches | lr 0.000424 | ms/batch 5135.48 | loss 0.79 | bpc 1.13786
| epoch 149 step 101790 | 262 batches | lr 0.000424 | ms/batch 5137.07 | loss 0.71 | bpc 1.02473
| epoch 149 step 101800 | 272 batches | lr 0.000424 | ms/batch 5134.57 | loss 0.70 | bpc 1.01265
| epoch 149 step 101810 | 282 batches | lr 0.000424 | ms/batch 5141.22 | loss 0.76 | bpc 1.09783
| epoch 149 step 101820 | 292 batches | lr 0.000424 | ms/batch 5135.66 | loss 0.76 | bpc 1.09710
| epoch 149 step 101830 | 302 batches | lr 0.000424 | ms/batch 5137.48 | loss 0.75 | bpc 1.08652
| epoch 149 step 101840 | 312 batches | lr 0.000424 | ms/batch 5137.33 | loss 0.77 | bpc 1.11660
| epoch 149 step 101850 | 322 batches | lr 0.000424 | ms/batch 5137.55 | loss 0.73 | bpc 1.05737
| epoch 149 step 101860 | 332 batches | lr 0.000424 | ms/batch 5136.57 | loss 0.77 | bpc 1.10586
| epoch 149 step 101870 | 342 batches | lr 0.000424 | ms/batch 5140.36 | loss 0.70 | bpc 1.00519
| epoch 149 step 101880 | 352 batches | lr 0.000424 | ms/batch 5137.52 | loss 0.76 | bpc 1.10141
| epoch 149 step 101890 | 362 batches | lr 0.000424 | ms/batch 5138.64 | loss 0.76 | bpc 1.10317
| epoch 149 step 101900 | 372 batches | lr 0.000424 | ms/batch 5139.07 | loss 0.72 | bpc 1.03524
| epoch 149 step 101910 | 382 batches | lr 0.000424 | ms/batch 5139.80 | loss 0.76 | bpc 1.09094
| epoch 149 step 101920 | 392 batches | lr 0.000424 | ms/batch 5134.38 | loss 0.75 | bpc 1.08066
| epoch 149 step 101930 | 402 batches | lr 0.000424 | ms/batch 5138.10 | loss 0.79 | bpc 1.13402
| epoch 149 step 101940 | 412 batches | lr 0.000424 | ms/batch 5135.59 | loss 0.74 | bpc 1.07405
| epoch 149 step 101950 | 422 batches | lr 0.000424 | ms/batch 5138.66 | loss 0.73 | bpc 1.05106
| epoch 149 step 101960 | 432 batches | lr 0.000424 | ms/batch 5140.52 | loss 0.74 | bpc 1.06560
| epoch 149 step 101970 | 442 batches | lr 0.000424 | ms/batch 5140.20 | loss 0.77 | bpc 1.11418
| epoch 149 step 101980 | 452 batches | lr 0.000424 | ms/batch 5134.05 | loss 0.76 | bpc 1.09063
| epoch 149 step 101990 | 462 batches | lr 0.000424 | ms/batch 5137.58 | loss 0.75 | bpc 1.08487
| epoch 149 step 102000 | 472 batches | lr 0.000424 | ms/batch 5139.30 | loss 0.74 | bpc 1.06360
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 408 at step 102000 | time: 1352.56s | valid loss 0.74 | bpc 1.07360
----------------------------------------------------------------------------------------------------
| epoch 149 step 102010 | 482 batches | lr 0.000424 | ms/batch 11916.66 | loss 0.75 | bpc 1.08851
| epoch 149 step 102020 | 492 batches | lr 0.000424 | ms/batch 5112.55 | loss 0.74 | bpc 1.06774
| epoch 149 step 102030 | 502 batches | lr 0.000424 | ms/batch 5113.05 | loss 0.74 | bpc 1.06282
| epoch 149 step 102040 | 512 batches | lr 0.000424 | ms/batch 5111.73 | loss 0.74 | bpc 1.06766
| epoch 149 step 102050 | 522 batches | lr 0.000424 | ms/batch 5112.78 | loss 0.75 | bpc 1.08245
| epoch 149 step 102060 | 532 batches | lr 0.000424 | ms/batch 5111.02 | loss 0.79 | bpc 1.13798
| epoch 149 step 102070 | 542 batches | lr 0.000424 | ms/batch 5134.60 | loss 0.75 | bpc 1.07704
| epoch 149 step 102080 | 552 batches | lr 0.000424 | ms/batch 5139.50 | loss 0.75 | bpc 1.08570
| epoch 149 step 102090 | 562 batches | lr 0.000424 | ms/batch 5138.34 | loss 0.77 | bpc 1.11736
| epoch 149 step 102100 | 572 batches | lr 0.000424 | ms/batch 5141.98 | loss 0.74 | bpc 1.07341
| epoch 149 step 102110 | 582 batches | lr 0.000424 | ms/batch 5138.24 | loss 0.73 | bpc 1.04859
| epoch 149 step 102120 | 592 batches | lr 0.000424 | ms/batch 5141.29 | loss 0.77 | bpc 1.11223
| epoch 149 step 102130 | 602 batches | lr 0.000424 | ms/batch 5142.65 | loss 0.74 | bpc 1.06518
| epoch 149 step 102140 | 612 batches | lr 0.000424 | ms/batch 5139.33 | loss 0.72 | bpc 1.03899
| epoch 149 step 102150 | 622 batches | lr 0.000424 | ms/batch 5136.50 | loss 0.76 | bpc 1.10261
| epoch 149 step 102160 | 632 batches | lr 0.000424 | ms/batch 5137.92 | loss 0.76 | bpc 1.09612
| epoch 149 step 102170 | 642 batches | lr 0.000424 | ms/batch 5139.81 | loss 0.75 | bpc 1.08526
| epoch 149 step 102180 | 652 batches | lr 0.000424 | ms/batch 5139.07 | loss 0.78 | bpc 1.13062
| epoch 149 step 102190 | 662 batches | lr 0.000424 | ms/batch 5131.62 | loss 0.78 | bpc 1.12133
| epoch 149 step 102200 | 672 batches | lr 0.000424 | ms/batch 5133.67 | loss 0.72 | bpc 1.03573
| epoch 149 step 102210 | 682 batches | lr 0.000424 | ms/batch 5139.46 | loss 0.71 | bpc 1.01831
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 150 step 102220 | 6 batches | lr 0.000424 | ms/batch 5141.88 | loss 0.71 | bpc 1.01972
| epoch 150 step 102230 | 16 batches | lr 0.000424 | ms/batch 5139.52 | loss 0.74 | bpc 1.06812
| epoch 150 step 102240 | 26 batches | lr 0.000424 | ms/batch 5140.45 | loss 0.78 | bpc 1.11827
| epoch 150 step 102250 | 36 batches | lr 0.000424 | ms/batch 5139.38 | loss 0.70 | bpc 1.01665
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 409 at step 102250 | time: 1352.06s | valid loss 0.75 | bpc 1.07572
----------------------------------------------------------------------------------------------------
| epoch 150 step 102260 | 46 batches | lr 0.000424 | ms/batch 11952.69 | loss 0.73 | bpc 1.05823
| epoch 150 step 102270 | 56 batches | lr 0.000424 | ms/batch 5139.24 | loss 0.77 | bpc 1.11028
| epoch 150 step 102280 | 66 batches | lr 0.000424 | ms/batch 5141.48 | loss 0.76 | bpc 1.09239
| epoch 150 step 102290 | 76 batches | lr 0.000424 | ms/batch 5139.39 | loss 0.71 | bpc 1.03041
| epoch 150 step 102300 | 86 batches | lr 0.000424 | ms/batch 5133.58 | loss 0.71 | bpc 1.02768
| epoch 150 step 102310 | 96 batches | lr 0.000424 | ms/batch 5135.01 | loss 0.71 | bpc 1.02901
| epoch 150 step 102320 | 106 batches | lr 0.000424 | ms/batch 5138.24 | loss 0.75 | bpc 1.08286
| epoch 150 step 102330 | 116 batches | lr 0.000424 | ms/batch 5142.09 | loss 0.74 | bpc 1.07194
| epoch 150 step 102340 | 126 batches | lr 0.000423 | ms/batch 5139.32 | loss 0.75 | bpc 1.08635
| epoch 150 step 102350 | 136 batches | lr 0.000423 | ms/batch 5138.91 | loss 0.77 | bpc 1.11044
| epoch 150 step 102360 | 146 batches | lr 0.000423 | ms/batch 5134.72 | loss 0.77 | bpc 1.10607
| epoch 150 step 102370 | 156 batches | lr 0.000423 | ms/batch 5139.40 | loss 0.77 | bpc 1.10760
| epoch 150 step 102380 | 166 batches | lr 0.000423 | ms/batch 5138.01 | loss 0.77 | bpc 1.11706
| epoch 150 step 102390 | 176 batches | lr 0.000423 | ms/batch 5138.40 | loss 0.68 | bpc 0.97499
| epoch 150 step 102400 | 186 batches | lr 0.000423 | ms/batch 5137.77 | loss 0.68 | bpc 0.97980
| epoch 150 step 102410 | 196 batches | lr 0.000423 | ms/batch 5141.53 | loss 0.66 | bpc 0.94869
| epoch 150 step 102420 | 206 batches | lr 0.000423 | ms/batch 5142.00 | loss 0.65 | bpc 0.93256
| epoch 150 step 102430 | 216 batches | lr 0.000423 | ms/batch 5135.85 | loss 0.75 | bpc 1.07755
| epoch 150 step 102440 | 226 batches | lr 0.000423 | ms/batch 5135.69 | loss 0.74 | bpc 1.06292
| epoch 150 step 102450 | 236 batches | lr 0.000423 | ms/batch 5137.18 | loss 0.73 | bpc 1.05491
| epoch 150 step 102460 | 246 batches | lr 0.000423 | ms/batch 5141.74 | loss 0.78 | bpc 1.12033
| epoch 150 step 102470 | 256 batches | lr 0.000423 | ms/batch 5134.34 | loss 0.74 | bpc 1.07473
| epoch 150 step 102480 | 266 batches | lr 0.000423 | ms/batch 5127.36 | loss 0.70 | bpc 1.00529
| epoch 150 step 102490 | 276 batches | lr 0.000423 | ms/batch 5139.47 | loss 0.74 | bpc 1.06253
| epoch 150 step 102500 | 286 batches | lr 0.000423 | ms/batch 5142.70 | loss 0.78 | bpc 1.12674
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 410 at step 102500 | time: 1352.73s | valid loss 0.74 | bpc 1.06951
----------------------------------------------------------------------------------------------------
| epoch 150 step 102510 | 296 batches | lr 0.000423 | ms/batch 11951.56 | loss 0.74 | bpc 1.06885
| epoch 150 step 102520 | 306 batches | lr 0.000423 | ms/batch 5137.37 | loss 0.75 | bpc 1.07694
| epoch 150 step 102530 | 316 batches | lr 0.000423 | ms/batch 5139.50 | loss 0.77 | bpc 1.11137
| epoch 150 step 102540 | 326 batches | lr 0.000423 | ms/batch 5138.39 | loss 0.75 | bpc 1.07547
| epoch 150 step 102550 | 336 batches | lr 0.000423 | ms/batch 5135.67 | loss 0.74 | bpc 1.07426
| epoch 150 step 102560 | 346 batches | lr 0.000423 | ms/batch 5132.89 | loss 0.71 | bpc 1.03140
| epoch 150 step 102570 | 356 batches | lr 0.000423 | ms/batch 5137.90 | loss 0.78 | bpc 1.12098
| epoch 150 step 102580 | 366 batches | lr 0.000423 | ms/batch 5138.51 | loss 0.74 | bpc 1.06078
| epoch 150 step 102590 | 376 batches | lr 0.000423 | ms/batch 5136.11 | loss 0.73 | bpc 1.04947
| epoch 150 step 102600 | 386 batches | lr 0.000423 | ms/batch 5135.69 | loss 0.77 | bpc 1.10426
| epoch 150 step 102610 | 396 batches | lr 0.000423 | ms/batch 5138.97 | loss 0.76 | bpc 1.08984
| epoch 150 step 102620 | 406 batches | lr 0.000423 | ms/batch 5139.83 | loss 0.76 | bpc 1.10274
| epoch 150 step 102630 | 416 batches | lr 0.000423 | ms/batch 5136.99 | loss 0.75 | bpc 1.07485
| epoch 150 step 102640 | 426 batches | lr 0.000423 | ms/batch 5138.18 | loss 0.71 | bpc 1.02683
| epoch 150 step 102650 | 436 batches | lr 0.000423 | ms/batch 5135.23 | loss 0.78 | bpc 1.12030
| epoch 150 step 102660 | 446 batches | lr 0.000423 | ms/batch 5137.53 | loss 0.75 | bpc 1.08659
| epoch 150 step 102670 | 456 batches | lr 0.000423 | ms/batch 5139.06 | loss 0.75 | bpc 1.08534
| epoch 150 step 102680 | 466 batches | lr 0.000423 | ms/batch 5137.39 | loss 0.74 | bpc 1.07061
| epoch 150 step 102690 | 476 batches | lr 0.000423 | ms/batch 5138.42 | loss 0.74 | bpc 1.06269
| epoch 150 step 102700 | 486 batches | lr 0.000423 | ms/batch 5138.35 | loss 0.75 | bpc 1.08875
| epoch 150 step 102710 | 496 batches | lr 0.000423 | ms/batch 5141.39 | loss 0.75 | bpc 1.07757
| epoch 150 step 102720 | 506 batches | lr 0.000423 | ms/batch 5137.85 | loss 0.73 | bpc 1.05529
| epoch 150 step 102730 | 516 batches | lr 0.000423 | ms/batch 5139.20 | loss 0.75 | bpc 1.07809
| epoch 150 step 102740 | 526 batches | lr 0.000423 | ms/batch 5137.88 | loss 0.76 | bpc 1.09121
| epoch 150 step 102750 | 536 batches | lr 0.000423 | ms/batch 5137.38 | loss 0.79 | bpc 1.13253
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 411 at step 102750 | time: 1353.82s | valid loss 0.74 | bpc 1.07328
----------------------------------------------------------------------------------------------------
| epoch 150 step 102760 | 546 batches | lr 0.000423 | ms/batch 11950.87 | loss 0.74 | bpc 1.06246
| epoch 150 step 102770 | 556 batches | lr 0.000423 | ms/batch 5137.11 | loss 0.76 | bpc 1.10258
| epoch 150 step 102780 | 566 batches | lr 0.000423 | ms/batch 5136.90 | loss 0.77 | bpc 1.11488
| epoch 150 step 102790 | 576 batches | lr 0.000423 | ms/batch 5139.94 | loss 0.73 | bpc 1.05619
| epoch 150 step 102800 | 586 batches | lr 0.000423 | ms/batch 5139.04 | loss 0.74 | bpc 1.06405
| epoch 150 step 102810 | 596 batches | lr 0.000423 | ms/batch 5135.58 | loss 0.77 | bpc 1.10450
| epoch 150 step 102820 | 606 batches | lr 0.000423 | ms/batch 5140.13 | loss 0.73 | bpc 1.05878
| epoch 150 step 102830 | 616 batches | lr 0.000423 | ms/batch 5138.46 | loss 0.73 | bpc 1.05758
| epoch 150 step 102840 | 626 batches | lr 0.000423 | ms/batch 5137.87 | loss 0.75 | bpc 1.08618
| epoch 150 step 102850 | 636 batches | lr 0.000423 | ms/batch 5140.59 | loss 0.76 | bpc 1.10063
| epoch 150 step 102860 | 646 batches | lr 0.000423 | ms/batch 5137.29 | loss 0.76 | bpc 1.09041
| epoch 150 step 102870 | 656 batches | lr 0.000423 | ms/batch 5133.88 | loss 0.78 | bpc 1.13098
| epoch 150 step 102880 | 666 batches | lr 0.000423 | ms/batch 5137.35 | loss 0.77 | bpc 1.11765
| epoch 150 step 102890 | 676 batches | lr 0.000423 | ms/batch 5140.09 | loss 0.70 | bpc 1.00554
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
| epoch 150 step 102900 | 686 batches | lr 0.000423 | ms/batch 5134.95 | loss 0.69 | bpc 0.99752
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 151 step 102910 | 10 batches | lr 0.000423 | ms/batch 5139.75 | loss 0.73 | bpc 1.04618
| epoch 151 step 102920 | 20 batches | lr 0.000423 | ms/batch 5119.34 | loss 0.76 | bpc 1.09059
| epoch 151 step 102930 | 30 batches | lr 0.000423 | ms/batch 5110.14 | loss 0.75 | bpc 1.08106
| epoch 151 step 102940 | 40 batches | lr 0.000423 | ms/batch 5110.55 | loss 0.71 | bpc 1.02695
| epoch 151 step 102950 | 50 batches | lr 0.000423 | ms/batch 5111.75 | loss 0.75 | bpc 1.08694
| epoch 151 step 102960 | 60 batches | lr 0.000423 | ms/batch 5109.43 | loss 0.76 | bpc 1.10329
| epoch 151 step 102970 | 70 batches | lr 0.000423 | ms/batch 5111.61 | loss 0.74 | bpc 1.06657
| epoch 151 step 102980 | 80 batches | lr 0.000423 | ms/batch 5111.21 | loss 0.70 | bpc 1.01150
| epoch 151 step 102990 | 90 batches | lr 0.000423 | ms/batch 5108.74 | loss 0.73 | bpc 1.05349
| epoch 151 step 103000 | 100 batches | lr 0.000423 | ms/batch 5118.25 | loss 0.70 | bpc 1.01611
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 412 at step 103000 | time: 1351.51s | valid loss 0.75 | bpc 1.07488
----------------------------------------------------------------------------------------------------
| epoch 151 step 103010 | 110 batches | lr 0.000423 | ms/batch 11952.96 | loss 0.76 | bpc 1.09961
| epoch 151 step 103020 | 120 batches | lr 0.000423 | ms/batch 5139.36 | loss 0.73 | bpc 1.05529
| epoch 151 step 103030 | 130 batches | lr 0.000423 | ms/batch 5134.96 | loss 0.77 | bpc 1.11581
| epoch 151 step 103040 | 140 batches | lr 0.000423 | ms/batch 5140.07 | loss 0.79 | bpc 1.13598
| epoch 151 step 103050 | 150 batches | lr 0.000422 | ms/batch 5137.01 | loss 0.75 | bpc 1.07724
| epoch 151 step 103060 | 160 batches | lr 0.000422 | ms/batch 5137.02 | loss 0.78 | bpc 1.12880
| epoch 151 step 103070 | 170 batches | lr 0.000422 | ms/batch 5140.95 | loss 0.73 | bpc 1.04954
| epoch 151 step 103080 | 180 batches | lr 0.000422 | ms/batch 5138.01 | loss 0.67 | bpc 0.96959
| epoch 151 step 103090 | 190 batches | lr 0.000422 | ms/batch 5139.04 | loss 0.67 | bpc 0.96959
| epoch 151 step 103100 | 200 batches | lr 0.000422 | ms/batch 5139.38 | loss 0.63 | bpc 0.90700
| epoch 151 step 103110 | 210 batches | lr 0.000422 | ms/batch 5131.17 | loss 0.70 | bpc 1.00334
| epoch 151 step 103120 | 220 batches | lr 0.000422 | ms/batch 5135.24 | loss 0.76 | bpc 1.09996
| epoch 151 step 103130 | 230 batches | lr 0.000422 | ms/batch 5140.48 | loss 0.74 | bpc 1.06099
| epoch 151 step 103140 | 240 batches | lr 0.000422 | ms/batch 5139.20 | loss 0.74 | bpc 1.06185
| epoch 151 step 103150 | 250 batches | lr 0.000422 | ms/batch 5134.84 | loss 0.78 | bpc 1.12509
| epoch 151 step 103160 | 260 batches | lr 0.000422 | ms/batch 5132.40 | loss 0.72 | bpc 1.03717
| epoch 151 step 103170 | 270 batches | lr 0.000422 | ms/batch 5132.43 | loss 0.70 | bpc 1.01367
| epoch 151 step 103180 | 280 batches | lr 0.000422 | ms/batch 5136.62 | loss 0.75 | bpc 1.07602
| epoch 151 step 103190 | 290 batches | lr 0.000422 | ms/batch 5132.69 | loss 0.77 | bpc 1.11605
| epoch 151 step 103200 | 300 batches | lr 0.000422 | ms/batch 5137.43 | loss 0.75 | bpc 1.07852
| epoch 151 step 103210 | 310 batches | lr 0.000422 | ms/batch 5137.77 | loss 0.77 | bpc 1.10769
| epoch 151 step 103220 | 320 batches | lr 0.000422 | ms/batch 5135.66 | loss 0.73 | bpc 1.05498
| epoch 151 step 103230 | 330 batches | lr 0.000422 | ms/batch 5137.72 | loss 0.76 | bpc 1.09529
| epoch 151 step 103240 | 340 batches | lr 0.000422 | ms/batch 5138.81 | loss 0.72 | bpc 1.03785
| epoch 151 step 103250 | 350 batches | lr 0.000422 | ms/batch 5138.30 | loss 0.74 | bpc 1.06289
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 413 at step 103250 | time: 1353.52s | valid loss 0.74 | bpc 1.07377
----------------------------------------------------------------------------------------------------
| epoch 151 step 103260 | 360 batches | lr 0.000422 | ms/batch 11941.76 | loss 0.78 | bpc 1.13150
| epoch 151 step 103270 | 370 batches | lr 0.000422 | ms/batch 5130.11 | loss 0.73 | bpc 1.04790
| epoch 151 step 103280 | 380 batches | lr 0.000422 | ms/batch 5125.95 | loss 0.74 | bpc 1.07403
| epoch 151 step 103290 | 390 batches | lr 0.000422 | ms/batch 5127.94 | loss 0.74 | bpc 1.07268
| epoch 151 step 103300 | 400 batches | lr 0.000422 | ms/batch 5126.95 | loss 0.78 | bpc 1.13113
| epoch 151 step 103310 | 410 batches | lr 0.000422 | ms/batch 5131.95 | loss 0.75 | bpc 1.08892
| epoch 151 step 103320 | 420 batches | lr 0.000422 | ms/batch 5137.38 | loss 0.73 | bpc 1.04918
| epoch 151 step 103330 | 430 batches | lr 0.000422 | ms/batch 5133.11 | loss 0.73 | bpc 1.05632
| epoch 151 step 103340 | 440 batches | lr 0.000422 | ms/batch 5126.41 | loss 0.77 | bpc 1.10454
| epoch 151 step 103350 | 450 batches | lr 0.000422 | ms/batch 5130.22 | loss 0.76 | bpc 1.09184
| epoch 151 step 103360 | 460 batches | lr 0.000422 | ms/batch 5125.86 | loss 0.75 | bpc 1.08643
| epoch 151 step 103370 | 470 batches | lr 0.000422 | ms/batch 5128.06 | loss 0.73 | bpc 1.04867
| epoch 151 step 103380 | 480 batches | lr 0.000422 | ms/batch 5129.65 | loss 0.76 | bpc 1.09629
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 524288.0
| epoch 151 step 103390 | 490 batches | lr 0.000422 | ms/batch 5126.37 | loss 0.74 | bpc 1.06864
| epoch 151 step 103400 | 500 batches | lr 0.000422 | ms/batch 5121.61 | loss 0.74 | bpc 1.06400
| epoch 151 step 103410 | 510 batches | lr 0.000422 | ms/batch 5130.20 | loss 0.74 | bpc 1.06747
| epoch 151 step 103420 | 520 batches | lr 0.000422 | ms/batch 5129.02 | loss 0.75 | bpc 1.08288
| epoch 151 step 103430 | 530 batches | lr 0.000422 | ms/batch 5128.04 | loss 0.77 | bpc 1.10727
| epoch 151 step 103440 | 540 batches | lr 0.000422 | ms/batch 5130.86 | loss 0.78 | bpc 1.11936
| epoch 151 step 103450 | 550 batches | lr 0.000422 | ms/batch 5125.68 | loss 0.73 | bpc 1.05748
| epoch 151 step 103460 | 560 batches | lr 0.000422 | ms/batch 5126.77 | loss 0.77 | bpc 1.10379
| epoch 151 step 103470 | 570 batches | lr 0.000422 | ms/batch 5132.81 | loss 0.76 | bpc 1.09157
| epoch 151 step 103480 | 580 batches | lr 0.000422 | ms/batch 5126.57 | loss 0.74 | bpc 1.06090
| epoch 151 step 103490 | 590 batches | lr 0.000422 | ms/batch 5129.01 | loss 0.75 | bpc 1.08078
| epoch 151 step 103500 | 600 batches | lr 0.000422 | ms/batch 5129.57 | loss 0.75 | bpc 1.08055
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 414 at step 103500 | time: 1350.36s | valid loss 0.74 | bpc 1.06895
----------------------------------------------------------------------------------------------------
| epoch 151 step 103510 | 610 batches | lr 0.000422 | ms/batch 11939.43 | loss 0.73 | bpc 1.05059
| epoch 151 step 103520 | 620 batches | lr 0.000422 | ms/batch 5129.54 | loss 0.75 | bpc 1.08378
| epoch 151 step 103530 | 630 batches | lr 0.000422 | ms/batch 5132.59 | loss 0.76 | bpc 1.10107
| epoch 151 step 103540 | 640 batches | lr 0.000422 | ms/batch 5129.20 | loss 0.75 | bpc 1.08040
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 262144.0
| epoch 151 step 103550 | 650 batches | lr 0.000422 | ms/batch 5128.56 | loss 0.79 | bpc 1.13640
| epoch 151 step 103560 | 660 batches | lr 0.000422 | ms/batch 5130.80 | loss 0.77 | bpc 1.10964
| epoch 151 step 103570 | 670 batches | lr 0.000422 | ms/batch 5131.29 | loss 0.75 | bpc 1.07749
| epoch 151 step 103580 | 680 batches | lr 0.000422 | ms/batch 5132.10 | loss 0.70 | bpc 1.00626
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
| epoch 152 step 103590 | 4 batches | lr 0.000422 | ms/batch 5134.33 | loss 0.70 | bpc 1.00281
| epoch 152 step 103600 | 14 batches | lr 0.000422 | ms/batch 5128.49 | loss 0.74 | bpc 1.06372
| epoch 152 step 103610 | 24 batches | lr 0.000422 | ms/batch 5131.75 | loss 0.77 | bpc 1.10664
| epoch 152 step 103620 | 34 batches | lr 0.000422 | ms/batch 5128.04 | loss 0.73 | bpc 1.04626
| epoch 152 step 103630 | 44 batches | lr 0.000422 | ms/batch 5130.41 | loss 0.72 | bpc 1.03885
| epoch 152 step 103640 | 54 batches | lr 0.000422 | ms/batch 5129.95 | loss 0.77 | bpc 1.10548
| epoch 152 step 103650 | 64 batches | lr 0.000422 | ms/batch 5131.20 | loss 0.75 | bpc 1.07841
| epoch 152 step 103660 | 74 batches | lr 0.000422 | ms/batch 5131.42 | loss 0.74 | bpc 1.06178
| epoch 152 step 103670 | 84 batches | lr 0.000422 | ms/batch 5131.41 | loss 0.70 | bpc 1.00912
| epoch 152 step 103680 | 94 batches | lr 0.000422 | ms/batch 5131.65 | loss 0.72 | bpc 1.03452
| epoch 152 step 103690 | 104 batches | lr 0.000422 | ms/batch 5130.02 | loss 0.74 | bpc 1.06794
| epoch 152 step 103700 | 114 batches | lr 0.000422 | ms/batch 5135.04 | loss 0.74 | bpc 1.07193
| epoch 152 step 103710 | 124 batches | lr 0.000422 | ms/batch 5132.36 | loss 0.76 | bpc 1.09026
| epoch 152 step 103720 | 134 batches | lr 0.000422 | ms/batch 5128.96 | loss 0.76 | bpc 1.09308
| epoch 152 step 103730 | 144 batches | lr 0.000422 | ms/batch 5133.66 | loss 0.78 | bpc 1.12686
| epoch 152 step 103740 | 154 batches | lr 0.000422 | ms/batch 5132.68 | loss 0.76 | bpc 1.09769
| epoch 152 step 103750 | 164 batches | lr 0.000421 | ms/batch 5130.79 | loss 0.78 | bpc 1.13182
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 415 at step 103750 | time: 1352.20s | valid loss 0.74 | bpc 1.06743
----------------------------------------------------------------------------------------------------
| epoch 152 step 103760 | 174 batches | lr 0.000421 | ms/batch 11942.94 | loss 0.68 | bpc 0.97746
| epoch 152 step 103770 | 184 batches | lr 0.000421 | ms/batch 5130.02 | loss 0.68 | bpc 0.97962
| epoch 152 step 103780 | 194 batches | lr 0.000421 | ms/batch 5134.07 | loss 0.69 | bpc 0.98881
| epoch 152 step 103790 | 204 batches | lr 0.000421 | ms/batch 5130.92 | loss 0.62 | bpc 0.88906
| epoch 152 step 103800 | 214 batches | lr 0.000421 | ms/batch 5129.07 | loss 0.73 | bpc 1.04803
| epoch 152 step 103810 | 224 batches | lr 0.000421 | ms/batch 5130.92 | loss 0.74 | bpc 1.06807
| epoch 152 step 103820 | 234 batches | lr 0.000421 | ms/batch 5131.33 | loss 0.74 | bpc 1.06042
| epoch 152 step 103830 | 244 batches | lr 0.000421 | ms/batch 5133.58 | loss 0.75 | bpc 1.08574
| epoch 152 step 103840 | 254 batches | lr 0.000421 | ms/batch 5132.21 | loss 0.78 | bpc 1.11812
| epoch 152 step 103850 | 264 batches | lr 0.000421 | ms/batch 5132.23 | loss 0.70 | bpc 1.00757
| epoch 152 step 103860 | 274 batches | lr 0.000421 | ms/batch 5131.37 | loss 0.72 | bpc 1.03820
| epoch 152 step 103870 | 284 batches | lr 0.000421 | ms/batch 5130.25 | loss 0.77 | bpc 1.10533
| epoch 152 step 103880 | 294 batches | lr 0.000421 | ms/batch 5131.31 | loss 0.76 | bpc 1.09128
| epoch 152 step 103890 | 304 batches | lr 0.000421 | ms/batch 5132.62 | loss 0.75 | bpc 1.08889
| epoch 152 step 103900 | 314 batches | lr 0.000421 | ms/batch 5130.52 | loss 0.76 | bpc 1.10196
| epoch 152 step 103910 | 324 batches | lr 0.000421 | ms/batch 5125.55 | loss 0.74 | bpc 1.07314
| epoch 152 step 103920 | 334 batches | lr 0.000421 | ms/batch 5113.13 | loss 0.75 | bpc 1.08465
| epoch 152 step 103930 | 344 batches | lr 0.000421 | ms/batch 5111.69 | loss 0.70 | bpc 1.01701
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0
| epoch 152 step 103940 | 354 batches | lr 0.000421 | ms/batch 5110.87 | loss 0.77 | bpc 1.11808
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 65536.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 65536.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 65536.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 65536.0
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 103950 | 364 batches | lr 0.000421 | ms/batch 5110.69 | loss nan | bpc nan
| epoch 152 step 103960 | 374 batches | lr 0.000421 | ms/batch 5109.98 | loss 0.72 | bpc 1.03865
| epoch 152 step 103970 | 384 batches | lr 0.000421 | ms/batch 5112.27 | loss 0.76 | bpc 1.09533
| epoch 152 step 103980 | 394 batches | lr 0.000421 | ms/batch 5111.61 | loss 0.75 | bpc 1.08365
| epoch 152 step 103990 | 404 batches | lr 0.000421 | ms/batch 5110.18 | loss 0.77 | bpc 1.10594
| epoch 152 step 104000 | 414 batches | lr 0.000421 | ms/batch 5121.36 | loss 0.75 | bpc 1.08615
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
| Eval 416 at step 104000 | time: 1349.28s | valid loss 0.74 | bpc 1.06833
----------------------------------------------------------------------------------------------------
| epoch 152 step 104010 | 424 batches | lr 0.000421 | ms/batch 11948.50 | loss 0.72 | bpc 1.03598
| epoch 152 step 104020 | 434 batches | lr 0.000421 | ms/batch 5138.66 | loss 0.76 | bpc 1.09152
| epoch 152 step 104030 | 444 batches | lr 0.000421 | ms/batch 5134.95 | loss 0.76 | bpc 1.09304
| epoch 152 step 104040 | 454 batches | lr 0.000421 | ms/batch 5133.48 | loss 0.77 | bpc 1.10379
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 104050 | 464 batches | lr 0.000421 | ms/batch 5126.16 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 104060 | 474 batches | lr 0.000421 | ms/batch 5123.08 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.001953125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.001953125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.001953125
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 104070 | 484 batches | lr 0.000421 | ms/batch 5121.17 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.001953125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0009765625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0009765625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0009765625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0009765625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07
WARNING:root:NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07
WARNING:root:NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07
| epoch 152 step 104080 | 494 batches | lr 0.000421 | ms/batch 5116.92 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10| epoch 152 step 104090 | 504 batches | lr 0.000421 | ms/batch 5115.58 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 104100 | 514 batches | lr 0.000421 | ms/batch 5120.08 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 104110 | 524 batches | lr 0.000421 | ms/batch 5118.74 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6940658945086007e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6940658945086007e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6940658945086007e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6940658945086007e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.235164736271502e-22
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.235164736271502e-22
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.235164736271502e-22
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.235164736271502e-22
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.117582368135751e-22
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.117582368135751e-22
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.117582368135751e-22
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.117582368135751e-22
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.293955920339377e-23
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.293955920339377e-23
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.293955920339377e-23
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.293955920339377e-23
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3234889800848443e-23
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3234889800848443e-23
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3234889800848443e-23
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 104120 | 534 batches | lr 0.000421 | ms/batch 5117.64 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3234889800848443e-23
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.308722450212111e-24
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.308722450212111e-24
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.308722450212111e-24
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.308722450212111e-24
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.271806125530277e-25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.271806125530277e-25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.271806125530277e-25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.271806125530277e-25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0679515313825692e-25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0679515313825692e-25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0679515313825692e-25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0679515313825692e-25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.169878828456423e-26
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.169878828456423e-26
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.169878828456423e-26
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.169878828456423e-26
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5849394142282115e-26
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5849394142282115e-26
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5849394142282115e-26
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5849394142282115e-26
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.462348535570529e-27
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.462348535570529e-27
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.462348535570529e-27
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.462348535570529e-27
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6155871338926322e-27
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6155871338926322e-27
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6155871338926322e-27
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6155871338926322e-27
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 104130 | 544 batches | lr 0.000421 | ms/batch 5124.81 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0389678347315804e-28
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0389678347315804e-28
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0389678347315804e-28
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0389678347315804e-28
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0097419586828951e-28
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0097419586828951e-28
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0097419586828951e-28
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0097419586828951e-28
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.048709793414476e-29
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.048709793414476e-29
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.048709793414476e-29
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.048709793414476e-29
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.262177448353619e-29
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.262177448353619e-29
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.262177448353619e-29
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.262177448353619e-29
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.310887241768095e-30
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.310887241768095e-30
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.310887241768095e-30
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.310887241768095e-30
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5777218104420236e-30
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5777218104420236e-30
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5777218104420236e-30
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5777218104420236e-30
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.888609052210118e-31
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.888609052210118e-31
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.888609052210118e-31Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.888609052210118e-31
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9721522630525295e-31
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9721522630525295e-31
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9721522630525295e-31
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 104140 | 554 batches | lr 0.000421 | ms/batch 5121.15 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9721522630525295e-31
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.930380657631324e-32
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.930380657631324e-32
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.930380657631324e-32
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.930380657631324e-32
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.232595164407831e-32
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.232595164407831e-32
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.232595164407831e-32
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.232595164407831e-32
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0814879110195774e-33
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0814879110195774e-33
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0814879110195774e-33
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0814879110195774e-33
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5407439555097887e-33
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5407439555097887e-33
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5407439555097887e-33
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5407439555097887e-33
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.703719777548943e-34
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.703719777548943e-34
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.703719777548943e-34
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.703719777548943e-34
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.851859888774472e-34
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.851859888774472e-34
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.851859888774472e-34
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.851859888774472e-34
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.925929944387236e-34
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.925929944387236e-34
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.925929944387236e-34
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.925929944387236e-34
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.81482486096809e-35
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.81482486096809e-35Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.81482486096809e-35
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 104150 | 564 batches | lr 0.000421 | ms/batch 5124.05 | loss nan | bpc nanGradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.81482486096809e-35
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2037062152420224e-35
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2037062152420224e-35
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2037062152420224e-35
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2037062152420224e-35
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.009265538105056e-36
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.009265538105056e-36
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.009265538105056e-36
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.009265538105056e-36
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.504632769052528e-36
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.504632769052528e-36
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.504632769052528e-36
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.504632769052528e-36
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.76158192263132e-37
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.76158192263132e-37
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.76158192263132e-37
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.76158192263132e-37
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.88079096131566e-37
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.88079096131566e-37
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.88079096131566e-37
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.88079096131566e-37
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.70197740328915e-38
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.70197740328915e-38
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.70197740328915e-38
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.70197740328915e-38
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1754943508222875e-38
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1754943508222875e-38
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1754943508222875e-38
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1754943508222875e-38
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 104160 | 574 batches | lr 0.000421 | ms/batch 5122.74 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.938735877055719e-39
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.938735877055719e-39
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.938735877055719e-39
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.938735877055719e-39
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4693679385278594e-39Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4693679385278594e-39
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4693679385278594e-39
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4693679385278594e-39
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6734198463196485e-40
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6734198463196485e-40
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6734198463196485e-40Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6734198463196485e-40
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.183549615799121e-41Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.183549615799121e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.183549615799121e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.183549615799121e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2958874039497803e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2958874039497803e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2958874039497803e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2958874039497803e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1479437019748901e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1479437019748901e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1479437019748901e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1479437019748901e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8698592549372254e-42
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8698592549372254e-42
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8698592549372254e-42
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8698592549372254e-42
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.174648137343064e-43
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.174648137343064e-43
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.174648137343064e-43
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.174648137343064e-43
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.587324068671532e-43Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.587324068671532e-43
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.587324068671532e-43
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.587324068671532e-43
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.96831017167883e-44
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.96831017167883e-44
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.96831017167883e-44
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 104170 | 584 batches | lr 0.000421 | ms/batch 5117.21 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.96831017167883e-44
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2420775429197073e-44
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2420775429197073e-44
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2420775429197073e-44
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2420775429197073e-44
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.605193857299268e-45
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.605193857299268e-45
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.605193857299268e-45
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.605193857299268e-45
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.401298464324817e-45Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.401298464324817e-45
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.401298464324817e-45
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.401298464324817e-45
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.503246160812043e-46
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.503246160812043e-46
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.503246160812043e-46
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.503246160812043e-46
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.758115402030107e-47
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.758115402030107e-47
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.758115402030107e-47
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.758115402030107e-47
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3790577010150533e-47
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3790577010150533e-47
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3790577010150533e-47
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3790577010150533e-47
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1895288505075267e-47
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1895288505075267e-47
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1895288505075267e-47
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1895288505075267e-47
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.473822126268817e-48
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.473822126268817e-48Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.473822126268817e-48
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.473822126268817e-48
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3684555315672042e-48
WARNING:root:NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3684555315672042e-48
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3684555315672042e-48
WARNING:root:NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3684555315672042e-48| epoch 152 step 104180 | 594 batches | lr 0.000421 | ms/batch 5121.18 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4211388289180104e-49
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4211388289180104e-49
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4211388289180104e-49
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4211388289180104e-49
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.552847072295026e-50
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.552847072295026e-50
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.552847072295026e-50
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.552847072295026e-50
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1382117680737565e-50
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1382117680737565e-50
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1382117680737565e-50
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1382117680737565e-50
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.345529420184391e-51
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.345529420184391e-51
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.345529420184391e-51
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.345529420184391e-51
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3363823550460978e-51
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3363823550460978e-51
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3363823550460978e-51
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3363823550460978e-51
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3409558876152446e-52
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3409558876152446e-52
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3409558876152446e-52
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3409558876152446e-52
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.352389719038111e-53
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.352389719038111e-53
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.352389719038111e-53
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.352389719038111e-53
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.088097429759528e-53
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.088097429759528e-53
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.088097429759528e-53
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.088097429759528e-53
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.22024357439882e-54
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.22024357439882e-54
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.22024357439882e-54
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.22024357439882e-54
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.305060893599705e-54
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.305060893599705e-54
WARNING:root:NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.305060893599705e-54
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 104190 | 604 batches | lr 0.000421 | ms/batch 5119.10 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.305060893599705e-54
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2626522339992623e-55
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2626522339992623e-55
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2626522339992623e-55
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2626522339992623e-55
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.156630584998156e-56
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.156630584998156e-56
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.156630584998156e-56
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.156630584998156e-56
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.039157646249539e-56
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.039157646249539e-56
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.039157646249539e-56
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.039157646249539e-56
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0978941156238473e-57
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0978941156238473e-57
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0978941156238473e-57
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0978941156238473e-57
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2744735289059618e-57
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2744735289059618e-57
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2744735289059618e-57
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2744735289059618e-57
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1861838222649046e-58
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1861838222649046e-58
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1861838222649046e-58
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1861838222649046e-58
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.965459555662261e-59
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.965459555662261e-59
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.965459555662261e-59
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.965459555662261e-59
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9913648889155653e-59
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9913648889155653e-59
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9913648889155653e-59
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9913648889155653e-59
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.9784122222889134e-60
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.9784122222889134e-60
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.9784122222889134e-60
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.9784122222889134e-60
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2446030555722283e-60
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2446030555722283e-60
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2446030555722283e-60
| epoch 152 step 104200 | 614 batches | lr 0.000421 | ms/batch 5120.82 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2446030555722283e-60
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.111507638930571e-61
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.111507638930571e-61Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.111507638930571e-61
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.111507638930571e-61
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.778769097326427e-62
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.778769097326427e-62
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.778769097326427e-62
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.778769097326427e-62
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9446922743316068e-62
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9446922743316068e-62
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9446922743316068e-62
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9446922743316068e-62
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.861730685829017e-63
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.861730685829017e-63
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.861730685829017e-63
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.861730685829017e-63
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2154326714572542e-63
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2154326714572542e-63
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2154326714572542e-63
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2154326714572542e-63
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0385816786431356e-64
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0385816786431356e-64
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0385816786431356e-64
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0385816786431356e-64
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.596454196607839e-65
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.596454196607839e-65
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.596454196607839e-65
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.596454196607839e-65
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8991135491519597e-65
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8991135491519597e-65
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8991135491519597e-65
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8991135491519597e-65
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.7477838728798994e-66
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.7477838728798994e-66
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.7477838728798994e-66
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.7477838728798994e-66
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1869459682199748e-66
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1869459682199748e-66
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1869459682199748e-66
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 104210 | 624 batches | lr 0.000421 | ms/batch 5120.74 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1869459682199748e-66
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.967364920549937e-67
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.967364920549937e-67
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.967364920549937e-67
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.967364920549937e-67
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.418412301374843e-68
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.418412301374843e-68
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.418412301374843e-68
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.418412301374843e-68
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8546030753437107e-68
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8546030753437107e-68
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8546030753437107e-68
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8546030753437107e-68
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.636507688359277e-69
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.636507688359277e-69
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.636507688359277e-69
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.636507688359277e-69
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1591269220898192e-69
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1591269220898192e-69
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1591269220898192e-69
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1591269220898192e-69
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.897817305224548e-70
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.897817305224548e-70
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.897817305224548e-70
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.897817305224548e-70
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.24454326306137e-71
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.24454326306137e-71
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.24454326306137e-71
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.24454326306137e-71
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8111358157653425e-71
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8111358157653425e-71
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8111358157653425e-71
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8111358157653425e-71
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.527839539413356e-72
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.527839539413356e-72
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.527839539413356e-72
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.527839539413356e-72
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.131959884853339e-72
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.131959884853339e-72
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.131959884853339e-72
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 104220 | 634 batches | lr 0.000421 | ms/batch 5123.10 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.131959884853339e-72
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8298997121333476e-73
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8298997121333476e-73
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8298997121333476e-73
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8298997121333476e-73
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.074749280333369e-74
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.074749280333369e-74
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.074749280333369e-74
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.074749280333369e-74
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7686873200833423e-74
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7686873200833423e-74
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7686873200833423e-74
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7686873200833423e-74
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.421718300208356e-75
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.421718300208356e-75
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.421718300208356e-75
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.421718300208356e-75
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.105429575052089e-75
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.105429575052089e-75
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.105429575052089e-75
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.105429575052089e-75
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7635739376302223e-76
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7635739376302223e-76
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7635739376302223e-76
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7635739376302223e-76
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.908934844075556e-77
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.908934844075556e-77
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.908934844075556e-77
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.908934844075556e-77
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.727233711018889e-77
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.727233711018889e-77
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.727233711018889e-77
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.727233711018889e-77
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3180842775472223e-78
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3180842775472223e-78Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3180842775472223e-78
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3180842775472223e-78
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0795210693868056e-78
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0795210693868056e-78
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0795210693868056e-78
| epoch 152 step 104230 | 644 batches | lr 0.000421 | ms/batch 5120.38 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0795210693868056e-78
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.698802673467014e-79
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.698802673467014e-79
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.698802673467014e-79
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.698802673467014e-79
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.747006683667535e-80
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.747006683667535e-80
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.747006683667535e-80
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.747006683667535e-80
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6867516709168837e-80
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6867516709168837e-80
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6867516709168837e-80Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6867516709168837e-80
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.2168791772922093e-81
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.2168791772922093e-81
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.2168791772922093e-81
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.2168791772922093e-81
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0542197943230523e-81
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0542197943230523e-81
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0542197943230523e-81
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0542197943230523e-81
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.635549485807631e-82
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.635549485807631e-82
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.635549485807631e-82
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.635549485807631e-82
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.588873714519077e-83
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.588873714519077e-83
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.588873714519077e-83
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.588873714519077e-83
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6472184286297693e-83
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6472184286297693e-83
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6472184286297693e-83
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6472184286297693e-83
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.118046071574423e-84
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.118046071574423e-84
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.118046071574423e-84
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.118046071574423e-84
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0295115178936058e-84
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0295115178936058e-84
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0295115178936058e-84
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 104240 | 654 batches | lr 0.000421 | ms/batch 5122.52 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0295115178936058e-84
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5737787947340145e-85
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5737787947340145e-85
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5737787947340145e-85
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5737787947340145e-85
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.434446986835036e-86
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.434446986835036e-86
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.434446986835036e-86
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.434446986835036e-86
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.608611746708759e-86
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.608611746708759e-86
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.608611746708759e-86
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.608611746708759e-86
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.021529366771898e-87
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.021529366771898e-87
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.021529366771898e-87
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.021529366771898e-87
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0053823416929744e-87
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0053823416929744e-87
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0053823416929744e-87
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0053823416929744e-87
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.513455854232436e-88Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.513455854232436e-88
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.513455854232436e-88
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.513455854232436e-88
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.28363963558109e-89
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.28363963558109e-89
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.28363963558109e-89
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.28363963558109e-89
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5709099088952725e-89
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5709099088952725e-89
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5709099088952725e-89
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5709099088952725e-89
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9272747722381812e-90
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9272747722381812e-90
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9272747722381812e-90
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9272747722381812e-90
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.818186930595453e-91
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.818186930595453e-91
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.818186930595453e-91
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 104250 | 664 batches | lr 0.000421 | ms/batch 5118.77 | loss nan | bpc nanGradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.818186930595453e-91
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| Eval 417 at step 104250 | time: 1355.27s | valid loss nan | bpc nan
----------------------------------------------------------------------------------------------------
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4545467326488633e-91
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4545467326488633e-91
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4545467326488633e-91Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4545467326488633e-91
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.136366831622158e-92
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.136366831622158e-92
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.136366831622158e-92
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.136366831622158e-92
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5340917079055395e-92
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5340917079055395e-92
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5340917079055395e-92
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5340917079055395e-92
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.835229269763849e-93
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.835229269763849e-93
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.835229269763849e-93
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.835229269763849e-93
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.588073174409622e-94
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.588073174409622e-94
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.588073174409622e-94
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.588073174409622e-94
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3970182936024055e-94
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3970182936024055e-94
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3970182936024055e-94
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3970182936024055e-94
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.992545734006014e-95
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.992545734006014e-95
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.992545734006014e-95
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.992545734006014e-95
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4981364335015035e-95
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4981364335015035e-95Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4981364335015035e-95
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4981364335015035e-95
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.745341083753759e-96
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.745341083753759e-96
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.745341083753759e-96
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.745341083753759e-96
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.363352709384397e-97
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 104260 | 674 batches | lr 0.000421 | ms/batch 11943.73 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.363352709384397e-97
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.363352709384397e-97
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.363352709384397e-97
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3408381773460992e-97
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3408381773460992e-97
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3408381773460992e-97
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3408381773460992e-97
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.852095443365248e-98
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.852095443365248e-98
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.852095443365248e-98
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.852095443365248e-98
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.463023860841312e-98
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.463023860841312e-98
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.463023860841312e-98
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.463023860841312e-98
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.65755965210328e-99
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.65755965210328e-99
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.65755965210328e-99
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.65755965210328e-99
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.1438991302582e-100
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.1438991302582e-100
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.1438991302582e-100
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.1438991302582e-100
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.28597478256455e-100
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.28597478256455e-100
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.28597478256455e-100
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.28597478256455e-100
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.714936956411375e-101
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.714936956411375e-101
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.714936956411375e-101
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.714936956411375e-101
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4287342391028437e-101
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4287342391028437e-101
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4287342391028437e-101
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4287342391028437e-101
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5718355977571093e-102
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5718355977571093e-102
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5718355977571093e-102
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5718355977571093e-102
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.929588994392773e-103
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.929588994392773e-103
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.929588994392773e-103
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 152 step 104270 | 684 batches | lr 0.000421 | ms/batch 5119.44 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.929588994392773e-103
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2323972485981933e-103
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2323972485981933e-103
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2323972485981933e-103Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2323972485981933e-103
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.5809931214954833e-104
dropping batch of size torch.Size([10574, 2]) because not full seq_len of 16384
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.5809931214954833e-104
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.5809931214954833e-104
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.5809931214954833e-104
dropping batch of size torch.Size([10575, 2]) because not full seq_len of 16384
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3952482803738708e-104
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3952482803738708e-104
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3952482803738708e-104
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3952482803738708e-104
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.488120700934677e-105
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.488120700934677e-105
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.488120700934677e-105
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.488120700934677e-105
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.720301752336693e-106
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.720301752336693e-106
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.720301752336693e-106
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.720301752336693e-106
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1800754380841732e-106
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1800754380841732e-106
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1800754380841732e-106
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1800754380841732e-106
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.450188595210433e-107
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.450188595210433e-107
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.450188595210433e-107
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.450188595210433e-107
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3625471488026082e-107
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3625471488026082e-107
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3625471488026082e-107
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3625471488026082e-107
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4063678720065206e-108
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4063678720065206e-108
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4063678720065206e-108
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4063678720065206e-108
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.515919680016301e-109
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.515919680016301e-109
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.515919680016301e-109Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.515919680016301e-109
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104280 | 8 batches | lr 0.000421 | ms/batch 5118.13 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1289799200040754e-109
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1289799200040754e-109
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1289799200040754e-109
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1289799200040754e-109
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.3224498000101884e-110
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.3224498000101884e-110
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.3224498000101884e-110
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.3224498000101884e-110
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3306124500025471e-110
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3306124500025471e-110
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3306124500025471e-110
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3306124500025471e-110
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.326531125006368e-111
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.326531125006368e-111
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.326531125006368e-111
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.326531125006368e-111
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.31632781251592e-112
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.31632781251592e-112
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.31632781251592e-112
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.31632781251592e-112
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.07908195312898e-112
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.07908195312898e-112
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.07908195312898e-112
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.07908195312898e-112
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.19770488282245e-113
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.19770488282245e-113
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.19770488282245e-113
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.19770488282245e-113
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2994262207056124e-113
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2994262207056124e-113
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2994262207056124e-113
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2994262207056124e-113
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.248565551764031e-114
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.248565551764031e-114
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.248565551764031e-114
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.248565551764031e-114
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.121413879410078e-115
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.121413879410078e-115
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.121413879410078e-115
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104290 | 18 batches | lr 0.000421 | ms/batch 5118.66 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.121413879410078e-115
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0303534698525194e-115
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0303534698525194e-115
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0303534698525194e-115
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0303534698525194e-115
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.075883674631299e-116
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.075883674631299e-116
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.075883674631299e-116
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.075883674631299e-116
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2689709186578246e-116
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2689709186578246e-116
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2689709186578246e-116
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2689709186578246e-116
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1724272966445615e-117
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1724272966445615e-117
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1724272966445615e-117
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1724272966445615e-117
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.931068241611404e-118
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.931068241611404e-118
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.931068241611404e-118
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.931068241611404e-118
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.982767060402851e-118
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.982767060402851e-118
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.982767060402851e-118
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.982767060402851e-118
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.9569176510071274e-119
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.9569176510071274e-119
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.9569176510071274e-119
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.9569176510071274e-119
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2392294127517818e-119
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2392294127517818e-119
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2392294127517818e-119
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2392294127517818e-119
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0980735318794546e-120
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0980735318794546e-120
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0980735318794546e-120
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0980735318794546e-120
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.745183829698637e-121
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.745183829698637e-121
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.745183829698637e-121
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.745183829698637e-121
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104300 | 28 batches | lr 0.000421 | ms/batch 5117.96 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9362959574246591e-121
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9362959574246591e-121
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9362959574246591e-121
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9362959574246591e-121
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.840739893561648e-122
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.840739893561648e-122
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.840739893561648e-122
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.840739893561648e-122
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.210184973390412e-122
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.210184973390412e-122
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.210184973390412e-122
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.210184973390412e-122
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.02546243347603e-123
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.02546243347603e-123
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.02546243347603e-123
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.02546243347603e-123
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.563656083690075e-124
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.563656083690075e-124
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.563656083690075e-124
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.563656083690075e-124
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8909140209225187e-124
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8909140209225187e-124
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8909140209225187e-124
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8909140209225187e-124
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.727285052306297e-125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.727285052306297e-125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.727285052306297e-125Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.727285052306297e-125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1818212630765742e-125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1818212630765742e-125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1818212630765742e-125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1818212630765742e-125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9545531576914354e-126
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9545531576914354e-126
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9545531576914354e-126
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9545531576914354e-126
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.386382894228589e-127
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.386382894228589e-127
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.386382894228589e-127
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104310 | 38 batches | lr 0.000421 | ms/batch 5120.30 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.386382894228589e-127
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8465957235571472e-127
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8465957235571472e-127
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8465957235571472e-127
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8465957235571472e-127
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.616489308892868e-128
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.616489308892868e-128
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.616489308892868e-128
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.616489308892868e-128
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.154122327223217e-128
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.154122327223217e-128
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.154122327223217e-128
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.154122327223217e-128
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8853058180580424e-129
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8853058180580424e-129
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8853058180580424e-129
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8853058180580424e-129
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.213264545145106e-130
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.213264545145106e-130
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.213264545145106e-130
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.213264545145106e-130
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8033161362862765e-130
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8033161362862765e-130Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8033161362862765e-130
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8033161362862765e-130
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5082903407156913e-131
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5082903407156913e-131
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5082903407156913e-131
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5082903407156913e-131
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1270725851789228e-131
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1270725851789228e-131
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1270725851789228e-131
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1270725851789228e-131
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.817681462947307e-132
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.817681462947307e-132
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.817681462947307e-132
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.817681462947307e-132
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.044203657368268e-133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.044203657368268e-133
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104320 | 48 batches | lr 0.000421 | ms/batch 5119.29 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.044203657368268e-133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.044203657368268e-133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.761050914342067e-133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.761050914342067e-133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.761050914342067e-133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.761050914342067e-133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4026272858551673e-134
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4026272858551673e-134
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4026272858551673e-134
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4026272858551673e-134
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1006568214637918e-134
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1006568214637918e-134
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1006568214637918e-134
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1006568214637918e-134
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7516420536594796e-135
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7516420536594796e-135
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7516420536594796e-135
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7516420536594796e-135
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.879105134148699e-136
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.879105134148699e-136
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.879105134148699e-136
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.879105134148699e-136
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7197762835371747e-136
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7197762835371747e-136
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7197762835371747e-136
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7197762835371747e-136
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.299440708842937e-137
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.299440708842937e-137
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.299440708842937e-137
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.299440708842937e-137
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0748601772107342e-137
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0748601772107342e-137
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0748601772107342e-137
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0748601772107342e-137
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6871504430268355e-138
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6871504430268355e-138
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6871504430268355e-138
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6871504430268355e-138
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.717876107567089e-139
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.717876107567089e-139
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104330 | 58 batches | lr 0.000421 | ms/batch 5117.85 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.717876107567089e-139
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.717876107567089e-139
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6794690268917722e-139
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6794690268917722e-139
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6794690268917722e-139
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6794690268917722e-139
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.1986725672294305e-140
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.1986725672294305e-140
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.1986725672294305e-140
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.1986725672294305e-140
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0496681418073576e-140Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0496681418073576e-140
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0496681418073576e-140
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0496681418073576e-140
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.624170354518394e-141
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.624170354518394e-141
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.624170354518394e-141
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.624170354518394e-141
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.560425886295985e-142Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.560425886295985e-142
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.560425886295985e-142
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.560425886295985e-142
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6401064715739963e-142
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6401064715739963e-142
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6401064715739963e-142
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6401064715739963e-142
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.100266178934991e-143
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.100266178934991e-143
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.100266178934991e-143
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.100266178934991e-143
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0250665447337477e-143
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0250665447337477e-143
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0250665447337477e-143
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0250665447337477e-143
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5626663618343692e-144
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5626663618343692e-144Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5626663618343692e-144
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5626663618343692e-144
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.406665904585923e-145Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.406665904585923e-145Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.406665904585923e-145
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104340 | 68 batches | lr 0.000421 | ms/batch 5119.16 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.406665904585923e-145
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6016664761464807e-145
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6016664761464807e-145
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6016664761464807e-145
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6016664761464807e-145
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.004166190366202e-146Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.004166190366202e-146
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.004166190366202e-146
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.004166190366202e-146
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0010415475915505e-146
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0010415475915505e-146
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0010415475915505e-146
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0010415475915505e-146
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5026038689788762e-147
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5026038689788762e-147
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5026038689788762e-147
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5026038689788762e-147
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.256509672447191e-148
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.256509672447191e-148
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.256509672447191e-148
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.256509672447191e-148
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5641274181117976e-148
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5641274181117976e-148
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5641274181117976e-148
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5641274181117976e-148
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.910318545279494e-149
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.910318545279494e-149
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.910318545279494e-149
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.910318545279494e-149
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.775796363198735e-150
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.775796363198735e-150
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.775796363198735e-150
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.775796363198735e-150
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.443949090799684e-150
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.443949090799684e-150
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.443949090799684e-150
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.443949090799684e-150
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.10987272699921e-151
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.10987272699921e-151
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104350 | 78 batches | lr 0.000421 | ms/batch 5117.98 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.10987272699921e-151
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.10987272699921e-151
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5274681817498023e-151Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5274681817498023e-151
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5274681817498023e-151
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5274681817498023e-151
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.818670454374506e-152
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.818670454374506e-152
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.818670454374506e-152
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.818670454374506e-152
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.546676135936265e-153
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.546676135936265e-153
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.546676135936265e-153
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.546676135936265e-153
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3866690339840662e-153
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3866690339840662e-153
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3866690339840662e-153
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3866690339840662e-153
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.966672584960166e-154
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.966672584960166e-154
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.966672584960166e-154
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.966672584960166e-154
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4916681462400413e-154
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4916681462400413e-154
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4916681462400413e-154
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4916681462400413e-154
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.7291703656001034e-155
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.7291703656001034e-155
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.7291703656001034e-155
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.7291703656001034e-155
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.322925914000258e-156
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.322925914000258e-156
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.322925914000258e-156
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.322925914000258e-156
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3307314785000646e-156Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3307314785000646e-156
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3307314785000646e-156
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3307314785000646e-156
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.826828696250162e-157Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.826828696250162e-157
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.826828696250162e-157
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.826828696250162e-157
| epoch 153 step 104360 | 88 batches | lr 0.000421 | ms/batch 5117.80 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4567071740625404e-157
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4567071740625404e-157
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4567071740625404e-157
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4567071740625404e-157
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.641767935156351e-158
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.641767935156351e-158
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.641767935156351e-158
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.641767935156351e-158
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.104419837890877e-159
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.104419837890877e-159
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.104419837890877e-159
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.104419837890877e-159
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2761049594727193e-159
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2761049594727193e-159
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2761049594727193e-159
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2761049594727193e-159
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.6902623986817984e-160
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.6902623986817984e-160
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.6902623986817984e-160
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.6902623986817984e-160
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4225655996704496e-160
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4225655996704496e-160
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4225655996704496e-160
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4225655996704496e-160
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.556413999176124e-161
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.556413999176124e-161
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.556413999176124e-161
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.556413999176124e-161
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.89103499794031e-162Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.89103499794031e-162
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.89103499794031e-162
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.89103499794031e-162
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2227587494850775e-162
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2227587494850775e-162
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2227587494850775e-162
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2227587494850775e-162
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.556896873712694e-163
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.556896873712694e-163
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.556896873712694e-163Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.556896873712694e-163
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104370 | 98 batches | lr 0.000421 | ms/batch 5119.46 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3892242184281734e-163
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3892242184281734e-163
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3892242184281734e-163
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3892242184281734e-163
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4730605460704336e-164Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4730605460704336e-164
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4730605460704336e-164
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4730605460704336e-164
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.682651365176084e-165
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.682651365176084e-165
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.682651365176084e-165
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.682651365176084e-165
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.170662841294021e-165
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.170662841294021e-165
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.170662841294021e-165
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.170662841294021e-165
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.426657103235053e-166
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.426657103235053e-166
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.426657103235053e-166
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.426657103235053e-166
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3566642758087631e-166
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3566642758087631e-166
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3566642758087631e-166
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3566642758087631e-166
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.391660689521908e-167
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.391660689521908e-167
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.391660689521908e-167
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.391660689521908e-167
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.47915172380477e-168
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.47915172380477e-168
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.47915172380477e-168
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.47915172380477e-168
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1197879309511924e-168
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1197879309511924e-168
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1197879309511924e-168
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1197879309511924e-168
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.299469827377981e-169
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.299469827377981e-169
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.299469827377981e-169
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.299469827377981e-169
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104380 | 108 batches | lr 0.000421 | ms/batch 5121.43 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3248674568444952e-169
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3248674568444952e-169
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3248674568444952e-169Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3248674568444952e-169
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.312168642111238e-170Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.312168642111238e-170
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.312168642111238e-170
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.312168642111238e-170
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.280421605278095e-171
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.280421605278095e-171
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.280421605278095e-171
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.280421605278095e-171
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.070105401319524e-171
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.070105401319524e-171
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.070105401319524e-171
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.070105401319524e-171
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.17526350329881e-172
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.17526350329881e-172
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.17526350329881e-172
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.17526350329881e-172
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2938158758247024e-172Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2938158758247024e-172Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2938158758247024e-172
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2938158758247024e-172
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.234539689561756e-173
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.234539689561756e-173
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.234539689561756e-173Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.234539689561756e-173
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.08634922390439e-174
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.08634922390439e-174
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.08634922390439e-174
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.08634922390439e-174
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0215873059760975e-174
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0215873059760975e-174
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0215873059760975e-174
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0215873059760975e-174
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.053968264940244e-175
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.053968264940244e-175
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.053968264940244e-175
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104390 | 118 batches | lr 0.000421 | ms/batch 5118.86 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.053968264940244e-175
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.263492066235061e-175
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.263492066235061e-175
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.263492066235061e-175
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.263492066235061e-175
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1587301655876523e-176
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1587301655876523e-176
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1587301655876523e-176
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1587301655876523e-176
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.896825413969131e-177Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.896825413969131e-177
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.896825413969131e-177
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.896825413969131e-177
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9742063534922827e-177
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9742063534922827e-177
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9742063534922827e-177
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9742063534922827e-177
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.935515883730707e-178
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.935515883730707e-178
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.935515883730707e-178
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.935515883730707e-178
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2338789709326767e-178
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2338789709326767e-178
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2338789709326767e-178
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2338789709326767e-178
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.084697427331692e-179Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.084697427331692e-179
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.084697427331692e-179
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.084697427331692e-179
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.71174356832923e-180
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.71174356832923e-180
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.71174356832923e-180
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.71174356832923e-180
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9279358920823073e-180
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9279358920823073e-180
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9279358920823073e-180
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9279358920823073e-180
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.819839730205768e-181
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.819839730205768e-181
| epoch 153 step 104400 | 128 batches | lr 0.000421 | ms/batch 5118.50 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.819839730205768e-181
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.819839730205768e-181
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.204959932551442e-181
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.204959932551442e-181
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.204959932551442e-181
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.204959932551442e-181
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.012399831378605e-182
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.012399831378605e-182
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.012399831378605e-182
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.012399831378605e-182
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.530999578446513e-183
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.530999578446513e-183
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.530999578446513e-183
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.530999578446513e-183
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8827498946116282e-183Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8827498946116282e-183
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8827498946116282e-183
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8827498946116282e-183
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.706874736529071e-184
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.706874736529071e-184
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.706874736529071e-184
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.706874736529071e-184
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1767186841322676e-184
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1767186841322676e-184
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1767186841322676e-184
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1767186841322676e-184
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.941796710330669e-185
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.941796710330669e-185
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.941796710330669e-185
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.941796710330669e-185
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.354491775826673e-186
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.354491775826673e-186
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.354491775826673e-186
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.354491775826673e-186
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8386229439566682e-186
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8386229439566682e-186
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8386229439566682e-186
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8386229439566682e-186
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5965573598916705e-187
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5965573598916705e-187
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5965573598916705e-187
WARNING:root:NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5965573598916705e-187
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104410 | 138 batches | lr 0.000421 | ms/batch 5117.97 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1491393399729176e-187
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1491393399729176e-187
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1491393399729176e-187
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1491393399729176e-187
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.872848349932294e-188
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.872848349932294e-188
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.872848349932294e-188
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.872848349932294e-188
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.182120874830735e-189
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.182120874830735e-189
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.182120874830735e-189
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.182120874830735e-189
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7955302187076838e-189
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7955302187076838e-189
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7955302187076838e-189
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7955302187076838e-189
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4888255467692094e-190
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4888255467692094e-190
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4888255467692094e-190
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4888255467692094e-190
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1222063866923024e-190
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1222063866923024e-190
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1222063866923024e-190
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1222063866923024e-190
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.805515966730756e-191
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.805515966730756e-191
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.805515966730756e-191
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.805515966730756e-191
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.01378991682689e-192
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.01378991682689e-192Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.01378991682689e-192
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.01378991682689e-192
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7534474792067224e-192
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7534474792067224e-192
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7534474792067224e-192
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7534474792067224e-192
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.383618698016806e-193
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.383618698016806e-193
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.383618698016806e-193
WARNING:root:NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.383618698016806e-193
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104420 | 148 batches | lr 0.000421 | ms/batch 5118.21 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0959046745042015e-193
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0959046745042015e-193
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0959046745042015e-193
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0959046745042015e-193
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.739761686260504e-194
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.739761686260504e-194
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.739761686260504e-194
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.739761686260504e-194
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.84940421565126e-195
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.84940421565126e-195
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.84940421565126e-195
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.84940421565126e-195
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.712351053912815e-195
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.712351053912815e-195
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.712351053912815e-195
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.712351053912815e-195
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.280877634782037e-196
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.280877634782037e-196Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.280877634782037e-196
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.280877634782037e-196
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0702194086955093e-196
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0702194086955093e-196
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0702194086955093e-196
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0702194086955093e-196
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6755485217387732e-197Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6755485217387732e-197Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6755485217387732e-197
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6755485217387732e-197
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.688871304346933e-198
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.688871304346933e-198
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.688871304346933e-198
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.688871304346933e-198
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6722178260867333e-198
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6722178260867333e-198
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6722178260867333e-198
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6722178260867333e-198
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.180544565216833e-199
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.180544565216833e-199Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.180544565216833e-199
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104430 | 158 batches | lr 0.000421 | ms/batch 5118.15 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.180544565216833e-199
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0451361413042083e-199
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0451361413042083e-199
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0451361413042083e-199
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0451361413042083e-199
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.612840353260521e-200
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.612840353260521e-200Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.612840353260521e-200
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.612840353260521e-200
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.532100883151302e-201
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.532100883151302e-201
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.532100883151302e-201
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.532100883151302e-201
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6330252207878255e-201
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6330252207878255e-201
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6330252207878255e-201
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6330252207878255e-201
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.082563051969564e-202
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.082563051969564e-202Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.082563051969564e-202
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.082563051969564e-202
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.020640762992391e-202
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.020640762992391e-202
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.020640762992391e-202
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.020640762992391e-202
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5516019074809773e-203
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5516019074809773e-203
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5516019074809773e-203
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5516019074809773e-203
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.379004768702443e-204
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.379004768702443e-204
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.379004768702443e-204
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.379004768702443e-204
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5947511921756108e-204
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5947511921756108e-204
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5947511921756108e-204
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5947511921756108e-204
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.986877980439027e-205
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.986877980439027e-205
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.986877980439027e-205
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.986877980439027e-205
| epoch 153 step 104440 | 168 batches | lr 0.000421 | ms/batch 5117.66 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.967194951097568e-206
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.967194951097568e-206
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.967194951097568e-206
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.967194951097568e-206
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.491798737774392e-206
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.491798737774392e-206
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.491798737774392e-206
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.491798737774392e-206
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.22949684443598e-207
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.22949684443598e-207
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.22949684443598e-207
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.22949684443598e-207
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.557374211108995e-207
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.557374211108995e-207
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.557374211108995e-207
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.557374211108995e-207
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8934355277724873e-208
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8934355277724873e-208
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8934355277724873e-208
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8934355277724873e-208
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.733588819431218e-209
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.733588819431218e-209
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.733588819431218e-209
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.733588819431218e-209
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4333972048578046e-209
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4333972048578046e-209
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4333972048578046e-209
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4333972048578046e-209
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.083493012144512e-210
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.083493012144512e-210
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.083493012144512e-210
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.083493012144512e-210
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.520873253036128e-210
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.520873253036128e-210
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.520873253036128e-210
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.520873253036128e-210
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.80218313259032e-211
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.80218313259032e-211
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.80218313259032e-211Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.80218313259032e-211
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104450 | 178 batches | lr 0.00042 | ms/batch 5117.37 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5054578314758e-212Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5054578314758e-212
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5054578314758e-212
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5054578314758e-212
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.37636445786895e-212
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.37636445786895e-212
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.37636445786895e-212
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.37636445786895e-212
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.940911144672375e-213
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.940911144672375e-213
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.940911144672375e-213Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.940911144672375e-213
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4852277861680936e-213
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4852277861680936e-213
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4852277861680936e-213
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4852277861680936e-213
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.713069465420234e-214
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.713069465420234e-214
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.713069465420234e-214Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.713069465420234e-214
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.282673663550585e-215
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.282673663550585e-215
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.282673663550585e-215
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.282673663550585e-215
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3206684158876463e-215
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3206684158876463e-215
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3206684158876463e-215
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3206684158876463e-215
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.801671039719116e-216Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.801671039719116e-216
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.801671039719116e-216
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.801671039719116e-216
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.450417759929779e-216
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.450417759929779e-216
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.450417759929779e-216
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.450417759929779e-216
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6260443998244473e-217Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6260443998244473e-217
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6260443998244473e-217
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104460 | 188 batches | lr 0.00042 | ms/batch 5118.34 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6260443998244473e-217
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.065110999561118e-218
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.065110999561118e-218
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.065110999561118e-218
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.065110999561118e-218
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2662777498902796e-218Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2662777498902796e-218
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2662777498902796e-218
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2662777498902796e-218
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.665694374725699e-219
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.665694374725699e-219
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.665694374725699e-219
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.665694374725699e-219
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4164235936814247e-219
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4164235936814247e-219
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4164235936814247e-219
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4164235936814247e-219
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.541058984203562e-220
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.541058984203562e-220
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.541058984203562e-220
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.541058984203562e-220
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.852647460508905e-221
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.852647460508905e-221Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.852647460508905e-221
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.852647460508905e-221
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2131618651272261e-221
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2131618651272261e-221
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2131618651272261e-221Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2131618651272261e-221
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.5329046628180653e-222
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.5329046628180653e-222
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.5329046628180653e-222
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.5329046628180653e-222
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3832261657045163e-222
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3832261657045163e-222
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3832261657045163e-222
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3832261657045163e-222
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.458065414261291e-223
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.458065414261291e-223
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104470 | 198 batches | lr 0.00042 | ms/batch 5118.37 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.458065414261291e-223
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.458065414261291e-223
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.645163535653227e-224
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.645163535653227e-224Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.645163535653227e-224
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.645163535653227e-224
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.161290883913307e-224
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.161290883913307e-224Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.161290883913307e-224
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.161290883913307e-224
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.403227209783267e-225
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.403227209783267e-225
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.403227209783267e-225
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.403227209783267e-225
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3508068024458167e-225
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3508068024458167e-225
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3508068024458167e-225
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3508068024458167e-225
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.377017006114542e-226
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.377017006114542e-226
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.377017006114542e-226
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.377017006114542e-226
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.442542515286355e-227Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.442542515286355e-227
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.442542515286355e-227
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.442542515286355e-227
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1106356288215886e-227Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1106356288215886e-227
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1106356288215886e-227
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1106356288215886e-227
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.276589072053972e-228
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.276589072053972e-228Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.276589072053972e-228
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.276589072053972e-228
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.319147268013493e-228
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.319147268013493e-228
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.319147268013493e-228
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.319147268013493e-228
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2978681700337323e-229
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104480 | 208 batches | lr 0.00042 | ms/batch 5119.34 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2978681700337323e-229
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2978681700337323e-229
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2978681700337323e-229
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.244670425084331e-230
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.244670425084331e-230
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.244670425084331e-230
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.244670425084331e-230
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0611676062710827e-230
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0611676062710827e-230
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0611676062710827e-230
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0611676062710827e-230
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.152919015677707e-231
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.152919015677707e-231
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.152919015677707e-231
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.152919015677707e-231
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2882297539194267e-231Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2882297539194267e-231
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2882297539194267e-231
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2882297539194267e-231
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.220574384798567e-232
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.220574384798567e-232
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.220574384798567e-232
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.220574384798567e-232
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.051435961996417e-233
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.051435961996417e-233
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.051435961996417e-233
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.051435961996417e-233
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0128589904991042e-233Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0128589904991042e-233
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0128589904991042e-233
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0128589904991042e-233
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0321474762477604e-234
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0321474762477604e-234
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0321474762477604e-234
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0321474762477604e-234
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2580368690619401e-234
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2580368690619401e-234
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2580368690619401e-234
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2580368690619401e-234
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1450921726548502e-235
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1450921726548502e-235
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104490 | 218 batches | lr 0.00042 | ms/batch 5118.17 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1450921726548502e-235
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1450921726548502e-235
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.862730431637126e-236Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.862730431637126e-236
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.862730431637126e-236
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.862730431637126e-236
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9656826079092814e-236
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9656826079092814e-236
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9656826079092814e-236
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9656826079092814e-236
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.914206519773204e-237
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.914206519773204e-237
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.914206519773204e-237
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.914206519773204e-237
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.228551629943301e-237
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.228551629943301e-237
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.228551629943301e-237
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.228551629943301e-237
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0713790748582522e-238
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0713790748582522e-238
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0713790748582522e-238
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0713790748582522e-238
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.678447687145631e-239
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.678447687145631e-239
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.678447687145631e-239
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.678447687145631e-239
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9196119217864076e-239
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9196119217864076e-239
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9196119217864076e-239
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9196119217864076e-239
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.799029804466019e-240
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.799029804466019e-240
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.799029804466019e-240
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.799029804466019e-240
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1997574511165048e-240
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1997574511165048e-240
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1997574511165048e-240
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1997574511165048e-240
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.999393627791262e-241
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.999393627791262e-241
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104500 | 228 batches | lr 0.00042 | ms/batch 5120.18 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.999393627791262e-241
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.999393627791262e-241
dropping batch of size torch.Size([2407, 2]) because not full seq_len of 16384
----------------------------------------------------------------------------------------------------
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| Eval 418 at step 104500 | time: 1347.99s | valid loss nan | bpc nan
----------------------------------------------------------------------------------------------------
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.498484069478155e-242
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.498484069478155e-242
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.498484069478155e-242
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.498484069478155e-242
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8746210173695387e-242
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8746210173695387e-242
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8746210173695387e-242
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8746210173695387e-242
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.686552543423847e-243
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.686552543423847e-243
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.686552543423847e-243
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.686552543423847e-243
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1716381358559617e-243
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1716381358559617e-243
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1716381358559617e-243
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1716381358559617e-243
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9290953396399042e-244
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9290953396399042e-244Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9290953396399042e-244
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9290953396399042e-244
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.322738349099761e-245
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.322738349099761e-245
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.322738349099761e-245
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.322738349099761e-245
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8306845872749401e-245
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8306845872749401e-245
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8306845872749401e-245
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8306845872749401e-245
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5767114681873503e-246
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5767114681873503e-246
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5767114681873503e-246
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5767114681873503e-246
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1441778670468376e-246Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1441778670468376e-246
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1441778670468376e-246
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1441778670468376e-246
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.860444667617094e-247
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.860444667617094e-247
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.860444667617094e-247
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104510 | 238 batches | lr 0.00042 | ms/batch 11941.77 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.860444667617094e-247
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.151111669042735e-248
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.151111669042735e-248
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.151111669042735e-248
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.151111669042735e-248
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7877779172606837e-248
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7877779172606837e-248
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7877779172606837e-248
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7877779172606837e-248
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4694447931517093e-249
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4694447931517093e-249
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4694447931517093e-249
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4694447931517093e-249
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1173611982879273e-249
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1173611982879273e-249
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1173611982879273e-249
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1173611982879273e-249
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7934029957198183e-250
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7934029957198183e-250
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7934029957198183e-250
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7934029957198183e-250
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.983507489299546e-251
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.983507489299546e-251
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.983507489299546e-251
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.983507489299546e-251
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7458768723248864e-251
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7458768723248864e-251
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7458768723248864e-251
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7458768723248864e-251
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.364692180812216e-252
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.364692180812216e-252
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.364692180812216e-252
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.364692180812216e-252
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.091173045203054e-252
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.091173045203054e-252
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.091173045203054e-252
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.091173045203054e-252
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.727932613007635e-253
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104520 | 248 batches | lr 0.00042 | ms/batch 5118.78 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.727932613007635e-253
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.727932613007635e-253
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.727932613007635e-253
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.819831532519088e-254
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.819831532519088e-254
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.819831532519088e-254
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.819831532519088e-254
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.704957883129772e-254
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.704957883129772e-254Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.704957883129772e-254
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.704957883129772e-254
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.26239470782443e-255
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.26239470782443e-255
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.26239470782443e-255
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.26239470782443e-255
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0655986769561075e-255
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0655986769561075e-255Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0655986769561075e-255
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0655986769561075e-255
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6639966923902686e-256
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6639966923902686e-256Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6639966923902686e-256
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6639966923902686e-256
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.659991730975672e-257
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.659991730975672e-257
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.659991730975672e-257
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.659991730975672e-257
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.664997932743918e-257
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.664997932743918e-257
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.664997932743918e-257
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.664997932743918e-257
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.162494831859795e-258
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.162494831859795e-258
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.162494831859795e-258
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.162494831859795e-258
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0406237079649487e-258
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0406237079649487e-258Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0406237079649487e-258
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0406237079649487e-258
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6015592699123717e-259
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6015592699123717e-259
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6015592699123717e-259
| epoch 153 step 104530 | 258 batches | lr 0.00042 | ms/batch 5118.20 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6015592699123717e-259
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.503898174780929e-260
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.503898174780929e-260
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.503898174780929e-260
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.503898174780929e-260
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6259745436952323e-260
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6259745436952323e-260
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6259745436952323e-260
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6259745436952323e-260
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.064936359238081e-261
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.064936359238081e-261
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.064936359238081e-261
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.064936359238081e-261
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0162340898095202e-261
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0162340898095202e-261
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0162340898095202e-261
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0162340898095202e-261
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5405852245238005e-262
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5405852245238005e-262
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5405852245238005e-262
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5405852245238005e-262
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.351463061309501e-263
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.351463061309501e-263
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.351463061309501e-263
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.351463061309501e-263
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5878657653273753e-263
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5878657653273753e-263
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5878657653273753e-263
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5878657653273753e-263
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9696644133184383e-264
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9696644133184383e-264
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9696644133184383e-264
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9696644133184383e-264
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.924161033296096e-265
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.924161033296096e-265
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.924161033296096e-265
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.924161033296096e-265
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.481040258324024e-265
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.481040258324024e-265
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.481040258324024e-265
| epoch 153 step 104540 | 268 batches | lr 0.00042 | ms/batch 5120.17 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.481040258324024e-265
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.20260064581006e-266
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.20260064581006e-266
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.20260064581006e-266
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.20260064581006e-266
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.550650161452515e-266
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.550650161452515e-266
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.550650161452515e-266
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.550650161452515e-266
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8766254036312874e-267
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8766254036312874e-267
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8766254036312874e-267
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8766254036312874e-267
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.691563509078218e-268Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.691563509078218e-268
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.691563509078218e-268
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.691563509078218e-268
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4228908772695546e-268Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4228908772695546e-268
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4228908772695546e-268
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4228908772695546e-268
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.057227193173887e-269
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.057227193173887e-269
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.057227193173887e-269
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.057227193173887e-269
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5143067982934716e-269
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5143067982934716e-269
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5143067982934716e-269
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5143067982934716e-269
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.785766995733679e-270
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.785766995733679e-270
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.785766995733679e-270
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.785766995733679e-270
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.464417489334198e-271
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.464417489334198e-271Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.464417489334198e-271
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.464417489334198e-271
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3661043723335494e-271
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3661043723335494e-271
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3661043723335494e-271
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3661043723335494e-271
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104550 | 278 batches | lr 0.00042 | ms/batch 5118.63 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.915260930833874e-272
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.915260930833874e-272
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.915260930833874e-272
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.915260930833874e-272
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4788152327084684e-272
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4788152327084684e-272
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4788152327084684e-272
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4788152327084684e-272
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.697038081771171e-273
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.697038081771171e-273
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.697038081771171e-273
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.697038081771171e-273
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.242595204427927e-274Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.242595204427927e-274
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.242595204427927e-274
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.242595204427927e-274
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.310648801106982e-274
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.310648801106982e-274
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.310648801106982e-274
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.310648801106982e-274
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.776622002767455e-275
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.776622002767455e-275
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.776622002767455e-275
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.776622002767455e-275
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4441555006918637e-275
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4441555006918637e-275
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4441555006918637e-275
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4441555006918637e-275
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.610388751729659e-276
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.610388751729659e-276Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.610388751729659e-276
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.610388751729659e-276
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.025971879324148e-277
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.025971879324148e-277
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.025971879324148e-277Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.025971879324148e-277
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.256492969831037e-277
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.256492969831037e-277
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.256492969831037e-277
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.256492969831037e-277
| epoch 153 step 104560 | 288 batches | lr 0.00042 | ms/batch 5119.90 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.641232424577593e-278
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.641232424577593e-278
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.641232424577593e-278
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.641232424577593e-278
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4103081061443981e-278
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4103081061443981e-278
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4103081061443981e-278
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4103081061443981e-278
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5257702653609953e-279
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5257702653609953e-279
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5257702653609953e-279
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5257702653609953e-279
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.814425663402488e-280
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.814425663402488e-280
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.814425663402488e-280
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.814425663402488e-280
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.203606415850622e-280Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.203606415850622e-280
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.203606415850622e-280
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.203606415850622e-280
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.509016039626555e-281
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.509016039626555e-281
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.509016039626555e-281
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.509016039626555e-281
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3772540099066388e-281
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3772540099066388e-281Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3772540099066388e-281
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3772540099066388e-281
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.443135024766597e-282
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.443135024766597e-282Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.443135024766597e-282
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.443135024766597e-282
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.607837561916492e-283
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.607837561916492e-283
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.607837561916492e-283
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.607837561916492e-283
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.151959390479123e-283
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.151959390479123e-283
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104570 | 298 batches | lr 0.00042 | ms/batch 5118.48 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.151959390479123e-283
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.151959390479123e-283
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.379898476197808e-284
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.379898476197808e-284
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.379898476197808e-284
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.379898476197808e-284
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.344974619049452e-284
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.344974619049452e-284
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.344974619049452e-284
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.344974619049452e-284
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.36243654762363e-285
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.36243654762363e-285
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.36243654762363e-285
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.36243654762363e-285
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.406091369059075e-286
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.406091369059075e-286
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.406091369059075e-286
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.406091369059075e-286
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1015228422647686e-286
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1015228422647686e-286
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1015228422647686e-286
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1015228422647686e-286
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.253807105661922e-287
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.253807105661922e-287
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.253807105661922e-287
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.253807105661922e-287
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3134517764154804e-287
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3134517764154804e-287
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3134517764154804e-287
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3134517764154804e-287
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.283629441038701e-288
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.283629441038701e-288
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.283629441038701e-288
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.283629441038701e-288
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.209073602596753e-289Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.209073602596753e-289
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.209073602596753e-289
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.209073602596753e-289
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0522684006491881e-289
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0522684006491881e-289
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0522684006491881e-289
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0522684006491881e-289
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104580 | 308 batches | lr 0.00042 | ms/batch 5118.17 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.1306710016229703e-290
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.1306710016229703e-290
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.1306710016229703e-290
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.1306710016229703e-290
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2826677504057426e-290
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2826677504057426e-290
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2826677504057426e-290
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2826677504057426e-290
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2066693760143564e-291
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2066693760143564e-291
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2066693760143564e-291
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2066693760143564e-291
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.016673440035891e-292
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.016673440035891e-292
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.016673440035891e-292
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.016673440035891e-292
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.004168360008973e-292
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.004168360008973e-292Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.004168360008973e-292
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.004168360008973e-292
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.010420900022432e-293
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.010420900022432e-293
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.010420900022432e-293
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.010420900022432e-293
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.252605225005608e-293
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.252605225005608e-293
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.252605225005608e-293
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.252605225005608e-293
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.13151306251402e-294
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.13151306251402e-294
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.13151306251402e-294
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.13151306251402e-294
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.82878265628505e-295
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.82878265628505e-295
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.82878265628505e-295
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.82878265628505e-295
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9571956640712625e-295
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9571956640712625e-295
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104590 | 318 batches | lr 0.00042 | ms/batch 5117.51 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9571956640712625e-295
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9571956640712625e-295
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.892989160178156e-296
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.892989160178156e-296
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.892989160178156e-296
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.892989160178156e-296
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.223247290044539e-296
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.223247290044539e-296Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.223247290044539e-296
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.223247290044539e-296
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0581182251113476e-297
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0581182251113476e-297
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0581182251113476e-297
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0581182251113476e-297
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.645295562778369e-298
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.645295562778369e-298Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.645295562778369e-298
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.645295562778369e-298
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9113238906945923e-298
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9113238906945923e-298Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9113238906945923e-298
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9113238906945923e-298
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.778309726736481e-299
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.778309726736481e-299
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.778309726736481e-299
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.778309726736481e-299
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1945774316841202e-299
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1945774316841202e-299
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1945774316841202e-299
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1945774316841202e-299
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9864435792103004e-300
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9864435792103004e-300Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9864435792103004e-300
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9864435792103004e-300
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.466108948025751e-301
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.466108948025751e-301
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.466108948025751e-301
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.466108948025751e-301
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8665272370064378e-301Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8665272370064378e-301
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8665272370064378e-301
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104600 | 328 batches | lr 0.00042 | ms/batch 5117.56 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8665272370064378e-301
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.6663180925160944e-302
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.6663180925160944e-302
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.6663180925160944e-302
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.6663180925160944e-302
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1665795231290236e-302
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1665795231290236e-302
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1665795231290236e-302
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1665795231290236e-302
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.916448807822559e-303
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.916448807822559e-303
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.916448807822559e-303
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.916448807822559e-303
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.291122019556398e-304
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.291122019556398e-304
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.291122019556398e-304
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.291122019556398e-304
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8227805048890994e-304
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8227805048890994e-304
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8227805048890994e-304
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8227805048890994e-304
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5569512622227484e-305
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5569512622227484e-305
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5569512622227484e-305
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5569512622227484e-305
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1392378155556871e-305
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1392378155556871e-305
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1392378155556871e-305
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1392378155556871e-305
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.848094538889218e-306
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.848094538889218e-306
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.848094538889218e-306
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.848094538889218e-306
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.120236347223045e-307
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.120236347223045e-307
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.120236347223045e-307
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.120236347223045e-307
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7800590868057611e-307
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7800590868057611e-307
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7800590868057611e-307
| epoch 153 step 104610 | 338 batches | lr 0.00042 | ms/batch 5118.73 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7800590868057611e-307
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.450147717014403e-308
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.450147717014403e-308
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.450147717014403e-308
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.450147717014403e-308
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1125369292536007e-308Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1125369292536007e-308
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1125369292536007e-308
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1125369292536007e-308
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.781342323134e-309
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.781342323134e-309
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.781342323134e-309
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.781342323134e-309
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.953355807835e-310
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.953355807835e-310
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.953355807835e-310
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.953355807835e-310
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.73833895195875e-310Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.73833895195875e-310
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.73833895195875e-310
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.73833895195875e-310
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.345847379897e-311
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.345847379897e-311Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.345847379897e-311
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.345847379897e-311
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.086461844974e-311
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.086461844974e-311
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.086461844974e-311
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.086461844974e-311
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.716154612436e-312
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.716154612436e-312
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.716154612436e-312
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.716154612436e-312
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.7903865311e-313
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.7903865311e-313
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.7903865311e-313
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.7903865311e-313
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.69759663277e-313
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104620 | 348 batches | lr 0.00042 | ms/batch 5118.36 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.69759663277e-313
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.69759663277e-313
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.69759663277e-313
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.243991582e-314
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.243991582e-314
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.243991582e-314
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.243991582e-314
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0609978955e-314
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0609978955e-314
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0609978955e-314
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0609978955e-314
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.65249474e-315
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.65249474e-315
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.65249474e-315
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.65249474e-315
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.63123685e-316
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.63123685e-316
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.63123685e-316
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.63123685e-316
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6578092e-316
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6578092e-316
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6578092e-316
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6578092e-316
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.144523e-317
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.144523e-317Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.144523e-317
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.144523e-317
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.036131e-317
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.036131e-317
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.036131e-317
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.036131e-317
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.590327e-318Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.590327e-318
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.590327e-318
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.590327e-318
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.4758e-319Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.4758e-319
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.4758e-319
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.4758e-319
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.61895e-319
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.61895e-319
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.61895e-319
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.61895e-319
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
| epoch 153 step 104630 | 358 batches | lr 0.00042 | ms/batch 5118.32 | loss nan | bpc nan
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0474e-320
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0474e-320Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0474e-320
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0474e-320
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.012e-320
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.012e-320
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.012e-320
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.012e-320
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.53e-321
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.53e-321Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.53e-321
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.53e-321
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.3e-322
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.3e-322
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.3e-322
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.3e-322
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6e-322
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6e-322
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6e-322
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6e-322
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1e-323Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
Traceback (most recent call last):
File "train.py", line 679, in <module>
Traceback (most recent call last):
File "train.py", line 679, in <module>
Traceback (most recent call last):
File "train.py", line 679, in <module>
train()train()
File "train.py", line 558, in train
File "train.py", line 558, in train
scaled_loss.backward()
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/contextlib.py", line 119, in __exit__
train()
scaled_loss.backward() File "train.py", line 558, in train
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/contextlib.py", line 119, in __exit__
next(self.gen)
next(self.gen) File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
scaled_loss.backward()
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/contextlib.py", line 119, in __exit__
next(self.gen)
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)optimizer._post_amp_backward(loss_scaler)
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 182, in post_backward_with_master_weights
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 182, in post_backward_with_master_weights
optimizer._post_amp_backward(loss_scaler)
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 182, in post_backward_with_master_weights
models_are_masters=False)
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/scaler.py", line 117, in unscale
models_are_masters=False)
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/scaler.py", line 117, in unscale
models_are_masters=False)
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/scaler.py", line 117, in unscale
1./scale)
ZeroDivisionError: float division by zero
1./scale)
1./scale)
ZeroDivisionError: float division by zero
ZeroDivisionError: float division by zero
Traceback (most recent call last):
File "train.py", line 679, in <module>
train()
File "train.py", line 558, in train
scaled_loss.backward()
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/contextlib.py", line 119, in __exit__
next(self.gen)
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 182, in post_backward_with_master_weights
models_are_masters=False)
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/apex/amp/scaler.py", line 117, in unscale
1./scale)
ZeroDivisionError: float division by zero
Traceback (most recent call last):
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/torch/distributed/launch.py", line 246, in <module>
main()
File "/home/armanc/miniconda3/envs/transformers/lib/python3.7/site-packages/torch/distributed/launch.py", line 242, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/armanc/miniconda3/envs/transformers/bin/python', '-u', 'train.py', '--local_rank=3', '--cuda', '--data', '../data/text8/', '--dataset', 'text8', '--n_layer', '12', '--d_model', '512', '--n_head', '8', '--d_head', '64', '--d_inner', '2048', '--dropout', '0.1', '--dropatt', '0.0', '--optim', 'adam', '--lr', '0.00025', '--warmup_step', '0', '--max_step', '400000', '--tgt_len', '512', '--mem_len', '512', '--eval_tgt_len', '128', '--batch_size', '22', '--multi_gpu', '--gpu0_bsz', '-1', '--work_dir', '/net/nfs.corp/s2-research/lm-long/exp_10_18/seq16k_w512_bsz4_nodilation_lr0.0005_batch8', '--attention_mode', 'tvm', '--mem_len', '0', '--tgt_len', '16384', '--eval_tgt_len', '16384', '--attention_window', '512', '--dilation_increase', '0', '--lr', '0.0005', '--batch_size', '2', '--warmup', '2000', '--fp16', '--eval-interval', '250', '--batch_chunk', '2']' returned non-zero exit status 1.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment